The Google Cloud Storage connector for Hadoop lets you run MapReduce jobs directly on data in Google Cloud Storage, and offers a number of benefits over choosing Hadoop Distributed File System (HDFS) as your default file system.
- Benefits of using the connector
- Getting the connector
- Configuring the connector
- Accessing Google Cloud Storage data
- Manually installing the connector
Benefits of using the connector
- Direct data access.
Store your data in Google Cloud Storage and access it directly, with no need to transfer it into HDFS first.
- HDFS compatibility.
You can store data in HDFS, in addition to Google Cloud Storage, and access it with the connector by using a different file path.
Storing data in Google Cloud Storage enables seamless interoperability between Hadoop and other Google services.
- Data accessibility.
When you shut down a Hadoop cluster, you still have access to your data in Google Cloud Storage, unlike HDFS.
- High data availability.
Data stored in Google Cloud Storage is highly available and globally replicated without a performance hit.
- No storage management overhead.
Unlike HDFS, Google Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.
- Quick startup.
In HDFS, a MapReduce job can't start until the NameNode is out of safe mode - a process that can take anywhere from a few seconds to many minutes depending on the size and state of your data. If using HDFS, you'll be charged for the cycles Google Compute Engine must wait for NameNode to exit safe mode. With Google Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.
Getting the connector
The Google Cloud Storage connector for Hadoop is included as part of the setup scripts, and is installed automatically when you unzip the archive and run the scripts.
If you want to use the connector without running the setup scripts, you can download the Hadoop 1.x compatible connector or the Hadoop 2.x compatible connector. The connector also available as part of the Google Cloud Platform bigdata-interop project on GitHub. To install the connector manually, see Manually installing the connector.
Configuring the connector
When you set up a Hadoop cluster by following the steps at setting up a Hadoop cluster, the cluster is automatically configured for optimal use with the connector. There is typically no need to configure the connector further.
To customize the connector, specify configuration values in core-site.xml either in the Hadoop configuration directory on the machine on which the connector is installed, or in conf/hadoop<version>gcs-core-template.xml in the bdutil package directory, before deploying a cluster with bdutil. For a complete list of configuration keys and their default values see gcs-core-default.xml.
Accessing Google Cloud Storage data
There are multiple ways to access your Google Cloud Storage data:
- The hadoop shell:
hadoop fs -ls gs://<CONFIGBUCKET>/dir/file(recommended), or
fs -ls /dir/file
- The Developers Console Cloud Storage browser
- The Google Cloud Storage JSON API
Manually installing the connector
To install the connector manually, complete the following steps.
Ensure authenticated Google Cloud Storage access
You must install the connector onto an authenticated Google Compute Engine VM configured to have access to the the Google Cloud Storage scope you intend to use the connector for, or have an OAuth 2.0 private key. Installing the connector on a machine besides a GCE VM, might lead to higher Google Cloud Storage access costs. For more info, see Google Cloud Storage Pricing
See the official documentation
Download the connector
Add the connector jar to Hadoop's classpath
Placing the connector jar in the proper subdirectory of the Hadoop installation, may be sufficient to ensure Hadoop class loads it. However to be safe add
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:</path/to/gcs-connector-jar> to hadoop-env.sh in the Hadoop configuration directory
In addition to the configuration described above, you must add the following two properties to core-site.xml.
<property> <name>fs.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value> <description>The FileSystem for gs: (GCS) uris.</description> </property> <property> <name>fs.AbstractFileSystem.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value> <description> The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2. </description> </property>
If you chose to use a private key for Google Cloud Storage Authorizaton, make sure to set the neccessary
google.cloud.auth values documented in gcs-core-default.xml.
Test the installation
On the command line type
hadoop fs -ls gs://<some-bucket>, where some
<some-bucket> is Google Cloud Storage bucket that the credentials you configured the connector to have read access. The command should output the top level directories and objects contained in
<some-bucket>. Otherwise see Troubleshooting.
Troubleshooting the installation
If the installation test gave the output
No FileSystem for scheme: gs, check that you correctly set the two properties in the manual configuration in the correct core-site.xml. If you see the output
java.lang.ClassNotFoundException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, then check that you added the connector to Haddop's classpath. If you see a message about authorization, ensure that you have access to bucket
<some-bucket> with gsutil and that the credentials in your configuration are correct.