The Google Cloud Storage connector for Hadoop lets you run Hadoop or Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing Hadoop Distributed File System (HDFS) as your default file system.
Benefits of using the connector
Choosing Cloud Storage alongside the Hadoop Distributed File System (HDFS) has several benefits:
- Direct data access
Store your data in Cloud Storage and access it directly, with no need to transfer it into HDFS first.
- HDFS compatibility
You can store data in HDFS in addition to Cloud Storage, and access it with the connector by using a different file path.
Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and other Google services.
- Data accessibility
When you shut down a Hadoop cluster, you still have access to your data in Cloud Storage, unlike HDFS.
- High data availability
Data stored in Cloud Storage is highly available and globally replicated without a performance hit.
- No storage management overhead
Unlike HDFS, Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.
- Quick startup
In HDFS, a MapReduce job can't start until the NameNode is out of safe mode—a process that can take from a few seconds to many minutes depending on the size and state of your data. With Google Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.
Getting the connector
There are several ways to get the Google Cloud Storage connector. The best method to choose depends on whether you are creating a new cluster or using an existing cluster.
There are two simple ways to create Spark and Hadoop clusters on Google Cloud Platform. The Cloud Storage connector is automatically installed with each method:
- Google Cloud Dataproc — Helps you create fast, reliable, and cost-effective managed Spark and Hadoop clusters. Cloud Storage is a great fit for Cloud Dataproc since Cloud Dataproc clusters only run when you need them. Since Cloud Dataproc is a managed service, it is the easiest way to create a Spark or Hadoop cluster.
- Command line setup scripts — Create Spark and Hadoop clusters with a set of command line utilities. The Cloud Storage Connector is automatically installed when you run the scripts.
If you have an existing cluster, you can download the Hadoop 1.x compatible connector or the Hadoop 2.x compatible connector. The connector is also available as part of the Google Cloud Platform bigdata-interop project on GitHub. See Manually installing the connector for instructions.
Configuring the connector
When you set up a Hadoop cluster by following setting up a Hadoop cluster, the cluster is automatically configured for optimal use with the connector. Typically, there is no need for further configuration.
To customize the connector, specify configuration values in:
- core-site.xml, either in:
- the Hadoop configuration directory on the machine on which the connector is installed, or
- in conf/hadoop<version>gcs-core-template.xml in the bdutil package directory, before deploying a cluster with bdutil
For a complete list of configuration keys and their default values see gcs-core-default.xml.
Accessing Cloud Storage data
There are multiple ways to access your Cloud Storage data:
- The hadoop shell:
hadoop fs -ls gs://<CONFIGBUCKET>/dir/file(recommended) or
fs -ls /dir/file
- the Cloud Platform Console Cloud Storage browser
- The Cloud Storage JSON API
Manually installing the connector
To install the connector manually, complete the following steps.
Ensure authenticated Cloud Storage access
You must install the connector onto an authenticated Google Compute Engine VM configured to have access to the Cloud Storage scope you intend to use the connector for, or have an OAuth 2.0 private key. Installing the connector on a machine other than a Compute Engine VM can lead to higher Cloud Storage access costs. For more info, see Cloud Storage Pricing.
See the documentation.
Download the connector
Add the connector jar to Hadoop's classpath
Placing the connector jar in the appropriate subdirectory of the Hadoop installation may be effective to have Hadoop load the jar. However, to be certain that the jar is loaded, add
hadoop-env.sh in the Hadoop configuration directory.
Based on the steps in configuring the connector, you must add the following two properties to core-site.xml.
<property> <name>fs.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value> <description>The FileSystem for gs: (GCS) uris.</description> </property> <property> <name>fs.AbstractFileSystem.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value> <description> The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2. </description> </property>
If you chose to use a private key for Cloud Storage Authorizaton, make sure to set the necessary
google.cloud.auth values documented in gcs-core-default.xml.
Test the installation
On the command line, type
hadoop fs -ls gs://<some-bucket>, where
<some-bucket> is the Google Cloud Storage bucket to which you gave the connector read access. The command should output the top-level directories and objects contained in
<some-bucket>. If there is a problem, see Troubleshooting the installation.
Troubleshooting the installation
- If the installation test reported
No FileSystem for scheme: gs, make sure that you correctly set the two properties in the manual configuration in the correct core-site.xml.
- If the test reported
java.lang.ClassNotFoundException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, check that you added the connector to Hadoop's classpath.
- If the test issued a message related to authorization, make sure that you have access to
<some-bucket>with gsutil, and that the credentials in your configuration are correct.