Google Cloud Storage Connector for Spark and Hadoop

The Google Cloud Storage connector for Hadoop lets you run Hadoop or Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing Hadoop Distributed File System (HDFS) as your default file system.

Benefits of using the connector

Choosing Cloud Storage alongside the Hadoop Distributed File System (HDFS) has several benefits:

  • Direct data access

    Store your data in Cloud Storage and access it directly, with no need to transfer it into HDFS first.

  • HDFS compatibility

    You can store data in HDFS in addition to Cloud Storage, and access it with the connector by using a different file path.

  • Interoperability

    Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and other Google services.

  • Data accessibility

    When you shut down a Hadoop cluster, you still have access to your data in Cloud Storage, unlike HDFS.

  • High data availability

    Data stored in Cloud Storage is highly available and globally replicated without a performance hit.

  • No storage management overhead

    Unlike HDFS, Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.

  • Quick startup

    In HDFS, a MapReduce job can't start until the NameNode is out of safe mode—a process that can take from a few seconds to many minutes depending on the size and state of your data. With Google Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.

Back to top

Getting the connector

There are several ways to get the Google Cloud Storage connector. The best method to choose depends on whether you are creating a new cluster or using an existing cluster.

New clusters

There are two simple ways to create Spark and Hadoop clusters on Google Cloud Platform. The Cloud Storage connector is automatically installed with each method:

  1. Google Cloud Dataproc — Helps you create fast, reliable, and cost-effective managed Spark and Hadoop clusters. Cloud Storage is a great fit for Cloud Dataproc since Cloud Dataproc clusters only run when you need them. Since Cloud Dataproc is a managed service, it is the easiest way to create a Spark or Hadoop cluster.
  2. Command line setup scripts — Create Spark and Hadoop clusters with a set of command line utilities. The Cloud Storage Connector is automatically installed when you run the scripts.

Existing clusters

If you have an existing cluster, you can download the Hadoop 1.x compatible connector or the Hadoop 2.x compatible connector. The connector is also available as part of the Google Cloud Platform bigdata-interop project on GitHub. See Manually installing the connector for instructions.

Back to top

Configuring the connector

When you set up a Hadoop cluster by following setting up a Hadoop cluster, the cluster is automatically configured for optimal use with the connector. Typically, there is no need for further configuration.

To customize the connector, specify configuration values in:

  • core-site.xml, either in:
    • the Hadoop configuration directory on the machine on which the connector is installed, or
    • in conf/hadoop<version>gcs-core-template.xml in the bdutil package directory, before deploying a cluster with bdutil

For a complete list of configuration keys and their default values see gcs-core-default.xml.

Back to top

Accessing Cloud Storage data

There are multiple ways to access your Cloud Storage data:

Back to top

Manually installing the connector

To install the connector manually, complete the following steps.

Ensure authenticated Cloud Storage access

You must install the connector onto an authenticated Google Compute Engine VM configured to have access to the Cloud Storage scope you intend to use the connector for, or have an OAuth 2.0 private key. Installing the connector on a machine other than a Compute Engine VM can lead to higher Cloud Storage access costs. For more info, see Cloud Storage Pricing.

Install Hadoop

See the documentation.

Download the connector

See Getting the connector.

Add the connector jar to Hadoop's classpath

Placing the connector jar in the appropriate subdirectory of the Hadoop installation may be effective to have Hadoop load the jar. However, to be certain that the jar is loaded, add HADOOP_CLASSPATH=$HADOOP_CLASSPATH:</path/to/gcs-connector-jar> to in the Hadoop configuration directory.

Configure Hadoop

Based on the steps in configuring the connector, you must add the following two properties to core-site.xml.

    <description>The FileSystem for gs: (GCS) uris.</description>
      The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.

If you chose to use a private key for Cloud Storage Authorizaton, make sure to set the necessary values documented in gcs-core-default.xml.

Test the installation

On the command line, type hadoop fs -ls gs://<some-bucket>, where <some-bucket> is the Google Cloud Storage bucket to which you gave the connector read access. The command should output the top-level directories and objects contained in <some-bucket>. If there is a problem, see Troubleshooting the installation.

Troubleshooting the installation

  • If the installation test reported No FileSystem for scheme: gs, make sure that you correctly set the two properties in the manual configuration in the correct core-site.xml.
  • If the test reported java.lang.ClassNotFoundException:, check that you added the connector to Hadoop's classpath.
  • If the test issued a message related to authorization, make sure that you have access to <some-bucket> with gsutil, and that the credentials in your configuration are correct.

Back to top

Send feedback about...

Hadoop on Google Cloud Platform