Hide

Google Cloud Storage Connector for Hadoop Overview

The Google Cloud Storage connector for Hadoop lets you run MapReduce jobs directly on data in Google Cloud Storage, and offers a number of benefits over choosing Hadoop Distributed File System (HDFS) as your default file system.

Contents

Benefits of using the connector

When first setting up a Hadoop cluster, you have a choice between two default file systems. Choosing Google Cloud Storage alongside the supplied connector has several benefits.

  • Direct data access.

    Store your data in Google Cloud Storage and access it directly, with no need to transfer it into HDFS first.

  • HDFS compatibility.

    You can store data in HDFS, in addition to Google Cloud Storage, and access it with the connector by using a different file path.

  • Interoperability.

    Storing data in Google Cloud Storage enables seamless interoperability between Hadoop and other Google services.

  • Data accessibility.

    When you shut down a Hadoop cluster, you still have access to your data in Google Cloud Storage, unlike HDFS.

  • High data availability.

    Data stored in Google Cloud Storage is highly available and globally replicated without a performance hit.

  • No storage management overhead.

    Unlike HDFS, Google Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.

  • Quick startup.

    In HDFS, a MapReduce job can't start until the NameNode is out of safe mode - a process that can take anywhere from a few seconds to many minutes depending on the size and state of your data. If using HDFS, you'll be charged for the cycles Google Compute Engine must wait for NameNode to exit safe mode. With Google Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.

Back to top

Getting the connector

The Google Cloud Storage connector for Hadoop is included as part of the setup scripts, and is installed automatically when you unzip the archive and run the scripts.

If you want to use the connector without running the setup scripts, you can download the Hadoop 1.x compatible connector or the Hadoop 2.x compatible connector. The connector also available as part of the Google Cloud Platform bigdata-interop project on GitHub. To install the connector manually, see Manually installing the connector.

Back to top

Configuring the connector

When you set up a Hadoop cluster by following the steps at setting up a Hadoop cluster, the cluster is automatically configured for optimal use with the connector. There is typically no need to configure the connector further.

To customize the connector, specify configuration values in core-site.xml either in the Hadoop configuration directory on the machine on which the connector is installed, or in conf/hadoop<version>gcs-core-template.xml in the bdutil package directory, before deploying a cluster with bdutil. For a complete list of configuration keys and their default values see gcs-core-default.xml.

Back to top

Accessing Google Cloud Storage data

There are multiple ways to access your Google Cloud Storage data:

Back to top

Manually installing the connector

To install the connector manually, complete the following steps.

Ensure authenticated Google Cloud Storage access

You must install the connector onto an authenticated Google Compute Engine VM configured to have access to the the Google Cloud Storage scope you intend to use the connector for, or have an OAuth 2.0 private key. Installing the connector on a machine besides a GCE VM, might lead to higher Google Cloud Storage access costs. For more info, see Google Cloud Storage Pricing

Install Hadoop

See the official documentation

Download the connector

See above

Add the connector jar to Hadoop's classpath

Placing the connector jar in the proper subdirectory of the Hadoop installation, may be sufficient to ensure Hadoop class loads it. However to be safe add HADOOP_CLASSPATH=$HADOOP_CLASSPATH:</path/to/gcs-connector-jar> to hadoop-env.sh in the Hadoop configuration directory

Configure Hadoop

In addition to the configuration described above, you must add the following two properties to core-site.xml.

  <property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
  </property>
  <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
      The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
  </property>

If you chose to use a private key for Google Cloud Storage Authorizaton, make sure to set the neccessary google.cloud.auth values documented in gcs-core-default.xml.

Test the installation

On the command line type hadoop fs -ls gs://<some-bucket>, where some <some-bucket> is Google Cloud Storage bucket that the credentials you configured the connector to have read access. The command should output the top level directories and objects contained in <some-bucket>. Otherwise see Troubleshooting.

Troubleshooting the installation

If the installation test gave the output No FileSystem for scheme: gs, check that you correctly set the two properties in the manual configuration in the correct core-site.xml. If you see the output java.lang.ClassNotFoundException: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, then check that you added the connector to Haddop's classpath. If you see a message about authorization, ensure that you have access to bucket <some-bucket> with gsutil and that the credentials in your configuration are correct.

Back to top