Installing the Cloud Storage connector

You can install and use the Cloud Storage Connector on an Apache Hadoop/Spark cluster—for example, to move on-premises HDFS data to Cloud Storage. Note that you can install the connector on a Spark standalone cluster, but you'll need set up the config file as noted in installation step 3.

Steps to install the connector

  1. Download the Cloud Storage Connector

    Hadoop 1.x

    1. Download the Cloud Storage connector for Hadoop 1.x
    2. Copy the jarfile into your hadoop/lib dir (see the next command example for Spark standalone mode).
      cp ~/Downloads/gcs-connector-hadoop1-latest.jar /your/hadoop/dir/lib/
      When running a Spark standalone cluster:
      cp ~/Downloads/gcs-connector-hadoop1-latest.jar $SPARK_HOME/jars directory

    Hadoop 2.x

    1. Download the Cloud Storage connector for Hadoop 2.x.
    2. Copy the jarfile into your $HADOOP_COMMON_LIB_JARS_DIR (see the next command example for Spark standalone mode).
      cp ~/Downloads/gcs-connector-hadoop2-latest.jar $HADOOP_COMMON_LIB_JARS_DIR.
      When running a Spark standalone cluster:
      cp ~/Downloads/gcs-connector-hadoop2-latest.jar $SPARK_HOME/jars directory

  2. Set up service-account "keyfile" authentication.

    1. Make sure you have enabled the Compute Engine API in your project.
    2. Visit Google Cloud Platform Console→APIs & Services→Credentials, and select Create Credentials→service account key.
    3. Select Service account→Compute Engine default service account and Key type→JSON, then click Create to download the key.
    4. Keep track of the downloaded .json file. You may want to to rename it before placing it in a directory that is more easily accessible from Hadoop. For example:
      cp ~/Downloads/project-id-xxxxxxx.json /path/to/hadoop/conf/gcskey.json
      
  3. Add the following entries to the conf/core-site.xml file in each cluster node (master and worker VMs):

    Required entries:
    <property>
      <name>google.cloud.auth.service.account.enable</name>
      <value>true</value>
    </property>
    <property>
      <name>google.cloud.auth.service.account.json.keyfile</name>
      <value>full path to JSON keyfile downloaded for service account</value>
    </property>
    

Testing the installation

With the above settings on your master and worker nodes, you should be ready to test the Cloud Storage connector by running:

hadoop fs -ls gs://bucket-name

Troubleshooting

The output of the following command may reveal informative error messages that can help you debug problems connecting to Cloud Storage.

gsutil ls gs://bucket-name
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation