Cloud Storage connector

The Cloud Storage connector is an open source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing the Hadoop Distributed File System (HDFS).

Benefits of the Cloud Storage connector

  • Direct data access – Store your data in Cloud Storage and access it directly. You do not need to transfer it into HDFS first.
  • HDFS compatibility – You can easily access your data in Cloud Storage using the gs:// prefix instead of hdfs://.
  • Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
  • Data accessibility – When you shut down a Hadoop cluster, unlike HDFS, you continue to have access to your data in Cloud Storage.
  • High data availability – Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
  • No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.
  • Quick startup – In HDFS, a MapReduce job can't start until the NameNode is out of safe mode, a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, which leads to significant cost savings over time.

Dataproc cluster Cloud Storage connector setup

The Cloud Storage connector is installed by default on all Dataproc cluster nodes in the /usr/local/share/google/dataproc/lib/ directory.

The following sections describe steps you can take to set up the connector successfully on your Dataproc cluster.

Service account permissions

When running the connector inside of Compute Engine VMs, including Dataproc clusters, the google.cloud.auth.service.account.enable property is set to false by default, which means you don't need to manually configure a service account for the connector; it gets service account credentials from the VM metadata server.

Non-default connector versions

The default Cloud Storage connector versions used in the latest images installed on Dataproc clusters are listed in the cluster image version pages (see Supported Dataproc versions). If your application depends on a non-default connector version deployed on your cluster, you must either:

  1. create a cluster with the --metadata=GCS_CONNECTOR_VERSION=x.y.z flag, which updates the connector used by applications running on the cluster to the specified connector version, or
  2. include and relocate the connector classes and connector dependencies for the version you are using into your application's jar. Relocation is necessary to avoid a conflict between the your deployed connector version and the default connector version installed on the Dataproc cluster (see the Maven dependencies relocation example).

Non-Dataproc clusters

  1. Download the connector.

    To download the Cloud Storage connector for Hadoop:

  2. Install the connector.

    See Installing the connector on GitHub to to install, configure, and test the Cloud Storage connector.

Use the connector

There are multiple ways to access data stored in Cloud Storage:

Resources

Java version

The Cloud Storage connector requires Java 8.

Apache Maven Dependency Information

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>gcs-connector</artifactId>
    <version>insert "hadoopX-X.X.X" connector version number here</version>
    <scope>provided</scope>
</dependency>

or for shaded version:

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>gcs-connector</artifactId>
    <version>insert "hadoopX-X.X.X" connector version number here</version>
    <scope>provided</scope>
    <classifier>shaded</classifier>
</dependency>

For more detailed information, see the Cloud Storage connector release notes and Javadoc reference.

What's next