The Cloud Storage connector is an open source Java library that lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing the Hadoop Distributed File System (HDFS).
Benefits of the Cloud Storage connector
- Direct data access – Store your data in Cloud Storage and access it directly. You do not need to transfer it into HDFS first.
- HDFS compatibility – You can easily access your data in
Cloud Storage using the
gs://
prefix instead ofhdfs://
. - Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility – When you shut down a Hadoop cluster, unlike HDFS, you continue to have access to your data in Cloud Storage.
- High data availability – Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
- No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.
- Quick startup – In HDFS, a MapReduce job can't start until the
NameNode
is out of safe mode, a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, which leads to significant cost savings over time.
Getting the connector
Dataproc clusters
The Cloud Storage connector is installed by default on all
Dataproc cluster nodes under
/usr/local/share/google/dataproc/lib/
directory.
If your application depends a connector version that is different from the default connector version deployed on your Dataproc cluster, you must either:
- create a new cluster with the
--metadata GCS_CONNECTOR_VERSION=x.y.z
flag, which will update the connector used by your application to the specified connector version, or - include and relocate the connector classes and connector dependencies for the version you are using into your application's jar so that the connector version you are using will not conflict with the connector version deployed on your Dataproc cluster (see this example of dependencies relocation in Maven).
Non-Dataproc clusters
Download the connector.
To download the Cloud Storage connector for Hadoop:
- latest version located in Cloud Storage bucket (not recommended for use in production):
- specific version
from Cloud Storage bucket by substituting Hadoop and
Cloud Storage connector versions in the
gcs-connector-HADOOP_VERSION-CONNECTOR_VERSION.jar
name pattern:- Example:
gs://hadoop-lib/gcs/gcs-connector-hadoop2-2.1.1.jar
- Example:
- specific version
from Apache Maven
repository (you should download a shaded jar that has
-shaded
suffix in the name):
Install the connector.
See Installing the connector on GitHub to to install, configure, and test the Cloud Storage connector.
Using the connector
There are multiple ways to access data stored in Cloud Storage:
- In a Spark (or PySpark) or Hadoop application using the
gs://
prefix. - The hadoop shell:
hadoop fs -ls gs://bucket/dir/file
. - The Google Cloud console Cloud Storage browser.
- Using the
gsutil cp
orgsutil rsync
commands.
Resources
Java version
The Cloud Storage connector requires Java 8.
Apache Maven Dependency Information
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>insert "hadoopX-X.X.X" connector version number here</version> <scope>provided</scope> </dependency>
or for shaded version:
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>insert "hadoopX-X.X.X" connector version number here</version> <scope>provided</scope> <classifier>shaded</classifier> </dependency>
For more detailed information, see the Cloud Storage connector release notes and Javadoc reference.
What's next
- Learn more about Cloud Storage