The Google Cloud Storage connector lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage, and offers a number of benefits over choosing the Hadoop Distributed File System (HDFS).
Benefits of the Cloud Storage connector
- Direct data access – Store your data in Cloud Storage and access it directly, with no need to transfer it into HDFS first.
- HDFS compatibility – You can easily access your data in Cloud
Storage using the
gs://prefix instead of
- Interoperability – Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility – When you shut down a Hadoop cluster, you still have access to your data in Cloud Storage, unlike HDFS.
- High data availability – Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
- No storage management overhead – Unlike HDFS, Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.
- Quick startup – In HDFS, a MapReduce job can't start until the
NameNodeis out of safe mode—a process that can take from a few seconds to many minutes depending on the size and state of your data. With Google Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.
Getting the connector
Cloud Dataproc clusters
The Cloud Storage connector is installed by default on all Google Cloud Dataproc clusters. It's available in both Spark and PySpark environments.
Other Spark/Hadoop clusters
You can can download the Cloud Storage connector for Hadoop 1.x
or the Cloud Storage connector for Hadoop 2.x.
To install and configure the connector, follow the README file in the
project on GitHub.
Using the connector
There are multiple ways to access data stored in Google Cloud Storage:
- In a Spark (or PySpark) or Hadoop application using the
- The hadoop shell:
hadoop fs -ls gs://CONFIGBUCKET/dir/file.
- The Cloud Platform Console Cloud Storage browser.
- Using the
- Learn more about Cloud Storage