Storage & Data Transfer

How to connect Cloudera’s CDH to Cloud Storage

If you are running CDH, Cloudera’s distribution of Hadoop, we aim to provide you with first-class integration on Google Cloud so you can run a CDH cluster with Cloud Storage integration.

In this post, we’ll help you get started deploying the Cloud Storage connector for your CDH clusters. The methods and steps we discuss here will apply to both on-premise clusters and cloud-based clusters. Keep in mind that the Cloud Storage connector uses Java, so you’ll want to make sure that the appropriate Java 8 packages are installed on your CDH cluster. Java 8 should come pre-configured as your default Java Development Kit.

[Check out this post if you’re deciding how and when to use Cloud Storage over the Hadoop Distributed File System (HDFS).]

Here’s how to get started:

Distribute using the Cloudera parcel

If you’re running a large Hadoop cluster or more than one cluster, it can be hard to deploy libraries and configure Hadoop services to use those libraries without making mistakes. Fortunately, Cloudera Manager provides a way to install packages with parcels. A parcel is a binary distribution format that consists of a gzipped (compressed) tar archive file with metadata.

We recommend using the CDH parcel to install the Cloud Storage connector. There are some big advantages of using a parcel instead of manual deployment and configuration to deploy the Cloud Storage connector on your Hadoop cluster:

  • Self-contained distribution: All related libraries, scripts and metadata are packaged into a single parcel file. You can host it at an internal location that is accessible to the cluster or even upload it directly to the Cloudera Manager node.

  • No need for sudo access or root: The parcel is not deployed under /usr or any of the system directories. Cloudera Manager will deploy it through agents, which eliminates the need to use sudo access users or root user to deploy.

Create your own Cloud Storage connector parcel

To create the parcel for your clusters, download and use this script. You can do this on any machine with access to the internet.

This script will execute the following actions:

  1. Download Cloud Storage connector to a local drive

  2. Package the connector Java Archive (JAR) file into a parcel

  3. Place the parcel under the Cloudera Manager’s parcel repo directory

If you’re connecting an on-premise CDH cluster or cluster on a cloud provider other than Google Cloud Platform (GCP), follow the instructions from this page to create a service account and download its JSON key file.

Create the Cloud Storage parcel

Next, you’ll want to run the script to create the parcel file and checksum file and let Cloudera Manager find it with the following steps:

1. Place the service account JSON key file and the create_parcel.sh script under the same directory. Make sure that there are no other files under this directory.

2. Run the script, which will look something like this: $ ./create_parcel.sh -f <parcel_name> -v <version> -o <os_distro_suffix>

  • parcel_name is the name of the parcel in a single string format without any spaces or special characters. (i.e.,, gcsconnector)
  • version is the version of the parcel in the format x.x.x (ex: 1.0.0)
  • os_distro_suffix: Like the naming conventions of RPM or deb, parcels need to be named in a similar way. A full list of possible distribution suffixes can be found here.
  • d is a flag you can use to deploy the parcel to the Cloudera Manager parcel repo folder. It’s optional; if not provided, the parcel file will be created in the same directory where the script ran.
3. Logs of the script can be found in /var/log/build_script.log

Distribute and activate the parcel

Once you’ve created the Cloud Storage parcel, Cloudera Manager has to recognize the parcel and install it on the cluster.

  1. The script you ran generated a .parcel file and a .parcel.sha checksum file. Put these two files on the Cloudera Manager node under directory /opt/cloudera/parcel-repo. If you already host Cloudera parcels somewhere, you can just place these files there and add an entry in the manifest.json file.

  2. On the Cloudera Manager interface, go to Hosts -> Parcels and click Check for New Parcels to refresh the list to load any new parcels. The Cloud Storage connector parcel should show up like this:

Cloudera Manager.png

3. On the Actions column of the new parcel, click Distribute. Cloudera Manager will start distributing the Cloud Storage connector.

4. JAR file to every node in the cluster.

When distribution is finished, click Activate to enable the parcel.

Configure CDH clusters to use the Cloud Storage connector

After the Cloud Storage connector is distributed on the cluster, you’ll need to do a few additional configuration steps to let the cluster use the connector. These steps will be different depending on whether you’re using HDFS or Spark for your Hadoop jobs.

Configuration for the HDFS service

1. From the Cloudera Manager UI, click HDFS service > Configurations. In the search bar,  type core-site.xml. In the box titled “Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml,” add the following properties:

  Name: fs.gs.project.id
Value: {GCP project ID}

Name: fs.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Name: fs.AbstractFileSystem.gs.impl
Value: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS

*Name: google.cloud.auth.service.account.enable
*Value: true

*Name: google.cloud.auth.service.account.json.keyfile
*Value: {full path to JSON keyfile downloaded for service account}
*Example: /opt/cloudera/parcels/gcsconnector/lib/hadoop/lib/key.json

*: Not needed if your CDH cluster is deployed on GCP

2. Click Save configurations > Restart required services.

3. Export Hadoop classpath to point to the Cloud Storage connector JAR file, as shown here:

  $ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cloudera/parcels/<Parcel name>/lib/hadoop/lib/gcs-connector-latest-hadoop2.jar

4. Run the “hdfs dfs - ls” command pointing to the bucket the service account has access to:

  [someone@cdh-parcel-automate1 ~]$ hdfs dfs -ls gs://dataproc-324dd107-1b93-4417-a5c1-c85b719e15e9-us-central1
Oct 04, 2018 5:16:19 PM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
Oct 04, 2018 5:16:19 PM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase configure
WARNING: No working directory configured, using default: 'gs://dataproc-324dd107-1b93-4417-a5c1-c85b719e15e9-us-central1/'
Found 1 items
drwx------   - someone someone          0 2018-10-04 15:27 gs://dataproc-324dd107-1b93-4417-a5c1-c85b719e15e9-us-central1/google-cloud-dataproc-metainfo

Configuration for the Spark service

In order to let Spark recognize the Cloud Storage path, you have to let Spark load the connector JAR. Here is how to configure it:

1. From the Cloudera Manager home page, go to Spark > Configuration > Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh. Add the configuration according to the Cloud Storage connector JAR path.

Spark service.png

2. Next, use Cloudera Manager to deploy the configuration and restart the service if necessary.

3. Open Spark shell to validate that you can access Cloud Storage using Spark.

  $ spark-shell
>val src=spark.read.json("gs://<sample bucket>/sample.json")
>src.take(1)

Configuration for the Hive service

If you also need to store Hive table data in Cloud Storage, configure Hive to load the connector JAR file with the following steps:

1. From Cloudera Manager home page, go to Hive Service > Configuration, search “Hive Auxiliary JARs Directory” and enter the path to the Cloud Storage connector JAR, as shown here:

  /opt/cloudera/parcels/<GCS parcel name>/lib/hadoop/lib/
Hive service.png

2. Validate if the JAR is being accepted by accessing the Hive CLI. Note that the IP address and the result may be different here from your screen:

  $ beeline
beeline.png
  beeline> !connect jdbc:hive2://35.184.82.84:10000/default
Connecting to jdbc:hive2://35.184.82.84:10000/default
Enter username for jdbc:hive2://35.184.82.84:10000/default: 
Enter password for jdbc:hive2://35.184.82.84:10000/default: 
Connected to: Apache Hive (version 1.1.0-cdh5.10.1)
Driver: Hive JDBC (version 1.1.0-cdh5.10.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://35.184.82.84:10000/default> create table test1 (title string) location 'gs://roderickyao-sandbox/gcsconnector/test1/';
INFO  : Compiling command(queryId=hive_20170516155757_743b604a-03c0-4a0a-b8d2-f017f89928c1): create table test1 (title string) location 'gs://roderickyao-sandbox/gcsconnector/test1/'
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20170516155757_743b604a-03c0-4a0a-b8d2-f017f89928c1); Time taken: 1.268 seconds
INFO  : Executing command(queryId=hive_20170516155757_743b604a-03c0-4a0a-b8d2-f017f89928c1): create table test1 (title string) location 'gs://roderickyao-sandbox/gcsconnector/test1/'
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20170516155757_743b604a-03c0-4a0a-b8d2-f017f89928c1); Time taken: 4.204 seconds
INFO  : OK
No rows affected (5.888 seconds)

: jdbc:hive2://35.184.82.84:10000/default> create external table testhivegcs(title string,comment_count int) Row format delimited fields terminated by ',' location 'gs://roderickyao-sandbox/gcsconnector/test1/' tblproperties ("skip.header.line.count"="1");
INFO  : Compiling command(queryId=hive_20170516160000_fa80b105-4d30-41ec-82e7-65d0ad4be548): create external table testhivegcs(title string,comment_count int) Row format delimited fields terminated by ',' location 'gs://roderickyao-sandbox/gcsconnector/test1/' tblproperties ("skip.header.line.count"="1")
…<more outputs>...
INFO  : Executing command(queryId=hive_20170516160000_fa80b105-4d30-41ec-82e7-65d0ad4be548): create external table testhivegcs(title string,comment_count int) Row format delimited fields terminated by ',' location 'gs://roderickyao-sandbox/gcsconnector/test1/' tblproperties ("skip.header.line.count"="1")
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20170516160000_fa80b105-4d30-41ec-82e7-65d0ad4be548); Time taken: 0.267 seconds
INFO  : OK
No rows affected (0.312 seconds)
0: jdbc:hive2://35.184.82.84:10000/default> select * from testhivegcs
. . . . . . . . . . . . . . . . . . . . . > ;
INFO  : Compiling command(queryId=hive_20170516160000_c45df2c9-4c3f-419b-8ec6-e942071bcd39): select * from testhivegcs
…<more outputs>...
+--------------------+----------------------------+--+
| testhivegcs.title  | testhivegcs.comment_count  |
+--------------------+----------------------------+--+
| rod1               | 2                          |
| rod2               | 5                          |
| rod5               | 6                          |
|                    | NULL                       |
+--------------------+----------------------------+--+
4 rows selected (2.613 seconds)
0: jdbc:hive2://35.184.82.84:10000/default>

That’s it! You’ve now connected Cloudera’s CDH to Google Cloud Storage, so you can store and access your data on Cloud Storage with high performance and scalability. Learn more here about running these workloads on Google Cloud.