Creating a Hadoop Cluster

Cloud Dataproc allows you to create one or more Compute Engine instances that can connect to a Cloud Bigtable instance and run Hadoop jobs. This page explains how to use Cloud Dataproc to automate all of the following tasks:

  • Installing Hadoop, HBase, and the Cloud Bigtable HBase adapter
  • Configuring Hadoop and Cloud Bigtable
  • Setting the correct authorization scopes for Cloud Bigtable

After you create your Cloud Dataproc cluster, you can use the cluster to run Hadoop jobs that read and write data to and from Cloud Bigtable.

This page assumes that you are already familiar with Hadoop. For additional information about Cloud Dataproc, see the Cloud Dataproc documentation.

Before you start

Before you start, you'll need to complete the following tasks:

  • Create a Cloud Bigtable instance. Be sure to note the project ID and Cloud Bigtable instance ID.
  • Enable the Cloud Bigtable, Cloud Bigtable Table Admin, Cloud Dataproc, and Cloud Storage JSON APIs.

    Enable the APIs

  • Install the Cloud SDK and gcloud tool. See the Cloud SDK setup instructions for details.
  • Install the gsutil tool by running the following command:
    gcloud components install gsutil
  • Install Apache Maven, which is used to run a sample Hadoop job.

    On Debian GNU/Linux or Ubuntu, run the following command:

    sudo apt-get install maven

    On RedHat Enterprise Linux or CentOS, run the following command:

    sudo yum install maven

    On OS X, install Homebrew, then run the following command:

    brew install maven
  • Clone the GitHub repository GoogleCloudPlatform/cloud-bigtable-examples, which contains an example of a Hadoop job that uses Cloud Bigtable:
    git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git

Creating a Cloud Storage bucket

Cloud Dataproc uses a Cloud Storage bucket to store temporary files. To prevent file-naming conflicts, you should create a new bucket for Cloud Dataproc.

Cloud Storage bucket names must be globally unique across all buckets. Choose a bucket name that is likely to be available, such as a name that incorporates your Cloud Platform project's name.

After you choose a name, use the following command to create a new bucket, replacing values in brackets with the appropriate values:

gsutil mb -p [PROJECT_ID] gs://[BUCKET_NAME]

Creating the Cloud Dataproc cluster

Run the following command to create a Cloud Dataproc cluster with four worker nodes, replacing values in brackets with the appropriate values:

gcloud dataproc clusters create [DATAPROC_CLUSTER_NAME] --bucket [BUCKET_NAME] \
    --zone [ZONE] --num-workers 4 --master-machine-type n1-standard-4 \
    --worker-machine-type n1-standard-4

See the gcloud dataproc clusters create documentation for additional settings that you can configure. If you get an error message that includes the text Insufficient 'CPUS' quota, try setting the --num-workers flag to a lower value.

Testing the Cloud Dataproc cluster

After you set up your Cloud Dataproc cluster, you can test the cluster by running a sample Hadoop job that counts the number of times a word appears in a text file. The sample job uses Cloud Bigtable to store the results of the operation. You can use this sample job as a reference when you set up your own Hadoop jobs.

Running the sample Hadoop job

  1. In the directory where you cloned the GitHub repository, change to the directory java/dataproc-wordcount.
  2. Run the following command to build the project, replacing values in brackets with the appropriate values:

    mvn clean package -Dbigtable.projectID=[PROJECT_ID] \
        -Dbigtable.instanceID=[BIGTABLE_INSTANCE_ID]
    
  3. Run the following command to start the Hadoop job, replacing values in brackets with the appropriate values:

    ./cluster.sh start [DATAPROC_CLUSTER_NAME]
    

When the job is complete, it displays the name of the output table, which is the word WordCount followed by a hyphen and a unique number:

Output table is: WordCount-1234567890

Verifying the results of the Hadoop job

Optionally, after you run the Hadoop job, you can use the HBase shell to verify that the job ran successfully:

  1. In the Google Cloud Platform Console, click the Cloud Shell icon (Cloud Shell icon) in the upper right corner.
  2. When Cloud Shell is ready to use, download and unzip the quickstart files:
    curl -f -O https://storage.googleapis.com/cloud-bigtable/quickstart/GoogleCloudBigtable-Quickstart-0.9.4.zip
    unzip GoogleCloudBigtable-Quickstart-0.9.4.zip
  3. Change to the quickstart directory, then start the HBase shell:

    ./quickstart.sh
  4. Scan the output table to view the results of the Hadoop job, replacing [TABLE_NAME] with the name of your output table:
    scan '[TABLE_NAME]'

Now that you've verified that the cluster is set up correctly, you can use it to run your own Hadoop jobs.

Deleting the Cloud Dataproc cluster

When you are done using the Cloud Dataproc cluster, run the following command to shut down and delete the cluster:

gcloud dataproc clusters delete [DATAPROC_CLUSTER_NAME]

Send feedback about...

Cloud Bigtable Documentation