Skip to content
This repository has been archived by the owner on Aug 10, 2023. It is now read-only.

Latest commit



149 lines (97 loc) · 5.34 KB

File metadata and controls

149 lines (97 loc) · 5.34 KB
title description author tags date_published
Run a Cloud Bigtable Spark job on Dataproc
Run a Spark job on Dataproc that reads from and writes to Cloud Bigtable.
bigtable, spark, database, big table, apache spark, hbase, dataproc

Billy Jacobson | Developer Relations Engineer | Google

Contributed by Google employees.

In this tutorial, you run a Spark job on Dataproc that reads from and writes to Cloud Bigtable.


This is a followup to Using Spark with Cloud Bigtable, so follow the steps in that tutorial before beginning this one. The previous tutorial walks you through setting up the environment variables, creating the Bigtable instance and table, and running the Spark job locally.

The examples in this tutorial use Dataproc 1.4. For the list of available Dataproc image versions see the Dataproc image version list.

Create the Dataproc cluster

  1. Set the environment variables for configuring your Dataproc cluster:

    BIGTABLE_SPARK_PROJECT_ID=your-project-id //This can be the same as your Bigtable project 

    For information about regions and zones, read Available regions and zones.

  2. Use the gcloud command-line tool to create a cluster:

    gcloud dataproc clusters create $BIGTABLE_SPARK_DATAPROC_CLUSTER \
      --project=$BIGTABLE_SPARK_PROJECT_ID \
  3. List the clusters:

    gcloud dataproc clusters list \

    Make sure that BIGTABLE_SPARK_DATAPROC_CLUSTER is among the clusters.

Upload the file to Cloud Storage

Because you're running the Spark job in the cloud, you need to upload your the file to Cloud Storage.

For information about gsutil, see Quickstart: Using the gsutil tool.

  1. Choose a bucket name and set it as an environment variable:


    Bucket names must be unique across all Google Cloud projects, so you may want to append a few digits, so you don't run into name conflicts during creation.

  2. Create the bucket:

    gsutil mb \
      -b on \
  3. Upload an input file into the bucket:

    gsutil cp src/test/resources/Romeo-and-Juliet-prologue.txt $BIGTABLE_SPARK_BUCKET_NAME
  4. List the contents of the bucket:


    The output should be the following:


Submit the Wordcount job

Submit the Wordcount job to the Dataproc instance:

gcloud dataproc jobs submit spark \
  --class=example.Wordcount \
  --properties=spark.jars.packages='org.apache.hbase.connectors.spark:hbase-spark:1.0.0' \
  -- \

It may take some time to see any progress. You can use the --verbosity global option with debug to be told about progress earlier.

Eventually, you should see the following messages:

Job [joibId] submitted.
Waiting for job output...


Read the database:

cbt \

If you ran the Wordcount job locally, you will see duplicate entries for words, since Bigtable supports data versioning.

Cleaning up

  1. If you created a new instance to try this out, delete the instance:

    cbt \
      deleteinstance $BIGTABLE_SPARK_INSTANCE_ID
  2. If you created a table on an existing instance, only delete the table:

    cbt \
  3. Delete the Dataproc cluster:

    gcloud dataproc clusters delete $BIGTABLE_SPARK_DATAPROC_CLUSTER \
  4. Verify that the cluster is deleted:

    gcloud dataproc clusters list \
  5. Delete the input file and your bucket:

    gsutil rm $BIGTABLE_SPARK_BUCKET_NAME/Romeo-and-Juliet-prologue.txt

What's next