Edit on GitHub
Report issue
Page history

Run a Cloud Bigtable Spark job on Dataproc

Author(s): @billyjacobson ,   Published: 2021-04-07

Billy Jacobson | Developer Relations Engineer | Google

Contributed by Google employees.

In this tutorial, you run a Spark job on Dataproc that reads from and writes to Cloud Bigtable.

Prerequisites

This is a followup to Using Spark with Cloud Bigtable, so follow the steps in that tutorial before beginning this one. The previous tutorial walks you through setting up the environment variables, creating the Bigtable instance and table, and running the Spark job locally.

The examples in this tutorial use Dataproc 1.4. For the list of available Dataproc image versions see the Dataproc image version list.

Create the Dataproc cluster

  1. Set the environment variables for configuring your Dataproc cluster:

    BIGTABLE_SPARK_DATAPROC_CLUSTER=your-dataproc-cluster
    BIGTABLE_SPARK_DATAPROC_REGION=your-dataproc-region
    BIGTABLE_SPARK_CLUSTER_ZONE=your-bigtable-cluster-zone
    BIGTABLE_SPARK_PROJECT_ID=your-project-id //This can be the same as your Bigtable project 
    

    For information about regions and zones, read Available regions and zones.

  2. Use the gcloud command-line tool to create a cluster:

    gcloud dataproc clusters create $BIGTABLE_SPARK_DATAPROC_CLUSTER \
      --region=$BIGTABLE_SPARK_DATAPROC_REGION \
      --zone=$BIGTABLE_SPARK_CLUSTER_ZONE \
      --project=$BIGTABLE_SPARK_PROJECT_ID \
      --image-version=1.4
    
  3. List the clusters:

    gcloud dataproc clusters list \
    --region=$BIGTABLE_SPARK_DATAPROC_REGION
    

    Make sure that BIGTABLE_SPARK_DATAPROC_CLUSTER is among the clusters.

Upload the file to Cloud Storage

Because you're running the Spark job in the cloud, you need to upload your the file to Cloud Storage.

For information about gsutil, see Quickstart: Using the gsutil tool.

  1. Choose a bucket name and set it as an environment variable:

    BIGTABLE_SPARK_BUCKET_NAME=gs://your-bucket-name-12345
    

    Bucket names must be unique across all Google Cloud projects, so you may want to append a few digits, so you don't run into name conflicts during creation.

  2. Create the bucket:

    gsutil mb \
      -b on \
      -l $BIGTABLE_SPARK_DATAPROC_REGION \
      -p $BIGTABLE_SPARK_PROJECT_ID \
      $BIGTABLE_SPARK_BUCKET_NAME
    
  3. Upload an input file into the bucket:

    gsutil cp src/test/resources/Romeo-and-Juliet-prologue.txt $BIGTABLE_SPARK_BUCKET_NAME
    
  4. List the contents of the bucket:

    gsutil ls $BIGTABLE_SPARK_BUCKET_NAME
    

    The output should be the following:

    gs://[your-bucket-name]/Romeo-and-Juliet-prologue.txt
    

Submit the Wordcount job

Submit the Wordcount job to the Dataproc instance:

gcloud dataproc jobs submit spark \
  --cluster=$BIGTABLE_SPARK_DATAPROC_CLUSTER \
  --region=$BIGTABLE_SPARK_DATAPROC_REGION \
  --class=example.Wordcount \
  --jars=$BIGTABLE_SPARK_ASSEMBLY_JAR \
  --properties=spark.jars.packages='org.apache.hbase.connectors.spark:hbase-spark:1.0.0' \
  -- \
  $BIGTABLE_SPARK_PROJECT_ID $BIGTABLE_SPARK_INSTANCE_ID \
  $BIGTABLE_SPARK_WORDCOUNT_TABLE $BIGTABLE_SPARK_BUCKET_NAME/Romeo-and-Juliet-prologue.txt

It may take some time to see any progress. You can use the --verbosity global option with debug to be told about progress earlier.

Eventually, you should see the following messages:

Job [joibId] submitted.
Waiting for job output...

Verify

Read the database:

cbt \
  -project=$BIGTABLE_SPARK_PROJECT_ID \
  -instance=$BIGTABLE_SPARK_INSTANCE_ID \
  read $BIGTABLE_SPARK_WORDCOUNT_TABLE

If you ran the Wordcount job locally, you will see duplicate entries for words, since Bigtable supports data versioning.

Cleaning up

  1. If you created a new instance to try this out, delete the instance:

    cbt \
      -project=$BIGTABLE_SPARK_PROJECT_ID \
      deleteinstance $BIGTABLE_SPARK_INSTANCE_ID
    
  2. If you created a table on an existing instance, only delete the table:

    cbt \
      -project=$BIGTABLE_SPARK_PROJECT_ID \
      -instance=$BIGTABLE_SPARK_INSTANCE_ID \
      deletetable $BIGTABLE_SPARK_WORDCOUNT_TABLE
    
  3. Delete the Dataproc cluster:

    gcloud dataproc clusters delete $BIGTABLE_SPARK_DATAPROC_CLUSTER \
      --region=$BIGTABLE_SPARK_DATAPROC_REGION \
      --project=$BIGTABLE_SPARK_PROJECT_ID
    
  4. Verify that the cluster is deleted:

    gcloud dataproc clusters list \
      --region=$BIGTABLE_SPARK_DATAPROC_REGION
    
  5. Delete the input file and your bucket:

    gsutil rm $BIGTABLE_SPARK_BUCKET_NAME/Romeo-and-Juliet-prologue.txt
    gsutil rb $BIGTABLE_SPARK_BUCKET_NAME
    

What's next

Submit a tutorial

Share step-by-step guides

Submit a tutorial

Request a tutorial

Ask for community help

Submit a request

View tutorials

Search Google Cloud tutorials

View tutorials

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies. Java is a registered trademark of Oracle and/or its affiliates.