title | description | author | tags | date_published |
---|---|---|---|---|
Run a Cloud Bigtable Spark job on Dataproc |
Run a Spark job on Dataproc that reads from and writes to Cloud Bigtable. |
billyjacobson |
bigtable, spark, database, big table, apache spark, hbase, dataproc |
2021-04-07 |
Billy Jacobson | Developer Relations Engineer | Google
Contributed by Google employees.
In this tutorial, you run a Spark job on Dataproc that reads from and writes to Cloud Bigtable.
This is a followup to Using Spark with Cloud Bigtable, so follow the steps in that tutorial before beginning this one. The previous tutorial walks you through setting up the environment variables, creating the Bigtable instance and table, and running the Spark job locally.
The examples in this tutorial use Dataproc 1.4. For the list of available Dataproc image versions see the Dataproc image version list.
-
Set the environment variables for configuring your Dataproc cluster:
BIGTABLE_SPARK_DATAPROC_CLUSTER=your-dataproc-cluster BIGTABLE_SPARK_DATAPROC_REGION=your-dataproc-region BIGTABLE_SPARK_CLUSTER_ZONE=your-bigtable-cluster-zone BIGTABLE_SPARK_PROJECT_ID=your-project-id //This can be the same as your Bigtable project
For information about regions and zones, read Available regions and zones.
-
Use the
gcloud
command-line tool to create a cluster:gcloud dataproc clusters create $BIGTABLE_SPARK_DATAPROC_CLUSTER \ --region=$BIGTABLE_SPARK_DATAPROC_REGION \ --zone=$BIGTABLE_SPARK_CLUSTER_ZONE \ --project=$BIGTABLE_SPARK_PROJECT_ID \ --image-version=1.4
-
List the clusters:
gcloud dataproc clusters list \ --region=$BIGTABLE_SPARK_DATAPROC_REGION
Make sure that
BIGTABLE_SPARK_DATAPROC_CLUSTER
is among the clusters.
Because you're running the Spark job in the cloud, you need to upload your the file to Cloud Storage.
For information about gsutil
, see Quickstart: Using the gsutil tool.
-
Choose a bucket name and set it as an environment variable:
BIGTABLE_SPARK_BUCKET_NAME=gs://your-bucket-name-12345
Bucket names must be unique across all Google Cloud projects, so you may want to append a few digits, so you don't run into name conflicts during creation.
-
Create the bucket:
gsutil mb \ -b on \ -l $BIGTABLE_SPARK_DATAPROC_REGION \ -p $BIGTABLE_SPARK_PROJECT_ID \ $BIGTABLE_SPARK_BUCKET_NAME
-
Upload an input file into the bucket:
gsutil cp src/test/resources/Romeo-and-Juliet-prologue.txt $BIGTABLE_SPARK_BUCKET_NAME
-
List the contents of the bucket:
gsutil ls $BIGTABLE_SPARK_BUCKET_NAME
The output should be the following:
gs://[your-bucket-name]/Romeo-and-Juliet-prologue.txt
Submit the Wordcount job to the Dataproc instance:
gcloud dataproc jobs submit spark \
--cluster=$BIGTABLE_SPARK_DATAPROC_CLUSTER \
--region=$BIGTABLE_SPARK_DATAPROC_REGION \
--class=example.Wordcount \
--jars=$BIGTABLE_SPARK_ASSEMBLY_JAR \
--properties=spark.jars.packages='org.apache.hbase.connectors.spark:hbase-spark:1.0.0' \
-- \
$BIGTABLE_SPARK_PROJECT_ID $BIGTABLE_SPARK_INSTANCE_ID \
$BIGTABLE_SPARK_WORDCOUNT_TABLE $BIGTABLE_SPARK_BUCKET_NAME/Romeo-and-Juliet-prologue.txt
It may take some time to see any progress. You can use the --verbosity
global option with debug
to be told about progress earlier.
Eventually, you should see the following messages:
Job [joibId] submitted.
Waiting for job output...
Read the database:
cbt \
-project=$BIGTABLE_SPARK_PROJECT_ID \
-instance=$BIGTABLE_SPARK_INSTANCE_ID \
read $BIGTABLE_SPARK_WORDCOUNT_TABLE
If you ran the Wordcount job locally, you will see duplicate entries for words, since Bigtable supports data versioning.
-
If you created a new instance to try this out, delete the instance:
cbt \ -project=$BIGTABLE_SPARK_PROJECT_ID \ deleteinstance $BIGTABLE_SPARK_INSTANCE_ID
-
If you created a table on an existing instance, only delete the table:
cbt \ -project=$BIGTABLE_SPARK_PROJECT_ID \ -instance=$BIGTABLE_SPARK_INSTANCE_ID \ deletetable $BIGTABLE_SPARK_WORDCOUNT_TABLE
-
Delete the Dataproc cluster:
gcloud dataproc clusters delete $BIGTABLE_SPARK_DATAPROC_CLUSTER \ --region=$BIGTABLE_SPARK_DATAPROC_REGION \ --project=$BIGTABLE_SPARK_PROJECT_ID
-
Verify that the cluster is deleted:
gcloud dataproc clusters list \ --region=$BIGTABLE_SPARK_DATAPROC_REGION
-
Delete the input file and your bucket:
gsutil rm $BIGTABLE_SPARK_BUCKET_NAME/Romeo-and-Juliet-prologue.txt gsutil rb $BIGTABLE_SPARK_BUCKET_NAME
- Learn more about Cloud Bigtable.
- Learn more about Dataproc.