This page shows you how to use the Google Cloud SDK gcloud command-line tool to create a Google Cloud Dataproc cluster, run a simple Apache Spark job in the cluster, then modify the number of workers in the cluster.
Before you begin
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
Select or create a GCP project.
Make sure that billing is enabled for your Google Cloud Platform project.
- Enable the Cloud Dataproc API.
Create a cluster
Run the following command to create a
example-cluster with default Cloud Dataproc settings:
gcloud dataproc clusters create example-cluster ... Waiting for cluster creation operation...done. Created [... example-cluster]
The default value of the
--region flag is
global. This is a special
multi-region endpoint that is capable of deploying instances into any
user-specified Compute Engine zone. You can also specify distinct regions,
europe-west1, to isolate resources (including VM
instances and Cloud Storage) and metadata storage locations utilized by Cloud Dataproc
within the user-specified region. See Regional endpoints
to learn more about the difference between global and regional endpoints.
See Available regions & zones
for information on selecting a region. You can also run the
gcloud compute regions list command to see a listing of available regions.
Submit a job
To submit a sample Spark job that calculates a rough value for pi, run the following command:
gcloud dataproc jobs submit spark --cluster example-cluster \ --class org.apache.spark.examples.SparkPi \ --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
This command specifies:
- That you want to run a
sparkjob on the
classcontaining the main method for the job's pi-calculating application
- The location of the jar file containing your job's code
- Any parameters you want to pass to the job—in this case the number of
tasks, which is
The job's running and final output is displayed in the terminal window:
Waiting for job output... ... Pi is roughly 3.14118528 ... Job finished successfully.
Update a cluster
To change the number of workers in the cluster to five, run the following command:
gcloud dataproc clusters update example-cluster --num-workers 5
Your cluster's updated details are displayed in the command's output:
workerConfig: ... instanceNames: - example-cluster-w-0 - example-cluster-w-1 - example-cluster-w-2 - example-cluster-w-3 - example-cluster-w-4 numInstances: 5 statusHistory: ... - detail: Add 3 workers.
You can use the same command to decrease the number of worker nodes to the original value:
gcloud dataproc clusters update example-cluster --num-workers 2
To avoid incurring charges to your GCP account for the resources used in this quickstart:
clusters deleteto delete your example cluster.
gcloud dataproc clusters delete example-clusterYou are prompted to confirm that you want to delete the cluster. Type
yto complete the deletion.
- You should also remove any Cloud Storage buckets that were created by the
cluster by running the following command:
gsutil rm gs://bucket/subdir/**