Create a Dataproc cluster by using the gcloud CLI

This page shows you how to use the Google Cloud CLI gcloud command-line tool to create a Google Cloud Dataproc cluster, run a simple Apache Spark job in the cluster, then modify the number of workers in the cluster.

An easy way to run the gcloud command-line tool is from Cloud Shell, which has the Google Cloud CLI pre-installed. Cloud Shell is free for Google Cloud customers (you need a Google Cloud project to use Cloud Shell).

You can find out how to do the same or similar tasks with Quickstarts Using the API Explorer, the Google Cloud console in Create a Dataproc cluster by using the Google Cloud console, and using the Client Libraries in Create a Dataproc cluster by using client libraries.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Enable the API

Create a cluster

Run the following command to create a cluster called example-cluster. See Available regions & zones for information on selecting a region (you can also run the gcloud compute regions list command to see a listing of available regions). Also see Regional endpoints to learn about regional endpoints.

gcloud dataproc clusters create example-cluster --region=region

Cluster creation is confirmed in the command output:

...
Waiting for cluster creation operation...done.
Created [... example-cluster]

Submit a job

To submit a sample Spark job that calculates a rough value for pi, run the following command:

gcloud dataproc jobs submit spark --cluster example-cluster \
    --region=region \
    --class org.apache.spark.examples.SparkPi \
    --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

This command specifies:

That you want to run a spark job on the example-cluster cluster in the specified region
The class containing the main method for the job's pi-calculating application
The location of the jar file containing your job's code
Any parameters you want to pass to the job—in this case the number of tasks, which is 1000

Parameters passed to the job must follow a double dash (--). See the gcloud documentation for more information.

The job's running and final output is displayed in the terminal window:

Waiting for job output...
...
Pi is roughly 3.14118528
...
Job finished successfully.

Update a cluster

To change the number of workers in the cluster to five, run the following command:

gcloud dataproc clusters update example-cluster \
    --region=region \
    --num-workers 5

Your cluster's details are displayed in the command's output:

workerConfig:
...
  instanceNames:
  - example-cluster-w-0
  - example-cluster-w-1
  - example-cluster-w-2
  - example-cluster-w-3
  - example-cluster-w-4
  numInstances: 5
statusHistory:
...
- detail: Add 3 workers.

You can use the same command to decrease the number of worker nodes to the original value:

gcloud dataproc clusters update example-cluster \
    --region=region \
    --num-workers 2

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

Run clusters delete to delete your example cluster.
```
gcloud dataproc clusters delete example-cluster \
    --region=region
```
You are prompted to confirm that you want to delete the cluster. Type y to complete the deletion.

What's next

Learn how to write and run a Spark Scala job.