This page shows you how to use the Google Cloud Console to create a Dataproc cluster, run a simple Apache Spark job in the cluster, then modify the number of workers in the cluster.
Before you begin
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- Enable the Dataproc API.
Create a cluster
- Go to the Cloud Console Cloud Dataproc Clusters page.
- Click Create cluster.
example-clusterin the Name field.
- Select a region and zone for the cluster from the Region and Zone
drop-down menus. You can select a distinct region,
europe-west1, to isolate resources (including VM instances and Cloud Storage) and metadata storage locations utilized by Dataproc within the specified region. If you select a distinct region, you can select "No preference" for the zone to let Dataproc pick a zone within the selected region for your cluster (see Dataproc Auto Zone Placement). You can also select a
globalregion, which is a special multi-region endpoint that is capable of deploying instances into any user-specified Compute Engine zone (when selecting a global region, you must select a zone). See Regional endpoints to learn more about the difference between global and regional endpoints. See Available regions & zones for information on selecting a region and zone. You can also run the
gcloud compute regions listcommand to see a listing of available regions.
- Use the provided defaults for all the other options.
- Click Create to create the cluster.
Your new cluster should appear in the Clusters list. Cluster status is listed as "Provisioning" until the cluster is ready to use, then changes to "Running."
Submit a job
To run a sample Spark job:
- Select Jobs in the left nav to switch to Dataproc's jobs view.
- Click Submit job.
- You can accept the Job ID or provide your own, which must be unique within the project.
- Select the Region of your new example-cluster.
- Select example-cluster from the Cluster drop-down menu.
- Select Spark from the Job type drop-down menu.
org.apache.spark.examples.SparkPiin the Main class or jar field.
file:///usr/lib/spark/examples/jars/spark-examples.jarin the Jar files field.
1000in the Arguments field to set the number of tasks.
- Click Submit.
Your job should appear in the Jobs list, which shows your project's jobs with their cluster, type, and current status. Job status displays as "Running," and then "Succeeded" after it completes. To see your completed job's output:
- Click the job ID in the Jobs list.
- Select Line Wrapping to avoid scrolling.
You should see that your job has successfully calculated a rough value for pi!
Update a cluster
To change the number of worker instances in your cluster:
- Select Clusters in the left navigation pane to return to the Cloud Dataproc Clusters view.
- Click example-cluster in the Clusters list. By default, the page displays an overview of your cluster's CPU usage.
- Click Configuration to display your cluster's current settings.
- Click Edit. The number of worker nodes is now editable.
5in the Worker nodes field.
- Click Save.
Your cluster is now updated. You can follow the same procedure to decrease the number of worker nodes to the original value.
To avoid incurring charges to your Google Cloud account for the resources used in this quickstart, follow these steps.
- On the example-cluster Cluster page, click Delete to delete the cluster. You are prompted to confirm that you want to delete the cluster. Click OK.
- You should also remove any Cloud Storage buckets that were created by the
cluster by running the following command:
gsutil rm gs://bucket/subdir/**