After creating a Cloud Dataproc cluster, you can adjust ("scale") the cluster by increasing or decreasing the number of worker nodes in the cluster. You can scale a Cloud Dataproc cluster at any time, even when jobs are running on the cluster.
Why scale a Cloud Dataproc cluster?
- to increase the number of workers to make a job run faster
- to decrease the number of workers to save money (see Graceful Decommissioning as an option to use when downsizing a cluster to avoid losing work in progress).
- to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage
Because clusters can be scaled more than once, you might want to increase/decrease the cluster size at one time, and then decrease/increase the size later.
There are three ways you can scale your Cloud Dataproc cluster:
- Use the
gcloudcommand-line tool in the Google Cloud SDK.
- Edit the cluster configuration in the Google Cloud Platform Console.
- Use the REST API.
New workers added to a cluster will use the same
as existing workers. For example, if a cluster is created with
workers that use the
n1-standard-8 machine type, new workers
will also use the
n1-standard-8 machine type.
gcloudTo scale a cluster with gcloud dataproc clusters update, run the following command.
gcloud dataproc clusters update cluster-name --num-workers new-number-of-workerswhere cluster-name is the name of the cluster to update, and new-number-of-workers is the updated number of worker nodes. For example, to scale a cluster named "dataproc-1" to use five worker nodes, run the following command.
gcloud dataproc clusters update dataproc-1 --num-workers 5 Waiting on operation [operations/projects/project-id/operations/...]. Waiting for cluster update operation...done. Updated [https://dataproc.googleapis.com/...]. clusterName: my-test-cluster ... masterDiskConfiguration: bootDiskSizeGb: 500 masterName: dataproc-1-m numWorkers: 5 ... workers: - my-test-cluster-w-0 - my-test-cluster-w-1 - my-test-cluster-w-2 - my-test-cluster-w-3 - my-test-cluster-w-4 ...
ConsoleAfter a cluster is created, you can scale a cluster by clicking the Edit button on the Configuration tab on the cluster detail page. Enter a new value for the number of Worker nodes (updated to "5" in the following screenshot). Click Save to update the cluster.
REST APISee clusters.patch.
When you update a cluster using Cloud Dataproc v 1.2 or later, you can use Graceful Decommissioning, which incorporates graceful YARN decommissioning to finish work in progress on a worker before it is removed from the Cloud Dataproc cluster.
Using Graceful Decommissioning
Cloud Dataproc Graceful Decommissioning incorporates graceful YARN decommissioning to finish work in progress on a worker before it is removed from the Cloud Dataproc cluster. As a default, graceful decommissioning is disabled. You enable it by setting a timeout value when you update your cluster to remove one or more workers from the cluster.
gcloudWhen you update a cluster to remove one or more workers, use the gcloud beta dataproc clusters update command with the
--graceful-decommission-timeoutflag. The timeout (string) values can be a value of "0s" (the default; forceful not graceful decommissioning) or a positive duration relative to the current time (for example, "3s"). The maximum duration is 1 day.
gcloud dataproc clusters update \ --graceful-decommission-timeout="timeout-value" other args ...