In addition to using standard Compute Engine VMs as Dataproc
workers (called "primary" workers), Dataproc clusters can use
secondary
workers.
The following characteristics apply to all secondary workers in a Dataproc cluster:
Processing only—Secondary workers do not store data. They only function as processing nodes. Therefore, you can use secondary workers to scale compute without scaling storage.
No secondary-worker-only clusters—Your cluster must have primary workers. If you create a cluster and you do not specify the number of primary workers, Dataproc adds two primary workers to the cluster.
Machine type—By default, secondary workers use the machine type of the cluster's primary workers. For example, if you create a cluster with primary workers that use
n1-standard-4
machine types, all secondary workers added to the cluster will also usen1-standard-4
machines.Instead of using the default primary worker machine type for secondary workers, you can specify one or more ranked lists of machine types for secondary workers. See Dataproc Flexible VMs for more information.
Persistent disk size—As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS. You can override the default disk size with the
gcloud dataproc clusters create --secondary-worker-boot-disk-size
command at cluster creation. You can specify this flag even if the cluster won't have secondary workers when it is created.Asynchronous Creation—When you add secondary workers by creating or scaling up a cluster, the secondary workers may not be provisioned by the time the create or update operation finishes. This is because Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned (see Checking the status of managed instances ).
Preemptible and non-preemptible secondary workers
There are three types of secondary workers: spot VMs, standard preemptible VMs, and non-preemptible VMs. If you specify secondary workers for your cluster, they must be the same type. The default Dataproc secondary worker type is the standard preemptible VM.
Example: If you select three secondary workers when you create a cluster, you can specify that all three will be Spot VMs, or all three will be (standard) preemptible VMS, or all three will be non-preemptible VMs, but you cannot specify that each will be a different type.
A spot VM is the latest type of Compute Engine preemptible VM. It shares the lower-cost pricing model of standard preemptible VMs, but unlike the standard preemptible VM with a 24-hour maximum lifetime, the spot VM has no maximum lifetime. Both spot and standard preemptible VM workers are reclaimed and removed from a Dataproc cluster if they are required by Google Cloud for other tasks.
Preemptible workers
Although the potential removal of preemptible workers can affect job stability, you may decide to use preemptible instances to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost (you can use the Google Cloud pricing calculator to estimate costs).
For best results, the number of preemptible workers in your cluster should be less than 50% of the total number of all workers (primary plus all secondary workers) in your cluster.
When using preemptible workers, your jobs will most likely experience a greater number of transient single-worker task failures compared to jobs run on non-preemptible workers. To increase job tolerance to low-level task failures, you can set cluster property values similar to the default property values used with autoscaling clusters to increase the maximum number of task retries and help avoid job failures.
A cost-savings consideration: Using preemptible VMs does not always save costs since preemptions can cause longer job execution with resulting higher job costs. Although using Enhanced Flexibility Mode (EFM) with preemptible VMs can help mitigate this result, the overall cost savings of preemptible VMs will vary with each use case. Generally, short-lived jobs are more suitable for preemptible VM use, since the probability of preemptions during job execution will be lower. Try different job options, such as no preemptible VMs and preemptible VMs with EFM, to estimate costs and arrive at the best solution.
Non-preemptible workers
- You can create a cluster with non-preemptible secondary workers to scale compute without sacrificing job stability. To do this, specify "non-preemptible" as the secondary worker type.
Using secondary workers
You can specify the number and type of secondary workers when you create a cluster using the Google Cloud console, gcloud CLI or the Dataproc API.
- Secondary workers must be the same type.
- You can update your cluster after it is created to change the number, but not the type, of secondary workers in your cluster.
- Label updates propagate to all preemptible secondary workers within 24 hours. Label updates do not propagate to existing non-preemptible secondary workers. Label updates propagate to all workers added to a cluster after a label update. For example, if you scale up the cluster, all new primary and secondary workers will have the new labels.
Console
You can specify the number of secondary workers when creating a Dataproc cluster from the Google Cloud console. After a cluster has been created, you can add and remove secondary workers by editing the cluster configuration from the Google Cloud console.
Creating a cluster with secondary workers
You can set the number and type of secondary workers to apply to a new cluster from the Secondary worker nodes section of the Configure nodes panel on the Dataproc Create a cluster page of the Google Cloud console. Specify the number and type of secondary workers in the Secondary worker nodes and Preemptibility fields, respectively.
Updating a cluster with secondary instances
To update the number of secondary workers in a cluster, click the name of the cluster on the Clusters page of the Google Cloud console. On the Cluster details page. Click the CONFIGURATION tab, then click EDIT and update the number in the Secondary worker nodes field.
Removing all secondary instances from a cluster
To remove all secondary workers from a cluster, update the cluster
configuration as explained earlier, specifying 0
in the
Secondary worker nodes field.
gcloud command
Use the
gcloud dataproc clusters create
command to add secondary workers to a cluster when the cluster is created.
After a cluster is created, you can add or remove secondary workers to or from the
cluster with the
gcloud dataproc clusters update
command (the number but not type of secondary workers can be updated).
Creating a cluster with secondary workers
To create a cluster with secondary workers, use the
gcloud dataproc clusters create
command with the --num-secondary-workers
argument. Note that
secondary workers are standard preemptible VMs by default.
You can specify non-preemptible secondary workers when you create a cluster
by setting --secondary-worker-type=non-preemptible
(the dataproc:secondary-workers.is-preemptible.override
property
is no longer used to specify the type of the secondary worker).
The following command creates "cluster1" with two standard preemptible (default type) secondary workers.
gcloud dataproc clusters create cluster1 \ --num-secondary-workers=2 \ --region=us-central1
The following command uses the secondary-worker-type
flag to create "cluster2" with two spot (preemptible) secondary workers.
gcloud dataproc clusters create cluster2 \ --num-secondary-workers=2 \ --secondary-worker-type=spot \ --region=us-central1
Example 3
The following command uses the secondary-worker-type
flag to create "cluster3" with two
non-preemptible secondary workers.
gcloud dataproc clusters create cluster3 \ --num-secondary-workers=2 \ --secondary-worker-type=non-preemptible \ --region=us-central1
Updating a cluster with secondary workers
To update a cluster to add or remove secondary workers, use the
gcloud dataproc clusters update
command with the --num-secondary-workers
argument.
The following command updates "example-cluster" to use four secondary workers (of the default type or type specified when you created the cluster).
gcloud dataproc clusters update example-cluster \ --num-secondary-workers=4 \ --region=us-central1
Removing all secondary workers from a cluster
To remove all secondary workers from a cluster, use the
gcloud dataproc clusters update
command with --num-secondary-workers
set to 0
.
The following command removes all secondary workers from "example-cluster".
gcloud dataproc clusters update example-cluster \ --num-secondary-workers=0 \ --region=us-central1
REST API
Creating a cluster with secondary workers
Use the Dataproc clusters.create API add secondary workers to a cluster when the cluster is created. Note that secondary workers are standard preemptible VMs by default.
Example 1The following POST request creates a "cluster1" with two standard preemptible (default type) VM workers.
POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters { "clusterName": "cluster1", "config": { "secondaryWorkerConfig": { "numInstances": 2 } } }
The following POST request creates a "cluster2" with two spot (preemptible) VM workers.
POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters { "clusterName": "cluster2", "config": { "secondaryWorkerConfig": { "numInstances": 2, "preemptibility": "SPOT" } } }
Example 3
The following POST request creates "cluster3" with two non-preemptible secondary workers.
POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters { "clusterName": "cluster3", "config": { "secondaryWorkerConfig": { "numInstances": 2, "preemptibility": "NON_PREEMPTIBLE" } } }
Updating a cluster with secondary workers
Use the Dataproc clusters.patch API to add and remove secondary workers.
ExampleThe following PATCH request updates a cluster to have four secondary workers (of the default type or type specified when you created the cluster).
PATCH /v1/projects/project-id/regions/region/clusters/cluster-name?updateMask=config.secondary_worker_config.num_instances { "config": { "secondaryWorkerConfig": { "numInstances": 4 } } }
Troubleshooting secondary workers
Service account permission issues: Secondary workers are created through a managed instance group. If there is a permission issue, Dataproc logs won't report the failure to create secondary workers, but failed workers are listed under the VM Instances tab of the Cluster details page in the Google Cloud console without a green checkmark. To view the listing, open the Dataproc Clusters page, then click the cluster name to open the Cluster details page for the cluster.
Managed instance group permissions issues: To check if there is an issue with managed instance group permissions:
- Find the name of the managed instance group (
instanceGroupManagerName
).Console
- Open the Dataproc Clusters page, then click the cluster name to open the Cluster details page for the cluster.
- Click Equivalent REST at the bottom of the page, then view the
config.secondaryWorkerConfig.managedGroupConfig.instanceGroupManagerName
value.
Google Cloud CLI
Run thegcloud dataproc clusters describe
command with the--format
flag to display theinstanceGroupManagerName
.gcloud dataproc clusters describe CLUSTER_NAME \ --region=REGION \ --format='value(config.secondaryWorkerConfig.managedGroupConfig.instanceGroupManagerName)'
REST API
Submit aclusters.get
request to return the value ofconfig.secondaryWorkerConfig.managedGroupConfig.instanceGroupManagerName
. - View the logs in Logs Explorer.
Select the
Google Compute Engine Instance Group
resource type, and filter for the managed instance group name.Alternatively, you can apply a logging filter for `resource.type="gce_instance_group" and
resource.labels.instance_group_name=INSTANCE_GROUP_MANAGER_NAME
.
- Find the name of the managed instance group (