In addition to using standard Compute Engine VMs as Dataproc
workers (called "primary" workers), Dataproc clusters can use
secondary
workers.
The following characteristics apply to all secondary workers in a Dataproc cluster:
Processing only—Secondary workers do not store data. They only function as processing nodes. Therefore, you can use secondary workers to scale compute without scaling storage.
No secondary-worker-only clusters—Your cluster must have primary workers. If you create a cluster and you don't specify the number of primary workers, Dataproc adds two primary workers to the cluster.
Machine type—By default, secondary workers use the machine type of the cluster's primary workers. For example, if you create a cluster with primary workers that use
n1-standard-4
machine types, by default, all secondary workers added to the cluster will also usen1-standard-4
machines.Instead of using the default primary worker machine type for secondary workers, you can specify one or more ranked lists of machine types for secondary workers. See Dataproc Flexible VMs for more information.
Persistent disk size—As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS. You can override the default disk size with the
gcloud dataproc clusters create --secondary-worker-boot-disk-size
command at cluster creation. You can specify this flag even if the cluster won't have secondary workers when it is created.Asynchronous Creation—When you add secondary workers by creating or scaling up a cluster, the secondary workers may not be provisioned by the time the create or update operation finishes. This is because Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned (see Checking the status of managed instances).
Preemptible and non-preemptible secondary workers
There are three types of secondary workers: spot VMs, standard preemptible VMs, and non-preemptible VMs. The default Dataproc secondary worker type is the standard preemptible VM. You can specify a mix of Spot and non-preemptible secondary workers.
Example: If you select three secondary workers when you create a cluster, you can specify three spot VMs, three preemptible VMS, three non-preemptible VMs, or a mix of spot and non-preemptible workers.
Preemptible workers
Although the potential removal of preemptible workers can affect job stability, you may decide to use preemptible instances to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost (you can use the Google Cloud pricing calculator to estimate costs).
For best results, the number of preemptible workers in your cluster should be less than 50% of the total number of all workers (primary plus all secondary workers) in your cluster.
When using preemptible workers, your jobs will most likely experience a greater number of transient single-worker task failures compared to jobs run on non-preemptible workers. To increase job tolerance to low-level task failures, you can set cluster property values similar to the default property values used with autoscaling clusters to increase the maximum number of task retries and help avoid job failures.
A cost-savings consideration: Using preemptible VMs does not always save costs since preemptions can cause longer job execution with resulting higher job costs. Although using Enhanced Flexibility Mode (EFM) with preemptible VMs can help mitigate this result, the overall cost savings of preemptible VMs will vary with each use case. Generally, short-lived jobs are more suitable for preemptible VM use, since the probability of preemptions during job execution will be lower. Try different job options, such as no preemptible VMs and preemptible VMs with EFM, to estimate costs and arrive at the best solution.
Non-preemptible workers
- You can create a cluster with non-preemptible secondary workers to scale
compute without sacrificing job stability. To do this, specify
non-preemptible
as the secondary worker type. You can mix non-preemptible with spot secondary workers.
Select secondary workers
You can specify the number and type of secondary workers when you create a cluster using the Google Cloud console, gcloud CLI or the Dataproc API.
- You can mix spot with non-preemptible secondary workers.
- You can update your cluster after it is created to change the number, but not the type, of secondary workers in your cluster.
- Label updates propagate to all preemptible secondary workers within 24 hours. Label updates don't propagate to existing non-preemptible secondary workers. Label updates propagate to all workers added to a cluster after a label update. For example, if you scale up the cluster, all new primary and secondary workers will have the new labels.
Console
You can specify the number of secondary workers when creating a Dataproc cluster from the Google Cloud console. After a cluster has been created, you can add and remove secondary workers by editing the cluster configuration from the Google Cloud console.
Create a cluster with secondary workers
You can set the number and type of secondary workers to apply to a new cluster from the Secondary worker nodes section of the Configure nodes panel on the Dataproc Create a cluster page of the Google Cloud console. Specify the number and type of secondary workers in the Secondary worker nodes and Preemptibility fields, respectively.
Update a cluster with secondary instances
To update the number of secondary workers in a cluster, click the name of the cluster on the Clusters page of the Google Cloud console. On the Cluster details page. Click the **Configuration** tab, then click Edit and update the number in the Secondary worker nodes field.
Remove all secondary instances from a cluster
To remove all secondary workers from a cluster, update the cluster
configuration as explained earlier, specifying 0
in the
Secondary worker nodes field.
Google Cloud CLI command
Use the
gcloud dataproc clusters create
command to add secondary workers to a cluster when the cluster is created.
After a cluster is created, you can add or remove secondary workers to or from the
cluster with the
gcloud dataproc clusters update
command (the number but not type of secondary workers can be updated).
Create a cluster with secondary workers
To create a cluster with secondary workers, use the
gcloud dataproc clusters create
command with the --num-secondary-workers
argument. Secondary workers
are standard preemptible VMs by default. You can specify non-preemptible or
spot secondary workers when you create a cluster
by setting --secondary-worker-type
flag to `non-preemptible`
or `spot`. The following examples show how to create a cluster with each
secondary worker type: the `preemptible` (default), spot (preemptible),
and non-preemptible. You can use additional flags to
mix spot with non-preemptible secondary workers.
The following command creates "cluster1" with two standard preemptible (default type) secondary workers.
gcloud dataproc clusters create cluster1 \ --num-secondary-workers=2 \ --region=us-central1Example 2
The following command uses the secondary-worker-type
flag to create "cluster2" with two spot (preemptible) secondary workers.
gcloud dataproc clusters create cluster2 \ --num-secondary-workers=2 \ --secondary-worker-type=spot \ --region=us-central1
Example 3
The following command uses the secondary-worker-type
flag to create "cluster3" with two
non-preemptible secondary workers.
gcloud dataproc clusters create cluster3 \ --num-secondary-workers=2 \ --secondary-worker-type=non-preemptible \ --region=us-central1Change the secondary worker boot disk size. As a default, all secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS. You can override the default disk size with the
gcloud dataproc clusters create --secondary-worker-boot-disk-size
command at cluster creation. This flag can be specified even if the cluster does not
have any secondary workers at creation time.
Let the Google Cloud console construct your cluster create request.
You can click the Equivalent REST or command line links at the
bottom of the left panel of the Dataproc
Create a cluster page to have
the Google Cloud console construct an equivalent API REST request or gcloud tool
command.
Update a cluster with secondary workers
To update a cluster to add or remove secondary workers, use the
gcloud dataproc clusters update
command with the --num-secondary-workers
flag.
The following command updates example-cluster to use four secondary workers (of the default type or type specified when you created the cluster).
gcloud dataproc clusters update example-cluster \ --num-secondary-workers=4 \ --region=us-central1
Remove all secondary workers from a cluster
To remove all secondary workers from a cluster, use the
gcloud dataproc clusters update
command with --num-secondary-workers
set to 0
.
The following command removes all secondary workers from "example-cluster".
gcloud dataproc clusters update example-cluster \ --num-secondary-workers=0 \ --region=us-central1
REST API
Create a cluster with secondary workers
Use the Dataproc
clusters.create
API add secondary workers to a cluster when the cluster is created. The following
examples show how to create a cluster with each
secondary worker type:
preemptible
(default), spot
(preemptible),
and non-preemptible
. You can use additional fields to
mix spot with non-preemptible secondary workers.
The following POST request creates a "cluster1" with two standard preemptible (default type) VM workers.
POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters { "clusterName": "cluster1", "config": { "secondaryWorkerConfig": { "numInstances": 2 } } }Example 2
The following POST request creates a "cluster2" with two spot (preemptible) VM workers.
POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters { "clusterName": "cluster2", "config": { "secondaryWorkerConfig": { "numInstances": 2, "preemptibility": "SPOT" } } }
Example 3
The following POST request creates "cluster3" with two non-preemptible secondary workers.
POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters { "clusterName": "cluster3", "config": { "secondaryWorkerConfig": { "numInstances": 2, "preemptibility": "NON_PREEMPTIBLE" } } }
Update a cluster with secondary workers
Use the Dataproc clusters.patch API to add and remove secondary workers.
ExampleThe following PATCH request updates a cluster to have four secondary workers (of the default type or type specified when you created the cluster).
PATCH /v1/projects/project-id/regions/region/clusters/cluster-name?updateMask=config.secondary_worker_config.num_instances { "config": { "secondaryWorkerConfig": { "numInstances": 4 } } }Let the Google Cloud console construct your cluster create request. You can click the Equivalent REST or command line links at the bottom of the left panel of the Dataproc Create a cluster page to have the Google Cloud console construct an equivalent API REST request or gcloud CLI command.
Troubleshoot secondary workers
Service account permission issues: Secondary workers are created through a managed instance group. If there is a permission issue, Dataproc logs won't report the failure to create secondary workers, but failed workers are listed under the VM Instances tab of the Cluster details page in the Google Cloud console without a green checkmark. To view the listing, open the Dataproc Clusters page, then click the cluster name to open the Cluster details page for the cluster.
Managed instance group permissions issues: To check if there is an issue with managed instance group permissions:
- Find the name of the managed instance group (
instanceGroupManagerName
).Console
- Open the Dataproc Clusters page, then click the cluster name to open the Cluster details page for the cluster.
- Click Equivalent REST at the bottom of the page, then view the
config.secondaryWorkerConfig.managedGroupConfig.instanceGroupManagerName
value.
Google Cloud CLI
Run thegcloud dataproc clusters describe
command with the--format
flag to display theinstanceGroupManagerName
.gcloud dataproc clusters describe CLUSTER_NAME \ --region=REGION \ --format='value(config.secondaryWorkerConfig.managedGroupConfig.instanceGroupManagerName)'
REST API
Submit aclusters.get
request to return the value ofconfig.secondaryWorkerConfig.managedGroupConfig.instanceGroupManagerName
. - View the logs in Logs Explorer.
Select the
Google Compute Engine Instance Group
resource type, and filter for the managed instance group name.Alternatively, you can apply a logging filter for `resource.type="gce_instance_group" and
resource.labels.instance_group_name=INSTANCE_GROUP_MANAGER_NAME
.
- Find the name of the managed instance group (
Mix spot with non-preemptible secondary workers
You can specify a mix of spot and non-preemptible secondary workers when you create a Dataproc cluster.
Secondary worker settings to mix spot with non-preemptible secondary workers
Use the following secondary worker settings when you create a Dataproc cluster to obtain a minimum level of secondary worker capacity with the ability to increase capacity when spot VMs are available:
secondary worker number: The total number of secondary workers to provision.
secondary worker type:
spot
is the secondary worker type when mixing spot with non-preemptible secondary workers.standardCapacityBase: The number of non-preemptible (standard) secondary workers to provision. Non-preemptible secondary workers are provisioned before other types of secondary workers.
standardCapacityPercentAboveBase: After the
standardCapacityBase
number of secondary workers is filled, the remaining number of secondary workers needed to meet the total number of secondary workers requested is filled with a mix of non-preemptible and spot VMS as follows:standardCapacityPercentAboveBase
: The percent of the remaining secondary workers to fill with non-preemptible VMS.- The remaining number needed to meet the total number of requested secondary workers are filled with spot VMs.
Example:
- Number of secondary workers: 15
standardCapacityBase
: 5standardCapacityPercentAboveBase
30%
Result:
- Non-preemptible: 8 = 5 (
standardCapacityBase
) + 3 (30% of remaining 10) - Spot: 7 (70% of remaining 10)
- Total = 15
Create a cluster with a mix of spot and non-preemptible secondary workers
You can use the gcloud CLI or the Dataproc API to mix spot with non-preemptible secondary workers when you create a cluster.
gcloud
Run the following command locally or in Cloud Shell to create a cluster with a mix of spot and non-preemptible secondary workers.
gcloud dataproc clusters create CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --secondary-worker-type=spot \ --num-secondary-workers=NUMBER_SECONDARY_WORKERS \ --secondary-worker-standard-capacity-base=STANDARD_CAPACITY_BASE \ --secondary-worker-standard-capacity-percent-above-base=STANDARD_CAPACITY_PERCENT_ABOVE_BASE \ OTHER_FLAGS_AS_NEEDED
Notes:
- CLUSTER_NAME: The name of the new cluster.
- PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
- REGION: An available Compute Engine region to run the workload.
--secondary-worker-type
: When mixing spot and non-preemptible secondary workers, specify the secondary worker type asspot
.- STANDARD_CAPACITY_BASE and STANDARD_CAPACITY_PERCENT_ABOVE_BASE: See Secondary worker settings to mix spot with non-preemptible secondary workers.
- OTHER_FLAGS_AS_NEEDED: See gcloud dataproc clusters create.
API
To mix spot with non-preemptible secondary workers, set the
Dataproc preemptibility
, standardCapacityBase
, and
standardCapacityPercentAboveBase
API fields as part of
a cluster.create
request, as shown in the following JSON sample:
{ "clusterName": "CLUSTER_NAME", "config": { "secondaryWorkerConfig": { "numInstances": 15, "preemptibility": "spot", "instanceFlexibilityPolicy": { "provisioningModelMix": { "standardCapacityBase": STANDARD_CAPACITY_BASE "standardCapacityPercentAboveBase": STANDARD_CAPACITY_PERCENT_ABOVE_BASE } } } } }
Notes:
- CLUSTER_NAME: The name of the new cluster.
preemptibility
: When mixing spot and non-preemptible secondary workers, specifyspot
.- STANDARD_CAPACITY_BASE and STANDARD_CAPACITY_PERCENT_ABOVE_BASE: See Secondary worker settings to mix spot with non-preemptible secondary workers.
Combine secondary worker mixing with flexible VMs
You can mix spot and non-preemptible secondary workers and specify flexible VM shapes for secondary workers when you create a cluster.
gcloud CLI example:
gcloud dataproc clusters create cluster-name \ --project=project-id \ --region=us-cdbtral1 \ --secondary-worker-type=spot \ --num-secondary-workers=15 \ --secondary-worker-standard-capacity-base=5 \ --secondary-worker-standard-capacity-percent-above-base=30 \ --secondary-worker-machine-types="type=n2-standard-8,rank=0" \ --secondary-worker-machine-types="type=e2-standard-8,type=t2d-standard-8,rank=1" ...other flags as needed
Secondary worker mixing characteristics
This section describes some of the behavior and characteristics associated with mixing spot and non-preemptible secondary workers.
Secondary worker preference
Dataproc doesn't give preference to either spot or non-preemptible VMs when scheduling applications on secondary workers.
Secondary worker scaling
When secondary workers are scaled through autoscaling or manual scaling, Dataproc maintains the requested spot-to-non-preemptible ratio when it adds or deletes secondary workers.
Updating secondary worker mix settings
You specify the mix of spot and non-preemptible secondary workers when you create a Dataproc cluster. You can't change the secondary worker mix settings after you create the cluster.
Spot secondary worker preemption
- Dataproc doesn't control the timing of spot VM preemption (see Preemption of Spot VMs).
- When spot preemption occurs, the secondary worker group can run with reduced capacity temporarily until Compute Engine reprovisions the preempted VMs.
- Dataproc won't add capacity to a secondary worker group in excess of the group's initial settings.