Dataproc secondary workers

In addition to using standard Compute Engine VMs as Dataproc workers (called "primary" workers), Dataproc clusters can use secondary workers.

The following characteristics apply to all secondary workers in a Dataproc cluster:

  • Processing only—Secondary workers do not store data. They only function as processing nodes. Therefore, you can use secondary workers to scale compute without scaling storage.

  • No secondary-worker-only clusters—Your cluster must have primary workers. If you create a cluster and you do not specify the number of primary workers, Dataproc adds two primary workers to the cluster.

  • Machine type—By default, secondary workers use the machine type of the cluster's primary workers. For example, if you create a cluster with primary workers that use n1-standard-4 machine types, all secondary workers added to the cluster will also use n1-standard-4 machines.

    Instead of using the default primary worker machine type for secondary workers, you can specify one or more ranked lists of machine types for secondary workers. See Dataproc Flexible VMs for more information.

  • Persistent disk size—As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS. You can override the default disk size with the gcloud dataproc clusters create --secondary-worker-boot-disk-size command at cluster creation. You can specify this flag even if the cluster won't have secondary workers when it is created.

  • Asynchronous Creation—When you add secondary workers by creating or scaling up a cluster, the secondary workers may not be provisioned by the time the create or update operation finishes. This is because Dataproc manages secondary workers using Managed Instance Groups (MIGs), which create VMs asynchronously as soon as they can be provisioned (see Checking the status of managed instances ).

Preemptible and non-preemptible secondary workers

There are three types of secondary workers: spot VMs, standard preemptible VMs, and non-preemptible VMs. If you specify secondary workers for your cluster, they must be the same type. The default Dataproc secondary worker type is the standard preemptible VM.

Example: If you select three secondary workers when you create a cluster, you can specify that all three will be Spot VMs, or all three will be (standard) preemptible VMS, or all three will be non-preemptible VMs, but you cannot specify that each will be a different type.

A spot VM is the latest type of Compute Engine preemptible VM. It shares the lower-cost pricing model of standard preemptible VMs, but unlike the standard preemptible VM with a 24-hour maximum lifetime, the spot VM has no maximum lifetime. Both spot and standard preemptible VM workers are reclaimed and removed from a Dataproc cluster if they are required by Google Cloud for other tasks.

Preemptible workers

  • Although the potential removal of preemptible workers can affect job stability, you may decide to use preemptible instances to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost (you can use the Google Cloud pricing calculator to estimate costs).

  • For best results, the number of preemptible workers in your cluster should be less than 50% of the total number of all workers (primary plus all secondary workers) in your cluster.

  • When using preemptible workers, your jobs will most likely experience a greater number of transient single-worker task failures compared to jobs run on non-preemptible workers. To increase job tolerance to low-level task failures, you can set cluster property values similar to the default property values used with autoscaling clusters to increase the maximum number of task retries and help avoid job failures.

  • A cost-savings consideration: Using preemptible VMs does not always save costs since preemptions can cause longer job execution with resulting higher job costs. Although using Enhanced Flexibility Mode (EFM) with preemptible VMs can help mitigate this result, the overall cost savings of preemptible VMs will vary with each use case. Generally, short-lived jobs are more suitable for preemptible VM use, since the probability of preemptions during job execution will be lower. Try different job options, such as no preemptible VMs and preemptible VMs with EFM, to estimate costs and arrive at the best solution.

Non-preemptible workers

  • You can create a cluster with non-preemptible secondary workers to scale compute without sacrificing job stability. To do this, specify "non-preemptible" as the secondary worker type.

Using secondary workers

You can specify the number and type of secondary workers when you create a cluster using the Google Cloud console, gcloud CLI or the Dataproc API.

  • Secondary workers must be the same type.
  • You can update your cluster after it is created to change the number, but not the type, of secondary workers in your cluster.
  • Label updates propagate to all preemptible secondary workers within 24 hours. Label updates do not propagate to existing non-preemptible secondary workers. Label updates propagate to all workers added to a cluster after a label update. For example, if you scale up the cluster, all new primary and secondary workers will have the new labels.

Console

You can specify the number of secondary workers when creating a Dataproc cluster from the Google Cloud console. After a cluster has been created, you can add and remove secondary workers by editing the cluster configuration from the Google Cloud console.

Creating a cluster with secondary workers

You can set the number and type of secondary workers to apply to a new cluster from the Secondary worker nodes section of the Configure nodes panel on the Dataproc Create a cluster page of the Google Cloud console. Specify the number and type of secondary workers in the Secondary worker nodes and Preemptibility fields, respectively.

Updating a cluster with secondary instances

To update the number of secondary workers in a cluster, click the name of the cluster on the Clusters page of the Google Cloud console. On the Cluster details page. Click the CONFIGURATION tab, then click EDIT and update the number in the Secondary worker nodes field.

Removing all secondary instances from a cluster

To remove all secondary workers from a cluster, update the cluster configuration as explained earlier, specifying 0 in the Secondary worker nodes field.

gcloud command

Use the gcloud dataproc clusters create command to add secondary workers to a cluster when the cluster is created. After a cluster is created, you can add or remove secondary workers to or from the cluster with the gcloud dataproc clusters update command (the number but not type of secondary workers can be updated).

Creating a cluster with secondary workers

To create a cluster with secondary workers, use the gcloud dataproc clusters create command with the --num-secondary-workers argument. Note that secondary workers are standard preemptible VMs by default. You can specify non-preemptible secondary workers when you create a cluster by setting --secondary-worker-type=non-preemptible (the dataproc:secondary-workers.is-preemptible.override property is no longer used to specify the type of the secondary worker).

Example 1

The following command creates "cluster1" with two standard preemptible (default type) secondary workers.

gcloud dataproc clusters create cluster1 \
    --num-secondary-workers=2 \
    --region=us-central1
Example 2

The following command uses the secondary-worker-type flag to create "cluster2" with two spot (preemptible) secondary workers.

gcloud dataproc clusters create cluster2 \
    --num-secondary-workers=2 \
    --secondary-worker-type=spot \
    --region=us-central1

Example 3

The following command uses the secondary-worker-type flag to create "cluster3" with two non-preemptible secondary workers.

gcloud dataproc clusters create cluster3 \
    --num-secondary-workers=2 \
    --secondary-worker-type=non-preemptible \
    --region=us-central1

Updating a cluster with secondary workers

To update a cluster to add or remove secondary workers, use the gcloud dataproc clusters update command with the --num-secondary-workers argument.

Example

The following command updates "example-cluster" to use four secondary workers (of the default type or type specified when you created the cluster).

gcloud dataproc clusters update example-cluster \
    --num-secondary-workers=4 \
    --region=us-central1

Removing all secondary workers from a cluster

To remove all secondary workers from a cluster, use the gcloud dataproc clusters update command with --num-secondary-workers set to 0.

Example

The following command removes all secondary workers from "example-cluster".

gcloud dataproc clusters update example-cluster \
    --num-secondary-workers=0 \
    --region=us-central1

REST API

Creating a cluster with secondary workers

Use the Dataproc clusters.create API add secondary workers to a cluster when the cluster is created. Note that secondary workers are standard preemptible VMs by default.

Example 1

The following POST request creates a "cluster1" with two standard preemptible (default type) VM workers.


POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

{
  "clusterName": "cluster1",
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2
    }
  }
}
Example 2

The following POST request creates a "cluster2" with two spot (preemptible) VM workers.


POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

{
  "clusterName": "cluster2",
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2,
      "preemptibility": "SPOT"
    }
  }
}

Example 3

The following POST request creates "cluster3" with two non-preemptible secondary workers.


POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

{
  "clusterName": "cluster3",
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2,
      "preemptibility": "NON_PREEMPTIBLE"
    }
  }
}

Updating a cluster with secondary workers

Use the Dataproc clusters.patch API to add and remove secondary workers.

Example

The following PATCH request updates a cluster to have four secondary workers (of the default type or type specified when you created the cluster).


PATCH /v1/projects/project-id/regions/region/clusters/cluster-name?updateMask=config.secondary_worker_config.num_instances
{
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 4
    }
  }
}

Troubleshooting secondary workers

Service account permission issues: Secondary workers are created through a managed instance group and Compute Engine uses your project's Google APIs Service Agent service account to perform managed instance group operations. This service account name is formatted as follows: project-id@cloudservices.gserviceaccount.com.

If there is a permission issue with this service account, Dataproc logs will not report the failure to create secondary workers, but failed workers will be listed under the VM INSTANCES tab of the Cluster details page in the Google Cloud console without a green checkmark (open the Dataproc Clusters page, then click the cluster name to open the Cluster details page for the cluster).

  • Managed instance group permissions issues: To check if there is an issue with managed instance group permissions, view the logs in Logs Explorer for "Google Compute Engine Instance Group" resource type, and filter for the corresponding instance group ID. The instance group ID filter will display the instance group name in the format of dataproc-CLUSTER NAME-sw and the instance group ID will be auto-populated in the logging query. Instead of using the dropdown filters, you can also apply a logging filter for resource.type="gce_instance_group" and resource.labels.instance_group_name="dataproc-CLUSTER NAME-sw".

  • Custom image permission issues: If Dataproc cluster VMs are created with custom images fetched from another project, the Compute Image User role must be assigned to your project's project-id@cloudservices.gserviceaccount.com service account (see Grant a managed instance group access to images). If the correct role is not assigned, this error message will appear in the logs: Required 'compute.images.useReadOnly' permission for 'projects/[IMAGE PROJECT]/global/images/[IMAGE NAME]