Secondary workers - preemptible and non-preemptible VMs

In addition to using standard Compute Engine VMs as Dataproc workers (called "primary" workers), Dataproc clusters can use "secondary" workers.

The following characteristics apply to all secondary workers in a Dataproc cluster:

  • Processing only—Secondary workers do not store data. They only function as processing nodes. Therefore, you can use secondary workers to scale compute without scaling storage.

  • No secondary-worker-only clusters—Your cluster must have primary workers. If you create a cluster and you do not specify the number of primary workers, Dataproc adds two primary workers to the cluster.

  • Machine type—Secondary workers use the machine type of the cluster's primary workers. For example, if you create a cluster with primary workers that use n1-standard-4 machine types, all secondary workers added to the cluster will also use n1-standard-4 machines.

  • Persistent disk size—As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS. You can override the default disk size with the gcloud dataproc clusters create --secondary-worker-boot-disk-size command at cluster creation. You can specify this flag even if the cluster will not have secondary workers when it is created.

Preemptible and non-preemptible secondary workers

There are two types of secondary workers: preemptible and non-preemptible. All secondary workers in your cluster must be of the same type, either preemptible or non-preemptible. The default is preemptible.

  • Preemptible secondary workers: Preemptible workers are the default secondary worker type. They are reclaimed and removed from the cluster if they are required by Google Cloud for other tasks. Although the potential removal of preemptible workers can affect job stability, you may decide to use preemptible instances to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost (you can use the Google Cloud pricing calculator to estimate costs).

    • For best results, the number of preemptible workers in your cluster should be less than 50% of the total number of all workers (primary plus all secondary workers) in your cluster.

    • When using preemptible workers, your jobs will most likely experience a greater number of transient single-worker task failures compared to jobs run on non-preemptible workers. To increase job tolerance to low-level task failures, you can set cluster property values similar to the default property values used with autoscaling clusters to increase the maximum number of task retries and help avoid job failures.

  • Non-preemptible secondary workers: You can create a cluster with non-preemptible secondary workers to scale compute without sacrificing job stability. You must specify "non-preemptible" as the secondary worker type ("preemptible" is the default).

Using secondary workers

You specify the number and type of secondary workers (preemptible or non-preemptible) when you create a cluster via a Dataproc API request, using the Cloud SDK gcloud command-line tool, or from the Google Cloud Console.

  • A cluster can contain either preemptible secondary workers or non-preemptible secondary workers, but not both.
  • You can update your cluster after it is created to change the number, but not the type, of secondary workers in your cluster.
  • Label updates propagate to all preemptible secondary workers within 24 hours. Currently, label updates do not propagate to existing non-preemptible secondary workers. Label updates propagate to all workers added to a cluster after a label update. For example, if you scale up the cluster, all new primary and secondary workers will have the new labels.

gcloud command

Use the gcloud dataproc clusters create command to add secondary workers to a cluster when the cluster is created. After a cluster is created, you can add or remove secondary workers to or from the cluster with the gcloud dataproc clusters update command (the number but not type of secondary workers can be updated).

Creating a cluster with secondary workers

To create a cluster with secondary workers, use the gcloud dataproc clusters create command with the --num-secondary-workers argument. Note that secondary workers are preemptible by default, but you can add non-preemptible secondary workers when you create a cluster by setting --secondary-worker-type=non-preemptible (see Example 2).

Example 1

The following command creates a cluster named "my-test-cluster" with two preemptible workers.

gcloud dataproc clusters create my-test-cluster \
    --num-secondary-workers=2 \
    --region=us-central1

Example 2

The following command uses the secondary-worker-type flag to create "my-test-cluster" cluster with two non-preemptible secondary workers.

gcloud dataproc clusters create my-test-cluster \
    --num-secondary-workers=2 \
    --secondary-worker-type=non-preemptible \
    --region=us-central1

Updating a cluster with secondary workers

To update a cluster to add or remove secondary workers, use the gcloud dataproc clusters update command with the --num-secondary-workers argument.

Example

The following command updates a cluster named "my-test-cluster" to use two secondary workers.

gcloud dataproc clusters update my-test-cluster \
    --num-secondary-workers=2 \
    --region=us-central1

Removing all secondary workers from a cluster

To remove all secondary workers from a cluster, use the gcloud dataproc clusters update command with --num-secondary-workers set to 0.

Example

The following command removes all secondary workers from a cluster.

gcloud dataproc clusters update my-test-cluster \
    --num-secondary-workers=0 \
    --region=us-central1

REST API

Creating a cluster with secondary workers

Use the Dataproc clusters.create API add secondary workers to a cluster when the cluster is created. Note that secondary workers are preemptible by default, but you can add non-preemptible secondary workers to your clusters as shown in Example 2.

Example 1

The following POST request creates a cluster with two preemptible workers.


POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

{
  "clusterName": "cluster-name",
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2
    }
  }
}

Example 2

The following POST request creates a cluster with two non-preemptible secondary workers.


POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

{
  "clusterName": "cluster-name",
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2,
      "preemptibility": "NON_PREEMPTIBLE"
    }
  }
}

Updating a cluster with secondary workers

Use the Dataproc clusters.patch API to add and remove secondary workers.

Example

The following PATCH request updates a cluster to have two secondary workers.


PATCH /v1/projects/project-id/regions/region/clusters/cluster-name?updateMask=config.secondary_worker_config.num_instances
{
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2
    }
  }
}

Console

You can specify the number of secondary workers when creating a Dataproc cluster from the Cloud Console. After a cluster has been created, you can add and remove secondary workers by editing the cluster configuration from the Cloud Console.

Creating a cluster with secondary workers

You can set the number and type of secondary workers to apply to a new cluster from the Secondary worker nodes section of the Configure nodes panel on the Dataproc Create a cluster page of the Cloud Console. Specify the number and type of secondary workers in the Secondary worker nodes and Preemptibility fields, respectively.

Updating a cluster with secondary instances

To update the number of secondary workers in a cluster, click the name of the cluster on the Clusters page of the Cloud Console. On the Cluster details page. Click the CONFIGURATION tab, then click EDIT and update the number in the Secondary worker nodes field.

Removing all secondary instances from a cluster

To remove all secondary workers from a cluster, update the cluster configuration as explained above, specifying 0 in the Secondary worker nodes field.