Preemptible VMs

In addition to using standard Compute Engine virtual machines (VMs), Dataproc clusters can use preemptible VM instances, also known as preemptible VMs. Preemptible workers are reclaimed—removed from the cluster—if they are required by Google Cloud for other tasks. Although the potential removal of preemptible workers can affect job stability, you may decide to use preemptible instances to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost. See the Dataproc pricing documentation for more information.

How preemptibles work with Dataproc

All secondary workers added to a cluster use the machine type of the cluster's primary worker nodes. For example, if you create a cluster with primary workers that use n1-standard-4 machine types, all secondary workers added to the cluster will also use n1-standard-4 machines.

Preemptible workers are reclaimed if they are needed by Google Cloud for other tasks. They are added back to the cluster if and when capacity permits. For example, if two preemptible machines are reclaimed and removed from a cluster, these instances will be added back to the cluster if and when capacity is available to add them.

The following rules apply to all secondary workers in a Dataproc cluster:

  • Processing only— Secondary workers do not store data. They only function as processing nodes.

  • No secondary-worker-only clusters— Your cluster must have primary workers. If you create a cluster and you do not specify the number of primary workers, Dataproc adds two primary workers to the cluster.

  • Persistent disk size—As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS. You can override the default disk size with the gcloud dataproc clusters create --secondary-worker-boot-disk-size command at cluster creation. You can specify this flag even if the cluster does will not have secondary workers when it is created.

Using preemptibles in a cluster

You can specify the number and type of secondary workers when you create a cluster via a Dataproc API request, using the Google Cloud SDK gcloud command-line tool, or from the Google Cloud Console.

gcloud command

Use the gcloud dataproc clusters create command to add preemptible instances to a cluster when the cluster is created. After a cluster is created, you can add or remove preemptible instances to or from the cluster with the gcloud dataproc clusters update command.

Creating a cluster with preemptible workers

To create a cluster with preemptible workers, use the gcloud dataproc clusters create command with the --num-secondary-workers argument.

Example 1

The following command creates a cluster named "my-test-cluster" with two preemptible workers.

gcloud dataproc clusters create my-test-cluster --num-secondary-workers 2
Waiting on operation [operations/projects/project-id/operations/...].
clusterName: my-test-cluster
  ...
secondaryWorkerConfiguration:
    - dataproc-1-sw-2skd
    - dataproc-1-sw-l20p
    isPreemptible: true
...

Updating a cluster with secondary workers

To update a cluster to add or remove secondary workers, use the gcloud dataproc clusters update command with the --num-secondary-workers argument.

Example

The following command updates a cluster named "my-test-cluster" to use two secondary workers.

gcloud dataproc clusters update my-test-cluster --num-secondary-workers 2
Waiting on operation [operations/projects/project-id/operations/...].
Waiting for cluster update operation...done.
Updated [https://dataproc.googleapis.com/...].
clusterName: my-test-cluster
  ...
secondaryWorkerConfiguration:
    - dataproc-1-sw-2skd
    - dataproc-1-sw-l20p
    isPreemptible: true
...

Removing all secondary workers from a cluster

To remove all secondary workers from a cluster, use the gcloud dataproc clusters update command with --num-secondary-workers set to 0.

Example

The following command removes all secondary workers from a cluster.

gcloud dataproc clusters update cluster-name --num-secondary-workers 0

REST API

Creating a cluster with preemptible workers

Use the Dataproc clusters.create API add preemptible workers to a cluster when the cluster is created.

Example 1

The following POST request creates a cluster with two preemptible workers.


POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

{
  "clusterName": "cluster-name",
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2
    }
  }
}

Updating a cluster with preemptible instances

Use the Dataproc clusters.patch API to add and remove secondary workers.

Example

The following PATCH request updates a cluster to have two secondary workers.


PATCH /v1/projects/project-id/regions/region/clusters/cluster-name?updateMask=config.secondary_worker_config.num_instances
{
  "config": {
    "secondaryWorkerConfig": {
      "numInstances": 2
    }
  }
}

Console

You can specify the number of preemptible workers when creating a Dataproc cluster from the Cloud Console. After a cluster has been created, you can add and remove preemptible workers by editing the cluster configuration from the Cloud Console.

Creating a cluster with preemptible instances

Open the expandable panel titled, "Preemptible workers, bucket, network, version, initialization, & access options," on the Dataproc Create a cluster page in the Cloud Console.

Add preemptible workers to the new cluster by specifying a positive number in the Nodes field.

Updating a cluster with preemptible instances

After a cluster is created, you can edit the number of preemptible workers in a cluster by clicking the Edit button on the Configuration tab on the Cluster details page.

To change the number of preemptible workers, specify a new value in the Preemptible worker nodes field.

Removing all preemptible instances from a cluster

To remove all preemptible instances from a cluster, update the cluster configuration as explained above, specifying 0 in the Preemptible worker nodes field.