Use preemptible VMs to run fault-tolerant workloads


This page shows you how to use preemptible VMs in Google Kubernetes Engine (GKE).

Overview

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours, and provide no availability guarantees. Preemptible VMs offer similar functionality to Spot VMs, but only last up to 24 hours after creation.

Comparison to Spot VMs

Preemptible VMs share many similarities with Spot VMs, including the following:

  • Terminated when Compute Engine requires the resources to run on-demand VMs.
  • Useful for running stateless, batch, or fault-tolerant workloads.
  • Lower pricing than on-demand VMs.
  • On clusters running GKE version 1.20 and later, graceful node shutdown is enabled by default.
  • Works with the cluster autoscaler and node auto-provisioning.

In contrast to Spot VMs, which have no maximum expiration time, preemptible VMs only last for up to 24 hours after creation.

You can enable preemptible VMs on new clusters and node pools, use nodeSelector or node affinity to control scheduling, and use taints and tolerations to avoid issues with system workloads when nodes are preempted.

Termination and graceful shutdown of preemptible VMs

When Compute Engine needs to reclaim the resources used by preemptible VMs, a preemption notice is sent to GKE. Preemptible VMs terminate 30 seconds after receiving a termination notice.

On clusters running GKE version 1.20 and later, the kubelet graceful node shutdown feature is enabled by default. The kubelet notices the termination notice and gracefully terminates Pods that are running on the node.

On a best-effort basis, the kubelet grants non-system Pods 25 seconds to gracefully terminate, after which system Pods (with the system-cluster-critical or system-node-critical priority classes) have five seconds to gracefully terminate.

During graceful Pod termination, the kubelet assigns a Failed status and a Shutdown reason to the terminated Pods. When the number of terminated Pods reaches a threshold, garbage collection cleans up the Pods.

You can also delete shutdown Pods manually using the following command:

kubectl get pods --all-namespaces | grep -i shutdown | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n

Modifications to Kubernetes behavior

Using preemptible VMs on GKE modifies some guarantees and constraints that Kubernetes provides, such as the following:

  • On clusters running GKE versions prior to 1.20, the kubelet graceful node shutdown feature is disabled by default. GKE shuts down preemptible VMs without a grace period for Pods, 30 seconds after receiving a preemption notice from Compute Engine.

  • Reclamation of preemptible VMs is involuntary and is not covered by the guarantees of PodDisruptionBudgets. You might experience greater unavailability than your configured PodDisruptionBudget.

Limitations

Create a cluster or node pool with preemptible VMs

You can use the gcloud command-line tool or Cloud Console to create a cluster or node pool with preemptible VMs.

gcloud

To create a cluster with preemptible VMs, run the following command:

gcloud container clusters create CLUSTER_NAME \
    --preemptible

Replace CLUSTER_NAME with the name of your new cluster.

To create a node pool with preemptible VMs, run the following command:

gcloud container node-pools create POOL_NAME \
    --cluster=CLUSTER_NAME \
    --preemptible

Replace POOL_NAME with the name of your new node pool.

Console

  1. Go to the Google Kubernetes Engine page in the Cloud Console.

    Go to Google Kubernetes Engine

  2. Click Create.

  3. Configure your cluster as desired.

  4. From the navigation pane, under Node Pools, for the node pool you want to configure, click Nodes.

  5. Select the Enable preemptible nodes checkbox.

  6. Click Create.

Use nodeSelector to schedule Pods on preemptible VMs

GKE adds the cloud.google.com/gke-preemptible=true node label to nodes that use preemptible VMs. You can use a nodeSelector in your deployments to tell GKE to schedule Pods onto preemptible VMs.

For example, the following Deployment filters for preemptible VMs:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-app
  template:
    metadata:
      labels:
        app: hello-app
    spec:
      containers:
      - name: hello-app
        image: us-docker.pkg.dev/google-samples/containers/gke/hello-app:1.0
        resources:
          requests:
            cpu: 200m
      nodeSelector:
        cloud.google.com/gke-preemptible: "true"

Use node taints for preemptible VMs

You can taint nodes that use preemptible VMs so that GKE can only place Pods with the corresponding toleration on those nodes.

To add a node taint to a node pool that uses preemptible VMs, use the --node-taints flag when creating the node pool, similar to the following command:

gcloud container node-pools create POOL2_NAME \
    --node-taints=cloud.google.com/gke-preemptible="true":NoSchedule

Now, only Pods that tolerate the node taint are scheduled to the node.

To add the relevant toleration to your Pods, modify your deployments and add the following to your Pod specification:

tolerations:
- key: cloud.google.com/gke-preemptible
  operator: Equal
  value: "true"
  effect: NoSchedule

Node taints for GPU preemptible VMs

Preemptible VMs support using GPUs. You should create at least one other node pool in your cluster that doesn't use preemptible VMs before adding a GPU node pool that uses preemptible VMs. Having an on-demand node pool ensures that GKE can safely place system components like DNS.

If you create a new cluster with GPU node pools that use preemptible VMs, or if you add a new GPU node pool that uses preemptible VMs to a cluster that does not already have an on-demand node pool, GKE does not automatically add the nvidia.com/gpu=present:NoSchedule taint to the nodes. GKE might schedule system Pods onto the preemptible VMs, which can lead to disruptions. This behavior also increases your resource consumption, because GPU nodes are more expensive than non-GPU nodes.

What's next