Spot VMs


This page provides an overview of support for Spot VMs in Google Kubernetes Engine (GKE).

Overview

Spot VMs are Compute Engine virtual machine (VM) instances that are priced lower than on-demand Compute Engine VMs. Spot VMs offer the same machine types and options as on-demand VMs, but provide no availability guarantees.

You can use Spot VMs in your clusters and node pools to run stateless, batch, or fault-tolerant workloads that can tolerate disruptions caused by the ephemeral nature of Spot VMs.

Spot VMs remain available until Compute Engine requires the resources for on-demand VMs. To maximize your cost efficiency, combine using Spot VMs with Best practices for running cost-optimized Kubernetes applications on GKE.

To learn more about Spot VMs, see Spot VMs in the Compute Engine documentation.

Benefits

Spot VMs and preemptible VMs share many benefits, including the following:

  • Lower pricing than on-demand Compute Engine VMs.
  • Useful for stateless, fault-tolerant workloads that are resilient to the ephemeral nature of these VMs.
  • Works with the cluster autoscaler and node auto-provisioning.

In contrast to preemptible VMs, which expire after 24 hours, Spot VMs have no expiration time. Spot VMs are only terminated when Compute Engine needs the resources elsewhere.

How Spot VMs work in GKE

When you create a cluster or node pool with Spot VMs, GKE creates underlying Compute Engine Spot VMs that behave like a managed instance group (MIG). Nodes that use Spot VMs behave like on-demand GKE nodes, but with no guarantee of availability. When the resources used by Spot VMs are required to run on-demand VMs, Compute Engine terminates those Spot VMs to use the resources elsewhere.

Termination and graceful shutdown of Spot VMs

When Compute Engine needs to reclaim the resources used by Spot VMs, a termination notice is sent to GKE. Spot VMs terminate 30 seconds after receiving a termination notice.

On clusters running GKE version 1.20 and later, the kubelet graceful node shutdown feature is enabled by default. The kubelet notices the termination notice and gracefully terminates Pods that are running on the node.

The kubelet grants non-system Pods 25 seconds to gracefully terminate, after which system Pods (with the system-cluster-critical or system-node-critical priority classes) have five seconds to gracefully terminate.

During graceful Pod termination, the kubelet assigns a Failed status and a Shutdown reason to the terminated Pods. When the number of terminated Pods reaches a threshold, garbage collection cleans up the Pods.

You can also delete shutdown Pods manually using the following command:

kubectl get pods --all-namespaces | grep -i shutdown | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n

Scheduling workloads on Spot VMs

GKE automatically adds the cloud.google.com/gke-spot=true label to nodes that use Spot VMs. You can schedule specific Pods on nodes that use Spot VMs using the nodeSelector field in your Pod spec, like in the following example:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    cloud.google.com/gke-spot: "true"

Alternatively, you can use node affinity to tell GKE to schedule Pods on Spot VMs, similar to the following example:

apiVersion: v1
kind: Pod
spec:
...
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-spot
            operator: In
            values:
            - true
...

You can also use nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution to prefer that GKE places Pods on nodes that use Spot VMs. Preferring Spot VMs is not recommended, because GKE might schedule the Pods onto existing viable nodes that use on-demand VMs instead.

Using taints and tolerations for scheduling

To avoid system disruptions, use a node taint to ensure that GKE doesn't schedule critical workloads onto Spot VMs. When you taint nodes that use Spot VMs, GKE only schedules Pods that have the corresponding toleration onto those nodes.

If you use node taints, ensure that your cluster also has at least one node pool that uses on-demand Compute Engine VMs. Node pools that use on-demand VMs provide a reliable place for GKE to schedule critical system components like DNS.

For information on using a node taint for Spot VMs, see Use taints and tolerations for Spot VMs.

Using Spot VMs with GPU node pools

Spot VMs support using GPUs. When you create a new GPU node pool, GKE automatically adds the nvidia.com/gpu=present:NoSchedule taint to the new nodes. Only Pods with the corresponding toleration can run on these nodes. GKE automatically adds this toleration to Pods that request GPUs.

Your cluster must have at least one existing non-GPU node pool that uses on-demand VMs before you create a GPU node pool that uses Spot VMs. If your cluster only has a GPU node pool with Spot VMs, GKE doesn't add the nvidia.com/gpu=present:NoSchedule taint to those nodes. As a result, GKE might schedule system workloads onto the GPU node pools with Spot VMs, which can lead to disruptions because of the Spot VMs and can increase your resource consumption because GPU nodes are more expensive than non-GPU nodes.

Cluster autoscaler and node auto-provisioning

You can use the cluster autoscaler and node auto-provisioning to automatically scale your clusters and node pools based on the demands of your workloads. Both the cluster autoscaler and node auto-provisioning support using Spot VMs.

Spot VMs and node auto-provisioning

Node auto-provisioning automatically creates and deletes node pools in your cluster to meet the demands of your workloads. When node auto-provisioning creates new node pools to accommodate Pods that require Spot VMs, GKE automatically adds the cloud.google.com/gke-spot=true:NoSchedule taint to nodes in the new node pools. Only Pods with the corresponding toleration can run on nodes in those node pools. You must add the corresponding toleration to your deployments to allow GKE to place the Pods on Spot VMs.

You can ensure that GKE only schedules your Pods on Spot VMs by using both a toleration and either a nodeSelector or node affinity rule to filter for Spot VMs.

You can also use only a toleration without filtering for Spot VMs using nodeSelector or a node affinity. In this case, GKE attempts to schedule the Pods on Spot VMs. If there are no available Spot VMs but there are existing on-demand VMs with capacity, GKE schedules the Pods onto the on-demand VMs instead.

Spot VMs and cluster autoscaler

The cluster autoscaler automatically adds and removes nodes in your node pools based on demand. If your cluster has Pods that can't be placed on existing Spot VMs, the cluster autoscaler adds new nodes that use Spot VMs.

Modifications to Kubernetes behavior

Using Spot VMs on GKE modifies some guarantees and constraints that Kubernetes provides, such as the following:

  • On clusters running GKE versions prior to 1.20, the kubelet graceful node shutdown feature is disabled by default. GKE shuts down Spot VMs without a grace period for Pods, 30 seconds after receiving a preemption notice from Compute Engine.

  • Reclamation of Spot VMs is involuntary and is not covered by the guarantees of PodDisruptionBudgets. You might experience greater unavailability than your configured PodDisruptionBudget.

Best practices

When designing a system that uses Spot VMs, you can avoid major disruptions by using the following guidelines:

  • Spot VMs have no availability guarantees. Design your systems under the assumption that GKE might reclaim any or all your Spot VMs at any time, with no guarantee of when new instances become available.
  • There is no guarantee that Pods running on Spot VMs will shut down gracefully. GKE might not notice that the node was reclaimed until a few minutes after reclamation occurs, which delays the rescheduling of those Pods onto a new node.
  • To ensure that your workloads and Jobs are processed even when no Spot VMs are available, ensure that your clusters have a mix of node pools that use Spot VMs and node pools that use on-demand Compute Engine VMs.
  • Ensure that your cluster has at least one non-GPU node pool that uses on-demand VMs before you add a GPU node pool that uses Spot VMs.
  • Use the Kubernetes on GCP Node Termination Event Handler on clusters running GKE versions prior to 1.20, where the kubelet graceful node shutdown feature is disabled. The handler gracefully terminates your Pods when Spot VMs are preempted.
  • While the node names do not usually change when nodes are recreated, the internal and external IP addresses used by Spot VMs might change after recreation.
  • Use node taints and tolerations to ensure that critical Pods aren't scheduled onto node pools that use Spot VMs.
  • Do not use stateful Pods with Spot VMs. StatefulSets inherently have at-most-one Pod per index semantics, which preemption of Spot VMs could violate, leading to data loss.
  • Follow the Kubernetes Pod termination best practices.

What's next