Spot VMs


This page explains what Spot VMs are and how they work in Google Kubernetes Engine (GKE). To learn how to use Spot VMs, refer to Use Spot VMs.

Overview of Spot VMs in GKE

Spot VMs are Compute Engine virtual machine (VM) instances that are priced lower than standard Compute Engine VMs and provide no guarantee of availability. Spot VMs offer the same machine types and options as standard VMs.

You can use Spot VMs in your clusters and node pools to run stateless, batch, or fault-tolerant workloads that can tolerate disruptions caused by the ephemeral nature of Spot VMs.

Spot VMs remain available until Compute Engine requires the resources for standard VMs. To maximize your cost efficiency, combine using Spot VMs with Best practices for running cost-optimized Kubernetes applications on GKE.

To learn more about Spot VMs, see Spot VMs in the Compute Engine documentation.

Benefits of Spot VMs

Spot VMs and preemptible VMs share many benefits, including the following:

  • Lower pricing than standard Compute Engine VMs.
  • Useful for stateless, fault-tolerant workloads that are resilient to the ephemeral nature of these VMs.
  • Works with the cluster autoscaler and node auto-provisioning.

In contrast to preemptible VMs, which expire after 24 hours, Spot VMs have no expiration time. Spot VMs are only terminated when Compute Engine needs the resources elsewhere.

How Spot VMs work in GKE

When you create a cluster or node pool with Spot VMs, GKE creates underlying Compute Engine Spot VMs that behave like a managed instance group (MIG). Nodes that use Spot VMs behave like standard GKE nodes, but with no guarantee of availability. When the resources used by Spot VMs are required to run standard VMs, Compute Engine terminates those Spot VMs to use the resources elsewhere.

Termination and graceful shutdown of Spot VMs

When Compute Engine needs to reclaim the resources used by Spot VMs, a termination notice is sent to GKE. Spot VMs terminate 30 seconds after receiving a termination notice.

On clusters running GKE version 1.20 and later, the kubelet graceful node shutdown feature is enabled by default. The kubelet notices the termination notice and gracefully terminates Pods that are running on the node. If the Pods are part of a Deployment, the controller creates and schedules new Pods to replace the terminated Pods.

On a best-effort basis, the kubelet grants the following graceful termination period, based on the GKE version of the node pool:

  • Later than 1.22.8-gke.200: 15 seconds for non-system Pods, after which system Pods (with the system-cluster-critical or system-node-critical priority classes) have 15 seconds to gracefully terminate.
  • 1.22.8-gke.200 and earlier: 25 seconds for non-system Pods, after which system Pods (with the system-cluster-critical or system-node-critical priority classes) have 5 seconds to gracefully terminate.

During graceful node termination, the kubelet updates the status of the Pods, assigning a Failed phase and a Terminated reason to the terminated Pods.

When the number of terminated Pods reaches a threshold of 1000 for clusters with fewer than 100 nodes or 5000 for clusters with 100 nodes or more, garbage collection cleans up the Pods.

You can also delete terminated Pods manually using the following commands:

  kubectl get pods --all-namespaces | grep -i NodeShutdown | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n
  kubectl get pods --all-namespaces | grep -i Terminated | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n

Scheduling workloads on Spot VMs

GKE automatically adds both the cloud.google.com/gke-spot=true and cloud.google.com/gke-provisioning=spot (for nodes running GKE version 1.25.5-gke.2500 or later) labels to nodes that use Spot VMs. You can schedule specific Pods on nodes that use Spot VMs using the nodeSelector field in your Pod spec. The following examples use the cloud.google.com/gke-spot label:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    cloud.google.com/gke-spot: "true"

Alternatively, you can use node affinity to tell GKE to schedule Pods on Spot VMs, similar to the following example:

apiVersion: v1
kind: Pod
spec:
...
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-spot
            operator: In
            values:
            - "true"
...

You can also use nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution to prefer that GKE places Pods on nodes that use Spot VMs. Preferring Spot VMs is not recommended, because GKE might schedule the Pods onto existing viable nodes that use standard VMs instead.

Using taints and tolerations for scheduling

To avoid system disruptions, use a node taint to ensure that GKE doesn't schedule critical workloads onto Spot VMs. When you taint nodes that use Spot VMs, GKE only schedules Pods that have the corresponding toleration onto those nodes.

If you use node taints, ensure that your cluster also has at least one node pool that uses standard Compute Engine VMs. Node pools that use standard VMs provide a reliable place for GKE to schedule critical system components like DNS.

For information on using a node taint for Spot VMs, see Use taints and tolerations for Spot VMs.

Using Spot VMs with GPU node pools

Spot VMs support using GPUs. When you create a new GPU node pool, GKE automatically adds the nvidia.com/gpu=present:NoSchedule taint to the new nodes. Only Pods with the corresponding toleration can run on these nodes. GKE automatically adds this toleration to Pods that request GPUs.

Your cluster must have at least one existing non-GPU node pool that uses standard VMs before you create a GPU node pool that uses Spot VMs. If your cluster only has a GPU node pool with Spot VMs, GKE doesn't add the nvidia.com/gpu=present:NoSchedule taint to those nodes. As a result, GKE might schedule system workloads onto the GPU node pools with Spot VMs, which can lead to disruptions because of the Spot VMs and can increase your resource consumption because GPU nodes are more expensive than non-GPU nodes.

Cluster autoscaler and node auto-provisioning

You can use the cluster autoscaler and node auto-provisioning to automatically scale your clusters and node pools based on the demands of your workloads. Both the cluster autoscaler and node auto-provisioning support using Spot VMs.

Spot VMs and node auto-provisioning

Node auto-provisioning automatically creates and deletes node pools in your cluster to meet the demands of your workloads. When you schedule workloads that require Spot VMs by using a nodeSelector or node affinity, node auto-provisioning creates new node pools to accommodate the workloads' Pods. GKE automatically adds the cloud.google.com/gke-spot=true:NoSchedule taint to nodes in the new node pools. Only Pods with the corresponding toleration can run on nodes in those node pools. You must add the corresponding toleration to your deployments to allow GKE to place the Pods on Spot VMs:

   tolerations:
   - key: cloud.google.com/gke-spot
     operator: Equal
     value: "true"
     effect: NoSchedule

You can ensure that GKE only schedules your Pods on Spot VMs by using both a toleration and either a nodeSelector or node affinity rule to filter for Spot VMs.

If you schedule a workload using only a toleration, GKE can schedule the Pods onto either Spot VMs or existing standard VMs with capacity. If you require a workload to be scheduled on Spot VMs, use a nodeSelector or a node affinity in addition to a toleration. To learn more, see Scheduling workloads on Spot VMs.

Spot VMs and cluster autoscaler

The cluster autoscaler automatically adds and removes nodes in your node pools based on demand. If your cluster has Pods that can't be placed on existing Spot VMs, the cluster autoscaler adds new nodes that use Spot VMs.

Default policy

Starting in GKE version 1.24.1-gke.800, you can define the autoscaler location policy. Cluster autoscaler attempts to provision Spot VMs node pools when resources are available and the default location policy is set to ANY. With this policy, Spot VMs have a lower risk of being preempted. For other VM types, the default cluster autoscaler distribution policy is BALANCED.

Upgrade Standard node pools using Spot VMs

If your Standard cluster node pools using Spot VMs are configured to use surge upgrades, GKE creates surge nodes with Spot VMs. However, GKE doesn't wait for the Spot VMs to be ready before cordoning and draining the existing nodes, as Spot VMs provide no guarantee of availability. To learn more, see Surge upgrades.

Modifications to Kubernetes behavior

Using Spot VMs on GKE modifies some guarantees and constraints that Kubernetes provides, such as the following:

  • Reclamation of Spot VMs is involuntary and is not covered by the guarantees of PodDisruptionBudgets. You might experience greater unavailability than your configured PodDisruptionBudget.

Best practices for Spot VMs

When designing a system that uses Spot VMs, you can avoid major disruptions by using the following guidelines:

  • Spot VMs have no availability guarantees. Design your systems under the assumption that GKE might reclaim any or all your Spot VMs at any time, with no guarantee of when new instances become available.
  • To ensure that your workloads and Jobs are processed even when no Spot VMs are available, ensure that your clusters have a mix of node pools that use Spot VMs and node pools that use standard Compute Engine VMs.
  • Ensure that your cluster has at least one non-GPU node pool that uses standard VMs before you add a GPU node pool that uses Spot VMs.
  • While the node names do not usually change when nodes are recreated, the internal and external IP addresses used by Spot VMs might change after recreation.
  • Use node taints and tolerations to ensure that critical Pods aren't scheduled onto node pools that use Spot VMs.
  • To run stateful workloads on Spot VMs, test to ensure that your workloads can gracefully terminate within 25 seconds of shutdown to minimize the risk of persistent volume data corruption.
  • Follow the Kubernetes Pod termination best practices.

What's next