Containers & Kubernetes

Announcing Spot Pods for GKE Autopilot—save on fault tolerant workloads

November 9, 2021

William Denniss

Group Product Manager, Google Kubernetes Engine

Try Google Cloud

Start building on Google Cloud with $300 in free credits and 20+ always free products.

We launched GKE Autopilot back in February and since then, we’ve been hard at work adding functionality to deliver a fully featured, fully managed Kubernetes platform. Today, we’re excited to introduce Spot Pods.

(Not familiar with GKE Autopilot yet? Check out the Autopilot breakout session at Google Cloud Next ‘21, which gives a rundown of everything this new Kubernetes platform can do. Customers like the Japanese healthcare startup Ubie are already realizing simpler operations thanks to Autopilot, allowing them to spend less time worrying about infrastructure, and more time building their core business.)

Back to Spot Pods… Autopilot is great for running stable, production-grade workloads thanks to its Pod-level SLA, a first for GKE. You might however have other types of workloads that don’t need this high level of reliability, for example fault-tolerant batch workloads, or dev/test clusters that can handle some disruption. Spot Pods give you a convenient and cost-effective way to run these kinds of workloads on GKE Autopilot. (GKE standard users can also take advantage of spot pricing by running their GKE clusters and node pools on Spot VMs.)

When you run your workloads with Spot Pods, you will receive a discount of between 60 to 91% off our regularly priced pods (see our pricing page for the current price). There is no hard limit to how long a Spot Pod can run, but they may be preempted and evicted at any time if the resources need to be reclaimed by the platform during times of high resource demand.

How Spot Pods work

Spot Pods run on spare compute capacity in Google Cloud, which allows you to use them at a lower price compared to regular Autopilot pods, for as long as compute resources are available. If Google Cloud needs the resources for other tasks, GKE evicts your Spot Pods with a grace period of 25s. By using a Kubernetes workload API like Deployment or Job, you can automatically redeploy your Spot Pods as soon as there's available capacity, and they pick up right where they left off.

Spot Pods are available starting in GKE 1.21.4. To enable Spot Pods on your deployment, just add a node selector for cloud.google.com/gke-spot: "true". Here’s an example Deployment that uses this node selector to enable Spot Pods:

When you ask for Spot Pods in this way, Autopilot automatically provisions nodes for them. Autopilot adds Kubernetes taints and tolerations so that your regular, critical Pods stay separated and don’t land on the same nodes as Spot Pods. All you need to do is request Spot Pods in your manifest — GKE handles the rest.

When GKE evicts a Spot Pod to reclaim capacity, your containers get a SIGTERM signal and get up to 25s to wrap up their work. Make the most of this by adding terminationGracePeriodSeconds to your PodSpec, and gracefully shut your container down when it receives the SIGTERM signal.

Use Spot Pods to maximize your savings when you run fault-tolerant workloads on Autopilot clusters. For your regular Pods, you can also take advantage of Autopilot committed use discounts (CUDs), which launched earlier this year, and offer discounts of up to 45%. CUDs don’t apply to Spot Pods, which are already heavily discounted, but they do offer a convenient way to save money on pods that require a more stable environment. Regardless of your workload, GKE gives you a way to save.

Spot Pods are in Preview, and available starting with GKE version 1.21.4. To get started with Spot Pods for GKE Autopilot, read the documentation for Spot Pods, and create an Autopilot cluster in the Rapid release channel. For more such capabilities register to join us live on Nov 18th for Kubernetes Tips and Tricks to Build and Run Cloud Native Apps.

Posted in

Containers & Kubernetes

How we cut Vertex AI latency by 35% with GKE Inference Gateway

By Fisayo Feyisetan • 4-minute read

Containers & Kubernetes

Accelerate GKE cluster autoscaling with faster concurrent node pool auto-creation

By Daniel Kłobuszewski • 4-minute read

Containers & Kubernetes

Accelerate model downloads on GKE with NVIDIA Run:ai Model Streamer

By Peter Schuurman • 4-minute read

Containers & Kubernetes

How Google Does It: Building the largest known Kubernetes cluster, with 130,000 nodes

By Besher Massri • 10-minute read

Announcing Spot Pods for GKE Autopilot—save on fault tolerant workloads

William Denniss

Try Google Cloud

How Spot Pods work

Related articles

How we cut Vertex AI latency by 35% with GKE Inference Gateway

Accelerate GKE cluster autoscaling with faster concurrent node pool auto-creation

Accelerate model downloads on GKE with NVIDIA Run:ai Model Streamer

How Google Does It: Building the largest known Kubernetes cluster, with 130,000 nodes