Containers & Kubernetes

Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

March 6, 2024

William Denniss

Group Product Manager, Google Kubernetes Engine

Kubernetes is a popular way to run AI workloads like training, and large language model (LLM) serving, including our new open model Gemma. Google Kubernetes Engine (GKE) in Autopilot mode provides a fully managed Kubernetes platform that offers the power and flexibility of Kubernetes but without the need to worry about compute nodes, so you can focus on delivering your own business value through AI. Today we’re excited to announce the new Accelerator compute class in Autopilot that improves GPU support with resource reservation capabilities, and a lower price for most GPU workloads (you can opt in to this pricing today, and eventually all workloads will be migrated). In addition, a new Performance compute class enables high-performance workloads to run on Autopilot mode at scale. Both compute classes also have more available ephemeral storage right on the boot disk, giving you more room to download AI models, etc before needing to configure additional storage via generic ephemeral volumes. With these enhancements, using our fully managed Kubernetes platform for inference and other compute-intensive workloads is even better.

With GKE running in Autopilot mode you avoid the need to specify and provision nodes upfront, and can focus on building the workload and creating your own business value. As a fully managed platform, once your workload is built you can run it with less operational overhead. Today’s news sweetens the deal even further.

Lower-priced GPUs, better discounts

We’re lowering the price for the majority of GPU workloads running on GKE in Autopilot mode, and moving to a new billing model to improve compatibility with other products and experiences in Google Cloud. Now, you can move workloads between the Standard and Autopilot modes of GKE, as well as between Compute Engine VMs and keep your existing Reservations and committed use discounts.

When you enable the new pricing model (by specifying the Accelerate compute class as illustrated in the code sample below), resources are billed based on Compute Engine VM resources, plus a premium for the fully managed experience. Today the new pricing model is an opt-in; after April 30, versions of GKE will be released that automatically migrate GPU workloads to this new model. The price for most workloads resulting from these changes is lower (workloads on NVIDIA T4 GPUs with less than 2 vCPU per GPU see a slight price increase).

Here’s a comparison of the hourly prices for several workload sizes in the us-central1 region for GPU, CPU and Memory resources (storage additional):

GPU	Pod Resource Requests	VM resources	Old price (GPU Pod)	New price (Accelerator Compute Class Pod)
NVIDIA A100 80GB	1 GPU 11 vCPU 148 GB memory	1 GPU 12 vCPU 170 GB memory	$6.09	$5.59
NVIDIA A100 40GB	1 GPU 11 vCPU 74 GB memory	1 GPU 12 vCPU 85 GB memory	$4.46	$4.09
NVIDIA L4	1 GPU 11 vCPU 40 GB memory	1 GPU 12 vCPU 48 GB memory	$1.61	$1.12
NVIDIA T4	1 GPU 1 vCPU 1 GB memory	1 GPU 2 vCPU 2 GB memory	$0.46	$0.47
NVIDIA T4	1 GPU 20 vCPU 40 GB memory	1 GPU 22 vCPU 48 GB memory	$1.96	$1.37

When using the Accelerator compute class, the workload is billed for (and can utilize) the complete node VM capacity, including bursting into resources allocated for system Pods.

To opt in to these changes today, upgrade to version 1.28.6-gke.1095000 or later, and add the compute-class selector to your existing GPU workloads, like so:

High-performance CPU resources

If you need dedicated CPU resources for your workloads, Autopilot now takes a similar approach as it does with GPUs. You can now run GKE Autopilot workloads on Compute Engine’s main machine families including the new C3, C3D and H3 machines, as well as C2, C2D, and more! These resources can be requested as part of the Performance compute class. Here’s an example:

Reservations

Reservations can help ensure that your project has resources for future increases in demand, but previously you weren't able to consume reservations in Autopilot mode. Good news, now you can! Using reservations is a breeze, and they can be used with both GPUs (when you opt in to the new model), and high-performance CPUs.

Larger boot disks

While GKE allows you to mount multiple persistent volumes to a container, each of which can be up to 64TB on any path in your container, offering larger boot disks for Pods lets you use ephemeral storage without mounting a separate volume. When using either the Performance or Accelerator compute-class labels above, your workload can now consume up to 122GiB of ephemeral storage. Need more? Persistent disks can be mounted to expand further.

Hardware when you need it, simplicity when you don’t

You may be wondering, where do regular Autopilot Pods fit in with this new model? Think about it like this: if you have a workload that requires dedicated, high-performance CPU hardware such as that offered by C3 machines, you can annotate just that workload with those requirements using the node selector described above.

But what about supporting workloads that run alongside the primary ones but don’t need the same computing power? This is where Autopilot mode really excels: by default, all those other workloads will continue to run on the standard Pod model, offering great price/performance for workloads that don’t have high-performance CPU needs. In Autopilot mode, just annotate those workloads that need specialized hardware, like a specific GPU or machine family, and we’ll do the rest. Leave the other workloads blank, and rest assured that they won’t accidentally run on the specialized hardware. This way, you get the best value out of each of your execution environments: broadly applicable defaults in Autopilot, and specialized hardware when you need it.

Here’s what our customers are saying

https://storage.googleapis.com/gweb-cloudblog-publish/images/contextual_ai.max-900x900.jpg

“At Contextual AI, we are building the next generation of Retrieval Augmented Generation (RAG). Contextual Language Models (CLMs) are end-to-end optimized to address pain points of RAG 1.0 and help enterprise customers build production-grade workflows. To achieve this, we rely on GKE Autopilot, a fully managed Kubernetes service that handles the complexity of running our application. With GKE Autopilot, we can easily scale our pods, optimize our resource utilization, and ensure the security and availability of our nodes. We also take advantage of the new billing models that offer more cost-effective GPUs for our inference tasks, while using regular Autopilot pods for our non-GPU services. We are excited to use GKE Autopilot to power CLMs while saving us money and improving our performance.” - Soumitr Pandey, Member of Technical Staff, Contextual AI

https://storage.googleapis.com/gweb-cloudblog-publish/images/hotspring.max-900x900.jpg

“We opted for GKE Autopilot for our ML infrastructure as it empowers our team to concentrate on research and development instead of cluster management. This approach not only automates resource provisioning throughout the entire regional cluster but also streamlines our operations. The latest enhancements in Autopilot are particularly exciting. They not only provide a unified resource pool but also introduce reservation capabilities, giving us greater control in meeting project deadlines.” - Jon Mason, CEO, Hotspring

To learn more about all the new features that we launched for Autopilot this week, check out the following resources:

Posted in

Containers & Kubernetes

How we cut Vertex AI latency by 35% with GKE Inference Gateway

By Fisayo Feyisetan • 4-minute read

Containers & Kubernetes

Accelerate GKE cluster autoscaling with faster concurrent node pool auto-creation

By Daniel Kłobuszewski • 4-minute read

Containers & Kubernetes

Accelerate model downloads on GKE with NVIDIA Run:ai Model Streamer

By Peter Schuurman • 4-minute read

Containers & Kubernetes

How Google Does It: Building the largest known Kubernetes cluster, with 130,000 nodes

By Besher Massri • 10-minute read

Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

William Denniss

Lower-priced GPUs, better discounts

High-performance CPU resources

Reservations

Larger boot disks

Hardware when you need it, simplicity when you don’t

Here’s what our customers are saying

Related articles

How we cut Vertex AI latency by 35% with GKE Inference Gateway

Accelerate GKE cluster autoscaling with faster concurrent node pool auto-creation

Accelerate model downloads on GKE with NVIDIA Run:ai Model Streamer

How Google Does It: Building the largest known Kubernetes cluster, with 130,000 nodes