About accelerator consumption options for AI/ML workloads in GKE


This page describes the available techniques that you can use to obtain computing accelerators, such as GPUs or TPUs, based on the requirements of your AI/ML workloads. These techniques are called accelerator consumption options in GKE. Understanding the different consumption options helps you optimize resource utilization to avoid underutilizing resources, increase the likelihood of obtaining resources, and balance cost and performance.

This page is intended for Platform admins and operators that coordinate with Machine learning (ML) engineers to obtain the necessary resources to successfully deploy AI/ML workloads.

To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

Understand consumption options

You can select from the following options to consume accelerators on GKE:

  • On-demand: you consume TPUs or GPUs on GKE without arranging capacity in advance. Before requesting resources, you must have enough on-demand quota for the specific type and quantity of accelerators. On-demand is the most flexible consumption option; however, there is no guarantee that enough on-demand resources will be available to satisfy your request.
  • Reservations: you reserve resources for a set period. A reservation can be any of the following:
    • Future reservations: you reserve resources for typically longer durations for a specific time in the future. You have exclusive access to your reserved resources for that period of time. Future reservations require engagement with a Technical Account Manager (TAM). For more information, see TPU and GPU guidance.
    • Future reservations for up to 90 days (in calendar mode): you request capacity for a specified time period, with a calendar advisor suggesting available dates. Future reservations for up to 90 days (in calendar mode) offers more flexibility for shorter durations and self-service capacity search. For more information, see Future reservations requests in calendar mode.
    • On-demand reservations: you can request an on-demand reservation to be provisioned as soon as the capacity is available, similar to the on-demand option. While the reservation is active, you pay for the resources whether you use them or not.
  • Flex-start: you secure densely allocated resources for short-duration workloads without a reservation. You request a specific number of GPUs or TPUs, and Compute Engine provisions them when capacity becomes available. The GPUs or TPUs run uninterrupted for up to seven days. For more information, see flex-start provisioning.
  • Spot: you provision Spot VMs, which lets you get significant discounts, but Spot VMs can be preempted at any time, with a 30-second warning. For more information, see Spot VMs.

Understand accelerator quota in GKE

Quotas specify the amount of a countable, shared resource that you can use, and they are defined by Google Cloud. By default, projects generally don't come with significant accelerator quota. You must request and receive approval for quota for specific accelerator types and regions.

Consider the following characteristics when managing the quotas that your workloads need:

  • You must request the quota needed for each consumption option. To identify the quota required for each consumption option, see the corresponding "Quota" parameters listed in the choose a consumption option table. If there isn't enough quota, attempts to create clusters, node pools, or deploy workloads requiring accelerators will fail with a Quota exceeded error.

  • You must request quota when you use custom compute classes in Autopilot. The nodes provisioned to meet the compute class requirements still consume your project's quota for the specified accelerators.

  • Google Cloud Free Trial accounts have limitations on requesting quota increases for high-value resources like GPUs and TPUs. To have access to accelerator quota, upgrade to a paid account.

To check and request quota, go to the Quotas page in the Google Cloud console. You can filter for accelerator quotas and request increases.

Choose a consumption option

Use the following considerations to choose the best consumption option for your AI/ML workload:

  • Workload type: consider the type of workload that you want to implement. GKE requirements vary if you are running a training or an inference workload:
    • Training: requires high-performance resources with significant memory. Training workloads typically have a well-defined lifespan. These workloads are commonly easier to plan for because they are less prone to sudden spikes in resource consumption.
    • Inference: typically requires accelerators that are optimized for scalability and lower cost. Inference workloads can require significant accelerator memory during sudden spikes in resource consumption.
  • Lifespan based on the implementation phase: consider your business goal if you are executing a Proof of Concept (POC), platform evaluation, application development or testing, productionalization, or optimization.
  • Time to provision: determine if your workload requires immediate execution or if it can be run in the future. If future execution is possible, determine how flexible the start time can be.
  • Balance between cost and performance: evaluate your workload performance requirements and budget constraints to select the most cost-effective accelerator. Consider the trade-off between the cost of the accelerators and their performance characteristics. Remember that new accelerators might bring improved cost-performance ratios.

Use the following table to choose a consumption option:

Workload type Time to provision Lifespan Recommended consumption option
  • Long-running, large-scale workloads such as pre-training foundation models or multi-host inference.
  • Production workloads.
Immediate (with approved reservation) Long-term (per reservation)

If you want to consume any GPU (except A4X, A4, or A3 Ultra), or any TPU, use On-demand reservations:

  • Cost: you are charged for the full reservation period.
  • Quota: quota is automatically increased before capacity is delivered.

If you want to consume G2, A2, A3 High, or A3 Mega accelerators, use Future reservations:

  • Cost: you are charged for the full reservation period.
  • Quota: quota is automatically increased before capacity is delivered.
  • Short-running distributed workloads like model fine-tuning, simulations, or batch inference, where a precise start time is needed.
  • Workloads for platform evaluation, benchmarking, or optimization testing.
Immediate (with approved reservation) Up to 90 days

Future reservations for up to 90 days (in calendar mode):

  • Cost: discounted (up to 53%). You are charged for the reservation period.
  • Quota: no quota is charged.
  • Supported accelerators: A4, A3 Ultra, TPU v5e, TPU v5p, TPU Trillium.
  • Batch workloads such as small model training, fine-tuning, or scalable inference where start time is flexible.
  • Workloads for POCs or integration testing.
On-demand (subject to availability) Up to 7 days per allocation

Flex-start provisioning mode:

  • Lower priority, fault-tolerant workloads like CI/CD, data analytics, or high performance computing (HPC).
  • Highly interruptible workloads.
On-demand (subject to availability) Variable, can be preempted with a 30-second warning

Spot VMs:

  • General purpose workloads requiring immediate execution.
Immediate (subject to availability) No limit

On-demand (GPUs or TPUs):

  • Cost: you pay as you go.
  • Quota: GPU or TPU on-demand quota is charged.
  • Supported accelerators: all GPU families except A4X, A4, or A3 Ultra. All TPU versions.

What's next