This page describes how to plan your usage of Tensor Processing Units (TPUs) in Google Kubernetes Engine (GKE) to reduce the risk of TPU misconfiguration, non-availability errors, or out-of-quota interruptions.
Before you use TPUs in GKE, ensure that you are familiar with TPUs definitions and terminology in GKE.
Plan your TPU configuration
To work with TPUs in GKE clusters, you must plan their configuration. We recommend that you follow these steps:
Choose a GKE mode of operation: Run your workloads on TPUs in a GKE Autopilot or Standard cluster.
Best practice: Use an Autopilot cluster for a fully managed Kubernetes experience.
Choose the TPU version: Different TPU types have different capabilities like price-performance ratios, training throughput, and serving latency. The TPU types affect the available CPU and memory capacities.
Validate TPU availability: TPUs are available in specific Google Cloud regions. To use a TPU type in your GKE workload, your cluster must be in a supported region for that type.
Choose the TPU Topology: The physical arrangement of the TPUs within a TPU slice. Select a topology that matches your model's parallelism requirements.
Use the reference tables on this page to identify if your node pools are single-host or multi-host TPU slice nodes.
Choose a GKE mode of operation
You can use TPUs in the available GKE modes of operation for clusters:
- Autopilot mode (recommended): GKE manages the underlying infrastructure such as node configuration, autoscaling, auto-upgrades, baseline security configurations, and baseline networking configuration. In Autopilot, you choose a TPU type and topology, then specify them in your Kubernetes manifest. GKE manages provisioning nodes with TPUs and scheduling your workloads.
- Standard mode: You manage the underlying infrastructure, including configuring the individual nodes.
To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.
Choose the TPU version
The VMs in a TPU slice have the following technical characteristics.
Autopilot
TPU version | Machine type | Number of vCPUs | Memory (GiB) | Number of NUMA nodes | Maximum TPU chips in a TPU slice node |
---|---|---|---|---|---|
TPU v5p |
tpu-v5p-slice |
208 | 448 | 2 | 6,144 |
TPU v5e |
tpu-v5-lite-podslice |
24 to 224 | 48 to 384 | 1 | 256 |
TPU v5e (single-host only) |
tpu-v5-lite-device |
24 to 224 | 48 to 384 | 1 to 2 | 8 |
TPU v4 |
tpu-v4-podslice |
240 | 407 | 2 | 4,096 |
TPU v3 (single-host only) |
tpu-v3-device |
96 | 340 | 2 | 8 |
TPU v3 |
tpu-v3-slice |
48 | 340 | 1 | 256 |
Standard
TPU version | Machine type | Number of vCPUs | Memory (GiB) | Number of NUMA nodes | Likelihood of being preempted |
---|---|---|---|---|---|
TPU Trillium (v6e) (Preview) |
ct6e-standard-1t |
44 | 448 | 2 | Higher |
TPU v6e (Preview) |
ct6e-standard-4t |
180 | 720 | 1 | Medium |
TPU v6e (Preview) |
ct6e-standard-8t |
180 | 1440 | 2 | Lower |
TPU v5p |
ct5p-hightpu-4t |
208 | 448 | 2 | |
TPU v5e |
ct5l-hightpu-1t |
24 | 48 | 1 | Higher |
TPU v5e |
ct5l-hightpu-4t |
112 | 192 | 1 | Medium |
TPU v5e |
ct5l-hightpu-8t |
224 | 384 | 2 | Lower |
TPU v5e |
ct5lp-hightpu-1t |
24 | 48 | 1 | Higher |
TPU v5e |
ct5lp-hightpu-4t |
112 | 192 | 1 | Medium |
TPU v5e |
ct5lp-hightpu-8t |
224 | 384 | 1 | Low |
TPU v4 |
ct4p-hightpu-4t |
240 | 407 | 2 | |
TPU v3 (single-host only) |
ct3-hightpu-4t |
96 | 340 | 2 | 8 |
TPU v3 |
ct3p-hightpu-4t |
48 | 340 | 1 | 256 |
Consider the following configurations when evaluating which machine type to use based on your model:
ct5l-
machine types are suitable for serving small-to-medium size models, and are less suitable for large models. Thect5l-
machine types are single-host and therefore don't have any high-speed interconnect links between multiple hosts.- Multi-host
ct5lp-
machine types are more suitable for serving large models or training. Multi-hostct5lp-
machines are interconnected with high-speed links.
Review the TPU specifications and pricing in the Cloud TPU pricing documentation to decide which TPU configuration to use.
Limitations
Consider these limitations when choosing the TPU to use:
- TPU v6e is in Preview and available in Standard clusters.
- TPU v6e doesn't support configuring SMT set to
2
onct6e-standard-8t
. - TPU v6e doesn't support 2x4 topology on
ct6e-standard-4t
. - GKE cost allocation and usage metering doesn't include any data about the usage or costs of reserved TPU v4.
- TPU v5p and v5e don't support riptide/image streaming in us-east5.
- TPU v5p autoscaling is supported on GKE clusters with control planes running at least version 1.29.2-gke.1035000 or 1.28.7-gke.1020000.
- For capacity reservations, use a specific reservation.
Validate TPU availability in GKE
TPUs are available in specific Google Cloud regions. To use a TPU type in your GKE cluster, your cluster must be in a supported region for that type.
Autopilot
See TPU regions and zones in the Cloud TPU documentation.
Standard
The following table lists the TPU availability for each TPU version and machine type:
TPU version | Machine type beginning with | Minimum GKE version | Availability | Zone |
---|---|---|---|---|
TPU v6e | ct6e- |
1.31.1-gke.1846000 | Preview | us-east5-b |
europe-west4-a |
||||
us-east1-d |
||||
asia-northeast1-b |
||||
us-south1-a |
||||
TPU v5e | ct5l- |
1.27.2-gke.2100 | Generally Available | europe-west4-b |
us-central1-a |
||||
TPU v5e | ct5lp- |
1.27.2-gke.2100 | Generally Available | europe-west4-a |
us-central1-a |
||||
us-east1-c |
||||
us-east5-b |
||||
us-west1-c |
||||
us-west4-a |
||||
us-west4-b |
||||
TPU v5p | ct5p- |
1.28.3-gke.1024000 | Generally Available | us-east1-d |
us-east5-a |
||||
us-east5-c |
||||
TPU v4 | ct4p- |
1.26.1-gke.1500 | Generally Available | us-central2-b |
TPU v3 | ct3p- |
1.31.1-gke.1146000 | Generally Available | us-east1-d |
europe-west4-a |
||||
TPU v3 | ct3- |
1.31.0-gke.1500 | Generally Available | us-east1-d |
europe-west4-a |
||||
us-central1-a |
||||
us-central1-b |
||||
us-central1-f |
Consider the following caveats when configuring a TPU:
- You can create a single-host TPU v5e node pool with
a machine type beginning with
ct5lp-
but not beginning withct5l-
in certain zones (europe-west4-a
,us-east5-b
, andus-west4-b
). You can usect5lp-hightpu-4t
with a topology of at least2x4
or larger in those zones. - To create a single-host TPU v5e in the
us-west4
region, choose the zoneus-west4-a
and use machine types beginning withct5lp-
, such asct5lp-hightpu-1t
. - To create a single-host TPU
v5e in the other regions listed in the preceding table, use machine types beginning with
ct5l-
(such asct5l-hightpu-1t
,ct5l-hightpu-4t
, orct5l-hightpu-8t
). - Machine types beginning with
ct5l-
require different quota than machine types beginning withct5lp-
.
Choose a topology
After you decide on a TPU version, select a topology that's supported by that TPU type. Depending on the TPU type, the topology is two- or three-dimensional. Your model's parallelism requirements help you to decide on a topology. You can identify the number of TPU chips in the slice by calculating the product of each size in the topology. For example:
2x2x2
is an 8-chip multi-host TPU v4 slice2x2
is a 4-chip single-host TPU v5e slice
If a specific topology supports both single-host and multi-host TPU slice nodes, the number of TPU chips that your workload requests determines the host type.
For example, TPU v5e
(tpu-v5-lite-podslice
) supports the 2x4
topology as both single- and
multi-host. If you:
- Request 4 chips in your workload, you get a multi-host node that has 4 TPU chips.
- Request 8 chips in your workload, you get a single-host node that has 8 TPU chips.
Use the following table to choose the TPU machine type and topology for your use case:
- For small-scale model training or inference, use TPU v4 or TPU v5e with single-host TPU slice node pools.
For large-scale model training or inference, use TPU v4 or TPU v5e with multi-host TPU slice node pools.
Autopilot
TPU version | Machine type | Topology | Number of TPU chips in a slice | Number of nodes | Node pool type |
---|---|---|---|---|---|
TPU v5p | tpu-v5p-slice |
2x2x1 | 4 | 1 | Single-host |
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
4x4x4 | 64 | 16 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host | ||
TPU v5e | tpu-v5-lite-podslice 2 |
1x1 | 1 | 1 | Single-host |
2x2 | 4 | 1 | |||
2x4 | 8 | 1 | |||
2x4 | 8 | 2 | Multi-host | ||
4x4 | 16 | 4 | |||
4x8 | 32 | 8 | |||
8x8 | 64 | 16 | |||
8x16 | 128 | 32 | |||
16x16 | 256 | 64 | |||
TPU v5e (single-host only) | tpu-v5-lite-device |
1x1 | 1 | 1 | Single-host |
2x2 | 4 | 1 | |||
2x4 | 8 | 1 | |||
TPU v4 | tpu-v4-podslice 2 |
2x2x1 | 4 | 1 | Single-host |
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
4x4x4 | 64 | 16 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host | ||
TPU v3 | tpu-v3-slice |
4x4 | 16 | 2 | Multi-host |
4x8 | 32 | 4 | Multi-host | ||
8x8 | 64 | 8 | Multi-host | ||
8x16 | 128 | 16 | Multi-host | ||
16x16 | 256 | 32 | Multi-host | ||
TPU v3 | tpu-v3-device |
2x2 | 4 | 1 | Single-host |
-
Calculated by the topology product divided by four. ↩
Custom topologies for more than 64 chips are supported. The following conditions apply:
- For more than 64 chips,
{A}
,{B}
, and{C}
must be multiples of 4 - The largest topology is
16x16x24
- The values must be
{A}
≤{B}
≤{C}
, like8x12x16
.
- For more than 64 chips,
-
Custom topologies aren't supported.
After you choose a TPU type and topology, specify these in your workload manifest. For instructions, see Deploy TPU workloads on GKE Autopilot.
Standard
TPU version | Machine type | Topology | Number of TPU chips | Number of VMs | Node pool type |
---|---|---|---|---|---|
TPU v6e (Preview) | ct6e-standard-1t |
1x1 | 1 | 1 | Single-host |
ct6e-standard-8t |
2x4 | 8 | 1 | Single-host | |
ct6e-standard-4t |
2x2 | 4 | 1 | Single-host | |
2x4 | 8 | 2 | Multi-host | ||
4x4 | 16 | 4 | Multi-host | ||
4x8 | 32 | 8 | Multi-host | ||
8x8 | 64 | 16 | Multi-host | ||
8x16 | 128 | 32 | Multi-host | ||
16x16 | 256 | 64 | Multi-host | ||
TPU v5p | ct5p-hightpu-4t |
2x2x1 | 4 | 1 | Single-host |
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host | ||
TPU v5e | ct5l-hightpu-1t |
1x1 | 1 | 1 | Single-host |
ct5l-hightpu-4t |
2x2 | 4 | 1 | Single-host | |
ct5l-hightpu-8t |
2x4 | 8 | 1 | Single-host | |
ct5lp-hightpu-1t |
1x1 | 1 | 1 | Single-host | |
ct5lp-hightpu-4t |
2x2 | 4 | 1 | Single-host | |
ct5lp-hightpu-8t |
2x4 | 8 | 1 | Single-host | |
ct5lp-hightpu-4t |
2x4 | 8 | 2 | Multi-host | |
4x4 | 16 | 4 | Multi-host | ||
4x8 | 32 | 8 | Multi-host | ||
8x8 | 64 | 16 | Multi-host | ||
8x16 | 128 | 32 | Multi-host | ||
16x16 | 256 | 64 | Multi-host | ||
TPU v4 | ct4p-hightpu-4t |
2x2x1 | 4 | 1 | Single-host |
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host | ||
TPU v3 | ct3-hightpu-4t |
2x2 | 4 | 1 | Single-host |
TPU v3 | ct3p-hightpu-4t |
4x4 | 16 | 4 | Multi-host |
4x8 | 32 | 8 | Multi-host | ||
8x8 | 64 | 16 | Multi-host | ||
8x16 | 128 | 32 | Multi-host | ||
16x16 | 256 | 64 | Multi-host | ||
16x32 | 512 | 128 | Multi-host | ||
32x32 | 1024 | 256 | Multi-host |
-
Calculated by the topology product divided by four. ↩
Advanced configurations
The following sections describe scheduling best practices for advanced TPU configurations.
TPU reservation
TPU reservations are available when purchasing a commitment. Any TPU reservation can be used with GKE.
When creating a TPU slice node pool, use the
--reservation
and --reservation-affinity=specific
flags to consume a reserved
TPU instance.
Autoscaling TPUs in GKE
GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.
With the
--enable-autoprovisioning
flag on a GKE cluster,
GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU
version and topology that meets the requirements of pending workloads.
When you use --enable-autoscaling
, GKE scales the node pool based on its type, as follows:
Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool.
Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type
ct5lp-hightpu-4t
and a topology of16x16
, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.
CPU for Standard clusters
This section doesn't apply to Autopilot clusters because GKE places each TPU slice on its own node. To learn more, see How TPUs work in Autopilot mode.
For Standard clusters, consider the following scheduling best practices.
To schedule a non-TPU workload on a VM in a TPU slice node, ensure that your
GKE Pod can tolerate the google.com/tpu
taint. If you want the
workload to be deployed to specific nodes, use
node selectors.
Kubernetes resource management and priority treats VMs in TPUs the same as other VM types. To give scheduling priority to Pods that require TPUs over other Pods on the same nodes, request the maximum CPU or memory for those TPU slices. Low-priority TPU slices should do the following:
- Set low CPU and memory requests to ensure that the node has enough allocatable resources for the TPU workloads. To learn more, see How Kubernetes applies resource requests and limits.
- Set no CPU limit (unlimited) to ensure that Pods can burst to use all unused cycles.
- Set appropriate memory limits to ensure Pods can function correctly without risking node-pressure eviction.
If a Kubernetes Pod doesn't request CPU and memory (even if it is requesting TPUs), then Kubernetes considers it a best-effort Pod, and there is no guarantee that it needed any CPU and memory. Only Pods that explicitly request CPU and memory have such guarantees. For specific Kubernetes scheduling, configure the Pod needs with explicit CPU and memory request. For more information, see Resource Management for Pods and Containers.
To learn more best practices, see Kubernetes best practices: Resource requests and limits.
Reduce workload interruption
If you are using TPUs to train a machine learning model and your workload is interrupted, all work performed since the last checkpoint is lost. To decrease the probability that your workload is interrupted, do the following:
- Set a higher priority for this Job than for all other Jobs: If resources are scarce, the GKE scheduler preempts lower priority Jobs to schedule a higher priority Job. This also ensures that your higher priority workload receives all the resources that it needs (up to the total resources available in the cluster). To learn more, see Pod priority and preemption.
- Configure maintenance exclusion: A maintenance exclusion is a non-repeating window of time during which automatic maintenance is forbidden. To learn more, see Maintenance exclusions.
- Use extended run time Pods in Autopilot: Use extended run time Pods for a grace period of up to seven days before GKE terminates your Pods for scale-downs or node upgrades.
These recommendations help to minimize interruptions, but not to prevent them. For example, a preemption due to a hardware failure or preemption for defragmentation can still occur. Similarly, setting a GKE maintenance exclusion doesn't prevent Compute Engine maintenance events.
Save checkpoints frequently and
add code to your training script to start from the last checkpoint when
resumed.
Handle disruption due to node maintenance
The GKE nodes that host the TPUs are subject to maintenance events or other disruptions that might cause node shutdown. In GKE clusters with the control plane running version 1.29.1-gke.1425000 and later, you can reduce disruption to workloads by configuring GKE to terminate your workloads gracefully.
To understand, configure, and monitor disruption events that might occur on GKE nodes running AI/ML workloads, see Manage GKE node disruption for GPUs and TPUs.
Maximize TPU utilization
To maximize your investment in TPUs, schedule a mix of Job priorities and queue them to maximize the amount of time that your TPUs are operating. For Job-level scheduling and preemption, you need to use an add-on to Kubernetes that orchestrates Jobs into queues.
Use Kueue to orchestrate Jobs into queues.
What's next
- Follow the Deploy TPU workloads in GKE to set up Cloud TPU with GKE.
- Learn about best practices for using Cloud TPU for your machine learning tasks.
- Build large-scale machine learning on Cloud TPUs with GKE.
- Serve Large Language Models with KubeRay on TPUs.