This page introduces Cloud TPU and helps you to plan your Cloud TPU configuration with Google Kubernetes Engine (GKE), including reserving TPU instances, autoscaling, TPU limitations, and workload scheduling considerations.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning (ML) workloads that use frameworks such as TensorFlow, PyTorch, and JAX.
Before you use TPUs in GKE, we recommend that you complete the following learning path:
- Learn how machine learning accelerators work with the Introduction to Cloud TPU.
- Learn about current TPU version availability with the Cloud TPU system architecture.
To learn how to set up Cloud TPU in GKE, see the following resources:
Benefits of using TPUs in GKE
GKE provides full support for TPU node and node pool lifecycle management, including creating, configuring, and deleting TPU VMs. GKE also supports Spot VMs and using reserved Cloud TPU. The benefits of using TPUs in GKE include:
- Consistent operational environment: You can use a single platform for all machine learning and other workloads.
- Automatic upgrades: GKE automates version updates which reduces operational overhead.
- Load balancing: GKE distributes the load, thus reducing latency and improving reliability.
- Responsive scaling: GKE automatically scales TPU resources to meet the needs of your workloads.
- Resource management: With Kueue, a Kubernetes-native job queuing system, you can manage resources across multiple tenants within your organization using queuing, preemption, prioritization, and fair sharing.
Terminology related to TPU in GKE
This document uses the following terminology related to TPUs:
- TPU type: the Cloud TPU type, like v5e.
- TPU slice node: a Kubernetes node that contains a set of VMs with interconnected TPU chips.
- TPU slice node pool: a group of Kubernetes nodes within a cluster that all have the same TPU configuration.
- TPU topology: the number and physical arrangement of the TPU chips in a TPU slice.
- Single-host TPU slice nodes: one or more independent TPU slice nodes. The VMs in a single-host TPU slice node aren't connected to each other by high-speed interconnects.
- Multi-host TPU slice nodes: two or more interconnected TPU slice nodes.
The VMs in a multi-host TPU slice node are connected by high-speed interconnects. Multi-host
TPU slice nodes have the following characteristics:
- Atomic: GKE treats all the interconnected nodes as a single unit. During scaling operations, GKE scales the entire set of nodes to 0 and creates new nodes. If a machine in the group fails or terminates, GKE recreates the entire set of nodes as a new unit.
- Immutable: You can't manually add new nodes to the set of interconnected nodes. However, you can create a new node pool with the desired TPU topology and schedule workloads on the new node pool.
How TPUs in GKE work
Kubernetes resource management and priority treats VMs on TPUs the same as other VM
types. You request TPU chips using the resource name google.com/tpu
:
resources:
requests:
google.com/tpu: 4
limits:
google.com/tpu: 4
When you use TPUs in GKE, you must consider the following TPU characteristics:
- A VM can access up to 8 TPU chips.
- A TPU slice contains a fixed number of TPU chips, with the number depending on the TPU machine type you choose.
- The number of requested
google.com/tpu
must be equal to the total number of available TPU chips on the TPU slice node. Any container in a GKE Pod that requests TPUs must consume all the TPU chips in the node. Otherwise, your Deployment fails, because GKE can't partially consume TPU resources. For example, see the following scenarios:- The machine type
ct5l-hightpu-8t
has a single TPU slice node with 8 TPU chips so on a node you:- Can deploy one GKE Pod that requires eight TPU chips.
- Can't deploy two GKE Pods that require four TPU chips each.
- The machine type
ct5lp-hightpu-4t
with a2x4
topology contains two TPU slice nodes with four TPU chips each, for a total of eight TPU chips. With this machine type, you:- Can't deploy a GKE Pod that requires eight TPU chips on the nodes in this node pool.
- Can deploy two Pods that require four TPU chips each, each on one of the two nodes in this node pool.
- TPU v5e with topology 4x4 has 16 TPU chips in four nodes. The GKE Autopilot workload that selects this configuration must request four TPU chips in each replica, for one to four replicas.
- The machine type
- In Standard clusters, multiple Kubernetes Pods can be scheduled on a VM, but only one container in each Pod can access the TPU chips.
- To create kube-system Pods, such as kube-dns, each Standard cluster must have at least one non-TPU slice node pool.
- By default, TPU slice nodes have the
google.com/tpu
taint which prevents non-TPU slices from being scheduled on the TPU slice nodes. Workloads that don't use TPUs are run on non-TPU nodes, freeing up compute on TPU slice nodes for code that uses TPUs. Note that the taint does not guarantee TPU resources are fully utilized. - GKE collects the logs emitted by containers running on TPU slice nodes. To learn more, see Logging.
- TPU utilization metrics, such as runtime performance, are available in Cloud Monitoring. To learn more, see Observability and metrics.
Plan your TPU configuration
To work with TPUs in GKE clusters, you must decide the following parameters:
- TPU type: the machine type, such as
ct5l-hightpu-8t
. Different TPU types have different capabilities like price-performance ratios, training throughput, and serving latency. The TPU types affect the available CPU and memory capacities. - Topology: the physical arrangement of the TPUs within a TPU slice. Each TPU type supports a 2D or 3D TPU topology. Select a topology that matches your model's parallelism requirements.
TPU interconnectivity: whether the nodes have high-speed interconnects. The TPU type and topology determine whether you can get multi-host TPU slice nodes, which are TPUs in multiple nodes that have high-speed interconnects. We recommend:
- For large-scale models, use multi-host TPU slice nodes
- For small-scale models, use single-host TPU slice nodes
Privileged mode: overrides many of the other security settings in the securityContext. To access TPUs, containers running in GKE nodes in:
- Version 1.28 and earlier need to enable privileged mode.
- Versions 1.28 or later don't need privileged mode.
Choose a TPU configuration for GKE Autopilot mode
In Autopilot mode, you choose a TPU type and a topology then specify these in your Kubernetes manifest. GKE manages provisioning nodes with TPUs and scheduling your workloads.
TPU availability in GKE Autopilot
TPUs are available in specific Google Cloud regions. To use a TPU type in your GKE workload, your cluster must be in a supported region for that type. For details, see TPU regions and zones in the Cloud TPU documentation.
Choose a TPU type in Autopilot
TPU type | Number of vCPUs | Memory (GiB) | Number of NUMA nodes | Maximum TPU chips in a slice |
---|---|---|---|---|
TPU v5ptpu-v5p-slice |
208 | 448 | 2 | 6,144 |
TPU v5etpu-v5-lite-podslice |
24 to 224 | 48 to 384 | 1 | 256 |
TPU v5e (single-host only)tpu-v5-lite-device |
24 to 224 | 48 to 384 | 1 to 2 | 8 |
TPU v4tpu-v4-podslice |
240 | 407 | 2 | 4,096 |
Review the TPU chip specifications and pricing in the Cloud TPU pricing documentation to help you decide which TPU type to use.
Choose a topology for Autopilot
After you decide on a TPU type, select a topology that's supported by that TPU type. Depending on the TPU type, the topology is two- or three-dimensional. Your model's parallelism requirements help you to decide on a topology. You can identify the number of TPU chips in the slice by calculating the product of each size in the topology. For example:
2x2x2
is an 8-chip multi-host TPU v4 slice2x2
is a 4-chip single-host TPU v5e slice
If a specific topology supports both single-host or multi-host TPU slice nodes, the number of TPU chips that your workload requests determines the host type that you get.
For example, TPU v5e
(tpu-v5-lite-podslice
) supports the 2x4 topology as both single- and
multi-host. If you:
- Request 4 chips in your workload, you get a multi-host node that has 4 TPU chips.
- Request 8 chips in your workload, you get a single-host node that has 8 TPU chips.
The following table lists each TPU type, its supported topologies, and usage notes. For each of those topologies, the table lists the number of TPU chips, the number of nodes, and the host type:
TPU type | Topology | TPU chips in a slice | Number of nodes | Host type | Notes |
---|---|---|---|---|---|
TPU v5ptpu-v5p-slice |
2x2x1 | 4 | 1 | Single-host | Custom topologies for more than 64 chips are supported. The following conditions apply:
|
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
4x4x4 | 64 | 16 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host | ||
TPU v5etpu-v5-lite-podslice |
1x1 | 1 | 1 | Single-host | Custom topologies aren't supported. |
2x2 | 4 | 1 | |||
2x4 | 8 | 1 | |||
2x4 | 2 | 1 | Multi-host | ||
4x4 | 16 | 4 | |||
4x8 | 32 | 8 | |||
8x8 | 64 | 16 | |||
8x16 | 128 | 32 | |||
16x16 | 256 | 64 | |||
TPU v5e (single-host only)tpu-v5-lite-device |
1x1 | 1 | 1 | Single-host | Custom topologies aren't supported |
2x2 | 4 | 1 | |||
2x4 | 8 | 1 | |||
TPU v4tpu-v4-podslice |
2x2x1 | 4 | 1 | Single-host | Custom topologies for more than 64 chips are supported. The following conditions apply:
|
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
4x4x4 | 64 | 16 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host |
-
Calculated by the topology product divided by four. ↩
After you choose a TPU type and topology, specify these in your workload manifest. For instructions, see Deploy TPU workloads on GKE Autopilot.
Choose a TPU configuration for GKE Standard mode
The following sections describe the TPU characteristics to consider when planning and setting up your TPU workloads in GKE. For details about available versions, machine types, valid topologies, and their number of TPU chips, refer to Mapping of TPU configurations in this document.
TPU availability in GKE Standard mode
The following table lists the TPU availability for each TPU version and machine type:
TPU version | Machine type beginning with | Minimum GKE version | Availability | Zone |
---|---|---|---|---|
TPU v4 | ct4p- |
1.26.1-gke.1500 | Generally Available | us-central2-b |
TPU v5e | ct5l- |
1.27.2-gke.2100 | Generally Available | europe-west4-b |
us-central1-a |
||||
TPU v5e | ct5lp- |
1.27.2-gke.2100 | Generally Available | europe-west4-a 1 |
us-central1-a 1 |
||||
us-east1-c |
||||
us-east5-b 1 |
||||
us-west1-c |
||||
us-west4-a |
||||
us-west4-b 1 |
||||
TPU v5p | ct5p- |
1.28.3-gke.1024000 | Generally Available | us-east1-d |
us-east5-a |
||||
us-east5-c |
-
You can create a single-host TPU v5e node pool with a machine type beginning with
ct5lp-
but not beginning withct5l-
in certain zones (europe-west4-a
,us-central1-a
,us-east5-b
, andus-west4-b
). You can usect5lp-hightpu-4t
with a topology of at least2x4
or larger in those zones. To create a single-host TPU v5e in theus-west4
region, choose the zoneus-west4-a
and use machine types beginning withct5lp-
, such asct5lp-hightpu-1t
. To create a single-host TPU v5e in the other regions listed in this paragraph, use machine types beginning withct5l-
(such asct5l-hightpu-1t
,ct5l-hightpu-4t
, orct5l-hightpu-8t
) and choose theus-central1-a
oreurope-west4-b
zone. Note that machine types beginning withct5l-
require different quota than machine types beginning withct5lp-
. ↩
The following sections describe the TPU characteristics to consider when planning and setting up your TPU workloads in GKE. For details about available versions, machine types, valid topologies, and their number of TPU chips, refer to Mapping of TPU configurations section in this document.
Machine type
Machine types that support TPU resources follow a naming convention that
includes the TPU version and the number of TPU chips per node, such as
ct<version>-hightpu-<node-chip-count>t
. For example, the machine
type ct5lp-hightpu-1t
supports TPU v5e and contains just one TPU chip.
Topology
The topology defines the physical arrangement of TPUs within a TPU slice. GKE provisions a TPU slice in two- or three-dimensional topologies, depending on the TPU version. You specify a topology as the number of TPU chips in each dimension:
For TPU v4 and v5p scheduled in multi-host TPU slice node pools, you define the topology in 3-tuples (
{A}x{B}x{C}
), for example4x4x4
. The product of{A}x{B}x{C}
defines the number of TPU chips in the node pool. For example, you can define small topologies smaller than 64 TPU chips with topology forms such as2x2x2
,2x2x4
, or2x4x4
. If you use topologies larger than 64 TPU chips, the values you assign to {A},{B}, and {C} must meet the following conditions:- {A},{B}, and {C} are multiples of four.
- The largest topology supported for v4 is
12x16x16
and v5p is16x16x24
. - The assigned values keep the A ≤ B ≤ C
pattern. For example,
4x4x8
or8x8x8
.
Mapping of TPU configuration
Use the following table to choose the TPU machine type and topology for your use case:
- For small-scale model training or inference, use TPU v4 or TPU v5e with single-host TPU slice node pools.
- For large-scale model training or inference, use TPU v4 or TPU v5e with multi-host TPU slice node pools.
TPU version | Machine type | Topology | Number of TPU chips | Number of VMs | Node pool type |
---|---|---|---|---|---|
TPU v4 | ct4p-hightpu-4t |
2x2x1 | 4 | 1 | Single-host |
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host | ||
TPU v5p | ct5p-hightpu-4t |
2x2x1 | 4 | 1 | Single-host |
2x2x2 | 8 | 2 | Multi-host | ||
2x2x4 | 16 | 4 | Multi-host | ||
2x4x4 | 32 | 8 | Multi-host | ||
{A}x{B}x{C} | A*B*C | (A*B*C/4)1 | Multi-host | ||
TPU v5e | ct5l-hightpu-1t |
1x1 | 1 | 1 | Single-host |
ct5l-hightpu-4t |
2x2 | 4 | 1 | Single-host | |
ct5l-hightpu-8t |
2x4 | 8 | 1 | Single-host | |
ct5lp-hightpu-1t |
1x1 | 1 | 1 | Single-host | |
ct5lp-hightpu-4t |
2x2 | 4 | 1 | Single-host | |
ct5lp-hightpu-8t |
2x4 | 8 | 1 | Single-host | |
ct5lp-hightpu-4t |
2x4 | 8 | 2 | Multi-host | |
4x4 | 16 | 4 | Multi-host | ||
4x8 | 32 | 8 | Multi-host | ||
8x8 | 64 | 16 | Multi-host | ||
8x16 | 128 | 32 | Multi-host | ||
16x16 | 256 | 64 | Multi-host |
-
Calculated by the topology product divided by four. ↩
TPU v5e characteristics
TPU v5e machines have the following technical characteristics:
Machine type | Number of vCPUs | Memory (GB) | Number of NUMA nodes | Likelihood of being preempted |
---|---|---|---|---|
ct5l-hightpu-1t |
24 | 48 | 1 | Higher |
ct5l-hightpu-4t |
112 | 192 | 1 | Medium |
ct5l-hightpu-8t |
224 | 384 | 2 | Lower |
ct5lp-hightpu-1t |
24 | 48 | 1 | Higher |
ct5lp-hightpu-4t |
112 | 192 | 1 | Medium |
ct5lp-hightpu-8t |
224 | 384 | 1 | Low |
TPU v4 and v5p characteristics
TPU v4p and v5p machines have the following technical characteristics:
Machine type | Number of vCPUs | Memory (GB) | Number of NUMA nodes |
---|---|---|---|
ct4p-hightpu-4t |
240 | 407 | 2 |
ct5p-hightpu-4t |
208 | 448 | 2 |
TPU reservation
TPU reservations are available when purchasing a commitment. Any TPU reservation can be used with GKE.
When creating a TPU slice node pool, use the
--reservation
and --reservation-affinity=specific
flags to consume a reserved
TPU instance.
Autoscaling TPUs in GKE
GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.
With the
--enable-autoprovisioning
flag on a GKE cluster,
GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU
version and topology that meets the requirements of pending workloads.
When you use --enable-autoscaling
, GKE scales the node pool based on its type, as follows:
Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool.
Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type
ct5lp-hightpu-4t
and a topology of16x16
, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.
Limitations
Use these considerations when planning how to use TPUs on your platform:
- For capacity reservations, you must use a specific reservation.
- GKE cost allocation and usage metering doesn't include any data about the usage or costs of reserved TPU v4.
- TPU v5p and v5e don't support riptide/image streaming in us-east5.
- TPU v5p autoscaling is supported on GKE clusters with control planes running at least 1.29.2-gke.1035000 or at least 1.28.7-gke.1020000.
Workload scheduling considerations
TPUs have unique characteristics that require special workload scheduling and management in Kubernetes. The following sections describe scheduling best practices.
CPU for Standard clusters
This section doesn't apply to Autopilot clusters because GKE places each TPU slice on its own node. To learn more, see How TPUs work in Autopilot mode.
For Standard clusters, consider the following scheduling best practices.
To schedule a non-TPU workload on a VM in a TPU slice node, ensure that your
GKE Pod can tolerate the google.com/tpu
taint. If you want the
workload to be deployed to specific nodes, use
node selectors.
Kubernetes resource management and priority treats VMs in TPUs the same as other VM types. To give scheduling priority to Pods that require TPUs over other Pods on the same nodes, request the maximum CPU or memory for those TPU slices. Low-priority TPU slices should do the following:
- Set low CPU and memory requests to ensure that the node has enough allocatable resources for the TPU workloads. To learn more, see How Kubernetes applies resource requests and limits.
- Set no CPU limit (unlimited) to ensure that Pods can burst to use all unused cycles.
- Set appropriate memory limits to ensure Pods can function correctly without risking node-pressure eviction.
If a Kubernetes Pod doesn't request CPU and memory (even if it is requesting TPUs), then Kubernetes considers it a best-effort Pod, and there is no guarantee that it needed any CPU and memory. Only Pods that explicitly request CPU and memory have such guarantees. For more information, see Resource Management for Pods and Containers.
To learn more best practices, see Kubernetes best practices: Resource requests and limits.
Reduce workload interruption
If you are using TPUs to train a machine learning model and your workload is interrupted, all work performed since the last checkpoint is lost. To decrease the probability that your workload is interrupted, do the following:
- Set a higher priority for this Job than for all other Jobs: If resources are scarce, the GKE scheduler preempts lower priority Jobs to schedule a higher priority Job. This also ensures that your higher priority workload receives all the resources that it needs (up to the total resources available in the cluster). To learn more, see Pod priority and preemption.
- Configure maintenance exclusion: A maintenance exclusion is a non-repeating window of time during which automatic maintenance is forbidden. To learn more, see Maintenance exclusions.
- Use extended run time Pods in Autopilot: Use extended run time Pods for a grace period of up to seven days before GKE terminates your Pods for scale-downs or node upgrades.
Handle disruption due to node maintenance
All GKE nodes, including those that contain TPUs, are subject to
maintenance events or other disruptions that might cause node shutdown. You can
reduce disruption to workloads running in GKE clusters with the control plane
running version 1.29.1-gke.1425000 or later.
GKE alerts the nodes of an imminent shutdown by sending a
SIGTERM
signal to the node up to five minutes before evictions. If your workload uses an ML framework such as MaxText, Pax, or JAX with Orbax, the workloads can capture the SIGTERM
signal and initiate a checkpointing process.
You can configure GKE to terminate your ML workloads
gracefully with the maximum notification time. In your Pod manifest, set the
spec.terminationGracePeriodSeconds
field to 300
seconds (five minutes).
GKE makes a best effort to terminate these Pods gracefully and
to execute the termination action that you define, for example, saving a
training state. GKE respects any configuration of up to five minutes for the
PodDisruptionBudget
or terminationGracePeriodSeconds
settings.
The spec.terminationGracePeriodSeconds
field only handles
disruptions due to maintenance and defragmentation events that occur on
non-preemptible VMs. GKE doesn't handle involuntary
disruptions, such as hardware failures.
To learn more, see Configure TPU slice node graceful termination.
Maximize TPU utilization
To maximize your investment in TPUs, schedule a mix of Job priorities and queue them to maximize the amount of time your TPUs are operating. If you want Job level scheduling and preemption, then you need to use an add-on to Kubernetes that orchestrates Jobs into queues. We recommend using Kueue for that use case.
What's next
- Follow the Deploy TPU workloads in GKE to set up Cloud TPU with GKE.
- Learn about best practices for using Cloud TPU for your machine learning tasks.
- Build large-scale machine learning on Cloud TPUs with GKE.
- Serve Large Language Models with KubeRay on TPUs.