About TPUs in GKE


This page introduces Cloud TPU and shows you where to find information on using Cloud TPU with Google Kubernetes Engine (GKE). Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads that use frameworks such as TensorFlow, PyTorch, and JAX.

Before you use TPUs in GKE, we recommend that you complete the following learning path:

  1. Learn how machine learning accelerators work with the Introduction to Cloud TPU.
  2. Learn about current TPU version availability with the Cloud TPU system architecture.

To learn how to set up Cloud TPU in GKE, see Deploy TPU workloads in GKE.

Benefits of using TPUs in GKE

GKE provides full support for TPU VM lifecycle management, including creating, configuring, and deleting TPU VMs. GKE also supports Spot VMs and using reserved Cloud TPU. The benefits of using TPUs in GKE include:

  • Consistent operational environment: A single platform for all machine learning and other workloads.
  • Automatic upgrades: GKE automates version updates which reduces operational overhead.
  • Load balancing: GKE distributes the load reducing latency and improving reliability.
  • Responsive scaling: GKE automatically scales TPU resources to meet the needs of your workloads.
  • Resource management: With Kueue, a Kubernetes-native job queuing system, you can manage resources across multiple tenants within your organization using queuing, preemption, prioritization, and fair sharing.

Terminology related to TPU in GKE

TPU and Kubernetes use some similar terms that you need to consider and differentiate as you read this document:

  • TPU VM: A Compute Engine VM running on a physical machine with TPU hardware.
  • TPU Pod: A collection of interconnected TPU chips. The number of TPU chips in a TPU Pod varies by TPU version.
  • TPU slice: A subset of a full TPU Pod.
  • TPU Topology: The number and physical arrangement of the TPU chips in a TPU slice.
  • Atomicity: The property of a multi-host TPU node pool where the node pool is treated as a single unit. You cannot resize a multi-host node pool. GKE scales a multi-host node pool by creating or removing all nodes in the node pool in a single step.

How TPUs in GKE work

Kubernetes resource management and priority treats TPU VMs the same as other VM types. You request TPU chips using the resource name google.com/tpu:

    resources:
        requests:
          google.com/tpu: 4
        limits:
          google.com/tpu: 4

When you use TPUs in GKE, you must consider the following TPU characteristics:

  • A TPU VM can access up to 8 TPU chips.
  • A TPU slice contains a fixed number of TPU chips that depends on the TPU machine type you choose.
  • The number of requested google.com/tpu must be equal to the total number of available chips on the TPU node. Any container in a GKE Pod that requests TPUs must consume all the TPU chips in the node. Otherwise, your Deployment fails, because GKE can't partially consume TPU resources. For example, see the following scenarios:
    • The machine type ct5l-hightpu-8t has a single TPU node with 8 TPU chips. One GKE Pod that requires 8 TPU chips can be deployed on the node, but 2 GKE Pods that require 4 TPU chips each cannot be deployed on a node.
    • The machine type ct5lp-hightpu-4t with a 2x4 topology contains two TPU nodes with 4 chips each, a total of 8 TPU chips. A GKE Pod that requires 8 TPU chips cannot be deployed in any of the nodes in this node pool, but 2 Pods that require 4 TPU chips each can be deployed on the 2 nodes in the node pool.
  • Multiple Kubernetes Pods can be scheduled on a TPU VM, but only one container in each Pod can access the TPU chips.
  • Each cluster must have at least one non-TPU node pool to create kube-system Pods, such as kube-dns.
  • By default, TPU nodes have the google.com/tpu taint that prevents non-TPU Pods from being scheduled on them. The taint does not guarantee TPU resources are fully utilized. It allows you to run workloads that doesn't use TPUs on non-TPU nodes, freeing up compute on TPU nodes for code that uses TPUs.
  • GKE collects the logs emitted by containers running on TPU VMs. To learn more, see Logging.
  • TPU utilization metrics, such as runtime performance, are available in Cloud Monitoring. To learn more, see Observability and metrics.

Machine type

Machine types that supports TPU resources follow a naming convention that includes the TPU version and the number of chips per node such as ct<version>-hightpu-<node-chip-count>t. For example, the machine type ct5lp-hightpu-1t supports TPU v5e and contains one TPU chip in total.

Types of TPU node pool

Based on the topology you define, GKE places the TPU workloads in one of the following node pool types:

  • Single-host TPU slice node pool: A node pool that contains one or more independent TPU VMs. In single-host TPU slice node pools, the TPUs attached to the VMs aren't interconnected by high-speed interconnects.
  • Multi-host TPU slice node pool: A node pool that contains two or more interconnected TPU VMs. This type of node pool is atomic and immutable, which means that you can't manually add nodes to the node pool. In case of machine failure or shutdown, GKE recreates the entire node pool as a new atomic unit.

Topology

The topology defines the physical arrangement of TPUs within a TPU slice. GKE provisions a TPU slice in two- or three-dimensional topologies depending on the TPU version. You specify a topology as the number of TPU chips in each dimension:

  • For TPU v4 and v5p scheduled in multi-host TPU slice node pools, you define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4. The product of {A}x{B}x{C} defines the number of chips in the node pool. For example, you can define small topologies smaller than 64 chips with topology forms such as 2x2x2,2x2x4, or 2x4x4. If you use topologies larger than 64 chips, the values you assign to {A},{B}, and {C} must meet the following conditions:

    • {A},{B}, and {C} are multiples of four.
    • The largest topology supported for v4 is 12x16x16 and v5p is 16x16x24.
    • The assigned values keep the A ≤ B ≤ C pattern. For example, 4x4x8 or 8x8x8.
  • For TPU v5e, topologies follow a 2-tuple ({A}x{B}) format, for example 2x2. To define your TPU configuration, refer to the table in Mapping of TPU configuration.

TPU availability in GKE

The following table lists the TPU availability depending on the machine type and version:

TPU version Machine type beginning with Minimum GKE version Availability Zone
TPU v4 ct4p- 1.26.1-gke.1500 Generally Available us-central2-b
TPU v5e ct5l- 1.27.2-gke.2100 Generally Available europe-west4-b
us-central1-a
TPU v5e ct5lp- 1.27.2-gke.2100 Generally Available europe-west4-a1
us-central1-a1
us-east1-c
us-east5-b1
us-west1-c
us-west4-a
us-west4-b1
TPU v5p ct5p- 1.28.3-gke.1024000 Preview us-east1-d
us-east5-a
us-east5-c
  1. When creating a TPU v5e with a machine type beginning with ct5lp- in any of the zones europe-west4-a, us-central1-a, us-east5-b, or us-west4-b, single-host TPU v5e node pools are not supported. In other words, when creating a TPU v5e node pool in any of these zones, only the machine type ct5lp-hightpu-4t with a topology of at least 2x4 or larger is supported. To create a single-host TPU v5e in us-central1 or europe-west4, choose the zones us-central1-a or europe-west4-b, respectively, and use machine types beginning with ct5l- such as ct5l-hightpu-1t, ct5l-hightpu-4t, or ct5l-hightpu-8t. To create a single-host TPU v5e in the us-west4 region, choose the zone us-west4-a and use machine types beginning with ct5lp- such as ct5lp-hightpu-1t. Note machine types beginning with ct5l- require different quota than machine types beginning with ct5lp-.

Mapping of TPU configuration

Use the following table to define the TPU machine type and topology to use based on your use case:

  • For small-scale model training or inference, use TPU v4 or TPU v5e with single-host TPU slice node pools.
  • For large-scale model training or inference, use TPU v4 or TPU v5e with multi-host TPU slice node pools.
TPU version Machine type Topology Number of TPU chips Number of VMs Node pool type
TPU v4 ct4p-hightpu-4t 2x2x1 4 1 Single-host
2x2x2 8 2 Multi-host
2x2x4 16 4 Multi-host
2x4x4 32 8 Multi-host
{A}x{B}x{C} A*B*C (A*B*C/4)1 Multi-host
TPU v5p ct5p-hightpu-4t 2x2x1 4 1 Single-host
2x2x2 8 2 Multi-host
2x2x4 16 4 Multi-host
2x4x4 32 8 Multi-host
{A}x{B}x{C} A*B*C (A*B*C/4)1 Multi-host
TPU v5e ct5l-hightpu-1t 1x1 1 1 Single-host
ct5l-hightpu-4t 2x2 4 1 Single-host
ct5l-hightpu-8t 2x4 8 1 Single-host
ct5lp-hightpu-1t 1x1 1 1 Single-host
ct5lp-hightpu-4t 2x2 4 1 Single-host
ct5lp-hightpu-8t 2x4 8 1 Single-host
ct5lp-hightpu-4t 2x4 8 2 Multi-host
4x4 16 4 Multi-host
4x8 32 8 Multi-host
8x8 64 16 Multi-host
8x16 128 32 Multi-host
16x16 256 64 Multi-host
  1. Calculated by the topology product divided by four.

TPU v5e characteristics

TPU v5e machines have the following technical characteristics:

Machine type Number of vCPUs Memory (GB) Number of NUMA nodes Likelihood of being preempted
ct5l-hightpu-1t 24 48 1 Higher
ct5l-hightpu-4t 112 192 1 Medium
ct5l-hightpu-8t 224 384 2 Lower
ct5lp-hightpu-1t 24 48 1 Higher
ct5lp-hightpu-4t 112 192 1 Medium
ct5lp-hightpu-8t 224 384 1 Low

TPU v4 and v5p characteristics

TPU v4p and v5p machines have the following technical characteristics:

Machine type Number of vCPUs Memory (GB) Number of NUMA nodes
ct4p-hightpu-4t 240 407 2
ct5p-hightpu-4t 208 448 2

TPU reservation

To make sure that TPU resources are available when you need them, you can use TPU reservations in the following scenarios:

  • If you have existing TPU reservations, you must work with your Google Cloud account team to migrate your TPU reservation to a new Compute Engine-based reservation system.

  • If you don't have an existing TPU reservation, you can create a TPU reservation and no migration is needed.

Autoscaling TPUs in GKE

GKE supports Tensor Processing Units (TPUs) to accelerate machine learning workloads. Both single-host TPU slice node pool and multi-host TPU slice node pool support autoscaling and auto-provisioning.

With the --enable-autoprovisioning flag on a GKE cluster, GKE creates or deletes single-host or multi-host TPU slice node pools with a TPU version and topology that meets the requirements of pending workloads.

When you use --enable-autoscaling, GKE scales the node pool based on its type, as follows:

  • Single-host TPU slice node pool: GKE adds or removes TPU nodes in the existing node pool. The node pool may contain any number of TPU nodes between zero and the maximum size of the node pool as determined by the --max-nodes and the --total-max-nodes flags. When the node pool scales, all the TPU nodes in the node pool have the same machine type and topology. To learn more how to create a single-host TPU slice node pool, see Create a node pool.

  • Multi-host TPU slice node pool: GKE atomically scales up the node pool from zero to the number of nodes required to satisfy the TPU topology. For example, with a TPU node pool with a machine type ct5lp-hightpu-4t and a topology of 16x16, the node pool contains 64 nodes. The GKE autoscaler ensures that this node pool has exactly 0 or 64 nodes. When scaling back down, GKE evicts all scheduled pods, and drains the entire node pool to zero. To learn more how to create a multi-host TPU slice node pool, see Create a node pool.

Limitations

  • When using TPUs in GKE, SPECIFIC is the only supported value for the --reservation-affinity flag of gcloud container node-pools create.
  • TPUs aren't available in GKE Autopilot clusters.
  • GKE cost allocation and usage metering doesn't include any data about the usage or costs of reserved TPU v4.
  • TPU v5p and v5e don't support riptide/image streaming in us-east5.

Workload scheduling considerations

TPUs have unique characteristics that require special workload scheduling and management in Kubernetes. The following sections describe scheduling best practices.

CPU

To schedule a workload on the onboard CPU on a TPU VM, ensure that your GKE Pod can tolerate the google.com/tpu taint. If you want the workload to be deployed to specific nodes, use node selectors.

Kubernetes resource management and priority treats TPU VMs the same as other VM types. To give Pods that require TPUs scheduling priority over other Pods on the same nodes, request the maximum CPU or memory for those TPU Pods. Low-priority Pods should do the following:

  1. Set low CPU and memory requests to ensure that the node has enough allocatable resources for the TPU workloads. To learn more, see How Kubernetes applies resource requests and limits.
  2. Set no CPU limit (unlimited) to ensure that Pods can burst to use all unused cycles.
  3. Set a high memory limit to ensure that Pods can use most unused memory while maintaining node stability.

If a Kubernetes Pod doesn't request CPU and memory (even if it is requesting TPUs), then Kubernetes considers it a best-effort Pod, and there is no guarantee that it needed any CPU and memory. Only Pods that explicitly request CPU and memory have such guarantees. For more information, see Resource Management for Pods and Containers.

To learn more, see the Kubernetes best practices: Resource requests and limits.

Maximize uninterrupted workload runtime

If you are using TPUs to train a machine learning model and your workload is interrupted, all work performed since the last checkpoint is lost. To decrease the probability that your workload is interrupted, do the following:

  • Set a higher priority for this Job than for all other Jobs: If resources are scarce, the GKE scheduler preempts lower priority Jobs to schedule a higher priority Job. This also ensures that your higher priority workload receives all the resources that it needs (up to the total resources available in the cluster). To learn more, see Pod priority and preemption.
  • Configure maintenance exclusion: A maintenance exclusion is a non-repeating window of time during which automatic maintenance is forbidden. To learn more, see Maintenance exclusions.

Maximize TPU utilization

To maximize your investment in TPUs, schedule a mix of Job priorities and queue them to maximize the amount of time your TPUs are operating. If you want Job level scheduling and preemption, then you need to use an add-on to Kubernetes that orchestrates Jobs into queues. We recommend using Kueue for that use case.

What's next