Deploy TPU workloads in GKE Standard

Standard

This page provides a foundation for learning how to accelerate machine learning (ML) workloads using TPUs in Google Kubernetes Engine (GKE). TPUs are designed for matrix multiplication processing, such as large-scale deep learning model training. TPUs are optimized to handle the enormous datasets and complex models of ML and therefore are more cost-effective and energy efficient for ML workloads due to their superior performance. In this guide, you learn how to deploy ML workloads by using Cloud TPU accelerators, configure quotas for TPUs, configure upgrades for node pools that run TPUs, and monitor TPU workload metrics.

This tutorial is intended for Machine learning (ML) engineers and Platform admins and operators who are interested in using Kubernetes container orchestration to manage large-scale model training, tuning, and inference workloads using TPUs. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Plan your TPU configuration

Plan your TPU configuration based on your model and how much memory it requires. Before you use this guide to deploy your workloads on TPU, complete the planning steps in Plan your TPU configuration.

Ensure that you have TPU quota

The following sections help you ensure that you have enough quota when using TPUs in GKE.

Quota for on-demand or Spot VMs

If you are creating a TPU slice node pool with on-demand or Spot VMs, you must have sufficient TPU quota available in the region that you want to use.

Creating a TPU slice node pool that consumes a TPU reservation does not require any TPU quota.¹ You may safely skip this section for reserved TPUs.

Creating an on-demand or Spot TPU slice node pool in GKE requires Compute Engine API quota. Compute Engine API quota (compute.googleapis.com) is not the same as Cloud TPU API quota (tpu.googleapis.com), which is needed when creating TPUs with the Cloud TPU API.

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

Go to the Quotas page in the Google Cloud console:

Go to Quotas

In the Filter box, do the following:

Use the following table to select and copy the property of the quota based on the TPU version and machine type. For example, if you plan to create on-demand TPU v5e nodes whose machine type begins with ct5lp- , enter Name: TPU v5 Lite PodSlice chips.

TPU version, machine type begins with	Property and name of the quota for on-demand instances	Property and name of the quota for Spot² instances
TPU v3, `ct3-`	Dimensions (e.g. location): tpu_family:CT3	Not applicable
TPU v3, `ct3p-`	Dimensions (e.g. location): tpu_family:CT3P	Not applicable
TPU v4, `ct4p-`	Name: TPU v4 PodSlice chips	Name: Preemptible TPU v4 PodSlice chips
TPU v5e, `ct5lp-`	Name: TPU v5 Lite PodSlice chips	Name: Preemptible TPU v5 Lite Podslice chips
TPU v5p, `ct5p-`	Name: TPU v5p chips	Name: Preemptible TPU v5p chips
TPU Trillium, `ct6e-`	Dimensions (e.g. location): tpu_family:CT6E	Name: Preemptible TPU slices v6e

Select the Dimensions (e.g. locations) property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU slice nodes in the zone us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota adjustment.

When a TPU reservation is created, both the limit and current use values for the corresponding quota increase by the number of chips in the TPU reservation. For example, when a reservation is created for 16 TPU v5e chips whose machine type begins with ct5lp- , then both the Limit and Current usage for the TPU v5 Lite PodSlice chips quota in the relevant region increase by 16.

When creating a TPU slice node pool, use the --reservation and --reservation-affinity=specific flags to create a reserved instance. TPU reservations are available when purchasing a commitment. ↩
When creating a TPU slice node pool, use the --spot flag to create a Spot instance. ↩

Quotas for additional GKE resources

You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.

Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
Ensure that max-pods-per-node aligns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example, max-pods-per-node of 32 requires 64 IP addresses which translates to a /26 subnet per node. Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the --max-pods-per-node flag to limit the number of pods allowed to be scheduled on a node. The quota for max-pods-per-node should be set at least as high as the maximum number of GKE nodes you anticipate creating.

To request an increase in quota, see Request a quota adjustment.

Ensure reservation availability

To create a TPU slice node pool using a reservation, the reservation must have sufficient available TPU chips at the time of node pool creation.

To see which reservations exist within a project and how many TPU chips within a TPU reservation are available, view a list of your reservations.

Options for provisioning TPUs in GKE

GKE lets you use TPUs directly in individual workloads by using Kubernetes nodeSelectors in your workload manifest or by creating Standard mode node pools with TPUs.

Alternatively, you can request TPUs by using custom compute classes. Custom compute classes let platform administrators define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware.

For instructions, see the Provision TPUs using custom compute classes section.

Create a cluster

Create a GKE cluster in Standard mode in a region with available TPUs.

Best practice:

Use regional clusters, which provide high availability of the Kubernetes control plane.

gcloud container clusters create CLUSTER_NAME \
  --location LOCATION \
  --cluster-version VERSION

Replace the following:

CLUSTER_NAME: the name of the new cluster.
LOCATION: the region with your TPU capacity available.
VERSION: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. To learn what are the minimum GKE versions available by TPU machine type, see TPU availability in GKE.

Create a node pool

You can create a single or multi-host TPU slice node pool.

Create a single-host TPU slice node pool

You can create a single-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

gcloud container node-pools create NODE_POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --node-locations=NODE_ZONES \
    --machine-type=MACHINE_TYPE \
    [--sandbox=type=gvisor]

Replace the following:

NODE_POOL_NAME: The name of the new node pool.
LOCATION: The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.
CLUSTER_NAME: The name of the cluster.
NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
MACHINE_TYPE: The type of machine to use for nodes. For more information about TPU compatible machine types, use the table in Choose the TPU version.

Optionally, you can also use the following flags:

--num-nodes=NUM_NODES: The initial number of nodes in the node pool in each zone. If you omit this flag,GKE assigns the default of 3,

Best practice:
If you use the enable-autoscaling flag for the node pool, set num-nodes to 0 so that the autoscaler provisions additional nodes as soon as your workloads demand them.
--reservation=RESERVATION_NAME: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPUs. To learn more about TPU reservations, see About Cloud TPU reservations.
--node-labels cloud.google.com/gke-workload-type=HIGH_AVAILABILITY: Tells GKE that the single-host TPU slice node pool is part of a collection. Use this flag if the following conditions apply:
- The node pool runs inference workload in the new node pool.
- The node pool uses TPU Trillium.
- The node pool doesn't use Spot VMs.
To learn more about collection scheduling management, see Manage collection scheduling in single-host TPU slices.
--enable-autoscaling: Create a node pool with autoscaling enabled. Requires the following additional flags:
- --total-min-nodes=TOTAL_MIN_NODES: Minimum number of all nodes in the node pool.
- --total-max-nodes=TOTAL_MAX_NODES: Maximum number of all nodes in the node pool.
- --location-policy=ANY: prioritize usage of unused reservations and reduce the preemption risk of Spot VMs.
--spot: Sets the node pool to use Spot VMs for the nodes in the node pool. This cannot be changed after node pool creation.
--flex-start: Sets the node pool to use Flex-start VMs. Flex-start VMs are created by using the flex-start consumption option, which is supported in GKE version 1.33.0-gke.1712000 or later.
--sandbox=type=gvisor: Provisions a node with GKE Sandbox enabled. Requires TPU v4 and later versions. For more information, see GKE Sandbox.

For a full list of all the flags that you can specify, see the gcloud container clusters create reference.

Terraform

Ensure that you use the version 4.84.0 or later of the google provider.
Add the following block to your Terraform configuration:

resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
  provider           = google
  project            = PROJECT_ID
  cluster            = CLUSTER_NAME
  name               = POOL_NAME
  location           = CLUSTER_LOCATION
  node_locations     = [NODE_ZONES]

  node_config {
    machine_type = MACHINE_TYPE
    reservation_affinity {
      consume_reservation_type = "SPECIFIC_RESERVATION"
      key = "compute.googleapis.com/reservation-name"
      values = [RESERVATION_LABEL_VALUES]
    }
    spot = true
    flex_start = false
  }
}

Replace the following:

NODE_POOL_RESOURCE_NAME: The name of the node pool resource in the Terraform template.
PROJECT_ID: Your project ID.
CLUSTER_NAME: The name of the existing cluster.
POOL_NAME: The name of the node pool to create.
CLUSTER_LOCATION: The compute zone(s) of the cluster. Specify the region where the TPU version is available. To learn more, see Select a TPU version and topology.
NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
MACHINE_TYPE: The type of TPU machine to use. To see TPU compatible machine types, use the table in Choose the TPU version.

Optionally, you can also use the following variables:

autoscaling: Create a node pool with autoscaling enabled. For single-host TPU slice, GKE scales between the TOTAL_MIN_NODES and TOTAL_MAX_NODES values.
- TOTAL_MIN_NODES: Minimum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
- TOTAL_MAX_NODES: Maximum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
RESERVATION_NAME: If you use About Cloud TPU reservations, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate the RESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider.
spot: Sets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
flex_start: Sets the node pool to use flex-start consumption option. Can't be set to true if spot is enabled. Flex-start is supported in GKE version 1.33.0-gke.1712000 or later.

Console

To create a node pool with TPUs:

Go to the Google Kubernetes Engine page in the Google Cloud console.

Go to Google Kubernetes Engine
In the cluster list, click the name of the cluster you want to modify.
Click Add node pool.
In the Node pool details section, check the Specify node locations box.
Select the zone based on the TPU version you want to use. To identify an available zone, see TPU availability in GKE.
From the navigation pane, click Nodes.
In the Machine Configuration section, select TPUs.
In the Series drop-down menu, select one of the following:
- CT3: TPU v3, single host device
- CT3P: TPU v3, multi host pod slice
- CT4P: TPU v4
- CT5LP: TPU v5e
- CT5P: TPU v5p
- CT6E: TPU Trillium (v6e)
In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a single-host TPU slice node pool.
In the TPU Topology drop-down menu, select the physical topology for the TPU slice.
In the Changes needed dialog, click Make changes.
Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.
Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.
Click Create.

Create a multi-host TPU slice node pool

You can create a multi-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

gcloud container node-pools create POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --node-locations=NODE_ZONES \
    --machine-type=MACHINE_TYPE \
    --tpu-topology=TPU_TOPOLOGY \
    [--num-nodes=NUM_NODES] \
    [--spot \]
    [--flex-start \]
    [--enable-autoscaling \
      --max-nodes MAX_NODES]
    [--reservation-affinity=specific \
    --reservation=RESERVATION_NAME] \
    [--node-labels cloud.google.com/gke-nodepool-group-name=COLLECTION_NAME,cloud.google.com/gke-workload-type=HIGH_AVAILABILITY]
    [--placement-type=COMPACT]

Replace the following:

POOL_NAME: The name of the new node pool.
LOCATION: The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.
CLUSTER_NAME: The name of the cluster.
NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
MACHINE_TYPE: The type of machine to use for nodes. To learn more about the available machine types, see Choose the TPU version.
TPU_TOPOLOGY: The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more about TPU topologies, use the table in Choose a topology.

To learn more, see Topology.

Optionally, you can also use the following flags:

NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM. For multi-host TPU v4 and TPU v5e, the number of chips in each VM is four. Therefore, if your TPU_TOPOLOGY is 2x4x4 (TPU v4 with four chips in each VM), then the NUM_NODES is 32/4 which equals to 8. If you omit this flag, the number of nodes is calculated and defaulted based on the topology and machine type.
RESERVATION_NAME: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPU slice node pools. To learn more about TPU reservations, see TPU reservation.
--spot: Sets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
--flex-start: Sets the node pool to use Flex-start VMs. Flex-start VMs are created by using the flex-start consumption option, which is supported in GKE version 1.33.0-gke.1712000 or later.
--enable-autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
- MAX_NODES: The maximum size of the node pool. The --max-nodes flag is required if --enable-autoscaling is supplied and must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.
--node-label=cloud.google.com/gke-nodepool-group-name=COLLECTION_NAME, cloud.google.com/gke-workload-type=HIGH_AVAILABILITY: Tells GKE that the multi-host TPU slice node pool is a collection. Use this flag if the following conditions apply:
- The node pool runs inference workloads in the new node pool.
- The node pool uses TPU Trillium.
- Spot VMs don't support collection scheduling.
To learn more about collection scheduling management, see Manage collection scheduling in multi-host TPU slices.
--placement-type=COMPACT: Create a node pool with compact placement enabled. This option must be used with the flag --tpu-topology. For more information, see Create a compact placement policy and TPU Topology.

Terraform

Ensure that you use the version 4.84.0 or later of the google provider.
Add the following block to your Terraform configuration:
```
resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
  provider           = google
  project            = PROJECT_ID
  cluster            = CLUSTER_NAME
  name               = POOL_NAME
  location           = CLUSTER_LOCATION
  node_locations     = [NODE_ZONES]
  initial_node_count = NUM_NODES

  autoscaling {
    max_node_count = MAX_NODES
    location_policy      = "ANY"
  }
  node_config {
    machine_type = MACHINE_TYPE
    reservation_affinity {
      consume_reservation_type = "SPECIFIC_RESERVATION"
      key = "compute.googleapis.com/reservation-name"
      values = [RESERVATION_LABEL_VALUES]
    }
    spot = true
    flex_start = false
  }

  placement_policy {
    type = "COMPACT"
    tpu_topology = TPU_TOPOLOGY
  }
}
```
Replace the following:
- NODE_POOL_RESOURCE_NAME: The name of the node pool resource in the Terraform template.
- PROJECT_ID: Your project ID.
- CLUSTER_NAME: The name of the existing cluster to add the node pool to.
- POOL_NAME: The name of the node pool to create.
- CLUSTER_LOCATION: Compute location for the cluster. We recommend having a regional cluster for higher reliability of the Kubernetes control plane. You can also use a zonal cluster. To learn more, see Select a TPU version and topology.
- NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
- NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the number of the TPU chips divided by four, because in multi-host TPU slices each TPU slice node has 4 chips. For example, if TPU_TOPOLOGY is 4x8, then there are 32 chips which means NUM_NODES must be 8. To learn more about TPU topologies, use the table in Choose the TPU version.
- TPU_TOPOLOGY: This indicates the desired physical topology for the TPU slice. The format of the topology depends on the TPU version you are using. To learn more about TPU topologies, use the table in Choose a topology.
Optionally, you can also use the following variables:
- RESERVATION_NAME: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate theRESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider.
- autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
  - MAX_NODES: It is the maximum size of the node pool. It must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.
- spot: Lets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
- flex_start: Sets the node pool to use flex-start consumption option. Can't be set to true if spot is enabled.

Console

To create a node pool with TPUs:

Go to the Google Kubernetes Engine page in the Google Cloud console.

Go to Google Kubernetes Engine
In the cluster list, click the name of the cluster you want to modify.
Click Add node pool.
In the Node pool details section, check the Specify node locations box.
Select the name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.
From the navigation pane, click Nodes.
In the Machine Configuration section, select TPUs.
In the Series drop-down menu, select one of the following:
- CT3P: For TPU v3.
- CT4P: For TPU v4.
- CT5LP: For TPU v5e.
In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a multi-host TPU slice node pool.
In the TPU Topology drop-down menu, select the physical topology for the TPU slice.
In the Changes needed dialog, click Make changes.
Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.
Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.
Click Create.

How GKE handles capacity issues

If GKE cannot create your TPU slice node pool due to insufficient TPU capacity available, GKE returns an error message indicating the TPU slice nodes cannot be created due to lack of capacity.

If you are creating a single-host TPU slice node pool, the error message looks similar to this:

2 nodes cannot be created due to lack of capacity. The missing nodes will be
created asynchronously once capacity is available. You can either wait for the
nodes to be up, or delete the node pool and try re-creating it again later.

If you are creating a multi-host TPU slice node pool, the error message looks similar to this:

The nodes (managed by ...) cannot be created now due to lack of capacity. They
will be created asynchronously once capacity is available. You can either wait
for the nodes to be up, or delete the node pool and try re-creating it again
later.

Your TPU provisioning request can stay in the queue for a long time and remains in the "Provisioning" state while in the queue.

Once capacity is available, GKE creates the remaining nodes that were not created.

If you need capacity sooner, consider trying Spot VMs, though note that Spot VMs consume different quota than on-demand instances.

You can delete the queued TPU request by deleting the TPU slice node pool.

Run your workload on TPU slice nodes

This section explains how to prepare your workloads and examples of how you can run your workloads.

Prepare your workloads

TPU workloads have the following preparation requirements.

Frameworks like JAX, PyTorch, and TensorFlow access TPU VMs using the libtpu shared library. libtpu includes the XLA compiler, TPU runtime software, and the TPU driver. Each release of PyTorch and JAX requires a certain libtpu.so version. To avoid package version conflicts, we recommend using a JAX AI image. To use TPUs in GKE, ensure that you use the following versions:

TPU type	`libtpu.so` version
TPU Trillium (v6e) `tpu-v6e-slice`	Recommended JAX AI image: jax0.4.35-rev1 or later Recommended jax[tpu] version: v0.4.9 or later Recommended torchxla[tpuvm] version: v2.1.0 or later
TPU v5e `tpu-v5-lite-podslice`	Recommended JAX AI image: jax0.4.35-rev1 or later Recommended jax[tpu] version: v0.4.9 or later Recommended torchxla[tpuvm] version: v2.1.0 or later
TPU v5p `tpu-v5p-slice`	Recommended JAX AI image: jax0.4.35-rev1 or later Recommended jax[tpu] version: 0.4.19 or later. Recommended torchxla[tpuvm] version: suggested to use a nightly version build on October 23, 2023.
TPU v4 `tpu-v4-podslice`	Recommended JAX AI image: jax0.4.35-rev1 or later Recommended jax[tpu]: v0.4.4 or later Recommended torchxla[tpuvm]: v2.0.0 or later
TPU v3 `tpu-v3-slice` `tpu-v3-device`	Recommended JAX AI image: jax0.4.35-rev1 or later Recommended jax[tpu]: v0.4.4 or later Recommended torchxla[tpuvm]: v2.0.0 or later

Set the following environment variables for the container requesting the TPU resources:
- TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique worker-id in the TPU slice. The supported values for this field range from zero to the number of Pods minus one.
- TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses that need to communicate with each other within the slice. There should be a hostname or IP address for each TPU VM in the slice. The list of IP addresses or hostnames are ordered and zero indexed by the TPU_WORKER_ID.
- In your workload manifest, add Kubernetes node selectors to ensure that GKE schedules your TPU workload on the TPU machine type and TPU topology you defined:
```
  nodeSelector:
    cloud.google.com/gke-tpu-accelerator: TPU_ACCELERATOR
    cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY
  
```
  Replace the following:
  - TPU_ACCELERATOR: The name of the TPU accelerator.
  - TPU_TOPOLOGY: The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more, see Plan TPUs in GKE.

After you complete the workload preparation, you can run a Job that uses TPUs.

The following sections show examples on how to run a Job that performs basic computation with TPUs.

Example 1: Run a workload that displays the number of available TPU chips in a TPU slice node pool

The following workload returns the number of TPU chips across all of the nodes in a multi-host TPU slice. To create a multi-host slice, the workload has the following parameters:

TPU version: TPU v4
Topology: 2x2x4

This version and topology selection result in a multi-host slice.

Save the following manifest as available-chips-multihost.yaml:

apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None
  selector:
    job-name: tpu-available-chips
---
apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-available-chips
spec:
  backoffLimit: 0
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: headless-svc
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice # Node selector to target TPU v4 slice nodes.
        cloud.google.com/gke-tpu-topology: 2x2x4 # Specifies the physical topology for the TPU slice.
      containers:
      - name: tpu-job
        image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest
        ports:
        - containerPort: 8471 # Default port using which TPU VMs communicate
        - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
        securityContext:
          privileged: true # Required for GKE versions earlier than 1.28 to access TPUs.
        command:
        - bash
        - -c
        - |
          python -c 'import jax; print("TPU cores:", jax.device_count())' # Python command to count available TPU chips.
        resources:
          requests:
            cpu: 10
            memory: 407Gi
            google.com/tpu: 4 # Request 4 TPU chips for this workload.
          limits:
            cpu: 10
            memory: 407Gi
            google.com/tpu: 4 # Limit to 4 TPU chips for this workload.

Deploy the manifest:
```
kubectl create -f available-chips-multihost.yaml
```
GKE runs a TPU v4 slice with four VMs (multi-host TPU slice). The slice has 16 interconnected TPU chips.

Verify that the Job created four Pods:

kubectl get pods

The output is similar to the following:

NAME                       READY   STATUS      RESTARTS   AGE
tpu-job-podslice-0-5cd8r   0/1     Completed   0          97s
tpu-job-podslice-1-lqqxt   0/1     Completed   0          97s
tpu-job-podslice-2-f6kwh   0/1     Completed   0          97s
tpu-job-podslice-3-m8b5c   0/1     Completed   0          97s

Get the logs of one of the Pods:
```
kubectl logs POD_NAME
```
Replace POD_NAME with the name of one of the created Pods. For example, tpu-job-podslice-0-5cd8r.

The output is similar to the following:
```
TPU cores: 16
```

Optional: Remove the workload:

kubectl delete -f available-chips-multihost.yaml

Example 2: Run a workload that displays the number of available TPU chips in the TPU slice

The following workload is a static Pod that displays the number of TPU chips that are attached to a specific node. To create a single-host node, the workload has the following parameters:

TPU version: TPU v5e
Topology: 2x4

This version and topology selection result in a single-host slice.

Save the following manifest as available-chips-singlehost.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: tpu-job-jax-v5
spec:
  restartPolicy: Never
  nodeSelector:
    cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice # Node selector to target TPU v5e slice nodes.
    cloud.google.com/gke-tpu-topology: 2x4 # Specify the physical topology for the TPU slice.
  containers:
  - name: tpu-job
    image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest
    ports:
    - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
    securityContext:
      privileged: true # Required for GKE versions earlier than 1.28 to access TPUs.
    command:
    - bash
    - -c
    - |
      python -c 'import jax; print("Total TPU chips:", jax.device_count())'
    resources:
      requests:
        google.com/tpu: 8 # Request 8 TPU chips for this container.
      limits:
        google.com/tpu: 8 # Limit to 8 TPU chips for this container.

Deploy the manifest:
```
kubectl create -f available-chips-singlehost.yaml
```
GKE provisions nodes with eight single-host TPU slices that use TPU v5e. Each TPU node has eight TPU chips (single-host TPU slice).
Get the logs of the Pod:
```
kubectl logs tpu-job-jax-v5
```
The output is similar to the following:
```
Total TPU chips: 8
```

Optional: Remove the workload:

  kubectl delete -f available-chips-singlehost.yaml

Upgrade node pools using accelerators (GPUs and TPUs)

GKE automatically upgrades Standard clusters, including node pools. You can also manually upgrade node pools if you want your nodes on a later version sooner. To control how upgrades work for your cluster, use release channels, maintenance windows and exclusions, and rollout sequencing.

You can also configure a node upgrade strategy for your node pool, such as surge upgrades , blue-green upgrades or short-lived upgrades. By configuring these strategies, you can ensure that the node pools are upgraded in a way that achieves the optimal balance between speed and disruption for your environment. For multi-host TPU slice node pools, instead of using the configured node upgrade strategy, GKE atomically recreates the entire node pool in a single step. To learn more, see the definition of atomicity in Terminology related to TPU in GKE.

Using a node upgrade strategy temporarily requires GKE to provision additional resources, depending on the configuration. If Google Cloud has limited capacity for your node pool's resources—for example, you're seeing resource availability errors when trying to create more nodes with GPUs or TPUs—see Upgrade in a resource-constrained environment.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this guide, consider deleting the TPU slice node pools that no longer have scheduled workloads. If the workloads running must be gracefully terminated, use kubectl drain to clean up the workloads before you delete the node.

Delete a TPU slice node pool:
```
gcloud container node-pools delete POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME
```
Replace the following:
- POOL_NAME: The name of the node pool.
- CLUSTER_NAME: The name of the cluster.
- LOCATION: The compute location of the cluster.

Configure additional settings

The following sections describe the additional configurations you can apply to your TPU workloads.

Manage collection scheduling

In TPU Trillium, you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.

TPU Trillium supports collection scheduling for single-host and multi-host node pools that run inference workloads. The following describes how collection scheduling behavior depends on the type of TPU slice that you use:

Multi-host TPU slice: GKE groups multi-host TPU slices to form a collection. Each GKE node pool is a replica within this collection. To define a collection, create a multi-host TPU slice and assign a unique name to the collection. To add more TPU slices to the collection, create another multi-host TPU slice node pool with the same collection name and workload type.
Single-host TPU slice: GKE considers the entire single-host TPU slice node pool as a collection. To add more TPU slices to the collection, you can resize the single-host TPU slice node pool.

To manage a collection, perform any of these actions based on the type of node pool that you use.

Manage collection scheduling in multi-host TPU slice node pools

Use the following tasks to manage multi-host TPU slice node pools.

To check if a multi-host TPU slice pool is part of a collection, run the following command:

gcloud container node-pools describe NODE_POOL_NAME \
    --location LOCATION \
    --cluster CLUSTER_NAME \
    --format="json" | jq -r \
    '"nodepool-group-name: \(.config.labels["cloud.google.com/gke-nodepool-group-name"] // "")\ngke-workload-type: \(.config.labels["cloud.google.com/gke-workload-type"] // "")"'

The output is similar to the following:

nodepool-group-name: <code><var>NODE_POOL_COLLECTION_NAME</var></code>
gke-workload-type: HIGH_AVAILABILITY

If multi-host TPU slice pool is part of a collection, the output has the following labels:

cloud.google.com/gke-workload-type: HIGH_AVAILABILITY
cloud.google.com/gke-nodepool-group-name: <code><var>COLLECTION_NAME</var></code>

To get the list of collections in the cluster, run the following command:

#!/bin/bash

# Replace with your cluster name, project, and location
CLUSTER_NAME=CLUSTER_NAME
PROJECT=PROJECT_ID
LOCATION=LOCATION

declare -A collection_names

node_pools=$(gcloud container node-pools list --cluster "$CLUSTER_NAME" --project "$PROJECT" --location "$LOCATION" --format="value(name)")

# Iterate over each node pool
for pool in $node_pools; do
    # Describe the node pool and extract labels using jq
    collection_name=$(gcloud container node-pools describe "$pool" \
        --cluster "$CLUSTER_NAME" \
        --project "$PROJECT" \
        --location "$LOCATION" \
        --format="json" | jq -r '.config.labels["cloud.google.com/gke-nodepool-group-name"]')

    # Add the collection name to the associative array if it's not empty
    if [[ -n "$collection_name" ]]; then
        collection_names["$collection_name"]=1
    fi
done

# Print the unique node pool collection names
echo "Unique cloud.google.com/gke-nodepool-group-name values:"
for name in "${!collection_names[@]}"; do
    echo "$name"
done

The output is similar to the following:

Unique cloud.google.com/gke-nodepool-group-name values: {COLLECTION_NAME_1}, {COLLECTION_NAME_2}, {COLLECTION_NAME_3}

To get a list of node pools that belong to a collection, run the following command:

#!/bin/bash

TARGET_COLLECTION_NAME=COLLECTION_NAME
CLUSTER_NAME=CLUSTER_NAME
PROJECT=PROJECT_ID
LOCATION=LOCATION

matching_node_pools=()

# Get the list of all node pools in the cluster
node_pools=$(gcloud container node-pools list --cluster "$CLUSTER_NAME" --project "$PROJECT" --location "$LOCATION" --format="value(name)")

# Iterate over each node pool
for pool in $node_pools; do
    # Get the value of the cloud.google.com/gke-nodepool-group-name label
    collection_name=$(gcloud container node-pools describe "$pool" \
        --cluster "$CLUSTER_NAME" \
        --project "$PROJECT" \
        --location "$LOCATION" \
        --format="json" | jq -r '.config.labels["cloud.google.com/gke-nodepool-group-name"]')

    # Check if the group name matches the target value
    if [[ "$collection_name" == "$TARGET_COLLECTION_NAME" ]]; then
        matching_node_pools+=("$pool")
    fi
done

# Print the list of matching node pools
echo "Node pools with collection name '$TARGET_COLLECTION_NAME':"
for pool in "${matching_node_pools[@]}"; do
    echo "$pool"
done

The output is similar to the following:

Node pools with collection name 'COLLECTION_NAME':
{NODE_POOL_NAME_1}
{NODE_POOL_NAME_2}
{NODE_POOL_NAME_3}

To scale up the collection, create another multi-host TPU slice node pool and add the cloud.google.com/gke-workload-type and cloud.google.com/gke-nodepool-group-name. Use the same collection name in cloud.google.com/gke-nodepool-group-name and run the same workload type. If node auto-provisioning is enabled on the cluster, GKE automatically creates pools based on workload demands.
To scale down the collection, delete the node pool.
To delete the collection, remove all of the attached node pools. You can delete the node pool or delete the cluster. Deleting the cluster removes all of the collections in it.

Manage collection scheduling in single-host TPU slice node pools

Use the following tasks to manage single-host TPU slice node pools.

To check if a single-host TPU slice pool has collection scheduling enabled, run the following command:

gcloud container node-pools describe NODE_POOL_NAME \
    --cluster CLUSTER_NAME \
    --project PROJECT_NAME \
    --location LOCATION \
    --format="json" | jq -r '.config.labels["cloud.google.com/gke-workload-type"]'

The output is similar to the following:

gke-workload-type: HIGH_AVAILABILITY

If the single-host TPU slice pool is part of a collection, the output has the cloud.google.com/gke-workload-type: HIGH_AVAILABILITY label.

To scale up the collection, resize the node pool manually or automatically with node auto-provisioning.
To scale down the collection, delete the node pool.
To delete the collection, remove all of the attached node pools. You can delete the node pool or delete the cluster. Deleting the cluster removes all of the collections in it.

Use Multislice

You can aggregate smaller slices together in a Multislice to handle larger training workloads. For more information, see Multislice TPUs in GKE.

Migrate your TPU reservation

If you have existing TPU reservations, you must first migrate your TPU reservation to a new Compute Engine-based reservation system. You can also create Compute Engine-based reservation system where no migration is needed. To learn how to migrate your TPU reservations, see TPU reservation.

Enable logging

Logs emitted by containers running on GKE nodes, including TPU VMs, are collected by the GKE logging agent, sent to Logging, and are visible in Logging.

Use GKE node auto-provisioning

You can configure GKE to automatically create and delete node pools to meet the resource demands of your TPU workloads. For more information, see Configuring Cloud TPUs.

Provision TPUs by using custom compute classes

You can also configure GKE to request TPUs during scaling operations that create new nodes by using custom compute classes.

You can specify TPU configuration options in your custom compute class specification. When a GKE workload uses that custom compute class, GKE attempts to provision TPUs that use your specified configuration when scaling up.

To provision TPUs with a custom compute class that follows the TPU rules and deploy the workload, complete the following steps:

Save the following manifest as tpu-compute-class.yaml:

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: tpu-class
spec:
  priorities:
  - tpu:
      type: tpu-v5-lite-podslice
      count: 4
      topology: 2x4
  - spot: true
    tpu:
      type: tpu-v5-lite-podslice
      count: 4
      topology: 2x4
  - flexStart:
      enabled: true
    tpu:
      type: tpu-v6e-slice
      count: 4
      topology: 2x4
  nodePoolAutoCreation:
    enabled: true

Deploy the compute class:
```
kubectl apply -f tpu-compute-class.yaml
```
For more information about custom compute classes and TPUs, see TPU configuration.

Save the following manifest as tpu-job.yaml:

apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None
  selector:
    job-name: tpu-job
---
apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-job
spec:
  backoffLimit: 0
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: headless-svc
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/compute-class: tpu-class
      containers:
      - name: tpu-job
        image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest
        ports:
        - containerPort: 8471 # Default port using which TPU VMs communicate
        - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
        command:
        - bash
        - -c
        - |
          python -c 'import jax; print("TPU cores:", jax.device_count())'
        resources:
          requests:
            cpu: 10
            memory: MEMORY_SIZE
            google.com/tpu: NUMBER_OF_CHIPS
          limits:
            cpu: 10
            memory: MEMORY_SIZE
            google.com/tpu: NUMBER_OF_CHIPS

Replace the following:

NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for limits and requests, equal to the value in the tpu.count field in the selected custom compute class.
MEMORY_SIZE: The maximum amount of memory that the TPU uses. Memory limits depend on the TPU version and topology that you use. To learn more, see Minimums and maximums for accelerators.
NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for limits and requests.

Deploy the Job:
```
kubectl create -f tpu-job.yaml
```
When you create this Job, GKE automatically does the following:
- Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices. Depending on the availability of TPU resources in the top priority, GKE might fall back to lower priorities to maximize obtainability.
- Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.
To learn more, see About custom compute classes.
When you finish this section, you can avoid continued billing by deleting the resources you created:
```
kubectl delete -f tpu-job.yaml
```

Configure auto repair for TPU slice nodes

If a TPU slice node in a multi-host TPU slice node pool is unhealthy, the entire node pool is recreated. Whereas, In a single-host TPU slice node pool, only the unhealthy TPU node is auto-repaired.

Conditions that result in unhealthy TPU slice nodes include the following:

Any TPU slice node with common node conditions.
Any TPU slice node with an unallocatable TPU count larger than zero.
Any VM instance in a TPU slice that is stopped (due to preemption) or is terminated.
Node maintenance: If any TPU slice node within a multi-host TPU slice node pool goes down for host maintenance, GKE recreates the entire TPU slice node pool.

You can see the repair status (including the failure reason) in the operation history. If the failure is caused by insufficient quota, contact your Google Cloud account representative to increase the corresponding quota.

Configure graceful termination for TPU slice nodes

In GKE clusters with the control plane running 1.29.1-gke.1425000 or later, TPU slice nodes support SIGTERM signals that alert the node of an imminent shutdown. The imminent shutdown notification is configurable up to five minutes in TPU nodes.

To configure GKE to terminate your workloads gracefully within this notification timeframe, follow the steps in Manage GKE node disruption for GPUs and TPUs.

Run containers without privileged mode

Containers running in nodes in GKE version 1.28 or later don't need to have privileged mode enabled to access TPUs. Nodes in GKE version 1.28 and earlier require privileged mode.

If your TPU slice node is running versions less than 1.28, read the following section:

A container running on a VM in a TPU slice needs access to higher limits on locked memory so the driver can communicate with the TPU chips over direct memory access (DMA). To enable this, you must configure a higher ulimit. If you want to reduce the permission scope on your container, complete the following steps:

Edit the securityContext to include the following fields:

securityContext:
  capabilities:
    add: ["SYS_RESOURCE"]

Increase ulimit by running the following command inside the container before your setting up your workloads to use TPU resources:
```
ulimit -l 68719476736
```

For TPU v5e, running containers without privileged mode is available in clusters in version 1.27.4-gke.900 and later.

Observability and metrics

Dashboard

Node pool observability in the Google Cloud console is generally available. To view the status of your TPU multi-host node pools on GKE, go to GKE TPU Node Pool Status dashboard provided by Cloud Monitoring:

Go to GKE TPU Node Pool Status

This dashboard gives you comprehensive insights into the health of your multi-host TPU node pools. For more information, see Monitor health metrics for TPU nodes and node pools.

In the Kubernetes Clusters page in the Google Cloud console, the Observability tab also displays TPU observability metrics, such as TPU usage, under the Accelerators > TPU heading. For more information, see View observability metrics.

The TPU dashboard is populated only if you have system metrics enabled in your GKE cluster.

Runtime metrics

In GKE version 1.27.4-gke.900 or later, TPU workloads that both use JAX version 0.4.14 or later and specify containerPort: 8431 export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU workload's runtime performance:

Duty cycle: percentage of time over the past sampling period (60 seconds) during which the TensorCores were actively processing on a TPU chip. Larger percentage means better TPU utilization.
Memory used: amount of accelerator memory allocated in bytes. Sampled every 60 seconds.
Memory total: total accelerator memory in bytes. Sampled every 60 seconds.

These metrics are located in the Kubernetes node (k8s_node) and Kubernetes container (k8s_container) schema.

Kubernetes container:

kubernetes.io/container/accelerator/duty_cycle
kubernetes.io/container/accelerator/memory_used
kubernetes.io/container/accelerator/memory_total

Kubernetes node:

kubernetes.io/node/accelerator/duty_cycle
kubernetes.io/node/accelerator/memory_used
kubernetes.io/node/accelerator/memory_total

Monitor health metrics for TPU nodes and node pools

When a training job has an error or terminates in failure, you can check metrics related to the underlying infrastructure to figure out if the interruption was caused by an issue with the underlying node or node pool.

Node status

In GKE version 1.32.1-gke.1357001 or later, the following GKE system metric exposes the condition of a GKE node:

kubernetes.io/node/status_condition

The condition field reports conditions on the node, such as Ready, DiskPressure, and MemoryPressure. The status field shows the reported status of the condition, which can be True, False, or Unknown. This is a metric with the k8s_node monitored resource type.

This PromQL query shows if a particular node is Ready:

kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    node_name="NODE_NAME",
    condition="Ready",
    status="True"}

To help troubleshoot issues in a cluster, you might want to look at nodes that have exhibited other conditions:

kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    condition!="Ready",
    status="True"}

You might want to specifically look at nodes that aren't Ready:

kubernetes_io:node_status_condition{
    monitored_resource="k8s_node",
    cluster_name="CLUSTER_NAME",
    condition="Ready",
    status="False"}

If there is no data, then the nodes are ready. The status condition is sampled every 60 seconds.

You can use the following query to understand the node status across the fleet:

avg by (condition,status)(
  avg_over_time(
    kubernetes_io:node_status_condition{monitored_resource="k8s_node"}[${__interval}]))

Node pool status

The following GKE system metric for the k8s_node_pool monitored resource exposes the status of a GKE node pool:

kubernetes.io/node_pool/status

This metric is reported only for multi-host TPU node pools.

The status field reports the status of the node pool, such as Provisioning, Running, Error, Reconciling, or Stopping. Status updates happen after GKE API operations complete.

To verify if a particular node pool has Running status, use the following PromQL query:

kubernetes_io:node_pool_status{
    monitored_resource="k8s_node_pool",
    cluster_name="CLUSTER_NAME",
    node_pool_name="NODE_POOL_NAME",
    status="Running"}

To monitor the number of node pools in your project grouped by their status, use the following PromQL query:

count by (status)(
  count_over_time(
    kubernetes_io:node_pool_status{monitored_resource="k8s_node_pool"}[${__interval}]))

Node pool availability

The following GKE system metric shows whether a multi-host TPU node pool is available:

kubernetes.io/node_pool/multi_host/available

The metric has a value of True if all of the nodes in the node pool are available, and False otherwise. The metric is sampled every 60 seconds.

To check the availability of multi-host TPU node pools in your project, use the following PromQL query:

avg by (node_pool_name)(
  avg_over_time(
    kubernetes_io:node_pool_multi_host_available{
      monitored_resource="k8s_node_pool",
      cluster_name="CLUSTER_NAME"}[${__interval}]))

Node interruption count

The following GKE system metric reports the count of interruptions for a GKE node since the last sample (the metric is sampled every 60 seconds):

kubernetes.io/node/interruption_count

The interruption_type (such as TerminationEvent, MaintenanceEvent, or PreemptionEvent) and interruption_reason (like HostError, Eviction, or AutoRepair) fields can help provide the reason for why a node was interrupted.

To get a breakdown of the interruptions and their causes in TPU nodes in the clusters in your project, use the following PromQL query:

  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node"}[${__interval}]))

To only see the host maintenance events, update the query to filter the HW/SW Maintenance value for the interruption_reason. Use the following PromQL query:

  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node", interruption_reason="HW/SW Maintenance"}[${__interval}]))

To see the interruption count aggregated by node pool, use the following PromQL query:

  sum by (node_pool_name,interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_pool_interruption_count{monitored_resource="k8s_node_pool", interruption_reason="HW/SW Maintenance", node_pool_name=NODE_POOL_NAME }[${__interval}]))

Node pool times to recover (TTR)

The following GKE system metric reports the distribution of recovery period durations for GKE multi-host TPU node pools:

kubernetes.io/node_pool/accelerator/times_to_recover

Each sample recorded in this metric indicates a single recovery event for the node pool from a downtime period.

This metric is useful for tracking the multi-host TPU node pool time to recover and time between interruptions.

You can use the following PromQL query to calculate the mean time to recovery (MTTR) for the last 7 days in your cluster:

sum(sum_over_time(
  kubernetes_io:node_pool_accelerator_times_to_recover_sum{
    monitored_resource="k8s_node_pool", cluster_name="CLUSTER_NAME"}[7d]))
/
sum(sum_over_time(
  kubernetes_io:node_pool_accelerator_times_to_recover_count{
    monitored_resource="k8s_node_pool",cluster_name="CLUSTER_NAME"}[7d]))

Node pool times between interruptions (TBI)

Node pool times between interruptions measures how long your infrastructure runs before experiencing an interruption. It is computed as the average over a window of time, where the numerator measures the total time that your infrastructure was up and the denominator measures the total interruptions to your infrastructure.

The following PromQL example shows the 7-day mean time between interruptions (MTBI) for the given cluster:

sum(count_over_time(
  kubernetes_io:node_memory_total_bytes{
    monitored_resource="k8s_node", node_name=~"gke-tpu.*|gk3-tpu.*", cluster_name="CLUSTER_NAME"}[7d]))
/
sum(sum_over_time(
  kubernetes_io:node_interruption_count{
    monitored_resource="k8s_node", node_name=~"gke-tpu.*|gk3-tpu.*", cluster_name="CLUSTER_NAME"}[7d]))

Host metrics

In GKE version 1.28.1-gke.1066000 or later, VMs in a TPU slice export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU host's performance:

TensorCore utilization: current percentage of the TensorCore that is utilized. The TensorCore value equals the sum of the matrix-multiply units (MXUs) plus the vector unit. The TensorCore utilization value is the division of the TensorCore operations that were performed over the past sample period (60 seconds) by the supported number of TensorCore operations over the same period. Larger value means better utilization.
Memory bandwidth utilization: current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period (60s) by the maximum supported bandwidth over the same sample period.

These metrics are located in the Kubernetes node (k8s_node) and Kubernetes container (k8s_container) schema.

Kubernetes container:

kubernetes.io/container/accelerator/tensorcore_utilization
kubernetes.io/container/accelerator/memory_bandwidth_utilization

Kubernetes node:

kubernetes.io/node/accelerator/tensorcore_utilization
kubernetes.io/node/accelerator/memory_bandwidth_utilization

For more information, see Kubernetes metrics and GKE system metrics.

Known issues

Cluster autoscaler might incorrectly calculate capacity for new TPU slice nodes before those nodes report available TPUs. Cluster autoscaler might then perform additional scale up and as a result create more nodes than needed. Cluster autoscaler scales down additional nodes, if they are not needed, after regular scale down operation.
Cluster autoscaler cancels scaling up of TPU slice node pools that remain in waiting status for more than 10 hours. Cluster Autoscaler retries such scale up operations later. This behavior might reduce TPU obtainability for customers who don't use reservations.
Non-TPU workloads that have a toleration for the TPU taint can prevent scale down of the node pool if they are being recreated during draining of the TPU slice node pool.
Memory bandwidth utilization metric is not available for v5e TPUs.

Deploy TPU workloads in GKE Standard

Before you begin

Plan your TPU configuration

Ensure that you have TPU quota

Quota for on-demand or Spot VMs

Quotas for additional GKE resources

Ensure reservation availability

Options for provisioning TPUs in GKE

Create a cluster

Create a node pool

Create a single-host TPU slice node pool

gcloud

Terraform

Console

Create a multi-host TPU slice node pool

gcloud

Terraform

Console

How GKE handles capacity issues

Run your workload on TPU slice nodes

Prepare your workloads

Example 1: Run a workload that displays the number of available TPU chips in a TPU slice node pool

Example 2: Run a workload that displays the number of available TPU chips in the TPU slice

Upgrade node pools using accelerators (GPUs and TPUs)

Clean up

Configure additional settings

Manage collection scheduling

Manage collection scheduling in multi-host TPU slice node pools

Manage collection scheduling in single-host TPU slice node pools

Use Multislice

Migrate your TPU reservation

Enable logging

Use GKE node auto-provisioning

Provision TPUs by using custom compute classes

Configure auto repair for TPU slice nodes

Configure graceful termination for TPU slice nodes

Run containers without privileged mode

Observability and metrics

Dashboard

Runtime metrics

Monitor health metrics for TPU nodes and node pools

Node status

Node pool status

Node pool availability

Node interruption count

Node pool times to recover (TTR)

Node pool times between interruptions (TBI)

Host metrics

Known issues

What's next