Deploy TPU workloads in GKE Standard


This page shows you how to request and deploy workloads that use Cloud TPU accelerators (TPUs) in Google Kubernetes Engine (GKE).

Before you configure and deploy TPU workloads in GKE, you should be familiar with the following concepts:

  1. Introduction to Cloud TPU.
  2. Cloud TPU system architecture.
  3. About TPUs in GKE.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

TPU availability in GKE

Use GKE to create and manage node pools with TPUs. You can use these purposely-built accelerators to perform large-scale AI model training, tuning, and inference.

See a list of supported TPU versions in GKE.

Plan your TPU configuration

Plan your TPU configuration based on your machine learning model and how much memory it requires. The following are the steps that are relevant when planning your TPU configuration:

  1. Select a TPU version and topology.
  2. Select the type of node pool to use.

Ensure sufficient quota for on-demand or Spot VMs

If you are creating a TPU node pool with on-demand or Spot VMs, you must have sufficient TPU quota available in the region that you want to use.

Creating a TPU node pool that consumes a TPU reservation does not require any TPU quota.1 You may safely skip this section for reserved TPUs.

Creating an on-demand or Spot TPU node pool in GKE requires Compute Engine API quota. Compute Engine API quota (compute.googleapis.com) is not the same as Cloud TPU API quota (tpu.googleapis.com), which is needed when creating TPUs with the Cloud TPU API.

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

  1. Go to the Quotas page in the Google Cloud console:

    Go to Quotas

  2. In the Filter box, do the following:

    1. Select the Service property, enter Compute Engine API, and press Enter.

    2. Select the Type property and choose Quota.

    3. Select the Name property and enter the name of the quota based on the TPU version and machine type. For example, if you plan to create on-demand TPU v5e nodes whose machine type begins with ct5lp-, enter TPU v5 Lite PodSlice chips.

      TPU version Machine type begins with Name of the quota for on-demand instances Name of the quota for Spot2 instances
      TPU v4 ct4p- TPU v4 PodSlice chips Preemptible TPU v4 PodSlice chips
      TPU v5e ct5l- TPU v5 Lite Device chips Preemptible TPU v5 Lite Device chips
      TPU v5e ct5lp- TPU v5 Lite PodSlice chips Preemptible TPU v5 Lite PodSlice chips
      TPU v5p ct5p- TPU v5p chips Preemptible TPU v5p chips

    4. Select the Dimensions (e.g. locations) property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU nodes in the zone us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the desired region, and you must request a TPU quota increase.

  1. When creating a TPU node pool, use the --reservation and --reservation-affinity=specific flags to create a reserved instance. TPU reservations are available when purchasing a commitment.

  2. When creating a TPU node pool, use the --spot flag to create a Spot instance.

Ensure reservation availability

Creating a reserved TPU node pool, which means a TPU node pool that consumes a reservation, does not require any TPU quota. However, the reservation must have sufficient available or unused chips at the time the node pool is created.

To see which reservations exist within a project, view a list of your reservations.

To view how many chips within a TPU reservation are available, view the details of a reservation.

Create a cluster

Create a GKE cluster in Standard mode in a region with available TPUs. We recommend that you use regional clusters, which provide high availability of the Kubernetes control plane. You can use the Google Cloud CLI or the Google Cloud console.

gcloud container clusters create CLUSTER_NAME \
  --location LOCATION \
  --cluster-version VERSION

Replace the following:

  • CLUSTER_NAME: the name of the new cluster.
  • LOCATION: the region with your TPU capacity available.
  • VERSION: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. To learn what are the minimum GKE versions available by TPU machine type, see TPU availability in GKE

Create a node pool

Single-host TPU slice

You can create a single-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

gcloud container node-pools create POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --node-locations=NODE_ZONES \
    --machine-type=MACHINE_TYPE \
    [--num-nodes=NUM_NODES \]
    [--spot \]
    [--enable-autoscaling \]
    [--reservation-affinity=specific \
    --reservation=RESERVATION_NAME \]
    [--total-min-nodes TOTAL_MIN_NODES \]
    [--total-max-nodes TOTAL_MAX_NODES \]
    [--location-policy=ANY]

Replace the following:

  • POOL_NAME: The name of the new node pool.
  • LOCATION: The name of the zone based on the TPU version you want to use:

    • For TPU v4, use us-central2-b.
    • For TPU v5e machine types beginning with ct5l-, use us-central1-a or europe-west4-b.
    • For TPU v5e machine types beginning with ct5lp-, use us-west1-c, us-west4-a, us-west4-b, us-central1-a, us-east1-c, us-east5-b, or europe-west4-a.
    • For TPU v5p, use us-east1-d, us-east5-a, or us-east5-c.

    To learn more, see TPU availability in GKE.

  • CLUSTER_NAME: The name of the cluster.

  • NODE_ZONE: The comma-separated list of one or more zones where GKE creates the node pool.

  • MACHINE_TYPE: The type of machine to use for nodes. For more information about TPU compatible machine types, use the table in Mapping of TPU configuration.

Optionally, you can also use the following flags:

  • NUM_NODES: The initial number of nodes in the node pool in each zone. If you omit this flag, the default is 3. If autoscaling is enabled for the node pool using the --enable-autoscaling flag, we recommend that you set NUM_NODES to 0, since the autoscaler provisions additional nodes as soon as your workloads demands them.
  • RESERVATION_NAME: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPUs. To learn more about TPU reservations, see TPU reservation.
  • --enable-autoscaling: Create a node pool with autoscaling enabled.
    • TOTAL_MIN_NODES: Minimum number of all nodes in the node pool. Omit this field unless autoscaling is also specified.
    • TOTAL_MAX_NODES: Maximum number of all nodes in the node pool. Omit this field unless autoscaling is also specified.
  • --spot: Sets the node pool to use Spot VMs for the nodes in the node pool. This cannot be changed after node pool creation.

Terraform

  1. Ensure that you use the version 4.84.0 or later of the google provider.
  2. Add the following block to your Terraform configuration:
resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
  provider           = google
  project            = PROJECT_ID
  cluster            = CLUSTER_NAME
  name               = POOL_NAME
  location           = CLUSTER_LOCATION
  node_locations     = [NODE_ZONES]
  initial_node_count = NUM_NODES
  autoscaling {
    total_min_node_count = TOTAL_MIN_NODES
    total_max_node_count = TOTAL_MAX_NODES
    location_policy      = "ANY"
  }

  node_config {
    machine_type = MACHINE_TYPE
    reservation_affinity {
      consume_reservation_type = "SPECIFIC_RESERVATION"
      key = "compute.googleapis.com/reservation-name"
      values = [RESERVATION_LABEL_VALUES]
    }
    spot = true
  }
}

Replace the following:

  • NODE_POOL_RESOURCE_NAME: The name of the node pool resource in the Terraform template.
  • PROJECT_ID: Your project ID.
  • CLUSTER_NAME: The name of the existing cluster.
  • POOL_NAME: The name of the node pool to create.
  • CLUSTER_LOCATION: The compute zone(s) of the cluster. Specify the region where the TPU version is available. To learn more, see Select a TPU version and topology.
  • NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
  • NUM_NODES: The initial number of nodes in the node pool in each of the node pool's zones. If omitted, default is 3. If auto-scaling is enabled for the node pool using the austoscaling template, we recommend that you set NUM_NODES to 0, since GKE provisions additional TPU nodes as soon as your workload demands them.
  • MACHINE_TYPE: The type of TPU machine to use. To see TPU compatible machine types, use the table in Mapping of TPU configuration.

Optionally, you can also use the following variables:

  • autoscaling: Create a node pool with autoscaling enabled. For single-host TPU slice, GKE scales between the TOTAL_MIN_NODES and TOTAL_MAX_NODES values.
    • TOTAL_MIN_NODES: Minimum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
    • TOTAL_MAX_NODES: Maximum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
  • RESERVATION_NAME: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate the RESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider.
  • spot: Sets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.

Console

To create a node pool with TPUs:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add node pool.

  4. In the Node pool details section, check the Specify node locations box.

  5. Select the zone based on the TPU version you want to use:

    • For TPU v4, use us-central2-b.
    • For TPU v5e machine types beginning with ct5l-, use us-central1-a or europe-west4-b.
    • For TPU v5e machine types beginning with ct5lp-, use us-west1-c, us-west4-a, us-west4-b, us-central1-a, us-east1-c, us-east5-b, or europe-west4-a.
    • For TPU v5p, use us-east1-d, us-east5-a, or us-east5-c.
  6. From the navigation pane, click Nodes.

  7. In the Machine Configuration section, select TPUs.

  8. In the Series drop-down menu, select one of the following:

    • CT4P: TPU v4
    • CT5LP: TPU v5e
    • CT5P: TPU v5p
  9. In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Mapping of TPU configuration table to learn how to define the machine type and TPU topology that create a single-host TPU node pool.

  10. In the TPU Topology drop-down menu, select the physical topology for the TPU slice.

  11. In the Changes needed dialog, click Make changes.

  12. Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.

  13. Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.

  14. Click Create.

Multi-host TPU slice

You can create a multi-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

gcloud container node-pools create POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --node-locations=NODE_ZONE \
    --machine-type=MACHINE_TYPE \
    --tpu-topology=TPU_TOPOLOGY \
    --num-nodes=NUM_NODES \
    [--spot \]
    [--enable-autoscaling \
      --max-nodes MAX_NODES]
    [--reservation-affinity=specific \
    --reservation=RESERVATION_NAME]

Replace the following:

  • POOL_NAME: The name of the new node pool.
  • LOCATION: The name of the zone based on the TPU version you want to use:

    • For TPU v4, use us-central2-b.
    • TPU v5e machine types beginning with ct5l- are never multi-host.
    • For TPU v5e machine types beginning with ct5lp-, use us-west1-c, us-west4-a, us-west4-b, us-central1-a, us-east1-c, us-east5-b, or europe-west4-a.
    • For TPU v5p machine types beginning with ct5p-, use us-east1-d, us-east5-a, or us-east5-c.

    To learn more, see TPU availability in GKE.

  • CLUSTER_NAME: The name of the cluster.

  • NODE_ZONE: The comma-separated list of one or more zones where GKE creates the node pool.

  • MACHINE_TYPE: The type of machine to use for nodes. To learn more about the available machine types, see Mapping of TPU configuration.

  • TPU_TOPOLOGY: The physical topology for the TPU slice. The format of the topology depends on the TPU version as follows:

    • TPU v4 or v5p: Define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4.
    • TPU v5e: Define the topology in 2-tuples ({A}x{B}), for example 2x2.

    To learn more, see Topology.

  • NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM. For multi-host TPU v4 and TPU v5e, the number of chips in each VM is four. Therefore, if your TPU_TOPOLOGY is 2x4x4 (TPU v4 with four chips in each VM), then the NUM_NODES is 32/4 which equals to 8.

Optionally, you can also use the following flags:

  • RESERVATION_NAME: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPU node pools. To learn more about TPU reservations, see TPU reservation.
  • --spot: Sets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
  • --enable-autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
    • MAX_NODES: The maximum size of the node pool. The --max-nodes flag is required if --enable-autoscaling is supplied and must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.

Terraform

  1. Ensure that you use the version 4.84.0 or later of the google provider.
  2. Add the following block to your Terraform configuration:

    resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
      provider           = google
      project            = PROJECT_ID
      cluster            = CLUSTER_NAME
      name               = POOL_NAME
      location           = CLUSTER_LOCATION
      node_locations     = [NODE_ZONES]
      initial_node_count = NUM_NODES
    
      autoscaling {
        max_node_count = MAX_NODES
        location_policy      = "ANY"
      }
      node_config {
        machine_type = MACHINE_TYPE
        reservation_affinity {
          consume_reservation_type = "SPECIFIC_RESERVATION"
          key = "compute.googleapis.com/reservation-name"
          values = [RESERVATION_LABEL_VALUES]
        }
        spot = true
      }
    
      placement_policy {
        type = "COMPACT"
        tpu_topology = TPU_TOPOLOGY
      }
    }
    

    Replace the following:

    • NODE_POOL_RESOURCE_NAME: The name of the node pool resource in the Terraform template.
    • PROJECT_ID: Your project ID.
    • CLUSTER_NAME: The name of the existing cluster to add the node pool to.
    • POOL_NAME: The name of the node pool to create.
    • CLUSTER_LOCATION: Compute location for the cluster. We recommend having a regional cluster for higher reliability of the Kubernetes control plane. You can also use a zonal cluster. To learn more, see Select a TPU version and topology.
    • NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
    • NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the number of the TPU chips divided by four, because in multi-host TPU slices each TPU node has 4 chips. For example, if TPU_TOPOLOGY is 4x8, then there are 32 chips which means NUM_NODES must be 8. To learn more about TPU topologies, use the table in Mapping of TPU configuration.
    • TPU_TOPOLOGY: This indicates the desired physical topology for the TPU slice. The format of the topology depends on the TPU version you are using:
      • For TPU v4: Define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4.
      • For TPU v5e: Define the topology in 2-tuples ({A}x{B}), for example 2x2.

    Optionally, you can also use the following variables:

    • RESERVATION_NAME: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate theRESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider.
    • autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
      • MAX_NODES: It is the maximum size of the node pool. It must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.
    • spot: Lets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.

Console

To create a node pool with TPUs:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add node pool.

  4. In the Node pool details section, check the Specify node locations box.

  5. Select the zone based on the TPU version you want to use:

    • For TPU v4, use us-central2-b.
    • TPU v5e machine types beginning with ct5l- are never multi-host.
    • For TPU v5e machine types beginning with ct5lp-, use us-west1-c, us-west4-a, us-west4-b, us-central1-a, us-east1-c, us-east5-b, or europe-west4-a.
    • For TPU v5p machine types beginning with ct5p-, use us-east1-d, us-east5-a, or us-east5-c.
  6. From the navigation pane, click Nodes.

  7. In the Machine Configuration section, select TPUs.

  8. In the Series drop-down menu, select one of the following:

    • CT4P: For TPU v4.
    • CT5LP: For TPU v5e.
  9. In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Mapping of TPU configuration table to learn how to define the machine type and TPU topology that create a multi-host TPU node pool.

  10. In the TPU Topology drop-down menu, select the physical topology for the TPU slice.

  11. In the Changes needed dialog, click Make changes.

  12. Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.

  13. Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.

  14. Click Create.

Provisioning state

If GKE cannot create your TPU slice node pool due to insufficient TPU capacity available, GKE returns an error message indicating the TPU nodes cannot be created due to lack of capacity.

If you are creating a single-host TPU slice node pool, the error message will look similar to this:

2 nodes cannot be created due to lack of capacity. The missing nodes will be
created asynchronously once capacity is available. You can either wait for the
nodes to be up, or delete the node pool and try re-creating it again later.

If you are creating a multi-host TPU slice node pool, the error message will look similar to this:

The nodes (managed by ...) cannot be created now due to lack of capacity. They
will be created asynchronously once capacity is available. You can either wait
for the nodes to be up, or delete the node pool and try re-creating it again
later.

Your TPU provisioning request may stay in the queue for a long time and will remain in the "Provisioning" state while in the queue.

Once capacity is available, GKE creates the remaining nodes that were not created.

If you need capacity sooner, consider trying Spot VMs, though note that Spot VMs consume different quota than on-demand instances.

You may delete the queued TPU request by deleting the TPU slice node pool.

Run your workload on TPU nodes

Workload preparation

TPU workloads have the following preparation requirements.

  1. Frameworks like JAX, PyTorch, and TensorFlow access TPU VMs using the libtpu shared library. libtpu includes the XLA compiler, TPU runtime software, and the TPU driver. Each release of PyTorch and JAX requires a certain libtpu.so version. To use TPUs in GKE, ensure that you use the following versions:
    TPU type libtpu.so version
    TPU v5e
    tpu-v5-lite-podslice
    tpu-v5-lite-device
    TPU v5p
    • Recommended jax[tpu] version: 0.4.19 or later.
    • Recommended torchxla[tpuvm] version: suggested to use a nightly version build on October 23, 2023.
    TPU v4
    tpu-v4-podslice
  2. Set the following environment variables for the container requesting the TPU resources:
    • TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique worker-id in the TPU slice. The supported values for this field range from zero to the number of Pods minus one.
    • TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses that need to communicate with each other within the slice. There should be a hostname or IP address for each TPU VM in the slice. The list of IP addresses or hostnames are ordered and zero indexed by the TPU_WORKER_ID.
    • GKE automatically injects these environment variables by using a mutating webhook when a Job is created with the completionMode: Indexed, subdomain, parallelism > 1, and requesting google.com/tpu properties. GKE adds a headless Service so that the DNS records are added for the Pods backing the Service.

      When deploying TPU multi-host resources with Kuberay, GKE provides a deployable webhook as part of the Terraform templates for running Ray on GKE. Instructions for running Ray on GKE with TPUs can be found in the TPU User Guide. The mutating webhook will inject these environment variables into Ray clusters requesting google.com/tpu properties and a multi-host cloud.google.com/gke-tpu-topology node selector.

    • In your workload manifest, add Kubernetes node selectors to ensure that GKE schedules your TPU workload on the TPU machine type and TPU topology you defined:

        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: TPU_ACCELERATOR
          cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY
        

      Replace the following:

      • TPU_ACCELERATOR: The name of the TPU accelerator:
        • For TPU v4, use tpu-v4-podslice
        • For TPU v5e machine types beginning with ct5l-, use tpu-v5-lite-device
        • For TPU v5e machine types beginning with ct5lp-, use tpu-v5-lite-podslice
        • For TPU v5p, use tpu-v5p-slice
      • TPU_TOPOLOGY: The physical topology for the TPU slice. The format of the topology depends on the TPU version as follows:
        • TPU v4: Define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4.
        • TPU v5e: Define the topology in 2-tuples ({A}x{B}), for example 2x2.
        • TPU v5p: Define the topology in 3-tuples ({A}x{B}x{C}), for example 4x4x4.

After you complete the workload preparation, you can run a Job that uses TPUs.

The following sections show examples on how to run a Job that performs simple computation with TPUs.

Example 1: Run a workload that displays the number of available TPU chips in a TPU node pool

The following workload returns the number of TPU chips across all of the nodes in a multi-host TPU slice. To create a multi-host slice, the workload has the following parameters:

  • TPU version: TPU v4
  • Topology: 2x2x4

This version and topology selection result in a multi-host slice.

  1. Save the following manifest as available-chips-multihost.yaml:
    apiVersion: v1
    kind: Service
    metadata:
      name: headless-svc
    spec:
      clusterIP: None
      selector:
        job-name: tpu-available-chips
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tpu-available-chips
    spec:
      backoffLimit: 0
      completions: 4
      parallelism: 4
      completionMode: Indexed
      template:
        spec:
          subdomain: headless-svc
          restartPolicy: Never
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
            cloud.google.com/gke-tpu-topology: 2x2x4
          containers:
          - name: tpu-job
            image: python:3.10
            ports:
            - containerPort: 8471 # Default port using which TPU VMs communicate
            - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
            securityContext:
              privileged: true
            command:
            - bash
            - -c
            - |
              pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
              python -c 'import jax; print("TPU cores:", jax.device_count())'
            resources:
              requests:
                cpu: 10
                memory: 500Gi
                google.com/tpu: 4
              limits:
                cpu: 10
                memory: 500Gi
                google.com/tpu: 4
    
  2. Deploy the manifest:
    kubectl create -f available-chips-multihost.yaml
    

    GKE runs a TPU v4 slice with four TPU VMs (multi-host TPU slice). The slice has 16 interconnected chips.

  3. Verify that the Job created four Pods:
    kubectl get pods
    

    The output is similar to the following:

    NAME                       READY   STATUS      RESTARTS   AGE
    tpu-job-podslice-0-5cd8r   0/1     Completed   0          97s
    tpu-job-podslice-1-lqqxt   0/1     Completed   0          97s
    tpu-job-podslice-2-f6kwh   0/1     Completed   0          97s
    tpu-job-podslice-3-m8b5c   0/1     Completed   0          97s
    
  4. Get the logs of one of the Pods:
    kubectl logs POD_NAME
    

    Replace POD_NAME with the name of one of the created Pods. For example, tpu-job-podslice-0-5cd8r.

    The output is similar to the following:

    TPU cores: 16
    

Example 2: run a workload that displays the number of available TPU chips in the TPU VM

The following workload is a static Pod that displays the number of TPU chips that are attached to a specific node. To create a single-host node, the workload has the following parameters:

  • TPU version: TPU v5e
  • Topology: 2x4

This version and topology selection result in a single-host slice.

  1. Save the following manifest as available-chips-singlehost.yaml:
    apiVersion: v1
    kind: Pod
    metadata:
      name: tpu-job-jax-v5
    spec:
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
        cloud.google.com/gke-tpu-topology: 2x4
      containers:
      - name: tpu-job
        image: python:3.10
        ports:
        - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
        securityContext:
          privileged: true
        command:
        - bash
        - -c
        - |
          pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
          python -c 'import jax; print("Total TPU chips:", jax.device_count())'
        resources:
          requests:
            google.com/tpu: 8
          limits:
            google.com/tpu: 8
    
  2. Deploy the manifest:
    kubectl create -f available-chips-singlehost.yaml
    

    GKE provisions nodes with eight single-host TPU slices that use TPU v5e. Each TPU VM has eight chips (single-host TPU slice).

  3. Get the logs of the Pod:
    kubectl logs tpu-job-jax-v5
    

    The output is similar to the following:

    Total TPU chips: 8
    

Upgrade node pools using accelerators (GPUs and TPUs)

GKE automatically upgrades Standard clusters, including node pools. You can also manually upgrade node pools if you want your nodes on a later version sooner. To control how upgrades work for your cluster, use release channels, maintenance windows and exclusions, and rollout sequencing.

You can also configure a node upgrade strategy for your node pool, such as surge upgrades or blue-green upgrades. By configuring these strategies, you can ensure that the node pools are upgraded in a way that achieves the optimal balance between speed and disruption for your environment. For multi-host TPU slice node pools, instead of using the configured node upgrade strategy, GKE atomically recreates the entire node pool in a single step. To learn more, see the definition of atomicity in Terminology related to TPU in GKE.

Using a node upgrade strategy will temporarily require GKE to provision additional resources, depending on the configuration. If Google Cloud has limited capacity for your node pool's resources—for example, you're seeing resource availability errors when trying to create more nodes with GPUs or TPUs—see Upgrade in a resource-constrained environment.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this guide, consider deleting the TPU node pools that no longer have scheduled workloads. If the workloads running must be gracefully terminated, use kubectl drain to clean up the workloads before you delete the node.

  1. Delete a TPU node pool:

    gcloud container node-pools delete POOL_NAME \
        --location=LOCATION \
        --cluster=CLUSTER_NAME
    

    Replace the following:

    • POOL_NAME: The name of the node pool.
    • CLUSTER_NAME: The name of the cluster.
    • LOCATION: The compute location of the cluster.

Additional configurations

The following sections describe the additional configurations you can apply to your TPU workloads.

Multislice

You can aggregate smaller slices together in a Multislice to handle larger training workloads. For more information, see Multislice TPUs in GKE.

Migrate your TPU reservation

If you have existing TPU reservations, you must first migrate your TPU reservation to a new Compute Engine-based reservation system. You can also create Compute Engine-based reservation system where no migration is needed. To learn how to migrate your TPU reservations, see TPU reservation.

Logging

Logs emitted by containers running on GKE nodes, including TPU VMs, are collected by the GKE logging agent, sent to Logging, and are visible in Logging.

Use GKE node auto-provisioning

You can configure GKE to automatically create and delete node pools to meet the resource demands of your TPU workloads. For more information, see Configuring Cloud TPUs.

TPU node auto repair

If a TPU node in a multi-host TPU slice node pool is unhealthy, the entire node pool is recreated. Conditions that result in unhealthy TPU nodes include the following:

  • Any TPU node with common node conditions.
  • Any TPU node with an unallocatable TPU count larger than zero.
  • Any TPU VM instance that is stopped (due to preemption) or is terminated.
  • Node maintenance: If any TPU node (VM) within a multi-host TPU slice node pool goes down for host maintenance, GKE recreates the entire TPU slice.

You can see the repair status (including the failure reason) in the operation history. If the failure is caused by insufficient quota, contact your Google Cloud account representative to increase the corresponding quota.

Configure TPU node graceful termination

In GKE clusters with the control plane running 1.29.1-gke.1425000 or later, TPU nodes support SIGTERM signals that alert the node of an imminent shutdown. The imminent shutdown notification is configurable up to five minutes in TPU nodes.

You can configure GKE to terminate your ML workloads gracefully within this notification timeframe. During the graceful termination, your workloads can perform clean-up processes, such as storing workload data to reduce data loss. To get the maximum notification time, in your Pod manifest, set the spec.terminationGracePeriodSeconds field to 300 seconds (five minutes) as follows:

    spec:
      terminationGracePeriodSeconds: 300

GKE makes a best effort to terminate these Pods gracefully and to execute the termination action that you define, for example, saving a training state.

Run containers without privileged mode

If your TPU node is running versions less than 1.28, read the following section:

A container running on a TPU VM needs access to higher limits on locked memory so the driver can communicate with the TPU chips over direct memory access (DMA). To enable this, you must configure a higher ulimit. If you want to reduce the permission scope on your container, complete the following steps:

  1. Edit the securityContext to include the following fields:

    securityContext:
      capabilities:
        add: ["SYS_RESOURCE"]
    
  2. Increase ulimit by running the following command inside the container before your setting up your workloads to use TPU resources:

    ulimit -l 68719476736
    

Note: For TPU v5e, running containers without privileged mode is available in clusters in version 1.27.4-gke.900 and later.

Observability and metrics

Dashboard

In the Kubernetes Clusters page in the Google Cloud console, the Observability tab displays the TPU observability metrics. For more information, see GKE observability metrics.

The TPU dashboard is populated only if you have system metrics enabled in your GKE cluster.

Runtime metrics

In GKE version 1.27.4-gke.900 or later, TPU workloads that use JAX version 0.4.14 or later and specify containerPort: 8431 export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU workload's runtime performance:

  • Duty cycle: Percentage of time over the past sampling period (60 seconds) during which the TensorCores were actively processing on a TPU chip. Larger percentage means better TPU utilization.
  • Memory used: Amount of accelerator memory allocated in bytes. Sampled every 60 seconds.
  • Memory total: Total accelerator memory in bytes. Sampled every 60 seconds.

These metrics are located in the Kubernetes node (k8s_node) and Kubernetes container (k8s_container) schema.

Kubernetes container:

  • kubernetes.io/container/accelerator/duty_cycle
  • kubernetes.io/container/accelerator/memory_used
  • kubernetes.io/container/accelerator/memory_total

Kubernetes node:

  • kubernetes.io/node/accelerator/duty_cycle
  • kubernetes.io/node/accelerator/memory_used
  • kubernetes.io/node/accelerator/memory_total

Host metrics

In GKE version 1.28.1-gke.1066000 or later, TPU VM export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU host's performance:

  • TensorCore utilization: Current percentage of the TensorCore that is utilized. The TensorCore value equals the sum of the matrix-multiply units (MXUs) plus the vector unit. The TensorCore utilization value is the division of the TensorCore operations that were performed over the past sample period (60 seconds) by the supported number of TensorCore operations over the same period. Larger value means better utilization.
  • Memory bandwidth utilization: Current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period (60s) by the maximum supported bandwidth over the same sample period.

These metrics are located in the Kubernetes node (k8s_node) and Kubernetes container (k8s_container) schema.

Kubernetes container:

  • kubernetes.io/container/accelerator/tensorcore_utilization
  • kubernetes.io/container/accelerator/memory_bandwidth_utilization

Kubernetes node:

  • kubernetes.io/container/node/tensorcore_utilization
  • kubernetes.io/container/node/memory_bandwidth_utilization

For more information, see Kubernetes metrics and GKE system metrics.

Known issues

  • Cluster autoscaler might wrongly calculate capacity for new TPU nodes before those nodes report available TPUs. Cluster autoscaler might then perform additional scale up and as a result create more nodes than needed. Cluster autoscaler will scale down additional nodes, if they are not needed, after regular scale down operation.
  • Cluster autoscaler cancels scaling up of TPU node pools that remain in waiting status for more than 15 minutes. Cluster Autoscaler will retry such scale up operations later. This behavior might reduce TPU obtainability for customers who don't use reservations.
  • Non-TPU workloads that have a toleration for the TPU taint may prevent scale down of the node pool if they are being recreated during draining of the TPU node pool.

What's next