Deploy TPU workloads in GKE Standard


This page provides a foundation for learning how to accelerate machine learning (ML) workloads using TPUs in Google Kubernetes Engine (GKE). TPUs are designed for matrix multiplication processing, such as large-scale deep learning model training. TPUs are optimized to handle the enormous datasets and complex models of ML and therefore are more cost-effective and energy efficient for ML workloads due to their superior performance. In this guide, you learn how to deploy ML workloads by using Cloud TPU accelerators, configure quotas for TPUs, configure upgrades for node pools that run TPUs, and monitor TPU workload metrics.

This tutorial is intended for Machine learning (ML) engineers and Platform admins and operators who are interested in using Kubernetes container orchestration to manage large-scale model training, tuning, and inference workloads using TPUs. To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Before reading this page, ensure that you're familiar with the following:

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Plan your TPU configuration

Plan your TPU configuration based on your model and how much memory it requires. Before you use this guide to deploy your workloads on TPU, complete the planning steps in Plan your TPU configuration.

Ensure that you have TPU quota

The following sections help you ensure that you have enough quota when using TPUs in GKE.

Quota for on-demand or Spot VMs

If you are creating a TPU slice node pool with on-demand or Spot VMs, you must have sufficient TPU quota available in the region that you want to use.

Creating a TPU slice node pool that consumes a TPU reservation does not require any TPU quota.1 You may safely skip this section for reserved TPUs.

Creating an on-demand or Spot TPU slice node pool in GKE requires Compute Engine API quota. Compute Engine API quota (compute.googleapis.com) is not the same as Cloud TPU API quota (tpu.googleapis.com), which is needed when creating TPUs with the Cloud TPU API.

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

  1. Go to the Quotas page in the Google Cloud console:

    Go to Quotas

  2. In the Filter box, do the following:

    1. Select the Service property, enter Compute Engine API, and press Enter.

    2. Select the Type property and choose Quota.

    3. Select the Name property and enter the name of the quota based on the TPU version and machine type. For example, if you plan to create on-demand TPU v5e nodes whose machine type begins with ct5lp- , enter TPU v5 Lite PodSlice chips.

      TPU version Machine type begins with Name of the quota for on-demand instances Name of the quota for Spot2 instances
      TPU v3 ct3- TPU v3 Device chips Preemptible TPU v3 Device chips
      TPU v3 ct3p- TPU v3 PodSlice chips Preemptible TPU v3 PodSlice chips
      TPU v4 ct4p- TPU v4 PodSlice chips Preemptible TPU v4 PodSlice chips
      TPU v5e ct5l- TPU v5 Lite Device chips Preemptible TPU v5 Lite Device chips
      TPU v5e ct5lp- TPU v5 Lite PodSlice chips Preemptible TPU v5 Lite PodSlice chips
      TPU v5p ct5p- TPU v5p chips Preemptible TPU v5p chips
      TPU Trillium ct6e- TPU v6e Slice chips Preemptible TPU v6e Lite PodSlice chips
    4. Select the Dimensions (e.g. locations) property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU slice nodes in the zone us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota increase.

When a TPU reservation is created, both the limit and current use values for the corresponding quota increase by the number of chips in the TPU reservation. For example, when a reservation is created for 16 TPU v5e chips whose machine type begins with ct5lp- , then both the Limit and Current usage for the TPU v5 Lite PodSlice chips quota in the relevant region increase by 16.

  1. When creating a TPU slice node pool, use the --reservation and --reservation-affinity=specific flags to create a reserved instance. TPU reservations are available when purchasing a commitment.

  2. When creating a TPU slice node pool, use the --spot flag to create a Spot instance.

Quotas for additional GKE resources

You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.

  • Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
  • In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
  • Ensure that max-pods-per-node aligns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example, max-pods-per-node of 32 requires 64 IP addresses which translates to a /26 subnet per node. Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the --max-pods-per-node flag to limit the number of pods allowed to be scheduled on a node. The quota for max-pods-per-node should be set at least as high as the maximum number of GKE nodes you anticipate creating.

To request an increase in quota, see Request higher quota.

Ensure reservation availability

Creating a reserved TPU slice node pool, which consumes a reservation, does not require any TPU quota. However, the reservation must have enough available or unused TPU chips when the node pool is created.

To see which reservations exist within a project, view a list of your reservations.

To view how many TPU chips within a TPU reservation are available, view the details of a reservation.

Options for provisioning TPUs in GKE

GKE lets you use TPUs directly in individual workloads by using Kubernetes nodeSelectors in your workload manifest or by creating Standard mode node pools with TPUs.

Alternatively, you can request TPUs by using custom compute classes. Custom compute classes let platform administrators define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware.

For instructions, see the Provision TPUs using custom compute classes section.

Create a cluster

Create a GKE cluster in Standard mode in a region with available TPUs.

Best practice:

Use regional clusters, which provide high availability of the Kubernetes control plane.

gcloud container clusters create CLUSTER_NAME \
  --location LOCATION \
  --cluster-version VERSION

Replace the following:

  • CLUSTER_NAME: the name of the new cluster.
  • LOCATION: the region with your TPU capacity available.
  • VERSION: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. To learn what are the minimum GKE versions available by TPU machine type, see TPU availability in GKE.

Create a node pool

Single-host TPU slice

You can create a single-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

gcloud container node-pools create NODE_POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --node-locations=NODE_ZONES \
    --machine-type=MACHINE_TYPE

Replace the following:

  • NODE_POOL_NAME: The name of the new node pool.
  • LOCATION: The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.
  • CLUSTER_NAME: The name of the cluster.
  • NODE_ZONE: The comma-separated list of one or more zones where GKE creates the node pool.
  • MACHINE_TYPE: The type of machine to use for nodes. For more information about TPU compatible machine types, use the table in Choose the TPU version.

Optionally, you can also use the following flags:

  • --num-nodes=NUM_NODES: The initial number of nodes in the node pool in each zone. If you omit this flag,GKE assigns the default of 3,

    Best practice:

    If you use the enable-autoscaling flag for the node pool, set num-nodes to 0 so that the autoscaler provisions additional nodes as soon as your workloads demand them.

  • --reservation=RESERVATION_NAME: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPUs. To learn more about TPU reservations, see TPU reservation.

  • --node-labels cloud.google.com/gke-workload-type=HIGH_AVAILABILITY: Tells GKE that the single-host TPU slice node pool is part of a collection. Use this flag if the following conditions apply:

    • The node pool runs inference workload in the new node pool.
    • The node pool uses TPU Trillium.
    • The node pool doesn't use Spot VMs.

    To learn more about collection scheduling management, see Manage collection scheduling in single-host TPU slices.

  • --enable-autoscaling: Create a node pool with autoscaling enabled. Requires the following additional flags:

    • --total-min-nodes=TOTAL_MIN_NODES: Minimum number of all nodes in the node pool.
    • --total-max-nodes=TOTAL_MAX_NODES: Maximum number of all nodes in the node pool.
    • --location-policy=ANY: prioritize usage of unused reservations and reduce the preemption risk of Spot VMs.
  • --spot: Sets the node pool to use Spot VMs for the nodes in the node pool. This cannot be changed after node pool creation.

For a full list of all the flags that you can specify, see the gcloud container clusters create reference.

Terraform

  1. Ensure that you use the version 4.84.0 or later of the google provider.
  2. Add the following block to your Terraform configuration:
resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
  provider           = google
  project            = PROJECT_ID
  cluster            = CLUSTER_NAME
  name               = POOL_NAME
  location           = CLUSTER_LOCATION
  node_locations     = [NODE_ZONES]

  node_config {
    machine_type = MACHINE_TYPE
    reservation_affinity {
      consume_reservation_type = "SPECIFIC_RESERVATION"
      key = "compute.googleapis.com/reservation-name"
      values = [RESERVATION_LABEL_VALUES]
    }
    spot = true
  }
}

Replace the following:

  • NODE_POOL_RESOURCE_NAME: The name of the node pool resource in the Terraform template.
  • PROJECT_ID: Your project ID.
  • CLUSTER_NAME: The name of the existing cluster.
  • POOL_NAME: The name of the node pool to create.
  • CLUSTER_LOCATION: The compute zone(s) of the cluster. Specify the region where the TPU version is available. To learn more, see Select a TPU version and topology.
  • NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
  • MACHINE_TYPE: The type of TPU machine to use. To see TPU compatible machine types, use the table in Choose the TPU version.

Optionally, you can also use the following variables:

  • autoscaling: Create a node pool with autoscaling enabled. For single-host TPU slice, GKE scales between the TOTAL_MIN_NODES and TOTAL_MAX_NODES values.
    • TOTAL_MIN_NODES: Minimum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
    • TOTAL_MAX_NODES: Maximum number of all nodes in the node pool. This field is optional unless autoscaling is also specified.
  • RESERVATION_NAME: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate the RESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider.
  • spot: Sets the node pool to use Spot VMs for the TPU nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.

Console

To create a node pool with TPUs:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add node pool.

  4. In the Node pool details section, check the Specify node locations box.

  5. Select the zone based on the TPU version you want to use. To identify an available zone, see TPU availability in GKE.

  6. From the navigation pane, click Nodes.

  7. In the Machine Configuration section, select TPUs.

  8. In the Series drop-down menu, select one of the following:

    • CT3: TPU v3, single host device
    • CT3P: TPU v3, multi host pod slice
    • CT4P: TPU v4
    • CT5LP: TPU v5e
    • CT5P: TPU v5p
    • CT6E: TPU Trillium (v6e)
  9. In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a single-host TPU slice node pool.

  10. In the TPU Topology drop-down menu, select the physical topology for the TPU slice.

  11. In the Changes needed dialog, click Make changes.

  12. Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.

  13. Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.

  14. Click Create.

Multi-host TPU slice

You can create a multi-host TPU slice node pool using the Google Cloud CLI, Terraform, or the Google Cloud console.

gcloud

gcloud container node-pools create POOL_NAME \
    --location=LOCATION \
    --cluster=CLUSTER_NAME \
    --node-locations=NODE_ZONE \
    --machine-type=MACHINE_TYPE \
    --tpu-topology=TPU_TOPOLOGY \
    --num-nodes=NUM_NODES \
    [--spot \]
    [--enable-autoscaling \
      --max-nodes MAX_NODES]
    [--reservation-affinity=specific \
    --reservation=RESERVATION_NAME] \
    

Replace the following:

  • POOL_NAME: The name of the new node pool.
  • LOCATION: The name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.
  • CLUSTER_NAME: The name of the cluster.
  • NODE_ZONE: The comma-separated list of one or more zones where GKE creates the node pool.
  • MACHINE_TYPE: The type of machine to use for nodes. To learn more about the available machine types, see Choose the TPU version.
  • TPU_TOPOLOGY: The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more about TPU topologies, use the table in Choose a topology.

    To learn more, see Topology.

  • NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM. For multi-host TPU v4 and TPU v5e, the number of chips in each VM is four. Therefore, if your TPU_TOPOLOGY is 2x4x4 (TPU v4 with four chips in each VM), then the NUM_NODES is 32/4 which equals to 8.

Optionally, you can also use the following flags:

  • RESERVATION_NAME: The name of the reservation GKE uses when creating the node pool. If you omit this flag, GKE uses available TPU slice node pools. To learn more about TPU reservations, see TPU reservation.
  • --spot: Sets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.
  • --enable-autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
    • MAX_NODES: The maximum size of the node pool. The --max-nodes flag is required if --enable-autoscaling is supplied and must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.

Terraform

  1. Ensure that you use the version 4.84.0 or later of the google provider.
  2. Add the following block to your Terraform configuration:

    resource "google_container_node_pool" "NODE_POOL_RESOURCE_NAME" {
      provider           = google
      project            = PROJECT_ID
      cluster            = CLUSTER_NAME
      name               = POOL_NAME
      location           = CLUSTER_LOCATION
      node_locations     = [NODE_ZONES]
      initial_node_count = NUM_NODES
    
      autoscaling {
        max_node_count = MAX_NODES
        location_policy      = "ANY"
      }
      node_config {
        machine_type = MACHINE_TYPE
        reservation_affinity {
          consume_reservation_type = "SPECIFIC_RESERVATION"
          key = "compute.googleapis.com/reservation-name"
          values = [RESERVATION_LABEL_VALUES]
        }
        spot = true
      }
    
      placement_policy {
        type = "COMPACT"
        tpu_topology = TPU_TOPOLOGY
      }
    }
    

    Replace the following:

    • NODE_POOL_RESOURCE_NAME: The name of the node pool resource in the Terraform template.
    • PROJECT_ID: Your project ID.
    • CLUSTER_NAME: The name of the existing cluster to add the node pool to.
    • POOL_NAME: The name of the node pool to create.
    • CLUSTER_LOCATION: Compute location for the cluster. We recommend having a regional cluster for higher reliability of the Kubernetes control plane. You can also use a zonal cluster. To learn more, see Select a TPU version and topology.
    • NODE_ZONES: The comma-separated list of one or more zones where GKE creates the node pool.
    • NUM_NODES: The number of nodes in the node pool. It must be zero or the product of the number of the TPU chips divided by four, because in multi-host TPU slices each TPU slice node has 4 chips. For example, if TPU_TOPOLOGY is 4x8, then there are 32 chips which means NUM_NODES must be 8. To learn more about TPU topologies, use the table in Choose the TPU version.
    • TPU_TOPOLOGY: This indicates the desired physical topology for the TPU slice. The format of the topology depends on the TPU version you are using. To learn more about TPU topologies, use the table in Choose a topology.

    Optionally, you can also use the following variables:

    • RESERVATION_NAME: If you use TPU reservation, this is the list of labels of the reservation resources to use when creating the node pool. To learn more about how to populate theRESERVATION_LABEL_VALUES in the reservation_affinity field, see Terraform Provider.
    • autoscaling: Create a node pool with autoscaling enabled. When GKE scales a multi-host TPU slice node pool, it atomically scales up the node pool from zero to the maximum size.
      • MAX_NODES: It is the maximum size of the node pool. It must be equal to the product of the values defined in TPU_TOPOLOGY ({A}x{B}x{C}) divided by the number of chips in each VM.
    • spot: Lets the node pool to use Spot VMs for the TPU slice nodes. This cannot be changed after node pool creation. For more information, see Spot VMs.

Console

To create a node pool with TPUs:

  1. Go to the Google Kubernetes Engine page in the Google Cloud console.

    Go to Google Kubernetes Engine

  2. In the cluster list, click the name of the cluster you want to modify.

  3. Click Add node pool.

  4. In the Node pool details section, check the Specify node locations box.

  5. Select the name of the zone based on the TPU version you want to use. To identify an available location, see TPU availability in GKE.

  6. From the navigation pane, click Nodes.

  7. In the Machine Configuration section, select TPUs.

  8. In the Series drop-down menu, select one of the following:

    • CT3P: For TPU v3.
    • CT4P: For TPU v4.
    • CT5LP: For TPU v5e.
  9. In the Machine type drop-down menu, select the name of the machine to use for nodes. Use the Choose the TPU version table to learn how to define the machine type and TPU topology that create a multi-host TPU slice node pool.

  10. In the TPU Topology drop-down menu, select the physical topology for the TPU slice.

  11. In the Changes needed dialog, click Make changes.

  12. Ensure that Boot disk type is either Standard persistent disk or SSD persistent disk.

  13. Optionally, select the Enable nodes on spot VMs checkbox to use Spot VMs for the nodes in the node pool.

  14. Click Create.

Provisioning state

If GKE cannot create your TPU slice node pool due to insufficient TPU capacity available, GKE returns an error message indicating the TPU slice nodes cannot be created due to lack of capacity.

If you are creating a single-host TPU slice node pool, the error message looks similar to this:

2 nodes cannot be created due to lack of capacity. The missing nodes will be
created asynchronously once capacity is available. You can either wait for the
nodes to be up, or delete the node pool and try re-creating it again later.

If you are creating a multi-host TPU slice node pool, the error message looks similar to this:

The nodes (managed by ...) cannot be created now due to lack of capacity. They
will be created asynchronously once capacity is available. You can either wait
for the nodes to be up, or delete the node pool and try re-creating it again
later.

Your TPU provisioning request can stay in the queue for a long time and remains in the "Provisioning" state while in the queue.

Once capacity is available, GKE creates the remaining nodes that were not created.

If you need capacity sooner, consider trying Spot VMs, though note that Spot VMs consume different quota than on-demand instances.

You can delete the queued TPU request by deleting the TPU slice node pool.

Run your workload on TPU slice nodes

Workload preparation

TPU workloads have the following preparation requirements.

  1. Frameworks like JAX, PyTorch, and TensorFlow access TPU VMs using the libtpu shared library. libtpu includes the XLA compiler, TPU runtime software, and the TPU driver. Each release of PyTorch and JAX requires a certain libtpu.so version. To use TPUs in GKE, ensure that you use the following versions:
    TPU type libtpu.so version
    TPU Trillium (v6e)
    tpu-v6e-slice
    TPU v5e
    tpu-v5-lite-podslice
    tpu-v5-lite-device
    TPU v5p
    tpu-v5p-slice
    • Recommended jax[tpu] version: 0.4.19 or later.
    • Recommended torchxla[tpuvm] version: suggested to use a nightly version build on October 23, 2023.
    TPU v4
    tpu-v4-podslice
    TPU v3
    tpu-v3-slice
    tpu-v3-device
  2. Set the following environment variables for the container requesting the TPU resources:
    • TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique worker-id in the TPU slice. The supported values for this field range from zero to the number of Pods minus one.
    • TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses that need to communicate with each other within the slice. There should be a hostname or IP address for each TPU VM in the slice. The list of IP addresses or hostnames are ordered and zero indexed by the TPU_WORKER_ID.
    • GKE automatically injects these environment variables by using a mutating webhook when a Job is created with the completionMode: Indexed, subdomain, parallelism > 1, and requesting google.com/tpu properties. GKE adds a headless Service so that the DNS records are added for the Pods backing the Service.

      When deploying TPU multi-host resources with Kuberay, GKE provides a deployable webhook as part of the Terraform templates for running Ray on GKE. Instructions for running Ray on GKE with TPUs can be found in the TPU User Guide. The mutating webhook will inject these environment variables into Ray clusters requesting google.com/tpu properties and a multi-host cloud.google.com/gke-tpu-topology node selector.

    • In your workload manifest, add Kubernetes node selectors to ensure that GKE schedules your TPU workload on the TPU machine type and TPU topology you defined:

        nodeSelector:
          cloud.google.com/gke-tpu-accelerator: TPU_ACCELERATOR
          cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY
        

      Replace the following:

      • TPU_ACCELERATOR: The name of the TPU accelerator.
      • TPU_TOPOLOGY: The physical topology for the TPU slice. The format of the topology depends on the TPU version. To learn more, see Plan TPUs in GKE.

After you complete the workload preparation, you can run a Job that uses TPUs.

The following sections show examples on how to run a Job that performs simple computation with TPUs.

Example 1: Run a workload that displays the number of available TPU chips in a TPU slice node pool

The following workload returns the number of TPU chips across all of the nodes in a multi-host TPU slice. To create a multi-host slice, the workload has the following parameters:

  • TPU version: TPU v4
  • Topology: 2x2x4

This version and topology selection result in a multi-host slice.

  1. Save the following manifest as available-chips-multihost.yaml:
    apiVersion: v1
    kind: Service
    metadata:
      name: headless-svc
    spec:
      clusterIP: None
      selector:
        job-name: tpu-available-chips
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tpu-available-chips
    spec:
      backoffLimit: 0
      completions: 4
      parallelism: 4
      completionMode: Indexed
      template:
        spec:
          subdomain: headless-svc
          restartPolicy: Never
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
            cloud.google.com/gke-tpu-topology: 2x2x4
          containers:
          - name: tpu-job
            image: python:3.10
            ports:
            - containerPort: 8471 # Default port using which TPU VMs communicate
            - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
            securityContext:
              privileged: true
            command:
            - bash
            - -c
            - |
              pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
              python -c 'import jax; print("TPU cores:", jax.device_count())'
            resources:
              requests:
                cpu: 10
                memory: 500Gi
                google.com/tpu: 4
              limits:
                cpu: 10
                memory: 500Gi
                google.com/tpu: 4
  2. Deploy the manifest:
    kubectl create -f available-chips-multihost.yaml
    

    GKE runs a TPU v4 slice with four VMs (multi-host TPU slice). The slice has 16 interconnected TPU chips.

  3. Verify that the Job created four Pods:
    kubectl get pods
    

    The output is similar to the following:

    NAME                       READY   STATUS      RESTARTS   AGE
    tpu-job-podslice-0-5cd8r   0/1     Completed   0          97s
    tpu-job-podslice-1-lqqxt   0/1     Completed   0          97s
    tpu-job-podslice-2-f6kwh   0/1     Completed   0          97s
    tpu-job-podslice-3-m8b5c   0/1     Completed   0          97s
    
  4. Get the logs of one of the Pods:
    kubectl logs POD_NAME
    

    Replace POD_NAME with the name of one of the created Pods. For example, tpu-job-podslice-0-5cd8r.

    The output is similar to the following:

    TPU cores: 16
    

Example 2: run a workload that displays the number of available TPU chips in the TPU slice

The following workload is a static Pod that displays the number of TPU chips that are attached to a specific node. To create a single-host node, the workload has the following parameters:

  • TPU version: TPU v5e
  • Topology: 2x4

This version and topology selection result in a single-host slice.

  1. Save the following manifest as available-chips-singlehost.yaml:
    apiVersion: v1
    kind: Pod
    metadata:
      name: tpu-job-jax-v5
    spec:
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
        cloud.google.com/gke-tpu-topology: 2x4
      containers:
      - name: tpu-job
        image: python:3.10
        ports:
        - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
        securityContext:
          privileged: true
        command:
        - bash
        - -c
        - |
          pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
          python -c 'import jax; print("Total TPU chips:", jax.device_count())'
        resources:
          requests:
            google.com/tpu: 8
          limits:
            google.com/tpu: 8
  2. Deploy the manifest:
    kubectl create -f available-chips-singlehost.yaml
    

    GKE provisions nodes with eight single-host TPU slices that use TPU v5e. Each TPU node has eight TPU chips (single-host TPU slice).

  3. Get the logs of the Pod:
    kubectl logs tpu-job-jax-v5
    

    The output is similar to the following:

    Total TPU chips: 8
    

Upgrade node pools using accelerators (GPUs and TPUs)

GKE automatically upgrades Standard clusters, including node pools. You can also manually upgrade node pools if you want your nodes on a later version sooner. To control how upgrades work for your cluster, use release channels, maintenance windows and exclusions, and rollout sequencing.

You can also configure a node upgrade strategy for your node pool, such as surge upgrades or blue-green upgrades. By configuring these strategies, you can ensure that the node pools are upgraded in a way that achieves the optimal balance between speed and disruption for your environment. For multi-host TPU slice node pools, instead of using the configured node upgrade strategy, GKE atomically recreates the entire node pool in a single step. To learn more, see the definition of atomicity in Terminology related to TPU in GKE.

Using a node upgrade strategy temporarily requires GKE to provision additional resources, depending on the configuration. If Google Cloud has limited capacity for your node pool's resources—for example, you're seeing resource availability errors when trying to create more nodes with GPUs or TPUs—see Upgrade in a resource-constrained environment.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this guide, consider deleting the TPU slice node pools that no longer have scheduled workloads. If the workloads running must be gracefully terminated, use kubectl drain to clean up the workloads before you delete the node.

  1. Delete a TPU slice node pool:

    gcloud container node-pools delete POOL_NAME \
        --location=LOCATION \
        --cluster=CLUSTER_NAME
    

    Replace the following:

    • POOL_NAME: The name of the node pool.
    • CLUSTER_NAME: The name of the cluster.
    • LOCATION: The compute location of the cluster.

Additional configurations

The following sections describe the additional configurations you can apply to your TPU workloads.

Manage collection scheduling

In TPU Trillium, you can use collection scheduling to group TPU slice nodes. Grouping these TPU slice nodes makes it easier to adjust the number of replicas to meet the workload demand. Google Cloud controls software updates to ensure that sufficient slices within the collection are always available to serve traffic.

Use the following tasks to manage single-host TPU slice node pools.

  • To check if a single-host TPU slice pool has collection scheduling enabled, run the following command:

    gcloud container node-pools describe NODE_POOL_NAME \
        --cluster CLUSTER_NAME \
        --project PROJECT_NAME \
        --format="json" | jq -r '.config.labels["cloud.google.com/gke-workload-type"]'
    

    The output is similar to the following:

    gke-workload-type: HIGH_AVAILABILITY
    

    If the single-host TPU slice pool is part of a collection, the output has the cloud.google.com/gke-workload-type: HIGH_AVAILABILITY label.

  • To scale up the collection, resize the node pool manually or automatically with node auto-provisioning.

  • To scale down the collection, delete the node pool.

  • To delete the collection, remove all of the attached node pools. You can delete the node pool or delete the cluster. Deleting the cluster removes all of the collections in it.

Multislice

You can aggregate smaller slices together in a Multislice to handle larger training workloads. For more information, see Multislice TPUs in GKE.

Migrate your TPU reservation

If you have existing TPU reservations, you must first migrate your TPU reservation to a new Compute Engine-based reservation system. You can also create Compute Engine-based reservation system where no migration is needed. To learn how to migrate your TPU reservations, see TPU reservation.

Logging

Logs emitted by containers running on GKE nodes, including TPU VMs, are collected by the GKE logging agent, sent to Logging, and are visible in Logging.

Use GKE node auto-provisioning

You can configure GKE to automatically create and delete node pools to meet the resource demands of your TPU workloads. For more information, see Configuring Cloud TPUs.

Provision TPUs by using custom compute classes

You can also configure GKE to request TPUs during scaling operations that create new nodes by using custom compute classes.

You can specify TPU configuration options in your custom compute class specification. When a GKE workload uses that custom compute class, GKE attempts to provision TPUs that use your specified configuration when scaling up.

To provision TPUs with a custom compute class, do the following:

  1. Ensure that your cluster has an available custom compute class that selects TPUs. To learn how to specify TPUs in custom compute classes, see TPU rules.

  2. Save the following manifest as tpu-job.yaml:

    apiVersion: v1
    kind: Service
    metadata:
      name: headless-svc
    spec:
      clusterIP: None
      selector:
        job-name: tpu-job
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tpu-job
    spec:
      backoffLimit: 0
      completions: 4
      parallelism: 4
      completionMode: Indexed
      template:
        spec:
          subdomain: headless-svc
          restartPolicy: Never
          nodeSelector:
            cloud.google.com/compute-class: TPU_CLASS_NAME
          containers:
          - name: tpu-job
            image: python:3.10
            ports:
            - containerPort: 8471 # Default port using which TPU VMs communicate
            - containerPort: 8431 # Port to export TPU runtime metrics, if supported.
            command:
            - bash
            - -c
            - |
              pip install 'jax[tpu]' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
              python -c 'import jax; print("TPU cores:", jax.device_count())'
            resources:
              requests:
                cpu: 10
                memory: 500Gi
                google.com/tpu: NUMBER_OF_CHIPS
              limits:
                cpu: 10
                memory: 500Gi
                google.com/tpu: NUMBER_OF_CHIPS
    

    Replace the following:

    • TPU_CLASS_NAME: the name of the existing custom compute class that specifies TPUs.
    • NUMBER_OF_CHIPS: the number of TPU chips for the container to use. Must be the same value for limits and requests, equal to the value in the tpu.count field in the selected custom compute class.
  3. Deploy the Job:

    kubectl create -f tpu-workload.yaml
    

When you create this Job, GKE automatically does the following:

  • Provisions nodes to run the Pods. Depending on the TPU type, topology, and resource requests that you specified, these nodes are either single-host slices or multi-host slices. Depending on the availability of TPU resources in the top priority, GKE might fall back to lower priorities to maximize obtainability.
  • Adds taints to the Pods and tolerations to the nodes to prevent any of your other workloads from running on the same nodes as TPU workloads.

To learn more, see About custom compute classes.

TPU slice node auto repair

If a TPU slice node in a multi-host TPU slice node pool is unhealthy, the entire node pool is recreated. Whereas, In a single-host TPU slice node pool, only the unhealthy TPU node is auto-repaired.

Conditions that result in unhealthy TPU slice nodes include the following:

  • Any TPU slice node with common node conditions.
  • Any TPU slice node with an unallocatable TPU count larger than zero.
  • Any VM instance in a TPU slice that is stopped (due to preemption) or is terminated.
  • Node maintenance: If any TPU slice node within a multi-host TPU slice node pool goes down for host maintenance, GKE recreates the entire TPU slice node pool.

You can see the repair status (including the failure reason) in the operation history. If the failure is caused by insufficient quota, contact your Google Cloud account representative to increase the corresponding quota.

Configure TPU slice node graceful termination

In GKE clusters with the control plane running 1.29.1-gke.1425000 or later, TPU slice nodes support SIGTERM signals that alert the node of an imminent shutdown. The imminent shutdown notification is configurable up to five minutes in TPU nodes.

To configure GKE to terminate your workloads gracefully within this notification timeframe, follow the steps in Manage GKE node disruption for GPUs and TPUs.

Run containers without privileged mode

Containers running in nodes in GKE version 1.28 or later don't need to have privileged mode enabled to access TPUs. Nodes in GKE version 1.28 and earlier require privileged mode.

If your TPU slice node is running versions less than 1.28, read the following section:

A container running on a VM in a TPU slice needs access to higher limits on locked memory so the driver can communicate with the TPU chips over direct memory access (DMA). To enable this, you must configure a higher ulimit. If you want to reduce the permission scope on your container, complete the following steps:

  1. Edit the securityContext to include the following fields:

    securityContext:
      capabilities:
        add: ["SYS_RESOURCE"]
    
  2. Increase ulimit by running the following command inside the container before your setting up your workloads to use TPU resources:

    ulimit -l 68719476736
    

For TPU v5e, running containers without privileged mode is available in clusters in version 1.27.4-gke.900 and later.

Observability and metrics

Dashboard

In the Kubernetes Clusters page in the Google Cloud console, the Observability tab displays the TPU observability metrics. For more information, see GKE observability metrics.

The TPU dashboard is populated only if you have system metrics enabled in your GKE cluster.

Runtime metrics

In GKE version 1.27.4-gke.900 or later, TPU workloads that use JAX version 0.4.14 or later and specify containerPort: 8431 export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU workload's runtime performance:

  • Duty cycle: Percentage of time over the past sampling period (60 seconds) during which the TensorCores were actively processing on a TPU chip. Larger percentage means better TPU utilization.
  • Memory used: Amount of accelerator memory allocated in bytes. Sampled every 60 seconds.
  • Memory total: Total accelerator memory in bytes. Sampled every 60 seconds.

These metrics are located in the Kubernetes node (k8s_node) and Kubernetes container (k8s_container) schema.

Kubernetes container:

  • kubernetes.io/container/accelerator/duty_cycle
  • kubernetes.io/container/accelerator/memory_used
  • kubernetes.io/container/accelerator/memory_total

Kubernetes node:

  • kubernetes.io/node/accelerator/duty_cycle
  • kubernetes.io/node/accelerator/memory_used
  • kubernetes.io/node/accelerator/memory_total

Host metrics

In GKE version 1.28.1-gke.1066000 or later, VMs in a TPU slice export TPU utilization metrics as GKE system metrics. The following metrics are available in Cloud Monitoring to monitor your TPU host's performance:

  • TensorCore utilization: Current percentage of the TensorCore that is utilized. The TensorCore value equals the sum of the matrix-multiply units (MXUs) plus the vector unit. The TensorCore utilization value is the division of the TensorCore operations that were performed over the past sample period (60 seconds) by the supported number of TensorCore operations over the same period. Larger value means better utilization.
  • Memory bandwidth utilization: Current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period (60s) by the maximum supported bandwidth over the same sample period.

These metrics are located in the Kubernetes node (k8s_node) and Kubernetes container (k8s_container) schema.

Kubernetes container:

  • kubernetes.io/container/accelerator/tensorcore_utilization
  • kubernetes.io/container/accelerator/memory_bandwidth_utilization

Kubernetes node:

  • kubernetes.io/container/node/tensorcore_utilization
  • kubernetes.io/container/node/memory_bandwidth_utilization

For more information, see Kubernetes metrics and GKE system metrics.

Known issues

  • Cluster autoscaler might incorrectly calculate capacity for new TPU slice nodes before those nodes report available TPUs. Cluster autoscaler might then perform additional scale up and as a result create more nodes than needed. Cluster autoscaler scales down additional nodes, if they are not needed, after regular scale down operation.
  • Cluster autoscaler cancels scaling up of TPU slice node pools that remain in waiting status for more than 10 hours. Cluster Autoscaler retries such scale up operations later. This behavior might reduce TPU obtainability for customers who don't use reservations.
  • Non-TPU workloads that have a toleration for the TPU taint can prevent scale down of the node pool if they are being recreated during draining of the TPU slice node pool.
  • Memory bandwidth utilization metric is not available for v5e TPUs.

What's next