Troubleshoot TPUs in GKE


This page shows you how to resolve issues related to TPUs in Google Kubernetes Engine (GKE).

If you need additional assistance, reach out to Cloud Customer Care.

Insufficient quota to satisfy the TPU request

An error similar to Insufficient quota to satisfy the request indicates your Google Cloud project has insufficient quota available to satisfy the request.

To resolve this issue, check your project's quota limit and current usage. If needed, request an increase to your TPU quota.

Check quota limit and current usage

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

  1. Go to the Quotas page in the Google Cloud console:

    Go to Quotas

  2. In the Filter box, do the following:

    1. Select the Service property, enter Compute Engine API, and press Enter.

    2. Select the Type property and choose Quota.

    3. Select the Dimensions (e.g. locations) property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU slice nodes in the zone us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the desired region, and you must request a TPU quota increase.

Error when enabling node auto-provisioning in a TPU slice node pool

The following error occurs when you are enabling node auto-provisioning in a GKE cluster that doesn't support TPUs.

The error message is similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=400,
  message=Invalid resource: tpu-v4-podslice.

To resolve this issue, upgrade your GKE cluster to version 1.27.6 or later.

GKE doesn't automatically provision TPU slice nodes

The following sections describe the cases where GKE doesn't automatically provision TPU slice nodes and how to fix them.

Limit misconfiguration

GKE doesn't automatically provision TPU slice nodes if the auto-provisioning limits you defined for a cluster are too low. You may observe the following errors in such scenarios:

  • If a TPU slice node pool exists, but GKE can't scale up the nodes due to violating resource limits, you can see the following error message when running the kubectl get events command:

    11s Normal NotTriggerScaleUp pod/tpu-workload-65b69f6c95-ccxwz pod didn't
    trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 max
    cluster cpu, memory limit reached
    

    Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:

    "Your cluster has one or more unschedulable Pods"
    
  • When GKE attempts to auto-provision a TPU slice node pool that exceeds resource limits, the cluster autoscaler visibility logs will display the following error message:

    messageId: "no.scale.up.nap.pod.zonal.resources.exceeded"
    

    Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:

    "Can't scale up because node auto-provisioning can't provision a node pool for
    the Pod if it would exceed resource limits"
    

To resolve these issues, increase the maximum number of TPU chips, CPU cores, and memory in the cluster.

To complete these steps:

  1. Calculate the resource requirements for a given TPU machine type and count. Note that you need to add resources for non-TPU slice node pools, like system workloads.
  2. Obtain a description of the available TPU, CPU, and memory for a specific machine type and zone. Use the gcloud CLI:

    gcloud compute machine-types describe MACHINE_TYPE \
        --zone COMPUTE_ZONE
    

    Replace the following:

    • MACHINE_TYPE: The type of machine to search.
    • COMPUTE_ZONE: The name of the compute zone.

    The output includes a description line similar to the following:

      description: 240 vCPUs, 407 GB RAM, 4 Google TPUs
      ```
    
  3. Calculate the total number of CPU and memory by multiplying these amounts by the required number of nodes. For example, the ct4p-hightpu-4t machine type uses 240 CPU cores and 407 GB RAM with 4 TPU chips. Assuming that you require 20 TPU chips, which corresponds to five nodes, you must define the following values:

    • --max-accelerator=type=tpu-v4-podslice,count=20.
    • CPU = 1200 (240 times 5 )
    • memory = 2035 (407 times 5)

    You should define the limits with some margin to accommodate non-TPU slice nodes such as system workloads.

  4. Update the cluster limits:

    gcloud container clusters update CLUSTER_NAME \
        --max-accelerator type=TPU_ACCELERATOR \
        count=MAXIMUM_ACCELERATOR \
        --max-cpu=MAXIMUM_CPU \
        --max-memory=MAXIMUM_MEMORY
    

    Replace the following:

    • CLUSTER_NAME: The name of the cluster.
    • TPU_ACCELERATOR: The name of the TPU accelerator.
    • MAXIMUM_ACCELERATOR: The maximum number of TPU chips in the cluster.
    • MAXIMUM_CPU: The maximum number of cores in the cluster.
    • MAXIMUM_MEMORY: The maximum number of gigabytes of memory in the cluster.

Not all instances running

ERROR: nodes cannot be created due to lack of capacity. The missing nodes
will be created asynchronously once capacity is available. You can either
wait for the nodes to be up, or delete the node pool and try re-creating it
again later.

This error may appear when GKE operation is timed out or the request cannot be fulfilled and queued for provisioning single-host or multi-host TPU node pools. To mitigate capacity issues, you may use reservations, or consider Spot VMs.

Workload misconfiguration

This error occurs due to misconfiguration of the workload. The following are some of the most common causes of the error:

  • The cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology labels are incorrect or missing in the Pod spec. GKE won't provision TPU slice node pools and the node auto-provision won't be able to scale up the cluster.
  • The Pod spec doesn't specify google.com/tpu in their resource requirements.

To resolve this issue do one of the following:

  1. Check that there are no unsupported labels in your workload node selector. For example, a node selector for cloud.google.com/gke-nodepool label will prevent GKE from creating additional node pools for your Pods.
  2. Ensure the Pod template specifications, where your TPU workload runs, include the following values:
    • cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology labels in its nodeSelector.
    • google.com/tpu in its request.

To learn how to deploy TPU workloads in GKE, see Run a workload that displays the number of available TPU chips in a TPU slice node pool.

Scheduling errors when deploying Pods that consume TPUs in GKE

The following issue occurs when GKE can't schedule Pods requesting TPUs on TPU slice nodes. For example, this might occur if some non-TPU slices were already scheduled on TPU nodes.

The error message, emitted as a FailedScheduling event on the Pod, is similar to the following:

Cannot schedule pods: Preemption is not helpful for scheduling.

Error message: 0/2 nodes are available: 2 node(s) had untolerated taint
{google.com/tpu: present}. preemption: 0/2 nodes are available: 2 Preemption is
not helpful for scheduling

To resolve this issue, do the following:

Check that you have at least one CPU node pool in your cluster so the system critical Pods can run in the non-TPU nodes. To learn more, see Deploy a Pod to a specific node pool.

Troubleshooting common issues with JobSets in GKE

For common issues with JobSet, and troubleshooting suggestions, see the JobSet Troubleshooting page. This page covers common issues such as "Webhook not available" error, child job, or Pods that are not created, and resuming issue of preempted workloads using JobSet and Kueue.

TPU initialization failed

The following issue occurs when GKE can't provision new TPU workloads due to lack of permission to access TPU devices.

The error message is similar to the following:

TPU platform initialization failed: FAILED_PRECONDITION: Couldn't mmap: Resource
temporarily unavailable.; Unable to create Node RegisterInterface for node 0,
config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: ""
dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true;
could not create driver instance

To resolve this issue, make sure you either run your TPU container in privileged mode or you increase the ulimit inside your container.

Scheduling deadlock

Two or more Jobs scheduling might fail in deadlock. For example, in the scenario where all of the following occurs:

  • You have two Jobs (Job A and Job B) with Pod affinity rules. GKE schedules the TPU slices for both Jobs with a TPU topology of v4-32.
  • You have two v4-32 TPU slices in the cluster.
  • Your cluster has ample capacity to schedule both Jobs and, in theory, each Job can be quickly scheduled on each TPU slice.
  • The Kubernetes scheduler schedules one Pod from Job A on one slice, and then schedules one Pod from Job B on the same slice.

In this case, given the Pod affinity rules for Job A, the scheduler attempts to schedule all remaining Pods for Job A and for Job B, on a single TPU slice each. As a result, GKE won't be able to fully schedule either Job A or Job B. Hence, the status of both Jobs will remain Pending.

To resolve this issue, use Pod anti-affinity with cloud.google.com/gke-nodepool as the topologyKey, as shown in the following example:

apiVersion: batch/v1
kind: Job
metadata:
 name: pi
spec:
 parallelism: 2
 template:
   metadata:
     labels:
       job: pi
   spec:
     affinity:
       podAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: job
               operator: In
               values:
               - pi
           topologyKey: cloud.google.com/gke-nodepool
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: job
               operator: NotIn
               values:
               - pi
           topologyKey: cloud.google.com/gke-nodepool
           namespaceSelector:
             matchExpressions:
             - key: kubernetes.io/metadata.name
               operator: NotIn
               values:
               - kube-system
     containers:
     - name: pi
       image: perl:5.34.0
       command: ["sleep",  "60"]
     restartPolicy: Never
 backoffLimit: 4

Permission denied during cluster creation in us-central2

If you are attempting to create a cluster in us-central2 (the only region where TPU v4 is available), then you may encounter an error message similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=403,
message=Permission denied on 'locations/us-central2' (or it may not exist).

This error is because the region us-central2 is a private region.

To resolve this issue, file a support case or reach out to your account team to ask for us-central2 to be made visible within your Google Cloud project.

Insufficient quota during TPU node pool creation in us-central2

If you are attempting to create a TPU slice node pool in us-central2 (the only region where TPU v4 is available), then you may need to increase the following GKE-related quotas when you first create TPU v4 node pools:

  • Persistent Disk SSD (GB) quota in us-central2: The boot disk of each Kubernetes node requires 100 GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating in us-central2 and 100 GB (maximum_nodes X 100 GB).
  • In-use IP addresses quota in us-central2: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating in us-central2.

Missing subnet during GKE cluster creation

If you are attempting to create a cluster in us-central2 (the only region where TPU v4 is available), then you may encounter an error message similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=404,
message=Not found: project <PROJECT> does not have an auto-mode subnetwork
for network "default" in region <REGION>.

A subnet is required in your VPC network to provide connectivity to your GKE nodes. However, in certain regions such as us-central2, a default subnet may not be created, even when you use the default VPC network in auto-mode (for subnet creation).

To resolve this issue, ensure that you have created a custom subnet in the region before creating your GKE cluster. This subnet must not overlap with other subnets created in other regions in the same VPC network.

View GKE TPU logs

To view all TPU-related logs for a specific workload, Cloud Logging offers a centralized location to query these logs when GKE system and workload logging are enabled. In Cloud Logging, logs are organized into log entries, and each individual log entry has a structured format. The following is an example of a TPU training job log entry.

{
  insertId: "gvqk7r5qc5hvogif"
  labels: {
  compute.googleapis.com/resource_name: "gke-tpu-9243ec28-wwf5"
  k8s-pod/batch_kubernetes_io/controller-uid: "443a3128-64f3-4f48-a4d3-69199f82b090"
  k8s-pod/batch_kubernetes_io/job-name: "mnist-training-job"
  k8s-pod/controller-uid: "443a3128-64f3-4f48-a4d3-69199f82b090"
  k8s-pod/job-name: "mnist-training-job"
}
logName: "projects/gke-tpu-demo-project/logs/stdout"
receiveTimestamp: "2024-06-26T05:52:39.652122589Z"
resource: {
  labels: {
    cluster_name: "tpu-test"
    container_name: "tensorflow"
    location: "us-central2-b"
    namespace_name: "default"
    pod_name: "mnist-training-job-l74l8"
    project_id: "gke-tpu-demo-project"
}
  type: "k8s_container"
}
severity: "INFO"
textPayload: "
  1/938 [..............................] - ETA: 13:36 - loss: 2.3238 - accuracy: 0.0469
  6/938 [..............................] - ETA: 9s - loss: 2.1227 - accuracy: 0.2995   
 13/938 [..............................] - ETA: 8s - loss: 1.7952 - accuracy: 0.4760
 20/938 [..............................] - ETA: 7s - loss: 1.5536 - accuracy: 0.5539
 27/938 [..............................] - ETA: 7s - loss: 1.3590 - accuracy: 0.6071
 36/938 [>.............................] - ETA: 6s - loss: 1.1622 - accuracy: 0.6606
 44/938 [>.............................] - ETA: 6s - loss: 1.0395 - accuracy: 0.6935
 51/938 [>.............................] - ETA: 6s - loss: 0.9590 - accuracy: 0.7160
……
937/938 [============================>.] - ETA: 0s - loss: 0.2184 - accuracy: 0.9349"
timestamp: "2024-06-26T05:52:38.962950115Z"
}

Each log entry from the TPU slice nodes have the label compute.googleapis.com/resource_name with the value set as the node name. If you want to view the logs from a particular node and you know the node name, you can filter the logs by that node in your query. For example, the following query shows the logs from the TPU node gke-tpu-9243ec28-wwf5:

resource.type="k8s_container"
labels."compute.googleapis.com/resource_name" = "gke-tpu-9243ec28-wwf5"

GKE attaches label cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology to all nodes containing TPUs. So, if you are not sure about the node name or you want to list all the TPU slice nodes, you can run the following command:

kubectl get nodes -l cloud.google.com/gke-tpu-accelerator

Sample output:

NAME                    STATUS   ROLES    AGE     VERSION
gke-tpu-9243ec28-f2f1   Ready    <none>   25m     v1.30.1-gke.1156000
gke-tpu-9243ec28-wwf5   Ready    <none>   7d22h   v1.30.1-gke.1156000

You can do additional filtering based on the node labels and their values. For example, the following command lists TPU node with a specific type and topology:

kubectl get nodes -l cloud.google.com/gke-tpu-accelerator=tpu-v5-lite-podslice,cloud.google.com/gke-tpu-topology=1x1

To view all the logs across the TPU slice nodes, you can use the query that matches the label to the TPU slice node suffix. For example, use the following query:

resource.type="k8s_container"
labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*"
log_id("stdout")

To view the logs associated with a particular TPU workload using a Kubernetes Job, you can filter the logs using the batch.kubernetes.io/job-name label. For example, for the job mnist-training-job, you can run the following query for the STDOUT logs:

resource.type="k8s_container"
labels."k8s-pod/batch_kubernetes_io/job-name" = "mnist-training-job"
log_id("stdout")

To view the logs for a TPU workload using a Kubernetes JobSet, you can filter the logs using the k8s-pod/jobset_sigs_k8s_io/jobset-name label. For example:

resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"

To drill down further, you can filter based on the other workload labels. For example, to view the logs for a multislice workload from worker 0 and slice 1, you can filter based on the labels: job-complete-index and job-index:

​​resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"
labels."k8s-pod/batch_kubernetes_io/job-completion-index"="0"
labels."k8s-pod/jobset_sigs_k8s_io/job-index"="1"

You can also filter using the Pod name pattern:

resource.labels.pod_name:<jobSetName>-<replicateJobName>-<job-index>-<worker-index>

For example, in the following query the jobSetName is multislice-job, and the replicateJobName is slice. Both job-index and worker-index are 0:

resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"
resource.labels.pod_name:"multislice-job-slice-0-0"

Other TPU workloads, such as a single GKE Pod workload, you can filter the logs by Pod names. For example:

resource.type="k8s_container"
resource.labels.pod_name="tpu-job-jax-demo"

If you want to check if the TPU device plugin is running correctly, you can use the following query to check its container logs:

resource.type="k8s_container"
labels.k8s-pod/k8s-app="tpu-device-plugin"
resource.labels.namespace_name="kube-system"

Run the following query to check the related events:

jsonPayload.involvedObject.name=~"tpu-device-plugin.*"
log_id("events")

For all queries, you can add additional filters, such as cluster name, location, and project ID. You can also combine conditions to narrow down the results. For example:

resource.type="k8s_container" AND
resource.labels.project_id="gke-tpu-demo-project" AND
resource.labels.location="us-west1" AND
resource.labels.cluster_name="tpu-demo" AND
resource.labels.namespace_name="default" AND
labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*" AND
labels."k8s-pod/batch_kubernetes_io/job-name" = "mnist-training-job" AND
log_id("stdout")

The AND operator is optional between comparisons and it can be omitted. For more information about the query language, you can read the Logging query language specification. You can also read Kubernetes related log queries for more query examples.

If you prefer SQL using Log Analytics, you can find query examples at SQL query with Log Analytics. Alternatively, you can also run the queries using the Google Cloud CLI instead of in the Logs Explorer. For example:

gcloud logging read 'resource.type="k8s_container" labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*" log_id("stdout")' --limit 10 --format json

What's next

If you need additional assistance, reach out to Cloud Customer Care.