Troubleshoot TPUs in GKE

Standard

This page shows you how to resolve issues related to TPUs in Google Kubernetes Engine (GKE).

Insufficient quota to satisfy the TPU request

An error similar to Insufficient quota to satisfy the request indicates your Google Cloud project has insufficient quota available to satisfy the request.

To resolve this issue, check your project's quota limit and current usage. If needed, request an increase to your TPU quota.

Check quota limit and current usage

The following sections help you ensure that you have enough quota when using TPUs in GKE.

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

Go to the Quotas page in the Google Cloud console:

Go to Quotas

In the Filter box, do the following:

Use the following table to select and copy the property of the quota based on the TPU version and . For example, if you plan to create on-demand TPU v5e nodes whose , enter Name: TPU v5 Lite PodSlice chips.

TPU version,	Property and name of the quota for on-demand instances	Property and name of the quota for Spot² instances
TPU v3,	Dimensions (e.g. location): tpu_family:CT3	Not applicable
TPU v3,	Dimensions (e.g. location): tpu_family:CT3P	Not applicable
TPU v4,	Name: TPU v4 PodSlice chips	Name: Preemptible TPU v4 PodSlice chips
TPU v5e,	Name: TPU v5 Lite PodSlice chips	Name: Preemptible TPU v5 Lite Podslice chips
TPU v5p,	Name: TPU v5p chips	Name: Preemptible TPU v5p chips
TPU Trillium,	Dimensions (e.g. location): tpu_family:CT6E	Name: Preemptible TPU slices v6e

Select the Dimensions (e.g. locations) property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU slice nodes in the zone us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the region that you need, and you must request a TPU quota adjustment.

When a TPU reservation is created, both the limit and current use values for the corresponding quota increase by the number of chips in the TPU reservation. For example, when a reservation is created for 16 TPU v5e chips whose , then both the Limit and Current usage for the TPU v5 Lite PodSlice chips quota in the relevant region increase by 16.

Quotas for additional GKE resources

You may need to increase the following GKE-related quotas in the regions where GKE creates your resources.

Persistent Disk SSD (GB) quota: The boot disk of each Kubernetes node requires 100GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating and 100GB (nodes * 100GB).
In-use IP addresses quota: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating.
Ensure that max-pods-per-node aligns with the subnet range: Each Kubernetes node uses secondary IP ranges for Pods. For example, max-pods-per-node of 32 requires 64 IP addresses which translates to a /26 subnet per node. Note that this range shouldn't be shared with any other cluster. To avoid exhausting the IP address range, use the --max-pods-per-node flag to limit the number of pods allowed to be scheduled on a node. The quota for max-pods-per-node should be set at least as high as the maximum number of GKE nodes you anticipate creating.

To request an increase in quota, see Request a quota adjustment.

Insufficient TPU resources to satisfy the TPU request

An error that contains GCE_STOCKOUT indicates that TPU resources are temporarily unavailable to satisfy the request. GKE fulfills the provisioning request when TPU resources become available.

To resolve this issue, you can use any of the following consumption options:

Future reservation for up to 90 days (in calendar mode): to provision TPU resources for up to 90 days, for a specified time period. For more information, see Future reservation for up to 90 days (in calendar mode).
Flex-start: to secure resources for up to seven days, with GKE automatically allocating the hardware on a best-effort basis based on availability. For more information, see About GPU and TPU provisioning with flex-start.
Spot VMs: to provision Spot VMs, you can get significant discounts, but Spot VMs can be preempted at any time, with a 30-second warning. For more information, see Spot VMs.
TPU reservations: to request a future reservation for one year or longer. For more information, see TPU reservations.

Error when enabling node auto-provisioning in a TPU slice node pool

The following error occurs when you are enabling node auto-provisioning in a GKE cluster that doesn't support TPUs.

The error message is similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=400,
  message=Invalid resource: tpu-v4-podslice.

To resolve this issue, upgrade your GKE cluster to version 1.27.6 or later.

GKE doesn't automatically provision TPU slice nodes

The following sections describe the cases where GKE doesn't automatically provision TPU slice nodes and how to fix them.

Limit misconfiguration

If your cluster's auto-provisioning limits are missing or too low, GKE won't automatically provision TPU slice nodes. You might observe the following errors in such scenarios:

When GKE attempts to auto-provision a TPU slice node pool that doesn't have defined limits, the cluster autoscaler visibility logs display the following error message:
```
messageId: "no.scale.up.nap.pod.tpu.no.limit.defined"
```
If a TPU slice node pool exists, but GKE can't scale up the nodes due to violating resource limits, you can see the following error message when running the kubectl get events command:
```
11s Normal NotTriggerScaleUp pod/tpu-workload-65b69f6c95-ccxwz pod didn't
trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 max
cluster cpu, memory limit reached
```
Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:
```
"Your cluster has one or more unschedulable Pods"
```
When GKE attempts to auto-provision a TPU slice node pool that exceeds resource limits, the cluster autoscaler visibility logs will display the following error message:
```
messageId: "no.scale.up.nap.pod.zonal.resources.exceeded"
```
Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:
```
"Can't scale up because node auto-provisioning can't provision a node pool for
the Pod if it would exceed resource limits"
```

To resolve these issues, increase the maximum number of TPU chips, CPU cores, and memory in the cluster.

To complete these steps:

Calculate the resource requirements for a given TPU machine type and count. Note that you need to add resources for non-TPU slice node pools, like system workloads.
Obtain a description of the available TPU, CPU, and memory for a specific machine type and zone. Use the gcloud CLI:
```
gcloud compute machine-types describe MACHINE_TYPE \
    --zone COMPUTE_ZONE
```
Replace the following:
- MACHINE_TYPE: The type of machine to search.
- COMPUTE_ZONE: The name of the compute zone.
The output includes a description line similar to the following:
```
  description: 240 vCPUs, 407 GB RAM, 4 Google TPUs
  ```
```
Calculate the total number of CPU and memory by multiplying these amounts by the required number of nodes. For example, the ct4p-hightpu-4t machine type uses 240 CPU cores and 407 GB RAM with 4 TPU chips. Assuming that you require 20 TPU chips, which corresponds to five nodes, you must define the following values:
- --max-accelerator=type=tpu-v4-podslice,count=20.
- CPU = 1200 (240 times 5 )
- memory = 2035 (407 times 5)
You should define the limits with some margin to accommodate non-TPU slice nodes such as system workloads.
Update the cluster limits:
```
gcloud container clusters update CLUSTER_NAME \
    --max-accelerator type=TPU_ACCELERATOR \
    count=MAXIMUM_ACCELERATOR \
    --max-cpu=MAXIMUM_CPU \
    --max-memory=MAXIMUM_MEMORY
```
Replace the following:
- CLUSTER_NAME: The name of the cluster.
- TPU_ACCELERATOR: The name of the TPU accelerator.
- MAXIMUM_ACCELERATOR: The maximum number of TPU chips in the cluster.
- MAXIMUM_CPU: The maximum number of cores in the cluster.
- MAXIMUM_MEMORY: The maximum number of gigabytes of memory in the cluster.

Not all instances running

ERROR: nodes cannot be created due to lack of capacity. The missing nodes
will be created asynchronously once capacity is available. You can either
wait for the nodes to be up, or delete the node pool and try re-creating it
again later.

This error might appear when GKE operation is timed out or the request cannot be fulfilled and queued for provisioning single-host or multi-host TPU node pools. To mitigate capacity issues, you might use reservations, or consider Spot VMs.

Workload misconfiguration

This error occurs due to misconfiguration of the workload. The following are some of the most common causes of the error:

The cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology labels are incorrect or missing in the Pod spec. GKE won't provision TPU slice node pools and the node auto-provision won't be able to scale up the cluster.
The Pod spec doesn't specify google.com/tpu in their resource requirements.

To resolve this issue do one of the following:

Check that there are no unsupported labels in your workload node selector. For example, a node selector for cloud.google.com/gke-nodepool label will prevent GKE from creating additional node pools for your Pods.
Ensure the Pod template specifications, where your TPU workload runs, include the following values:
- cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology labels in its nodeSelector.
- google.com/tpu in its request.

To learn how to deploy TPU workloads in GKE, see Run a workload that displays the number of available TPU chips in a TPU slice node pool.

Scheduling errors when deploying Pods that consume TPUs in GKE

The following issue occurs when GKE can't schedule Pods requesting TPUs on TPU slice nodes. For example, this might occur if some non-TPU slices were already scheduled on TPU nodes.

The error message, emitted as a FailedScheduling event on the Pod, is similar to the following:

Cannot schedule pods: Preemption is not helpful for scheduling.

Error message: 0/2 nodes are available: 2 node(s) had untolerated taint
{google.com/tpu: present}. preemption: 0/2 nodes are available: 2 Preemption is
not helpful for scheduling

To resolve this issue, do the following:

Check that you have at least one CPU node pool in your cluster so the system critical Pods can run in the non-TPU nodes. To learn more, see Deploy a Pod to a specific node pool.

Troubleshooting common issues with JobSets in GKE

For common issues with JobSet, and troubleshooting suggestions, see the JobSet Troubleshooting page. This page covers common issues such as "Webhook not available" error, child job, or Pods that are not created, and resuming issue of preempted workloads using JobSet and Kueue.

TPU initialization failed

The following issue occurs when GKE can't provision new TPU workloads due to lack of permission to access TPU devices.

The error message is similar to the following:

TPU platform initialization failed: FAILED_PRECONDITION: Couldn't mmap: Resource
temporarily unavailable.; Unable to create Node RegisterInterface for node 0,
config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: ""
dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true;
could not create driver instance

To resolve this issue, make sure you either run your TPU container in privileged mode or you increase the ulimit inside your container.

Scheduling deadlock

Two or more Jobs scheduling might fail in deadlock. For example, in the scenario where all of the following occurs:

You have two Jobs (Job A and Job B) with Pod affinity rules. GKE schedules the TPU slices for both Jobs with a TPU topology of v4-32.
You have two v4-32 TPU slices in the cluster.
Your cluster has ample capacity to schedule both Jobs and, in theory, each Job can be quickly scheduled on each TPU slice.
The Kubernetes scheduler schedules one Pod from Job A on one slice, and then schedules one Pod from Job B on the same slice.

In this case, given the Pod affinity rules for Job A, the scheduler attempts to schedule all remaining Pods for Job A and for Job B, on a single TPU slice each. As a result, GKE won't be able to fully schedule either Job A or Job B. Hence, the status of both Jobs will remain Pending.

To resolve this issue, use Pod anti-affinity with cloud.google.com/gke-nodepool as the topologyKey, as shown in the following example:

apiVersion: batch/v1
kind: Job
metadata:
 name: pi
spec:
 parallelism: 2
 template:
   metadata:
     labels:
       job: pi
   spec:
     affinity:
       podAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: job
               operator: In
               values:
               - pi
           topologyKey: cloud.google.com/gke-nodepool
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: job
               operator: NotIn
               values:
               - pi
           topologyKey: cloud.google.com/gke-nodepool
           namespaceSelector:
             matchExpressions:
             - key: kubernetes.io/metadata.name
               operator: NotIn
               values:
               - kube-system
     containers:
     - name: pi
       image: perl:5.34.0
       command: ["sleep",  "60"]
     restartPolicy: Never
 backoffLimit: 4

Permission denied during cluster creation in us-central2

If you are attempting to create a cluster in us-central2 (the only region where TPU v4 is available), then you might encounter an error message similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=403,
message=Permission denied on 'locations/us-central2' (or it may not exist).

This error occurs because the region us-central2 is a private region.

To resolve this issue, file a support case or reach out to your account team to ask for us-central2 to be made visible within your Google Cloud project.

Insufficient quota during TPU node pool creation in us-central2

If you are attempting to create a TPU slice node pool in us-central2 (the only region where TPU v4 is available), then you might need to increase the following GKE-related quotas when you first create TPU v4 node pools:

Persistent Disk SSD (GB) quota in us-central2: The boot disk of each Kubernetes node requires 100 GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating in us-central2 and 100 GB (maximum_nodes X 100 GB).
In-use IP addresses quota in us-central2: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating in us-central2.

Missing subnet during GKE cluster creation

If you are attempting to create a cluster in us-central2 (the only region where TPU v4 is available), then you might encounter an error message similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=404,
message=Not found: project <PROJECT> does not have an auto-mode subnetwork
for network "default" in region <REGION>.

A subnet is required in your VPC network to provide connectivity to your GKE nodes. However, in certain regions such as us-central2, a default subnet might not be created, even when you use the default VPC network in auto-mode (for subnet creation).

To resolve this issue, ensure that you have created a custom subnet in the region before creating your GKE cluster. This subnet must not overlap with other subnets created in other regions in the same VPC network.

Enable the read-only kubelet port

If you use a GKE cluster version that's earlier than 1.32, make sure to check that the insecureKubeletReadonlyPortEnabled field is set to true.

You can check the value of the insecureKubeletReadonlyPortEnabled field by describing your node pool:

gcloud container node-pools describe NODEPOOL_NAME --cluster=CLUSTER_NAME

If the output includes insecureKubeletReadonlyPortEnabled: false, then enable the port by running the following command:

gcloud container node-pools update NODEPOOL_NAME --cluster CLUSTER_NAME --enable-insecure-kubelet-readonly-port

The following sample errors mention a TCP connection error to port 10255, which indicates that you might need to enable the port.

error sending request: Get "http://gke-tpu-d32e5ca6-f4gp:10255/pods": GET http://gke-tpu-d32e5ca6-f4gp:10255/pods giving up after 5 attempt(s): Get "http://gke-tpu-d32e5ca6-f4gp:10255/pods": dial tcp [2600:1901:8130:662:0:19c::]:10255: connect: connection refused

failed to get TPU container Info: failed to call kubelet: Get "http://gke-tpu-d32e5ca6-f4gp:10255/pods": GET http://gke-tpu-d32e5ca6-f4gp:10255/pods giving up after 5 attempt(s): Get "http://gke-tpu-d32e5ca6-f4gp:10255/pods": dial tcp [2600:1901:8130:662:0:19c::]:10255: connect: connection refused

Connection error when running a training workload with JAX

If you're attempting to initialize the JAX framework to run a training workload on TPU machines, then you might find an error message similar to the following:

E0115 19:06:10.727412 340 master.cc:246] Initialization of slice failed with
error status: INVALID_ARGUMENT: When linking node TPU_ID:pe0:0
to TPU_ID:pe0:3</code> with link TPU_ID:pe0:0:p5:x couldn't find opposite link in destination node.; Failed to create the mesh (xW, xW, xW); Please make sure the topology is correct.;
Failed to discover ICI network topology

This error occurs when GKE fails to establish the high-speed inter chip interconnects (ICI) network topology across large TPU slices.

To mitigate this issue, complete the following steps:

Identify the TPU slices that experience the connectivity error. To see the event logs, use the following query:
```
resource.type="k8s_container"
resource.labels.project_id=PROJECT_ID
severity>=DEFAULT
SEARCH("`[/dev/vfio/0` `TPU_ID` Driver `opened.`")
```
Replace the following:
- PROJECT_ID: your project ID.
- TPU_ID: the ID of the TPU experiencing errors. You can see the TPU ID in the error message.
Taint the node pool or one of the nodes included in the error message. To learn more, see Taint and label a node pool for your workloads
Rerun the Job again on another node pool.

If the issue persists, file a support case or reach out to your account team.

View GKE TPU logs

To view all TPU-related logs for a specific workload, Cloud Logging offers a centralized location to query these logs when GKE system and workload logging are enabled. In Cloud Logging, logs are organized into log entries, and each individual log entry has a structured format. The following is an example of a TPU training job log entry.

{
  insertId: "gvqk7r5qc5hvogif"
  labels: {
  compute.googleapis.com/resource_name: "gke-tpu-9243ec28-wwf5"
  k8s-pod/batch_kubernetes_io/controller-uid: "443a3128-64f3-4f48-a4d3-69199f82b090"
  k8s-pod/batch_kubernetes_io/job-name: "mnist-training-job"
  k8s-pod/controller-uid: "443a3128-64f3-4f48-a4d3-69199f82b090"
  k8s-pod/job-name: "mnist-training-job"
}
logName: "projects/gke-tpu-demo-project/logs/stdout"
receiveTimestamp: "2024-06-26T05:52:39.652122589Z"
resource: {
  labels: {
    cluster_name: "tpu-test"
    container_name: "tensorflow"
    location: "us-central2-b"
    namespace_name: "default"
    pod_name: "mnist-training-job-l74l8"
    project_id: "gke-tpu-demo-project"
}
  type: "k8s_container"
}
severity: "INFO"
textPayload: "
  1/938 [..............................] - ETA: 13:36 - loss: 2.3238 - accuracy: 0.0469
  6/938 [..............................] - ETA: 9s - loss: 2.1227 - accuracy: 0.2995
 13/938 [..............................] - ETA: 8s - loss: 1.7952 - accuracy: 0.4760
 20/938 [..............................] - ETA: 7s - loss: 1.5536 - accuracy: 0.5539
 27/938 [..............................] - ETA: 7s - loss: 1.3590 - accuracy: 0.6071
 36/938 [>.............................] - ETA: 6s - loss: 1.1622 - accuracy: 0.6606
 44/938 [>.............................] - ETA: 6s - loss: 1.0395 - accuracy: 0.6935
 51/938 [>.............................] - ETA: 6s - loss: 0.9590 - accuracy: 0.7160
……
937/938 [============================>.] - ETA: 0s - loss: 0.2184 - accuracy: 0.9349"
timestamp: "2024-06-26T05:52:38.962950115Z"
}

Each log entry from the TPU slice nodes have the label compute.googleapis.com/resource_name with the value set as the node name. If you want to view the logs from a particular node and you know the node name, you can filter the logs by that node in your query. For example, the following query shows the logs from the TPU node gke-tpu-9243ec28-wwf5:

resource.type="k8s_container"
labels."compute.googleapis.com/resource_name" = "gke-tpu-9243ec28-wwf5"

GKE attaches label cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology to all nodes containing TPUs. So, if you are not sure about the node name or you want to list all the TPU slice nodes, you can run the following command:

kubectl get nodes -l cloud.google.com/gke-tpu-accelerator

Sample output:

NAME                    STATUS   ROLES    AGE     VERSION
gke-tpu-9243ec28-f2f1   Ready    <none>   25m     v1.30.1-gke.1156000
gke-tpu-9243ec28-wwf5   Ready    <none>   7d22h   v1.30.1-gke.1156000

You can do additional filtering based on the node labels and their values. For example, the following command lists TPU node with a specific type and topology:

kubectl get nodes -l cloud.google.com/gke-tpu-accelerator=tpu-v5-lite-podslice,cloud.google.com/gke-tpu-topology=1x1

To view all the logs across the TPU slice nodes, you can use the query that matches the label to the TPU slice node suffix. For example, use the following query:

resource.type="k8s_container"
labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*"
log_id("stdout")

To view the logs associated with a particular TPU workload using a Kubernetes Job, you can filter the logs using the batch.kubernetes.io/job-name label. For example, for the job mnist-training-job, you can run the following query for the STDOUT logs:

resource.type="k8s_container"
labels."k8s-pod/batch_kubernetes_io/job-name" = "mnist-training-job"
log_id("stdout")

To view the logs for a TPU workload using a Kubernetes JobSet, you can filter the logs using the k8s-pod/jobset_sigs_k8s_io/jobset-name label. For example:

resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"

To drill down further, you can filter based on the other workload labels. For example, to view the logs for a multislice workload from worker 0 and slice 1, you can filter based on the labels: job-complete-index and job-index:

resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"
labels."k8s-pod/batch_kubernetes_io/job-completion-index"="0"
labels."k8s-pod/jobset_sigs_k8s_io/job-index"="1"

You can also filter using the Pod name pattern:

resource.labels.pod_name:<jobSetName>-<replicateJobName>-<job-index>-<worker-index>

For example, in the following query the jobSetName is multislice-job, and the replicateJobName is slice. Both job-index and worker-index are 0:

resource.type="k8s_container"
labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="multislice-job"
resource.labels.pod_name:"multislice-job-slice-0-0"

Other TPU workloads, such as a single GKE Pod workload, you can filter the logs by Pod names. For example:

resource.type="k8s_container"
resource.labels.pod_name="tpu-job-jax-demo"

If you want to check if the TPU device plugin is running correctly, you can use the following query to check its container logs:

resource.type="k8s_container"
labels.k8s-pod/k8s-app="tpu-device-plugin"
resource.labels.namespace_name="kube-system"

Run the following query to check the related events:

jsonPayload.involvedObject.name=~"tpu-device-plugin.*"
log_id("events")

For all queries, you can add additional filters, such as cluster name, location, and project ID. You can also combine conditions to narrow down the results. For example:

resource.type="k8s_container" AND
resource.labels.project_id="gke-tpu-demo-project" AND
resource.labels.location="us-west1" AND
resource.labels.cluster_name="tpu-demo" AND
resource.labels.namespace_name="default" AND
labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*" AND
labels."k8s-pod/batch_kubernetes_io/job-name" = "mnist-training-job" AND
log_id("stdout")

The AND operator is optional between comparisons and it can be omitted. For more information about the query language, you can read the Logging query language specification. You can also read Kubernetes related log queries for more query examples.

If you prefer SQL using Log Analytics, you can find query examples at SQL query with Log Analytics. Alternatively, you can also run the queries using the Google Cloud CLI instead of in the Logs Explorer. For example:

gcloud logging read 'resource.type="k8s_container" labels."compute.googleapis.com/resource_name" =~ "gke-tpu-9243ec28.*" log_id("stdout")' --limit 10 --format json

What's next

If you can't find a solution to your problem in the documentation, see Get support for further help, including advice on the following topics:
- Opening a support case by contacting Cloud Customer Care.
- Getting support from the community by asking questions on StackOverflow and using the google-kubernetes-engine tag to search for similar issues. You can also join the #kubernetes-engine Slack channel for more community support.
- Opening bugs or feature requests by using the public issue tracker.

Troubleshoot TPUs in GKE Stay organized with collections Save and categorize content based on your preferences.