Troubleshoot TPUs in GKE

Standard

This page shows you how to resolve issues related to TPUs in Google Kubernetes Engine (GKE).

If you need additional assistance, reach out to Cloud Customer Care.

Insufficient quota to satisfy the TPU request

An error similar to Insufficient quota to satisfy the request indicates your Google Cloud project has insufficient quota available to satisfy the request.

To resolve this issue, check your project's quota limit and current usage. If needed, request an increase to your TPU quota.

Check quota limit and current usage

To check the limit and current usage of your Compute Engine API quota for TPUs, follow these steps:

Go to the Quotas page in the Google Cloud console:

Go to Quotas
In the Filter box, do the following:
1. Select the Service property, enter Compute Engine API, and press Enter.
2. Select the Type property and choose Quota.
3. Select the Dimensions (e.g. locations) property and enter region: followed by the name of the region in which you plan to create TPUs in GKE. For example, enter region:us-west4 if you plan to create TPU nodes in the zone us-west4-a. TPU quota is regional, so all zones within the same region consume the same TPU quota.

If no quotas match the filter you entered, then the project has not been granted any of the specified quota for the desired region, and you must request a TPU quota increase.

Error when enabling node auto-provisioning in a TPU node pools

The following error occurs when you are enabling node auto-provisioning in a GKE cluster that doesn't support TPUs.

The error message is similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=400,
  message=Invalid resource: tpu-v4-podslice.

To resolve this issue, upgrade your GKE cluster to version 1.27.6 or later.

GKE doesn't automatically provision TPU nodes

The following sections describe the cases where GKE doesn't automatically provision TPU nodes and how to fix them.

Limit misconfiguration

GKE doesn't automatically provision TPU nodes if the auto-provisioning limits you defined for a cluster are too low. You may observe the following errors in such scenarios:

If a TPU node pool exists, but GKE can't scale up the nodes due to violating resource limits, you can see the following error message when running the kubectl get events command:
```
11s Normal NotTriggerScaleUp pod/tpu-workload-65b69f6c95-ccxwz pod didn't
trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 max
cluster cpu, memory limit reached
```
Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:
```
"Your cluster has one or more unschedulable Pods"
```
When GKE attempts to auto-provision a TPU node pool that exceeds resource limits, the cluster autoscaler visibility logs will display the following error message:
```
messageId: "no.scale.up.nap.pod.zonal.resources.exceeded"
```
Also, in this scenario, you can see warning messages similar to the following in the Google Cloud console:
```
"Can't scale up because node auto-provisioning can't provision a node pool for
the Pod if it would exceed resource limits"
```

To resolve these issues, increase the maximum number of TPU chips, CPU cores, and memory in the cluster.

To complete these steps:

Calculate the resource requirements for a given TPU machine type and count. Note that you need to add resources for non-TPU node pools, like system workloads.
Obtain a description of the available TPU, CPU, and memory for a specific machine type and zone. Use the gcloud CLI:
```
gcloud compute machine-types describe MACHINE_TYPE \
    --zone COMPUTE_ZONE
```
Replace the following:
- MACHINE_TYPE: The type of machine to search.
- COMPUTE_ZONE: The name of the compute zone.
The output includes a description line similar to the following:
```
  description: 240 vCPUs, 407 GB RAM, 4 Google TPUs
  ```
```
Calculate the total number of CPU and memory by multiplying these amounts by the required number of nodes. For example, the ct4p-hightpu-4t machine type uses 240 CPU cores and 407 GB RAM with 4 TPU chips. Assuming that you require 20 TPU chips, which corresponds to five nodes, you must define the following values:
- --max-accelerator=type=tpu-v4-podslice,count=20.
- CPU = 1200 (240 times 5 )
- memory = 2035 (407 times 5)
You should define the limits with some margin to accommodate non-TPU nodes such as system workloads.
Update the cluster limits:
```
gcloud container clusters update CLUSTER_NAME \
    --max-accelerator type=TPU_ACCELERATOR \
    count=MAXIMUM_ACCELERATOR \
    --max-cpu=MAXIMUM_CPU \
    --max-memory=MAXIMUM_MEMORY
```
Replace the following:
- CLUSTER_NAME: The name of the cluster.
- TPU_ACCELERATOR: The name of the TPU accelerator.
- MAXIMUM_ACCELERATOR: The maximum number of TPU chips in the cluster.
- MAXIMUM_CPU: The maximum number of cores in the cluster.
- MAXIMUM_MEMORY: The maximum number of gigabytes of memory in the cluster.

Workload misconfiguration

This error occurs due to misconfiguration of the workload. The following are some of the most common causes of the error:

The cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology labels are incorrect or missing in the Pod spec. GKE won't provision TPU node pools and the node auto-provision won't be able to scale up the cluster.
The Pod spec doesn't specify google.com/tpu in their resource requirements.

To resolve this issue do one of the following:

Check that there are no unsupported labels in your workload node selector. For example, a node selector for cloud.google.com/gke-nodepool label will prevent GKE from creating additional node pools for your Pods.
Ensure the Pod template specifications, where your TPU workload runs, include the following values:
- cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology labels in its nodeSelector.
- google.com/tpu in its request.

To learn how to deploy TPU workloads in GKE, see Run a workload that displays the number of available TPU chips in a TPU node pool.

Scheduling errors when deploying Pods that consume TPUs in GKE

The following issue occurs when GKE can't schedule Pods requesting TPUs on TPU nodes. For example, this might occur if some non-TPU Pods were already scheduled on TPU nodes.

The error message, emitted as a FailedScheduling event on the Pod, is similar to the following:

Cannot schedule pods: Preemption is not helpful for scheduling.

Error message: 0/2 nodes are available: 2 node(s) had untolerated taint
{google.com/tpu: present}. preemption: 0/2 nodes are available: 2 Preemption is
not helpful for scheduling

To resolve this issue, do the following:

Check that you have at least one CPU node pool in your cluster so the system critical Pods can run in the non-TPU nodes. To learn more, see Deploy a Pod to a specific node pool.

TPU initialization failed

The following issue occurs when GKE can't provision new TPU workloads due to lack of permission to access TPU devices.

The error message is similar to the following:

TPU platform initialization failed: FAILED_PRECONDITION: Couldn't mmap: Resource
temporarily unavailable.; Unable to create Node RegisterInterface for node 0,
config: device_path: "/dev/accel0" mode: KERNEL debug_data_directory: ""
dump_anomalies_only: true crash_in_debug_dump: false allow_core_dump: true;
could not create driver instance

To resolve this issue, make sure you either run your TPU container in privileged mode or you increase the ulimit inside your container.

Scheduling deadlock

Two or more Jobs scheduling might fail in deadlock. For example, in the scenario where all of the following occurs:

You have two Jobs (Job A and Job B) with Pod affinity rules. GKE schedules the TPU slices for both Jobs with a TPU topology of v4-32.
You have two v4-32 TPU slices in the cluster.
Your cluster has ample capacity to schedule both Jobs and, in theory, each Job can be quickly scheduled on each TPU slice.
The Kubernetes scheduler schedules one Pod from Job A on one slice, and then schedules one Pod from Job B on the same slice.

In this case, given the Pod affinity rules for Job A, the scheduler attempts to schedule all remaining Pods for Job A and for Job B, on a single TPU slice each. As a result, GKE won't be able to fully schedule either Job A or Job B. Hence, the status of both Jobs will remain Pending.

To resolve this issue, use Pod anti-affinity with cloud.google.com/gke-nodepool as the topologyKey, as shown in the following example:

apiVersion: batch/v1
kind: Job
metadata:
 name: pi
spec:
 parallelism: 2
 template:
   metadata:
     labels:
       job: pi
   spec:
     affinity:
       podAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: job
               operator: In
               values:
               - pi
           topologyKey: cloud.google.com/gke-nodepool
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: job
               operator: NotIn
               values:
               - pi
           topologyKey: cloud.google.com/gke-nodepool
           namespaceSelector:
             matchExpressions:
             - key: kubernetes.io/metadata.name
               operator: NotIn
               values:
               - kube-system
     containers:
     - name: pi
       image: perl:5.34.0
       command: ["sleep",  "60"]
     restartPolicy: Never
 backoffLimit: 4

Permission denied during cluster creation in us-central2

If you are attempting to create a cluster in us-central2 (the only region where TPU v4 is available), then you may encounter an error message similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=403,
message=Permission denied on 'locations/us-central2' (or it may not exist).

This error is because the region us-central2 is a private region.

To resolve this issue, file a support case or reach out to your account team to ask for us-central2 to be made visible within your Google Cloud project.

Insufficient quota during TPU node pool creation in us-central2

If you are attempting to create a TPU node pool in us-central2 (the only region where TPU v4 is available), then you may need to increase the following GKE-related quotas when you first create TPU v4 node pools:

Persistent Disk SSD (GB) quota in us-central2: The boot disk of each Kubernetes node requires 100 GB by default. Therefore, this quota should be set at least as high as the product of the maximum number of GKE nodes you anticipate creating in us-central2 and 100 GB (maximum_nodes X 100 GB).
In-use IP addresses quota in us-central2: Each Kubernetes node consumes one IP address. Therefore, this quota should be set at least as high as the maximum number of GKE nodes you anticipate creating in us-central2.

Missing subnet during GKE cluster creation

If you are attempting to create a cluster in us-central2 (the only region where TPU v4 is available), then you may encounter an error message similar to the following:

ERROR: (gcloud.container.clusters.create) ResponseError: code=404,
message=Not found: project <PROJECT> does not have an auto-mode subnetwork
for network "default" in region <REGION>.

A subnet is required in your VPC network to provide connectivity to your GKE nodes. However, in certain regions such as us-central2, a default subnet may not be created, even when you use the default VPC network in auto-mode (for subnet creation).

To resolve this issue, ensure that you have created a custom subnet in the region before creating your GKE cluster. This subnet must not overlap with other subnets created in other regions in the same VPC network.

What's next

If you need additional assistance, reach out to Cloud Customer Care.