Manage AI-optimized GKE clusters

This page shows you how to manage AI-optimized Google Kubernetes Engine (GKE) clusters that use Cluster Director for GKE, including the following common events relevant to GKE clusters and AI workloads:

  • Host maintenance
  • Cluster upgrades
  • Faulty host reporting

Manage host maintenance for AI workloads

GKE nodes run on virtual machines (VMs) that periodically experience host events that can be disruptive to AI workloads. Since host events occur on the underlying Google Cloud infrastructure, they bypass GKE maintenance windows and exclusions. While most Compute Engine VMs have their host maintenance policy set to live migrate, which minimizes the disruption of workloads, GPUs and TPUs don't support live migration. When these host events affect your GKE nodes running AI workloads, GKE has to terminate the node and the Pods running on the node. If the Pods are deployed as part of a larger workload like a Job or Deployment, GKE attempts to restart the Pods on the affected node.

To learn more about managing host maintenance of the underlying VMs, see Manage GKE node disruption for GPUs and TPUs.

Monitor host maintenance events

For clusters running GKE version 1.31.1-gke.2008000 or later, you can view the scheduled start time of the host maintenance event in the following way. The start time is represented by Kubernetes node labels on the corresponding GKE node for all GPUs and TPUs.

For details, see Monitor maintenance notifications.

With these node labels, you can do the following:

Manually start a host maintenance event

After Compute Engine issues a notification about a scheduled maintenance event, you can manually start maintenance at a time that aligns with your schedule. For example, you can choose to perform maintenance during periods of reduced activity.

If you don't manually start a host maintenance event, then Compute Engine will automatically complete regularly scheduled maintenance.

Follow the instructions to Manually start a host maintenance event. Also, continue reading this section to learn the following:

Use host maintenance information while scheduling your workloads

You can use the maintenance information surfaced through GKE node labels along with node affinity and anti-affinity to minimize disruption to your workloads.

See the following sections for examples of how to use this information.

Schedule Pods to nodes that have no future scheduled maintenance events

You can instruct GKE to only schedule Pods to nodes that have no future scheduled maintenance events, such as with the following snippet:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/scheduled-maintenance-time
            operator: DoesNotExist

Schedule Pods to nodes that have maintenance scheduled after a certain date

You can instruct GKE to only schedule Pods to nodes that have maintenance scheduled after a certain date by providing the Unix epoch time:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/scheduled-maintenance-time
            operator: Gt
            values:
            - 1733296000

Manage GKE cluster upgrades for AI workloads

AI workloads are sensitive to disruption.

During the lifecycle of a GKE cluster, AI workloads must be prepared for disruption to both the underlying Compute Engine VMs, as well as the GKE cluster itself:

We recommend that you keep your cluster enrolled in a release channel. GKE clusters, by default, are enrolled in the Regular release channel. To learn more about the benefits of release channels, see the Comparison between clusters enrolled and not enrolled in a release channel.

With release channels, you get access to more features, including additional maintenance exclusion scopes. We recommend the "no minor or node upgrades" scope for AI workloads.

Report faulty hosts through GKE

This section outlines how, through GKE, you can report a faulty host that has VMs provisioned by using the reservation-bound provisioning model. If you want to report a faulty host for a node that was provisioned by using the flex-start provisioning model (Preview), then contact your account team instead.

A host is a single physical server machine in the data center running a VM which hosts your GKE node. You can report faulty hosts by applying a fault-behavior node label to the affected GKE node. After you apply the node label to a particular GKE node, GKE does the following steps:

  1. Gracefully evicts workloads from the node.
  2. Prevents new Pods from being scheduled on the node.
  3. Calls the API on the VM to mark the host as faulty.
  4. Waits for the VM to be brought back up on a healthy host machine. For reservations that use the all capacity reservation operational mode, Compute Engine brings back the VM on the same node after the repair operation completes.
  5. Removes the taint and the fault-behavior label from the node.

After this, the node will be ready to serve workloads again.

Requirements

To report a faulty host, your GKE node must meet the following requirements:

  • You must be running GKE patch version 1.32.3-gke.1057001 or later.
  • You must be running one of the following GPU machine types: A4X, A4, or A3 Ultra.
  • You must be running your GKE nodes on a VM instance that is reservation-bound.
  • Your GKE node must be in a RUNNING state. If you try to report a faulty host after deleting the VM, an error message is returned, and the host machine won't be marked as faulty.
  • You might be rate-limited on the number of calls to this API per reservation per month based on an evaluation of the health of your blocks. Rate-limits don't apply if your reservation uses the all capacity reservation operational mode.

Report a faulty host

To report a faulty host:

  1. Use the GKE observability tools, your own monitoring tools, or logs to identify the GKE nodes that are experiencing performance issues. Save the NODE_NAME.
  2. Report the node as faulty:

      kubectl label nodes NODE_NAME cloud.google.com/fault-behavior=FAULT_REASON
    

    Replace the following:

    • NODE_NAME: the name of the faulty node.
    • FAULT_REASON: the appropriate fault reason using one or more of the following values:
      • PERFORMANCE: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.
      • SDC: use this value for silent data corruption, if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.
      • XID: use this value if you identified an unrecoverable GPU error with an XID for a VM.
      • unspecified: use this value if you are not sure what behavior is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
After you report a faulty host for a node, the time when the node restarts varies based on the reservation operational mode that is specified in the reservation that the node uses. To verify the reservation operational mode for a reservation, view the reservationOperationalMode field in the reservation. The following table summarizes the faulty host process for the two available reservation operational modes: all capacity mode and managed mode.
All capacity mode (ALL_CAPACITY) Managed mode (HIGHLY_AVAILABLE_CAPACITY)
Supported machine types A4X A4 and A3 Ultra
Faulty host report API rate limiting No rate limits apply. Calls to the API may be rate-limited.
Faulty host report process

When you report a faulty host for a node that runs in the all capacity mode, the following occurs:

  1. Evict Pods: After the label is applied to the faulty node, GKE taints the node to block scheduling new Pods. GKE also starts to gracefully evict the running Pods on the node. GKE respects the Pod Disruption Budgets (PDBs) and the spec.terminationGracePeriodSeconds field of your Pod manifests. For more details, see Configure GKE to terminate your workloads gracefully.
  2. Report and repair the faulty host: GKE automatically reports and repairs the faulty host by calling the Compute Engine API, which results in a sequence of operations that usually takes 10-12 minutes to report the faulty host and then can take 3-14 days, or even longer at times, to repair the host.
  3. Restart the VM: After the host repair operation completes (usually 3-14 days), one of the following occurs:

    • If the VM is in the REPAIRING state and the resources are available when the repair completes, then Compute Engine automatically restarts the VM on the repaired host.
    • Otherwise, if the VM is in the TERMINATED state or if resources aren't available when the repair completes, then the VM state stays in or changes to TERMINATED. You must manually restart the VM when you want it to run. However, restarting the VM might fail if resources aren't available when you restart the VM; for example, this can happen if other VMs are already using the repaired host.

When you report a faulty host for a node that runs in the managed mode, the following occurs:

  1. Evict Pods: After the label is applied to the faulty node, GKE taints the node to block scheduling new Pods. GKE also starts to gracefully evict the running Pods on the node. GKE respects the Pod Disruption Budgets (PDBs) and the spec.terminationGracePeriodSeconds field of your Pod manifests. For more details, see Configure GKE to terminate your workloads gracefully.
  2. Report and start repairing the faulty host: GKE automatically reports and repairs the faulty host by calling the Compute Engine API, which results in a sequence of operations that usually takes 10-12 minutes to report the faulty host and then can take 3-14 days, or even longer at times, to repair the host.
  3. Migrate and restart the VM: After the host repair operation starts (usually 10-12 minutes), Compute Engine attempts to reserve one more host to replace your reported faulty host in your reserved capacity. If Compute Engine finds a healthy host—if it successfully replaces the faulty host or otherwise finds a matching healthy host in your reserved capacity—then Compute Engine migrates the VM to that host. Then, restarting the VM happens through one of the following:

    • If the VM is in the REPAIRING state and resources are available before or when the repair completes, then Compute Engine automatically restarts the VM on a healthy host.
    • Otherwise, if the VM is in the TERMINATED state or if resources aren't available before or when the repair completes, then the VM state stays in or changes to TERMINATED. You must manually restart the VM when you want it to run. However, restarting the VM might fail if resources aren't available when you restart the VM; for example, this can happen if other VMs are already using the repaired host.

Monitor the operation progress

You can monitor the progress of GKE's operation using the cloud.google.com/report-and-replace-status node label on your GKE node, which has one of the following values:

  • PodsEvicted: GKE has finished evicting Pods from the affected node.
  • OperationRUNNING: the operation to report the fault host is running.
  • OperationDone: the underlying host has been reported as faulty and the GKE node is ready to be moved to a new host
  • Error: API call failed, for reasons including one of the requirements described in the previous section.

You can also view the cloud.google.com/report-and-replace-operation node label to view the Compute Engine operation ID to monitor the status of the operation.

You can view both these node labels using the following command:

  kubectl get nodes NODE_NAME \
  -L cloud.google.com/report-and-replace-status,cloud.google.com/report-and-replace-operation

In case of any API errors, GKE sets the node label cloud.google.com/report-and-replace-status=ERROR. GKE clears the node taints and remove the cloud.google.com/fault-behavior node label.