Manage Hypercompute Clusters with GKE

This page shows you how to manage Hypercompute Clusters with GKE (Preview), including the following common events relevant to GKE clusters and AI workloads:

Host maintenance
Cluster upgrades
Faulty host reporting

Manage host maintenance for AI workloads

GKE nodes run on virtual machines (VMs) that periodically experience host events that can be disruptive to AI workloads. Since host events occur on the underlying Google Cloud infrastructure, they bypass GKE maintenance windows and exclusions. While most Compute Engine VMs have their host maintenance policy set to live migrate, which minimizes the disruption of workloads, GPUs and TPUs don't support live migration. When these host events affect your GKE nodes running AI workloads, GKE has to terminate the node and the Pods running on the node. If the Pods are deployed as part of a larger workload like a Job or Deployment, GKE attempts to restart the Pods on the affected node.

To learn more about managing host maintenance of the underlying VMs, see Manage GKE node disruption for GPUs and TPUs.

Monitor host maintenance events

For clusters running GKE version 1.31.1-gke.2008000 or later, you can view the scheduled start time of the host maintenance event in the following way. The start time is represented by Kubernetes node labels on the corresponding GKE node for all GPUs and TPUs.

For details, see Monitor maintenance notifications.

With these node labels, you can do the following:

Manually start a host maintenance event
Use host maintenance event information while scheduling your workloads

Manually start a host maintenance event

After Compute Engine issues a notification about a scheduled maintenance event, you can manually start maintenance at a time that aligns with your schedule. For example, you can choose to perform maintenance during periods of reduced activity.

If you don't manually start a host maintenance event, then Compute Engine will automatically complete regularly scheduled maintenance.

Follow the instructions to Manually start a host maintenance event. Also, continue reading this section to learn the following:

Use host maintenance information while scheduling your workloads

You can use the maintenance information surfaced through GKE node labels along with node affinity and anti-affinity to minimize disruption to your workloads.

See the following sections for examples of how to use this information.

Schedule Pods to nodes that have no future scheduled maintenance events

You can instruct GKE to only schedule Pods to nodes that have no future scheduled maintenance events, such as with the following snippet:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/scheduled-maintenance-time
            operator: DoesNotExist

Schedule Pods to nodes that have maintenance scheduled after a certain date

You can instruct GKE to only schedule Pods to nodes that have maintenance scheduled after a certain date by providing the Unix epoch time:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/scheduled-maintenance-time
            operator: Gt
            values:
            - 1733296000

Manage GKE cluster upgrades for AI workloads

AI workloads are sensitive to disruption.

During the lifecycle of a GKE cluster, AI workloads must be prepared for disruption to both the underlying Compute Engine VMs, as well as the GKE cluster itself:

Host maintenance: To manage host maintenance of the underlying VMs, see Manage GKE node disruption for GPUs and TPUs. This is also described in the previous sections.
Cluster upgrades: To manage disruption from cluster upgrades, you can use the following tools:
- Maintenance windows: Schedule when GKE can perform cluster upgrades and other types of cluster operations.
- Maintenance exclusions: Prevent cluster upgrades and other types of cluster operations during a specific time period.

We recommend that you keep your cluster enrolled in a release channel. GKE clusters, by default, are enrolled in the Regular release channel. To learn more about the benefits of release channels, see the Comparison between clusters enrolled and not enrolled in a release channel.

With release channels, you get access to more features, including additional maintenance exclusion scopes. We recommend the "no minor or node upgrades" scope for AI workloads.

Report faulty hosts through GKE

This section outlines how, through GKE, you can report a faulty A3 accelerator-optimized host that is running your AI/ML or HPC workloads. A host is a single physical server machine in the data center running a VM which hosts your GKE node. You can report faulty hosts by applying a fault-behavior node label to the affected GKE node. After you apply the node label to a particular GKE node, GKE does the following steps:

Gracefully evicts workloads from the node.
Prevents new Pods from being scheduled on the node.
Calls the API on the VM instance to mark the host as faulty.
Waits for the VM to be brought back up on a healthy host machine.
Removes the taint and the fault-behavior label from the node.

After this, the node will be ready to serve workloads again.

Requirements

To report a faulty host, your GKE node must meet the following requirements:

You must be running GPUs in the A3 machine series, which includes A3 Ultra, A3 Mega, and A3 High.
You must be running GKE patch version 1.31.2-gke.1384000 or higher.
You must be running your GKE nodes on a VM instance that is part of a reserved block of capacity.
Your GKE node must be in a RUNNING state. If you try to report a faulty host after deleting the VM, an error message is returned, and the host machine won't be marked as faulty.
You might be rate-limited on the number of calls to this API per reservation per month based on an evaluation of the health of your blocks.
You must be allow-listed to use the faulty host API. To request access, complete the API enablement request form.

Report a faulty host

To report a faulty host:

Use the GKE observability tools, your own monitoring tools, or logs to identify the GKE nodes that are experiencing performance issues. Save the NODE_NAME.
Report the node as faulty:
```
  kubectl label nodes NODE_NAME cloud.google.com/fault-behavior=FAULT_REASON
```
Replace the following:
- NODE_NAME: the name of the faulty node.
- FAULT_REASON: the appropriate fault reason using one or more of the following values:
  - PERFORMANCE: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.
  - SDC: use this value for silent data corruption, if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.
  - XID: use this value if you identified an unrecoverable GPU error with an XID for a VM.
  - unspecified: use this value if you are not sure what behavior is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.

After you report a faulty host, GKE does the following:

After the label is applied to the faulty node, GKE taints the node to block scheduling new Pods. GKE also starts to gracefully evict the running Pods on the node. GKE will respect the Pod Disruption Budgets (PDBs) and the spec.terminationGracePeriodSeconds field of your Pod manifests. For more details, see Configure GKE to terminate your workloads gracefully.
GKE then automatically reports the faulty host by calling the Compute Engine API, which results in a sequence of operations that takes around 10 to 12 minutes. Throughout this time, the VM is in the RUNNING state. Based on the VM's maintenance configuration for automaticRestart, Compute Engine either attempts to restart the VM on another host machine or leave the VM in the TERMINATED state.
After the operation is complete, GKE removes the fault-behavior node label from the node. The node is ready to serve workloads again.

Monitor the operation progress

You can monitor the progress of GKE's operation using the cloud.google.com/report-and-replace-status node label on your GKE node, which has one of the following values:

PodsEvicted: GKE has finished evicting Pods from the affected node.
OperationRUNNING: the operation to report the fault host is running.
OperationDone: the underlying host has been reported as faulty and the GKE node is ready to be moved to a new host
Error: API call failed, for reasons including one of the requirements described in the previous section.

You can also view the cloud.google.com/report-and-replace-operation node label to view the Compute Engine operation ID to monitor the status of the operation.

You can view both these node labels using the following command:

  kubectl get nodes NODE_NAME \
  -L cloud.google.com/report-and-replace-status,cloud.google.com/report-and-replace-operation

In case of any API errors, GKE sets the node label cloud.google.com/report-and-replace-status=ERROR. GKE clears the node taints and remove the cloud.google.com/fault-behavior node label.