This page shows you how to manage Hypercompute Clusters with GKE, including the following common events relevant to GKE clusters and AI workloads:
- Host maintenance
- Cluster upgrades
- Faulty host reporting
Manage host maintenance for AI workloads
GKE nodes run on virtual machines (VMs) that periodically experience host events that can be disruptive to AI workloads. Since host events occur on the underlying Google Cloud infrastructure, they bypass GKE maintenance windows and exclusions. While most Compute Engine VMs have their host maintenance policy set to live migrate, which minimizes the disruption of workloads, GPUs and TPUs don't support live migration. When these host events affect your GKE nodes running AI workloads, GKE has to terminate the node and the Pods running on the node. If the Pods are deployed as part of a larger workload like a Job or Deployment, GKE attempts to restart the Pods on the affected node.
To learn more about managing host maintenance of the underlying VMs, see Manage GKE node disruption for GPUs and TPUs.
Monitor host maintenance events
For clusters running GKE version 1.31.1-gke.2008000 or later, you can view the scheduled start time of the host maintenance event in the following way. The start time is represented by Kubernetes node labels on the corresponding GKE node for all GPUs and TPUs.
For details, see Monitor maintenance notifications.
With these node labels, you can do the following:
- Manually start a host maintenance event
- Use host maintenance event information while scheduling your workloads
Manually start a host maintenance event
After Compute Engine issues a notification about a scheduled maintenance event, you can manually start maintenance at a time that aligns with your schedule. For example, you can choose to perform maintenance during periods of reduced activity.
If you don't manually start a host maintenance event, then Compute Engine will automatically complete regularly scheduled maintenance.
Follow the instructions to Manually start a host maintenance event. Also, continue reading this section to learn the following:
- Configure GKE to terminate your workloads gracefully
- Process of graceful termination
- Monitor the progress of an active graceful termination
Use host maintenance information while scheduling your workloads
You can use the maintenance information surfaced through GKE node labels along with node affinity and anti-affinity to minimize disruption to your workloads.
See the following sections for examples of how to use this information.
Schedule Pods to nodes that have no future scheduled maintenance events
You can instruct GKE to only schedule Pods to nodes that have no future scheduled maintenance events, such as with the following snippet:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/scheduled-maintenance-time
operator: DoesNotExist
Schedule Pods to nodes that have maintenance scheduled after a certain date
You can instruct GKE to only schedule Pods to nodes that have maintenance scheduled after a certain date by providing the Unix epoch time:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/scheduled-maintenance-time
operator: Gt
values:
- 1733296000
Manage GKE cluster upgrades for AI workloads
AI workloads are sensitive to disruption.
During the lifecycle of a GKE cluster, AI workloads must be prepared for disruption to both the underlying Compute Engine VMs, as well as the GKE cluster itself:
- Host maintenance: To manage host maintenance of the underlying VMs, see Manage GKE node disruption for GPUs and TPUs. This is also described in the previous sections.
- Cluster upgrades: To manage disruption from cluster
upgrades, you can use the following tools:
- Maintenance windows: Schedule when GKE can perform cluster upgrades and other types of cluster operations.
- Maintenance exclusions: Prevent cluster upgrades and other types of cluster operations during a specific time period.
We recommend that you keep your cluster enrolled in a release channel. GKE clusters, by default, are enrolled in the Regular release channel. To learn more about the benefits of release channels, see the Comparison between clusters enrolled and not enrolled in a release channel.
With release channels, you get access to more features, including additional maintenance exclusion scopes. We recommend the "no minor or node upgrades" scope for AI workloads.
Report faulty hosts through GKE
This section outlines how, through GKE, you can report a faulty
A3 accelerator-optimized
host that is running your AI/ML or HPC workloads. A
host is a single
physical server machine in the data center running a VM which hosts your
GKE node. You can report faulty hosts by applying a
fault-behavior
node label to the affected GKE node. After you
apply the node label to a particular GKE node, GKE
does the following steps:
- Gracefully evicts workloads from the node.
- Prevents new Pods from being scheduled on the node.
- Calls the API on the VM instance to mark the host as faulty.
- Waits for the VM to be brought back up on a healthy host machine.
- Removes the taint and the
fault-behavior
label from the node.
After this, the node will be ready to serve workloads again.
Requirements
To report a faulty host, your GKE node must meet the following requirements:
- You must be running GPUs in the A3 machine series, which includes A3 Ultra, A3 Mega, and A3 High.
- You must be running GKE patch version 1.31.2-gke.1384000 or higher.
- You must be running your GKE nodes on a VM instance that is part of a reserved block of capacity.
- Your GKE node must be in a
RUNNING
state. If you try to report a faulty host after deleting the VM, an error message is returned, and the host machine won't be marked as faulty. - You might be rate-limited on the number of calls to this API per reservation per month based on an evaluation of the health of your blocks.
- You must be allow-listed to use the faulty host API. To request access, complete the API enablement request form.
Report a faulty host
To report a faulty host:
- Use the GKE observability
tools, your
own monitoring tools, or logs to identify the GKE nodes that
are experiencing performance issues. Save the
NODE_NAME
. Report the node as faulty:
kubectl label nodes NODE_NAME cloud.google.com/fault-behavior=FAULT_REASON
Replace the following:
NODE_NAME
: the name of the faulty node.FAULT_REASON
: the appropriate fault reason using one or more of the following values:PERFORMANCE
: use this value if GPUs on a VM are performing slower than other GPUs in the cluster and you don't see any XID errors in the logs, and none of the other usual failure patterns such as silent data corruption are detected.SDC
: use this value for silent data corruption, if you see data corruption but no system crash. This data corruption can be caused by CPU defects, software bugs such as use-after-free or memory stomping, kernel issues, or other defects. Most often, this term is used to refer to hardware-induced defects.XID
: use this value if you identified an unrecoverable GPU error with an XID for a VM.unspecified
: use this value if you are not sure what behavior is causing the issue with your VM. This is the default value. However, we recommend specifying one of the other values, if applicable.
After you report a faulty host, GKE does the following:
- After the label is applied to the faulty node, GKE taints the
node to block scheduling new Pods. GKE also starts to
gracefully evict the running Pods on the node. GKE will
respect the Pod Disruption Budgets
(PDBs) and
the
spec.terminationGracePeriodSeconds
field of your Pod manifests. For more details, see Configure GKE to terminate your workloads gracefully. - GKE then automatically reports the faulty host by calling
the Compute Engine API, which results in a sequence of operations
that takes around 10 to 12 minutes. Throughout this time, the VM is in the
RUNNING
state. Based on the VM's maintenance configuration for automaticRestart, Compute Engine either attempts to restart the VM on another host machine or leave the VM in theTERMINATED
state. - After the operation is complete, GKE removes the
fault-behavior
node label from the node. The node is ready to serve workloads again.
Monitor the operation progress
You can monitor the progress of GKE's operation using the
cloud.google.com/report-and-replace-status
node label on your
GKE node, which has one of the following values:
PodsEvicted
: GKE has finished evicting Pods from the affected node.OperationRUNNING
: the operation to report the fault host is running.OperationDone
: the underlying host has been reported as faulty and the GKE node is ready to be moved to a new hostError
: API call failed, for reasons including one of the requirements described in the previous section.
You can also view the cloud.google.com/report-and-replace-operation
node label
to view the Compute Engine operation ID to monitor the status of the
operation.
You can view both these node labels using the following command:
kubectl get nodes NODE_NAME \
-L cloud.google.com/report-and-replace-status,cloud.google.com/report-and-replace-operation
In case of any API errors, GKE sets the node label
cloud.google.com/report-and-replace-status=ERROR
. GKE clears
the node taints and remove the cloud.google.com/fault-behavior
node label.