This page helps you understand, configure, and monitor disruption events that might occur on Google Kubernetes Engine (GKE) nodes running artificial intelligence (AI) or machine learning (ML) workloads, including:
- Why GPUs and TPUs require disruption management
- Handle workload disruption in GKE
- Configure GKE to terminate your workloads gracefully
- Monitor the progress of an active graceful termination
During the lifecycle of a long-running GKE cluster, periodic disruptions to workloads occur due to automatic maintenance events issued by Google Cloud for the Compute Engine resources underlying the GKE infrastructure. When these disruptions affect your GKE nodes running AI/ML workloads, GKE needs to restart both the running workloads and the underlying node.
Why GPUs and TPUs require disruption management
Your GKE clusters manage the lifecycle of the GKE nodes. These nodes are provisioned on Compute Engine VMs. Compute Engine VMs periodically experience host events caused by a variety of reasons such as hardware or software updates, maintenance, and hardware failures. Host events are issued for the underlying Google Cloud infrastructure and bypass GKE maintenance policies and exclusions.
Most Compute Engine VMs, with some exceptions, have their host maintenance policy set to live migrate, which means that there's little to no disruption of running workloads. However, certain classes of VMs don't support live migration, including VMs with attached GPUs and TPUs where your AI/ML workloads run. In addition, GKE might also restart reserved and on-demand TPUs by using preemption, which allows GKE to provision larger TPUs, due to defragmentation reasons.
When a host event occurs, GKE terminates the node and its Pods. If the Pods are deployed as part of a larger workload like a Job or Deployment, GKE restarts the Pods on the affected node. It is up to you or the frameworks you use to handle the workloads or Jobs to react appropriately to the disruption. For example, you can save the state of your AI training job to reduce data loss.
Process of graceful termination
The following workflow shows how GKE executes graceful node termination after a disruption is issued by Compute Engine:
- Compute Engine issues an updated value of
TERMINATE_ON_HOST_MAINTENANCE
for the VM metadata keymaintenance-event
. Within 60 seconds, the following occurs:
The system components apply the following new node label set to
true
to indicate the maintenance is in process:cloud.google.com/active-node-maintenance
GKE applies the node taint to prevent new Pods from being scheduled on the node: cloud.google.com/impending-node-termination:NoSchedule. We recommend that you modify workloads to tolerate this taint due to the known termination that occurs.
The
maintenance-handler
component begins to evict Pods in order of workload Pods first and then system Pods (for example,kube-system
).GKE sends a shutdown signal SIGTERM to alert running workload Pods on the node of an imminent shutdown. Pods can use this alert to finish any ongoing tasks. GKE makes a best effort to terminate these Pods gracefully.
The maintenance-event
notifications occur when the underlying Compute Engine VM
of a GKE node undergoes a disruptive host event
that leads to node termination. When this occurs Compute Engine updates the maintenance-event
metadata key.
The advanced maintenance notification window before the node is terminated is
as follows:
- GPU: 60 minutes.
- TPU: 5 minutes.
Afterwards, the node termination occurs and a replacement node is allocated. GKE clears the labels and taints when the process is finished. To increase the termination window for your workloads using GPUs or TPUs, complete the steps in the Configure GKE to terminate your workloads gracefully section.
Handle workload disruption in GKE
To manage GKE termination events and reduce disruptions to workloads in your clusters, GKE monitors for these notifications for you, and will do the following:
- GKE notifies your workloads in advance of an imminent shutdown: When a GKE node needs to stop for a host maintenance event, GKE sends a SIGTERM signal to running Pods on the node at the beginning of the advanced notice period. The OS signals such as SIGTERM can be handled natively by most standard libraries, for example Python and Go. Frameworks that can capture SIGTERM include MaxText, Pax, and JAX with Orbax.
- GKE gracefully terminates your workloads: You can configure GKE to gracefully terminate your workloads with a Pod termination grace period. Pods can react to the SIGTERM signal to finish any ongoing tasks and execute any termination action you define, such as saving a training state. During the graceful termination, GKE makes a best effort to terminate the Pods gracefully and to execute clean-up processes or the termination action that you define in your application by, for example, storing workload data to reduce data loss or saving a training state.
Configure GKE to terminate your workloads gracefully
In this section, you configure GKE to manage your application lifecycle and minimize the disruption to your workload. If you don't configure a grace period, the grace period defaults to 30 seconds.
GKE makes a best effort to terminate these Pods gracefully and to
execute the termination action that you define, for example, saving a training
state. GKE sends a SIGTERM
signals to Pods at the beginning of the grace
period. If Pods don't exit by the end of the grace period, GKE
sends a follow-up SIGKILL
signal to any processes still running in any
container in the Pod.
To configure the graceful termination period for workloads, follow the instructions for GPUs or TPUs.
GPU
In your Pod manifest, set the spec.terminationGracePeriodSeconds
field to a
value up to a maximum of 3600 seconds (60 minutes). For example, to get a
notification time of 10 minutes, in your Pod manifest, set the
spec.terminationGracePeriodSeconds
field to 600 seconds, as follows:
spec:
terminationGracePeriodSeconds: 600
We recommend that you set a termination grace period that is long enough for any ongoing tasks to finish within the notification timeframe.
TPU
To allocate the maximum time to perform your clean-up processes, set the
spec.terminationGracePeriodSeconds
field to 300 seconds (five minutes) in your
Pod manifest. For example:
spec:
terminationGracePeriodSeconds: 300
We recommend that you set a termination grace period that is long enough for any ongoing tasks to finish within the notification timeframe.
If your workload uses a ML framework such as MaxText, Pax, or JAX with Orbax, the workloads can capture the shutdown SIGTERM signal and initiate a checkpointing process. To learn more, see TPU Autocheckpoint.
Monitor the progress of an active graceful termination
In GKE clusters with the control plane running 1.29.1-gke.1425000
or later, GKE deploys a node-level component called
gpu-maintenance-handler
. This component runs on all GPU and TPU nodes along
with a corresponding control plane component. These components do the following:
- Process graceful termination events.
- Respond to imminent disruption events on the GKE VM by forwarding a SIGTERM signal to a node's running workloads. These disruptions are logged as Pod Eviction and Deletion requests.
GKE adds a label and taint to nodes with imminent shutdown
status. GKE monitors host events notifications, like
maintenance, by using the system component maintenance-handler
running on each
GPU and TPU node.
GKE logs the following graceful termination events:
- When it detects a disruption due to an impending node termination like GCE
host maintenance event: GKE adds the following node labels:
cloud.google.com/active-node-maintenance
set totrue
. - When restricting new workloads from being scheduled: GKE
applies the
cloud.google.com/impending-node-termination:NoSchedule
taint.
When GKE finishes the graceful termination, the labels and taints are cleared.
To monitor the status of an active graceful termination caused by a disruption,
you can see the gpu-maintenance-handler
logs by using the Console or Google Cloud CLI.
gcloud
Locate the names of the nodes and Pods that are running instances of
gpu-maintenance-handler
by running the following command:kubectl get pods -l name=maintenance-handler -A -o wide
Each row of the output includes the Node name, Pod name, and status.
Check the logs:
kubectl logs -n=kube-system MAINTENANCE_HANDLER_POD_NAME
Replace
MAINTENANCE_HANDLER_POD_NAME
with the name of the handler instance.If a maintenance event is detected, the Pod records a message, applies the labels, and the evictions start.
Check the node labels and taints:
kubectl describe node NODE_NAME
Replace
NODE_NAME
with the name of the node that you want to view.The output shows the list of node labels and taints to watch.
Console
Go to the Logs Explorer page in the Google Cloud console:
In the Query field, specify the following query:
resource.type="k8s_container" resource.labels.namespace_name="kube-system" resource.labels.container_name="maintenance-handler" resource.labels.cluster_name="CLUSTER_NAME"
Replace the
CLUSTER_NAME
: The name of your cluster.