Manage GKE node disruption for GPUs and TPUs

Autopilot Standard

During the lifecycle of a long-running GKE cluster, periodic disruptions to workloads occur due to infrastructure interruptions that Google Cloud issues. These automatic events can occur to respond to scheduling decisions (preemption events), control plane or node updates, which include GKE node auto-upgrades (maintenance events), or remediation of detected issues (termination events).

This page helps you understand what node disruption means in GKE, monitor maintenance notifications, and minimize disruption impact in your GKE nodes with attached GPUs and TPUs.

This document is for Platform admins and operators who manage the lifecycle of the underlying tech infrastructure. To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.

What does infrastructure interruption mean in GKE?

Your GKE clusters manage the lifecycle of the GKE nodes. These nodes are provisioned on Compute Engine VMs, which periodically experience the following interruptions:

Remediation of detected issues (TerminationEvent): these events occur because Google Cloud detects an issue and interrupts your cluster infrastructure. TerminationEvent events don't support graceful shutdown. TerminationEvent events are triggered by the following issues:
- Auto repair occurs when GKE repairs a node after repeated failed health checks.
- HostError occurs when a hardware or software error on the physical machine causes the VM to stop.
Note: Maintenance events to the underlying Compute Engine are considered automatic maintenance events. These events bypass GKE maintenance policies and exclusions.
Maintenance or upgrade events (MaintenanceEvent): these events occur when Google Cloud needs to interrupt a VM to perform maintenance. MaintenanceEvent events are triggered by the following maintenance tasks:
- Maintenance events occurs when Google Cloud upgrades the underlying host.
- Node updates, which include node auto-upgrades, occurs when GKE updates the version of Kubernetes running on the node.
For more information about how you and GKE manage changes during the lifecycle of a cluster, see Types of changes.
Response to scheduling decisions (PreemptionEvent): occur when Google Cloud needs to preempt VMs to make capacity available for higher-priority resources. PreemptionEvent events can be any of the following:
- Eviction: occurs when preemptible or Spot infrastructure is preempted to accommodate a higher priority VM.
- Defragmentation: occurs when GKE preempts a smaller TPU slice to accommodate a larger TPU slice. Defragmentation only occurs on TPU slices.

During the lifecycle of a long-running GKE cluster, the nodes might experience periodic disruptions to training or serving workloads. When these disruptions affect your GKE nodes that run AI/ML workloads, GKE needs to restart both the running workloads and the underlying node.

Why GPUs and TPUs require interruption management

Most Compute Engine VMs, with some exceptions, have their host maintenance policy set to live migrate, which means that running workloads typically experience little to no disruption. However, certain classes of VMs don't support live migration, including VMs with attached GPUs and TPUs. When a host event happens to the VM within a TPU slice, the entire slice gets interrupted and then rescheduled because all maintenance events are coordinated at the slice level. So if you create a TPU slice that has hundreds of VMs, all of those VMs will receive the same maintenance event schedule.

When a host event occurs, GKE terminates the node and its Pods. If the Pods are deployed as part of a larger workload, like a Job or Deployment, GKE restarts the Pods on the affected node.

It is up to you, or the frameworks that you use, to handle the workload configuration to react appropriately to maintenance events. For example, you can save the state of your AI training job to reduce data loss.

To manage disruption on AI/ML workloads, you can do the following:

Monitor node and node pool interruptions
Monitor maintenance notifications
Minimize disruption impact

Monitor node interruptions

The following GKE system metric reports the count of interruptions for a GKE node since the last sample (the metric is sampled every 60 seconds):

kubernetes.io/node/interruption_count

The interruption_type (such as TerminationEvent, MaintenanceEvent, or PreemptionEvent) and interruption_reason (like HostError, Eviction, or AutoRepair) fields can help provide the reason for why a node was interrupted.

To get a breakdown of the interruptions and their causes in TPU nodes in the clusters in your project, use the following PromQL query:

  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node"}[${__interval}]))

To only see the host maintenance events, update the query to filter the HW/SW Maintenance value for the interruption_reason. Use the following PromQL query:

  sum by (interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_interruption_count{monitored_resource="k8s_node", interruption_reason="HW/SW Maintenance"}[${__interval}]))

To see the interruption count aggregated by node pool, use the following PromQL query:

  sum by (node_pool_name,interruption_type,interruption_reason)(
    sum_over_time(
      kubernetes_io:node_pool_interruption_count{monitored_resource="k8s_node_pool", interruption_reason="HW/SW Maintenance", node_pool_name=NODE_POOL_NAME }[${__interval}]))

Monitor maintenance notifications

Compute Engine issues notifications when nodes and their underlying VMs are scheduled for disruptive host events, and when these events become active. The notifications include information about planned start time, the type of event, and other details.

On GKE version 1.31.1-gke.2008000 and later, you can monitor upcoming maintenance events, including the events that are described in this section.

Upcoming maintenance is scheduled but not active

Before a VM with attached GPUs or TPUs has a scheduled maintenance event, Compute Engine pushes notifications out to all its VMs. These notifications report the start of the maintenance window. When an upcoming maintenance is scheduled by the VM but not active, GKE adds scheduled-maintenance-time to the node label.

To query these notifications at the node level, run the following command:

kubectl get nodes -l cloud.google.com/scheduled-maintenance-time \
    -L cloud.google.com/scheduled-maintenance-time

The output is similar to the following:

NAME                         STATUS    SCHEDULED-MAINTENANCE-TIME
<gke-accelerator-node-name>  Ready     1733083200
<gke-accelerator-node-name>  Ready     1733083200
[...]

The SCHEDULED-MAINTENANCE-TIME column represents seconds, which are displayed in Unix epoch time format.

To query these notifications at the level of node metadata, check instances for a maintenance event notification.

For accelerator-optimized machine families that support advanced maintenance, you can access the upcoming-maintenance endpoint that provides information about scheduled and started maintenance events.

Minimize disruption impact

Compute Engine issues notifications about upcoming maintenance events and schedules a maintenance window. Between the notification time and when the maintenance window start time, you can decide to either:

Manually start a host maintenance event.
Let Compute Engine start the maintenance event on schedule.

Manually start a host maintenance event

When Compute Engine issues a notification about a scheduled maintenance event, you can manually start maintenance at a time that aligns with your operational schedule, for example, during periods of reduced activity.

On a node in the node pool, set the node label cloud.google.com/perform-maintenance to true. For example:

kubectl label nodes <node-name> cloud.google.com/perform-maintenance=true

If you start a maintenance event, GKE executes the following operations:

Taints the node.
Gracefully evicts Pods.
Requests Compute Engine to start the maintenance event immediately, instead of waiting for the scheduled time.

Compute Engine starts the maintenance event on schedule

If you don't start a host maintenance event, Compute Engine starts the scheduled maintenance event on its own. Starting in GKE version 1.33, the node is not tainted and Pods are not evicted when the maintenance window starts.

When the maintenance event starts, a node might shutdown one or many times with a short notification time before its imminent termination. In these cases, GKE makes a best effort to terminate workloads and gracefully evicts Pods.

Scheduled maintenance starts

When scheduled maintenance starts, Compute Engine updates the metadata in the http://metadata.google.internal/computeMetadata/v1/instance/attributes/ directory. Compute Engine updates the metadata labels as follows:

Sets maintenance-event to TERMINATE_ON_HOST_MAINTENANCE.
In upcoming-maintenance, sets maintenance_status to ONGOING.

GKE handles a scheduled host maintenance event, depending on whether you trigger it manually or let GKE proceed automatically.

Configure GKE to terminate your workloads gracefully

In this section, you configure GKE to manage your application lifecycle and minimize the disruption to your workload. If you don't configure a grace period, the grace period defaults to 30 seconds.

GKE makes a best effort to terminate these Pods gracefully and to execute the termination action that you define, for example, saving a training state. GKE sends a SIGTERM signal to Pods at the beginning of the grace period. If Pods don't exit by the end of the grace period, GKE sends a follow-up SIGKILL signal to any processes still running in any container in the Pod.

To configure the graceful termination period, set the termination grace period (seconds) in the spec.terminationGracePeriodSeconds field of your Pod manifest. For example, to get a notification time of 10 minutes, set the spec.terminationGracePeriodSecondsfield in your Pod manifest to 600 seconds, as follows:

    spec:
      terminationGracePeriodSeconds: 600

We recommend that you set a termination grace period that is long enough for any ongoing tasks to finish within the notification timeframe. If your workload uses a ML framework such as MaxText, Pax, or JAX with Orbax, the workloads can capture the shutdown SIGTERM signal and initiate a checkpointing process. To learn more, see TPU Autocheckpoint.

Process of graceful termination

When a manually-started maintenance event begins, Compute Engine signals the impending machine shutdown by updating the maintenance-event metadata key. GKE starts graceful termination.

The following workflow shows how GKE executes graceful node termination when there is an impending node shutdown:

Within 60 seconds, the following occurs:
1. The system components apply the cloud.google.com/active-node-maintenance node label set to ONGOING to indicate that workloads are being stopped.
2. GKE applies the node taint to prevent new Pods from being scheduled on the node. The taint has the cloud.google.com/impending-node-termination:NoSchedule key. We recommend that you don't modify your workloads to tolerate this taint due to the known termination that occurs.
The maintenance-handler component begins to evict Pods by first evicting workload Pods, and then evicting system Pods (for example, kube-system).
GKE sends a SIGTERM shutdown signal to workload Pods that are running on the node to alert them of an imminent shutdown. Pods can use this alert to finish any ongoing tasks. GKE makes a best effort to terminate these Pods gracefully.
After eviction finishes, GKE updates the value of the cloud.google.com/active-node-maintenance label to terminating to indicate that the node is ready to terminate.

Afterwards, the node termination occurs and a replacement node is allocated. GKE clears the labels and taints when the process is finished. To increase the termination window for your workloads using GPUs or TPUs, complete the steps in the Manually start a host maintenance event section.

Monitor the progress of an active graceful termination

You can filter the GKE logs by the following graceful termination events:

When the VM detects a disruption due to an impending node termination like Compute Engine host maintenance event, GKE sets the cloud.google.com/active-node-maintenance to ONGOING when workloads are being stopped, and to terminating when the workloads are finished and the node is ready to terminate.
When restricting new workloads from being scheduled, GKE applies the cloud.google.com/impending-node-termination:NoSchedule taint.