Manage GKE node disruption for GPUs and TPUs


This page helps you understand, configure, and monitor disruption events that might occur on Google Kubernetes Engine (GKE) nodes running artificial intelligence (AI) or machine learning (ML) workloads, including:

During the lifecycle of a long-running GKE cluster, periodic disruptions to workloads occur due to automatic maintenance events issued by Google Cloud for the Compute Engine resources underlying the GKE infrastructure. When these disruptions affect your GKE nodes running AI/ML workloads, GKE needs to restart both the running workloads and the underlying node.

Why GPUs and TPUs require disruption management

Your GKE clusters manage the lifecycle of the GKE nodes. These nodes are provisioned on Compute Engine VMs. Compute Engine VMs periodically experience host events caused by a variety of reasons such as hardware or software updates, maintenance, and hardware failures. Host events are issued for the underlying Google Cloud infrastructure and bypass GKE maintenance policies and exclusions.

Most Compute Engine VMs, with some exceptions, have their host maintenance policy set to live migrate, which means that there's little to no disruption of running workloads. However, certain classes of VMs don't support live migration, including VMs with attached GPUs and TPUs where your AI/ML workloads run. In addition, GKE might also restart reserved and on-demand TPUs by using preemption, which allows GKE to provision larger TPUs, due to defragmentation reasons.

When a host event occurs, GKE terminates the node and its Pods. If the Pods are deployed as part of a larger workload like a Job or Deployment, GKE restarts the Pods on the affected node. It is up to you or the frameworks you use to handle the workloads or Jobs to react appropriately to the disruption. For example, you can save the state of your AI training job to reduce data loss.

Process of graceful termination

The following workflow shows how GKE executes graceful node termination after a disruption is issued by Compute Engine:

  1. Compute Engine issues an updated value of TERMINATE_ON_HOST_MAINTENANCE for the VM metadata key maintenance-event.
  2. Within 60 seconds, the following occurs:

    1. The system components apply the following new node label set to true to indicate the maintenance is in process: cloud.google.com/active-node-maintenance

    2. GKE applies the node taint to prevent new Pods from being scheduled on the node: cloud.google.com/impending-node-termination:NoSchedule. We recommend that you modify workloads to tolerate this taint due to the known termination that occurs.

  3. The maintenance-handler component begins to evict Pods in order of workload Pods first and then system Pods (for example, kube-system).

  4. GKE sends a shutdown signal SIGTERM to alert running workload Pods on the node of an imminent shutdown. Pods can use this alert to finish any ongoing tasks. GKE makes a best effort to terminate these Pods gracefully.

The maintenance-event notifications occur when the underlying Compute Engine VM of a GKE node undergoes a disruptive host event that leads to node termination. When this occurs Compute Engine updates the maintenance-event metadata key. The advanced maintenance notification window before the node is terminated is as follows:

  • GPU: 60 minutes.
  • TPU: 5 minutes.

Afterwards, the node termination occurs and a replacement node is allocated. GKE clears the labels and taints when the process is finished. To increase the termination window for your workloads using GPUs or TPUs, complete the steps in the Configure GKE to terminate your workloads gracefully section.

Handle workload disruption in GKE

To manage GKE termination events and reduce disruptions to workloads in your clusters, GKE monitors for these notifications for you, and will do the following:

  • GKE notifies your workloads in advance of an imminent shutdown: When a GKE node needs to stop for a host maintenance event, GKE sends a SIGTERM signal to running Pods on the node at the beginning of the advanced notice period. The OS signals such as SIGTERM can be handled natively by most standard libraries, for example Python and Go. Frameworks that can capture SIGTERM include MaxText, Pax, and JAX with Orbax.
  • GKE gracefully terminates your workloads: You can configure GKE to gracefully terminate your workloads with a Pod termination grace period. Pods can react to the SIGTERM signal to finish any ongoing tasks and execute any termination action you define, such as saving a training state. During the graceful termination, GKE makes a best effort to terminate the Pods gracefully and to execute clean-up processes or the termination action that you define in your application by, for example, storing workload data to reduce data loss or saving a training state.

Configure GKE to terminate your workloads gracefully

In this section, you configure GKE to manage your application lifecycle and minimize the disruption to your workload. If you don't configure a grace period, the grace period defaults to 30 seconds.

GKE makes a best effort to terminate these Pods gracefully and to execute the termination action that you define, for example, saving a training state. GKE sends a SIGTERM signals to Pods at the beginning of the grace period. If Pods don't exit by the end of the grace period, GKE sends a follow-up SIGKILL signal to any processes still running in any container in the Pod.

To configure the graceful termination period for workloads, follow the instructions for GPUs or TPUs.

GPU

In your Pod manifest, set the spec.terminationGracePeriodSeconds field to a value up to a maximum of 3600 seconds (60 minutes). For example, to get a notification time of 10 minutes, in your Pod manifest, set the spec.terminationGracePeriodSeconds field to 600 seconds, as follows:

    spec:
      terminationGracePeriodSeconds: 600

We recommend that you set a termination grace period that is long enough for any ongoing tasks to finish within the notification timeframe.

TPU

To allocate the maximum time to perform your clean-up processes, set the spec.terminationGracePeriodSeconds field to 300 seconds (five minutes) in your Pod manifest. For example:

    spec:
      terminationGracePeriodSeconds: 300

We recommend that you set a termination grace period that is long enough for any ongoing tasks to finish within the notification timeframe.

If your workload uses a ML framework such as MaxText, Pax, or JAX with Orbax, the workloads can capture the shutdown SIGTERM signal and initiate a checkpointing process. To learn more, see TPU Autocheckpoint.

Monitor the progress of an active graceful termination

In GKE clusters with the control plane running 1.29.1-gke.1425000 or later, GKE deploys a node-level component called gpu-maintenance-handler. This component runs on all GPU and TPU nodes along with a corresponding control plane component. These components do the following:

  • Process graceful termination events.
  • Respond to imminent disruption events on the GKE VM by forwarding a SIGTERM signal to a node's running workloads. These disruptions are logged as Pod Eviction and Deletion requests.

GKE adds a label and taint to nodes with imminent shutdown status. GKE monitors host events notifications, like maintenance, by using the system component maintenance-handler running on each GPU and TPU node.

GKE logs the following graceful termination events:

  • When it detects a disruption due to an impending node termination like GCE host maintenance event: GKE adds the following node labels: cloud.google.com/active-node-maintenance set to true.
  • When restricting new workloads from being scheduled: GKE applies the cloud.google.com/impending-node-termination:NoSchedule taint.

When GKE finishes the graceful termination, the labels and taints are cleared.

To monitor the status of an active graceful termination caused by a disruption, you can see the gpu-maintenance-handler logs by using the Console or Google Cloud CLI.

gcloud

  1. Locate the names of the nodes and Pods that are running instances of gpu-maintenance-handler by running the following command:

    kubectl get pods -l name=maintenance-handler -A -o wide
    

    Each row of the output includes the Node name, Pod name, and status.

  2. Check the logs:

    kubectl logs -n=kube-system MAINTENANCE_HANDLER_POD_NAME
    

    Replace MAINTENANCE_HANDLER_POD_NAME with the name of the handler instance.

    If a maintenance event is detected, the Pod records a message, applies the labels, and the evictions start.

  3. Check the node labels and taints:

    kubectl describe node NODE_NAME
    

    Replace NODE_NAME with the name of the node that you want to view.

    The output shows the list of node labels and taints to watch.

Console

  1. Go to the Logs Explorer page in the Google Cloud console:

    Go to Logs Explorer

  2. In the Query field, specify the following query:

    resource.type="k8s_container"
    resource.labels.namespace_name="kube-system"
    resource.labels.container_name="maintenance-handler"
    resource.labels.cluster_name="CLUSTER_NAME"
    

    Replace the CLUSTER_NAME: The name of your cluster.

What's next