Manage maintenance events with Cloud TPU Pods

Overview

TPU Nodes and TPU VMs are instances of Compute Engine VMs with attached TPU hardware. Compute Engine VMs are subject to Compute Engine VM maintenance events. Each TPU is connected to a Compute Engine VM, so using more TPUs (for example, in a TPU Pod) increases the likelihood of one of your VMs encountering a maintenance event.

This document discusses various approaches to handle maintenance events for long-running training jobs on Cloud TPUs.

Using checkpoints for fast recovery from maintenance events

Checkpoints are key to short recoveries from maintenance events and should be saved frequently: a good rule of thumb is saving checkpoints approximately every hour. Not checkpointing often enough risks losing a lot of training progress due to maintenance events or other training interruptions.

Checkpoints generally refer to all of the saved parameters used in training (such as model weights). The time it takes to save a checkpoint can range from the order of seconds to the order of minutes.

Although most maintenance events are automatically recovered and training jobs continue without manual intervention, there might be edge cases where the job does not restart and automatically continue. When this happens, you need to delete and recreate the TPU resources, and restart the training job from a saved checkpoint. For information about how to detect and recover from automatic recovery failures, see Detect and recover from TPU failures.

The mechanisms used to save and load checkpoints are different for each ML framework. Supported Cloud TPU models generally have checkpointing built-in. For more information on checkpointing, see : TensorFlow 2.x, PyTorch, or JAX/flax.

Detecting Maintenance Events

You can detect if and when a maintenance event occurs on your TPU using the following gcloud describe command:

TPU VMs

$ gcloud compute tpus tpu-vm describe tpu-name --zone=zone  | grep 'health'

TPU Nodes

$ gcloud compute tpus describe tpu-name --zone=zone | grep 'health'

The output from this command displays the current state of the TPU and a description of the most recent maintenance event. The output should look similar to the following:

health: HEALTHY
healthDescription: The TPU had a maintenance event at 2022-01-26T03:44:36.265703305Z

Maintenance event logs

You can view historical logs of maintenance events on your TPU in system event audit logs.

In the Google Cloud console navigation menu, click Compute Engine > VM instances and search, for example:

"tpu.nodes.terminate" OR "tpu.nodes.restart"

Within your search timeframe, any interruptions and repairs of your TPU workers are displayed. The logs will show the date and time of the event, the type of event, and for "terminate" events, the reason for the termination in protoPayload.metadata.terminateReason.

Handle maintenance events

There are several ways you can mitigate maintenance event disruptions.

  1. Periodically save checkpoints

    In the ideal scenario, when an "interruption event" happens, training resumes from the latest checkpoint.

  2. Training script retries

    The training script might stop as a result of an "interruption event". You can use a bash script to continuously retry the training script until training is complete. Each retry should continue from the latest checkpoint, so retry scripts should always be used in conjunction with checkpoints.

    Production-ready training pipelines should use a resource management system such as Google Kubernetes Engine (GKE). For more information on using Google Kubernetes Engine with the TPU VM architecture, see Deploy TPU workloads. For more information about using Google Kubernetes Engine with the TPU Node architecture, see Run TPU applications on Google Kubernetes Engine. Otherwise, you can implement a bash script to continuously retry the training script until completion. For example:

    With TPU Node:

    (From your VM) bash while ! python3 [training command]; do sleep 1; done

    With TPU VM:

    while ! gcloud compute tpus tpu-vm ssh ${TPU_NAME} --command "python3 [training command]"; do sleep 1; done
    

    (Note that you need to run the TPU VM command from a Cloud Shell or from a terminal, not from the TPU VM).

  3. Detect and recover from TPU failures

    When a TPU does not recover from a maintenance event, you can use a recovery script to detect the TPU state and delete and re-create the TPU. An example of this script can be found here. See Managing TPUs for details on manually deleting and re-creating TPUs.

    When creating or re-creating a TPU VM, you can specify a startup script with the --metadata startup-script parameter. A startup script runs whenever a TPU VM is created. Refer to Run standard installation scripts for more information.