[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Preserve training progress using Autocheckpoint\n===============================================\n\n|\n| **Preview**\n|\n|\n| This product or feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA products and features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nHistorically, when a TPU VM requires\n[maintenance](/tpu/docs/maintenance-events),\nthe procedure is initiated immediately, without leaving time for users to\nperform progress-preserving actions such as saving a checkpoint. This is\nshown in Figure 1(a).\n\n**Fig. 1.** Illustration of the Autocheckpoint feature:\n(a) Without Autocheckpoint, the training progress from the last checkpoint\nis lost when there is an upcoming maintenance event. (b) With Autocheckpoint,\nthe training progress since the last\ncheckpoint can be preserved when there is an upcoming maintenance event.\n\nYou can use Autocheckpoint (Figure 1(b)) to preserve training progress by\nconfiguring your code to save a non-scheduled checkpoint when a maintenance\nevent occurs. When a maintenance event occurs, progress since the last\ncheckpoint is automatically saved. The feature works on both single slices\nand Multislice.\n\nThe Autocheckpoint feature works with frameworks that can capture\nSIGTERM signals and subsequently save a checkpoint. The supported frameworks\ninclude:\n\n- [MaxText](https://github.com/google/maxtext),\n- [Pax](https://github.com/google/paxml),\n- JAX with [Orbax](https://github.com/google/orbax).\n\nUsing Autocheckpoint\n--------------------\n\nThe Autocheckpoint feature is disabled by default. When you create a\nTPU or a request a [queued resource](/tpu/docs/queued-resources),\nyou can enable Autocheckpoint by adding the `--autocheckpoint-enabled` flag when provisioning\nthe TPU.\nWith the feature enabled, Cloud TPU\nperforms the following steps once it receives notification of a\nmaintenance event:\n\n1. Capture SIGTERM signal sent to the process using the TPU device\n2. Wait until the process exits, or 5 minutes have elapsed, whichever comes first\n3. Perform maintenance on the impacted slices\n\nThe infrastructure used by Autocheckpoint is ML framework-independent.\nAny ML framework\ncan support Autocheckpoint if it can capture the SIGTERM signal\nand initiate a checkpointing process.\n\nIn the application code, you need to enable the Autocheckpoint\ncapabilities provided by the ML framework. In Pax, for example,\nthis means enabling command-line flags when launching the\ntraining. For more information, see [the Autocheckpoint quickstart with Pax](#pax-single-slice).\nBehind the scenes, the frameworks save a\nnon-scheduled checkpoint when a SIGTERM signal is received,\nand the impacted TPU VM goes through maintenance when the TPU is no longer\nin use.\n\nQuickstart: Autocheckpoint with MaxText\n---------------------------------------\n\n[MaxText](https://github.com/google/maxtext) is a high performance,\narbitrarily scalable, open source, well-tested LLM written in pure Python/JAX\ntargeting Cloud TPUs.\nMaxText contains all the necessary setup to use the Autocheckpoint\nfeature.\n\nThe [MaxText `README`\nfile](https://github.com/AI-Hypercomputer/maxtext/blob/main/README.md) describes\ntwo ways to run MaxText at scale:\n\n- Using [multihost_runner.py](https://github.com/google/maxtext/blob/main/multihost_runner.py), recommended for experimentation\n- Using [multihost_job.py](https://github.com/google/maxtext/blob/main/multihost_job.py), recommended for production\n\nWhen using `multihost_runner.py`, enable Autocheckpoint by setting\nthe `autocheckpoint-enabled` flag when provisioning the queued resource.\n\nWhen using\n`multihost_job.py`, enable Autocheckpoint by specifying the\n`ENABLE_AUTOCHECKPOINT=true` command line flag when launching the job.\n\nQuickstart: Autocheckpoint with Pax on a single slice\n-----------------------------------------------------\n\nThis section provides an example of how to set up and use Autocheckpoint\nwith Pax on a single slice. With the appropriate setup:\n\n- A checkpoint will be saved when a maintenance event occurs.\n- Cloud TPU will perform maintenance on the affected TPU VM(s) after the checkpoint is saved.\n- When Cloud TPU completes maintenance, you can use the TPU VM as usual.\n\n1. Use the `autocheckpoint-enabled` flag when creating the TPU VM or requesting a\n queued resource.\n\n For example:\n 1. Set environment variables:\n\n\n ```bash\n export PROJECT_ID=your-project-id\n export TPU_NAME=your-tpu-name\n export ZONE=zone-you-want-to-use\n export ACCELERATOR_TYPE=your-accelerator-type\n export RUNTIME_VERSION=tpu-ubuntu2204-base\n ``` \n\n #### Environment variable descriptions\n\n \u003cbr /\u003e\n\n 2. Set your project ID and zone in your active configuration:\n\n ```bash\n gcloud config set project $PROJECT_ID\n gcloud config set compute/zone $ZONE\n ```\n 3. Create a TPU:\n\n ```bash\n gcloud alpha compute tpus tpu-vm create $TPU_NAME \\\n --accelerator-type $ACCELERATOR_TYPE \\\n --version $RUNTIME_VERSION \\\n --autocheckpoint-enabled\n ```\n | **Note:** The Pax version that supports Autocheckpoint requires the runtime version `tpu-ubuntu2204-base` which comes with Python 3.10.\n2. Connect to the TPU using SSH:\n\n gcloud compute tpus tpu-vm ssh $TPU_NAME\n\n3. Install Pax on a single slice\n\n The Autocheckpoint feature works on Pax versions 1.1.0 and later. On the TPU VM,\n install `jax[tpu]` and the latest `paxml`: \n\n ```bash\n pip install paxml && pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html\n ```\n4. Configure the [`LmCloudSpmd2B`](https://github.com/google/paxml/blob/b18a8d109ec45bbd7e4bcab04e3e53c2e65f3035/paxml/tasks/lm/params/lm_cloud.py#L166)\n model. Before running the training script, change `ICI_MESH_SHAPE` to\n `[1, 8, 1]`:\n\n ```python\n @experiment_registry.register\n class LmCloudSpmd2B(LmCloudSpmd):\n\n \"\"\"SPMD model with 2B params.\n\n Global batch size = 2 * 2 * 1 * 32 = 128\n \"\"\"\n PERCORE_BATCH_SIZE = 8\n\n NUM_LAYERS = 18\n MODEL_DIMS = 3072\n HIDDEN_DIMS = MODEL_DIMS * 4\n\n CHECKPOINT_POLICY = layers.AutodiffCheckpointType.SAVE_NOTHING\n ICI_MESH_SHAPE = [1, 8, 1]\n ```\n5. Launch the training with the appropriate configuration.\n\n The following example shows how to configure the [LmCloudSpmd2B](https://github.com/google/paxml/blob/b18a8d109ec45bbd7e4bcab04e3e53c2e65f3035/paxml/tasks/lm/params/lm_cloud.py#L166)\n model to save checkpoints triggered by Autocheckpoint to a Cloud Storage\n bucket. Replace \u003cvar translate=\"no\"\u003eyour-storage-bucket\u003c/var\u003e with the name of an existing\n bucket, or [create a new bucket](/storage/docs/creating-buckets). \n\n ```bash\n export JOB_LOG_DIR=gs://your-storage-bucket\n\n { python3 .local/lib/python3.10/site-packages/paxml/main.py \\\n --jax_fully_async_checkpoint=1 \\\n --exit_after_ondemand_checkpoint=1 \\\n --exp=tasks.lm.params.lm_cloud.LmCloudSpmd2B \\\n --job_log_dir=$JOB_LOG_DIR; } 2\u003e&1 | tee pax_logs.txt\n ```\n\n Note the two flags that are passed to the command:\n - `jax_fully_async_checkpoint`: With this flag on, [orbax.checkpoint.AsyncCheckpointer](https://github.com/google/orbax/blob/986f23ff728c0ed5273f17662fa49011a08342bc/checkpoint/orbax/checkpoint/async_checkpointer.py#L43C53-L43C53) will be used. The `AsyncCheckpointer` class automatically saves a checkpoint when the training script receives a SIGTERM signal.\n - `exit_after_ondemand_checkpoint`: With this flag on, the TPU process exits after the Autocheckpoint is successfully saved, which triggers the maintenance to be performed immediately. If you don't use this flag, the training will continue after the checkpoint is saved and Cloud TPU will wait for a timeout to occur (5 minutes) before performing the required maintenance.\n\nAutocheckpoint with Orbax\n-------------------------\n\nThe Autocheckpoint feature is not limited to MaxText or Pax. Any framework\nthat can capture the SIGTERM signal and initiate a\ncheckpointing process works with the infrastructure provided by Autocheckpoint.\n[Orbax](https://github.com/google/orbax), a namespace that provides\ncommon utility libraries for JAX users, provides these capabilities.\n\nAs explained in the [Orbax documentation](https://github.com/google/orbax/blob/986f23ff728c0ed5273f17662fa49011a08342bc/docs/preemption_checkpointing.ipynb),\nthese capabilities are enabled by default for users\nof `orbax.checkpoint.CheckpointManager`. The `save` method\nthat is called after every step automatically checks whether a maintenance\nevent is impending, and if so, saves a checkpoint even if the step number\nis not a multiple of `save_interval_steps`.\nThe [GitHub documentation](https://github.com/google/orbax/blob/986f23ff728c0ed5273f17662fa49011a08342bc/docs/preemption_checkpointing.ipynb)\nalso illustrates how to make the training exit after saving an\nAutocheckpoint, with a modification in the user code."]]