Manually start a host maintenance event
This page explains how to manually start a host maintenance event on supported TPU VMs. This is useful for workloads that might be impacted by degraded performance or downtime, for which you need the maintenance window to start at a specific time.
When you manually start a maintenance event, the host maintenance event starts immediately. You can't specify a date or time for the maintenance event to start. If you don't use this feature, then the maintenance event occurs at the time indicated in the upcoming maintenance notification.
For information about manually starting a maintenance for TPUs in GKE, see Manage GKE node disruption for GPUs and TPUs.
Limitations
You can only manually start a host maintenance event for TPU v6e VMs with the following configurations:
- TPU v6e VMs with the
2x4
topology configuration (v6e-8
if using the accelerator type field in the Cloud TPU API) or larger - GKE multi-host node pools with TPU v6e VMs that are
2x4
or larger
Starting a host maintenance immediately for larger slices might result in slice unavailability of up to a few hours. Normally, a host maintenance event results in the slice getting rescheduled as soon as possible to another eligible set of hosts, but for larger host maintenance event requests, there might not be sufficient capacity to immediately reschedule the slice, leading to a longer wait time.
Additionally, initiating maintenance on the Cloud TPU slice will start
maintenance for all underlying TPU VMs. If you perform maintenance directly on
one of the instances using the Instances
API, all of the
instances within the Cloud TPU slice will go into maintenance. Instead, use
the queued-resources
Cloud TPU API to specify which nodes should have
maintenance performed.
Manually start a host maintenance event
You can use maintenance notifications to determine when you can manually start a maintenance event on a TPU.
Check the notification information
You can find notifications for upcoming maintenance events using the Cloud TPU API or by querying the metadata server on your VM. For more information, see View maintenance notifications.
You can start a maintenance event ahead of time when there is an upcoming host
maintenance notification present on the TPU. To start the maintenance
event ahead of time, the upcoming host maintenance notification must have
canReschedule
set to true
and maintenanceStatus
set to PENDING
.
Start the maintenance event
To start a host maintenance event, you can use the Cloud TPU API with the
perform-maintenance
command:
gcloud alpha compute tpus tpu-vm perform-maintenance TPU_NAME \ --zone=ZONE
When the operation completes, the windowEndTime
and windowStartTime
fields
change to the time in which you initiated the maintenance event, and the
maintenanceStatus
field changes to ONGOING
. The host maintenance event
begins soon after.
Use the gcloud alpha compute tpus tpu-vm describe
command to view
the status of the maintenance event:
gcloud alpha compute tpus tpu-vm describe TPU_NAME \ --zone=ZONE
The output contains a section similar to the following:
upcomingMaintenance: canReschedule: true latestWindowStartTime: "2025-12-01T19:00:00Z" maintenanceStatus: ONGOING type: SCHEDULED windowEndTime: "2025-12-01T22:00:00Z" windowStartTime: "2025-12-01T19:00:00Z"
Maintenance is complete when the TPU VM's state is READY
and the output from
the gcloud alpha compute tpus tpu-vm describe
command no longer contains an
upcomingMaintenance
metadata field.
For Multislice environments, you can manually start a host maintenance event on specific slices using the following command:
gcloud alpha compute tpus queued-resources perform-maintenance QR_NAME \ --zone=ZONE --node-names=NODE_NAMES
NODE_NAMES
is a comma-separated list of slices (nodes) in the queued resource,
for which you want to start a host maintenance event. For example, if the queued
resource has nodes named my-qr-0, my-qr-1
, and my-qr-2
, a valid input to the
perform-maintenance
command would be --node-names=my-qr-0,my-qr-1
.