Handle GPU host maintenance events

When Compute Engine performs maintenance on a virtual machine (VM) with attached graphics processing units (GPUs), the VM must be stopped. This is because VMs with attached GPUs can't be live migrated.

You must set these VMs to stop for host maintenance events. You can set your stopped VMs to automatically restart after the maintenance event completes.

Host maintenance events typically occur once every two weeks, but might occasionally run more frequently.

This document discusses how you can minimize disruptions to your workloads during a maintenance event.

Receive advance notice before maintenance events

You can monitor the maintenance schedule for your virtual machine (VM) instance, and prepare your workloads to transition through the system restart.

To receive advance notice of host events, monitor the /computeMetadata/v1/instance/maintenance-event metadata value. If the request to the metadata server returns NONE, then the VM isn't scheduled to stop. For example, run the following command from within a VM:

curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"

NONE

If the metadata server returns TERMINATE_ON_HOST_MAINTENANCE, then your VM is scheduled for stopping. Compute Engine gives GPU VMs a 1-hour stopping notice, while normal VMs receive only a 60-second notice. Configure your application to transition through the maintenance event. For example, you might use one of the following techniques:

Configure your application to temporarily move work in progress to a Cloud Storage bucket, then retrieve that data after the VM restarts.
Write data to a secondary Persistent Disk. When the VM automatically restarts, the Persistent Disk can be reattached and your application can resume work.

What's next?

Learn more about GPU platforms.
To learn more about managing and scaling groups of VMs, see Set the group's target size.
To monitor GPU performance, see Monitoring GPU performance.
To improve network performance, see Use higher network bandwidth.
Learn how to troubleshoot VM shutdowns and reboots.