GPU instances cannot be live migrated. You must set your GPU instances to stop for host maintenance events. If needed, you can set your stopped instances to automatically restart after the maintenance event completes. Host maintenance events, on Compute Engine, have a frequency of once every two weeks but might occasionally run more frequently.
To minimize disruptions to your workloads during a maintenance event, you can monitor the maintenance schedule for your instance, and prepare your workloads to transition through the system restart.
To receive advanced notice of host maintenance events, monitor the
/computeMetadata/v1/instance/maintenance-event
metadata value.
If the request to the metadata server returns NONE
, the instance isn't
scheduled to stop. For example, run the following command from within
an instance:
curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"
NONE
If the metadata server returns TERMINATE_ON_HOST_MAINTENANCE
, then your
instance is scheduled for stopping. Compute Engine gives GPU
instances a 1-hour stopping notice, while normal instances receive only
a 60-second notice. Configure your application to transition through the
maintenance event. For example, you might use one of the following techniques:
Configure your application to temporarily move work in progress to a Cloud Storage bucket, then retrieve that data after the instance restarts.
Write data to a secondary persistent disk. When the instance automatically restarts, the persistent disk can be reattached and your application can resume work.
You can also receive notification of changes in this metadata value without polling. For examples of how to receive advanced notice of host maintenance events without polling, read getting live migration notices.
What's next?
- Learn more about GPUs on Compute Engine.
- To learn more about managing and scaling groups of instances, see Manually resizing a managed instance group.
- To monitor GPU performance, see Monitoring GPU performance.
- To optimize GPU performance, see Optimizing GPU performance.