You can choose how your virtual machines (VM) instances respond during or after a host event by setting the host maintenance policy. A host event can include the regular maintenance of Compute Engine infrastructure, or a host error on a VM. By default, VMs are set to live migrate during host system events, but you can set them to terminate and optionally restart.
The following host events lead to either the live migration or termination of your VM depending on the host maintenance policy that you set:
A maintenance event is when Compute Engine stops a VM to perform a hardware or software update. If you enable the live migration host maintenance policy, Compute Engine moves the VM to a new host, and there is no disruption to your application.
VM behavior during a maintenance event can vary depending on the tenancy of the VM. The following table shows some differences between the behavior of multi-tenant and sole-tenant VMs during maintenance events.
|Host tenancy||Approximate frequency*||Live migration to new host||Host selection|
|Multi-tenant||Every 2 weeks||Yes||Compute Engine|
|Sole-tenant||Every 4 to 6 weeks||Depends on the host maintenance policy||Depends on the host maintenance policy|
Compute Engine also applies some lightweight hypervisor and network upgrades in the background nondisruptively.
A host error (
compute.instances.hostError) means that there was a hardware
or software issue on the physical machine hosting
your VM that caused your VM to crash. A host error which involves total hardware
failure or other hardware issues might prevent
live migration of your VM.
If your VM is set to automatically restart, which is the
default setting, Google restarts your VM, typically within three minutes from the time the error
was detected. Depending on the issue, the restart might take up to 5.5 minutes.
VMs with local SSD disks
If a host error occurs on a VM that has one or more Local SSD disks attached, Compute Engine makes a best effort to reconnect to the VM and preserve the Local SSD data. While Compute Engine is recovering your VM and Local SSD disk, the host system and the underlying disk are unresponsive.
You can specify how much time Compute Engine spends trying to recover Local SSD data by setting the Local SSD recovery timeout.
For more information about how Local SSD disks behave when a host error occurs, see Local SSD data persistence.
Occasionally, a VM might become unresponsive before a host error is detected. You can reduce the time Compute Engine waits to restart or terminate the VM by setting the host error recovery timeout (Preview). For more information, see Set availability policies.
Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:
Host maintenance policy
A VM's host maintenance policy determines how it behaves during the following events:
- When there is a maintenance event where Google must move a VM to another host machine
- When there is a host error where Google must terminate or restart a VM
You can configure VMs to continue running during host maintenance, while Compute Engine live migrates them to another host or you can choose to stop your VMs instead. You can update a VM's host maintenance policy at any time to control how you want your VMs to behave.
You can change a VM's host maintenance policy by configuring the following settings:
- Maintenance behavior: whether the VM is live migrated or stopped when there is a maintenance event.
- Restart behavior: whether Compute Engine restarts or terminates the VM if the VM crashes or experiences a host error.
- Host error detection time: the maximum amount of time Compute Engine waits to restart or terminate a VM after detecting that the VM is unresponsive.
- Local SSD recovery time: the maximum amount of time Compute Engine spends recovering the data on local SSD disks after detecting a host error. The local SSD data is lost if the specified time elapses without a successful recovery.
By default, standard VMs are set to live migrate, where Compute Engine automatically migrates your VM away from an infrastructure maintenance event, and your VM remains running during the migration. Your VM might experience a short period of decreased performance, but in general, most VMs should not perform noticeably different. This is ideal for VMs that require constant uptime, and can tolerate a short period of decreased performance.
When Compute Engine migrates your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Live migration events have the following operation type:
Stop and (optionally) restart
If you do not want your VM to live migrate, you can choose to stop and optionally restart your VM. For VMs set to stop and optionally restart, Compute Engine sends a soft power-off signal to shut down the VM. Then, it waits 60 seconds for the VM to shut down cleanly, terminates the VM, and restarts it away from the maintenance event. If the VM doesn't shut down cleanly in 60 seconds, it is terminated.
This option is ideal if your VMs demand constant, maximum performance, and if your overall application is built to handle VM failures or reboots.
When Compute Engine stops and reboots your VMs, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Stopped events have the following operation type:
When your VM reboots, it uses the same persistent boot disk and reattaches any secondary persistent disks that you configured. The data on those disks persists through VM migration and restart.
Local SSD data does not persist when a VM is stopped due to a maintenance event. When the VM restarts, it creates a new Local SSD that you must format and mount.
If your VM is set to stop when there is a maintenance event, or your
VM crashes because of an underlying hardware issue, you can set
Compute Engine to automatically restart the VM by setting the
automaticRestart field to
true. This setting does not apply if the
VM is taken offline through a user action, such as calling
sudo shutdown, or during a zone outage.
When Compute Engine automatically restarts your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Automatic restart events have the following operation type:
Local SSD recovery timeout
When a host error occurs, Compute Engine tries to recover any Local SSD disks attached to the VM. You can control how much time Compute Engine spends trying to recover the data with the Local SSD recovery timeout. By default, Compute Engine spends 1 hour recovering the data, but valid values are between 0 - 168, in increments of 1 hour.
If the timeout expires and the data still can't be recovered, Compute Engine restarts the VM without the Local SSD disk. Compute Engine attaches a new, blank Local SSD disk to the restarted VM.
If the timeout is 1 hour or more, the VM is in
REPAIRING state while
Compute Engine recovers any attached Local SSD disks. The VM and
Local SSD disks are unresponsive during recovery.
If the timeout is 0, Compute Engine will not attempt to recover the Local SSD disks and the data is unrecoverable. You can set the recovery timeout to 0 if resuming the workload is more important than recovering the Local SSD data.
Stop Local SSD disk recovery
You can interrupt the recovery process before the Local SSD recovery timeout
To do so, use the
gcloud compute instances stop command with
This will stop the recovery process, stop the VM, and discard the Local SSD data. You can restart the VM afterward. See Stop a VM with Local SSD for more information.
To set the Local SSD recovery timeout, see Set VM host maintenance policy.
- Learn more about live migration.
- Learn more about setting VM host maintenance policy.
- Learn more about getting live migration notices.
- Learn more about simulating host maintenance.
- Learn more about handling GPU host maintenance events.
- Learn more about manually live migrating sole-tenant VMs.