You can choose how your virtual machines (VM) instances respond during or after a host event by setting the host maintenance policy. A host event can include the regular maintenance of Compute Engine infrastructure, or a host error on a VM. By default, VMs are set to live migrate during host system events, but you can set them to terminate and optionally restart.
Host events
The following host events lead to either the live migration or termination of your VM depending on the host maintenance policy that you set:
Maintenance events
A maintenance event is when Compute Engine stops a VM to perform a hardware or software update. If you enable the live migration host maintenance policy, Compute Engine moves the VM to a new host, and there is no disruption to your application.
VM behavior during a maintenance event can vary depending on the tenancy of the VM. The following table shows some differences between the behavior of multi-tenant and sole-tenant VMs during maintenance events.
Host tenancy | Approximate frequency* | Live migration to new host | Host selection |
---|---|---|---|
Multi-tenant | Every 2 weeks | Yes | Compute Engine |
Sole-tenant | Every 4 to 6 weeks | Depends on the host maintenance policy | Depends on the host maintenance policy |
Compute Engine also applies some lightweight hypervisor and network upgrades in the background nondisruptively.
Host errors
A host error (compute.instances.hostError
) means that there was a hardware or software issue on the physical machine hosting
your VM that caused your VM to crash. A host error which involves total hardware failure or other hardware issues might prevent
live migration of your VM. If your VM is set to automatically restart, which is the
default setting, Google restarts your VM, typically within three minutes from the time the error
was detected. Depending on the issue, the restart might take up to 5.5 minutes.
Certain resources behave differently, such as local SSDs. If there is a host error, Compute Engine makes a best effort to reconnect to the VM and preserve the local SSD data, but if the underlying drive does not recover within 60 minutes, the VM restarts without the local SSD data. While Compute Engine is recovering your VM and local SSD, which can take up to 60 minutes, the host system and the underlying drive are unresponsive. For more information about how local SSD disks behave in the event of host errors, see Local SSD data persistence.
If a hardware fails completely or otherwise prevents live migration, the VM crashes and restarts automatically and a host error is logged.
Occasionally, a VM might become unresponsive before a host error is detected. You can reduce the
time Compute Engine waits to restart or terminate the VM by using the
--host-error-timeout-seconds
flag
(Preview). This flag sets the maximum amount of time
Compute Engine waits to restart or terminate a VM after detecting that the VM is unresponsive.
For more information, see
Set availability policies.
Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:
Google also offers managed services such as App Engine and the App Engine flexible environment.
Host maintenance policy
A VM's host maintenance policy determines how it behaves during the following events:
- When there is a maintenance event where Google must move a VM to another host machine
- When there is a host error where Google must terminate or restart a VM
You can configure VMs to continue running during host maintenance, while Compute Engine live migrates them to another host or you can choose to stop your VMs instead. You can update a VM's host maintenance policy at any time to control how you want your VMs to behave.
You can change a VM's host maintenance policy by configuring the following settings:
- Maintenance behavior: whether the VM is live migrated or stopped when there is a maintenance event.
- Restart behavior: whether Compute Engine restarts or terminates the VM if the VM crashes or experiences a host error.
- Host error detection time: the maximum amount of time Compute Engine waits to restart or terminate a VM after detecting that the VM is unresponsive.
The default maintenance behavior for VMs is to live migrate, but you can change the behavior to stop your VM during maintenance events instead.
Live migrate
By default, standard VMs are set to live migrate, where Compute Engine automatically migrates your VM away from an infrastructure maintenance event, and your VM remains running during the migration. Your VM might experience a short period of decreased performance, but in general, most VMs should not perform noticeably different. This is ideal for VMs that require constant uptime, and can tolerate a short period of decreased performance.
When Compute Engine migrates your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Live migration events have the following operation type:
compute.instances.migrateOnHostMaintenance
Stop and (optionally) restart
If you do not want your VM to live migrate, you can choose to stop and optionally restart your VM. For VMs set to stop and optionally restart, Compute Engine sends a soft power-off signal to shut down the VM. Then, it waits 60 seconds for the VM to shut down cleanly, terminates the VM, and restarts it away from the maintenance event. If the VM doesn't shut down cleanly in 60 seconds, it is terminated.
This option is ideal if your VMs demand constant, maximum performance, and if your overall application is built to handle VM failures or reboots.
When Compute Engine stops and reboots your VMs, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Stopped events have the following operation type:
compute.instances.terminateOnHostMaintenance
When your VM reboots, it uses the same persistent boot disk and reattaches any secondary persistent disks that you configured. The data on those disks persists through VM migration and restart.
Local SSD data does not persist when a VM is stopped. When the VM restarts, it creates a new Local SSD that you must format and mount.
Automatic restart
If your VM is set to stop when there is a maintenance event, or your
VM crashes because of an underlying hardware issue, you can set
Compute Engine to automatically restart the VM by setting the
automaticRestart
field to true
. This setting does not apply if the
VM is taken offline through a user action, such as calling
sudo shutdown
, or during a zone outage.
When Compute Engine automatically restarts your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Automatic restart events have the following operation type:
compute.instances.automaticRestart
What's next
- Learn more about live migration.
- Learn more about setting VM host maintenance policy.
- Learn more about getting live migration notices.
- Learn more about simulating host maintenance.
- Learn more about handling GPU host maintenance events.
- Learn more about manually live migrating sole-tenant VMs.