About host events


You can choose how your virtual machines (VM) instances respond during or after a host event by setting the host maintenance policy during VM creation. A host event can include the regular maintenance of Compute Engine infrastructure, or a host error on a VM. By default, VMs are set to live migrate during host system events, but you can set them to terminate and optionally restart. Z3 VMs are the exception to live migration, since they restart in place by default.

The following host events lead to either the live migration or termination of your VM depending on the host maintenance policy that you set:

Maintenance events

A maintenance event is when Compute Engine stops a VM to perform a hardware or software update. If you enable the live migration host maintenance policy, Compute Engine moves the VM to a new host, and there is no disruption to your application.

VM behavior during a maintenance event can vary depending on the tenancy of the VM. The following table shows some differences between the behavior of multi-tenant and sole-tenant VMs during maintenance events.

Host tenancy Approximate frequency* Live migration to new host Host selection
Multi-tenant Every 2 weeks Yes Compute Engine
Sole-tenant Every 4 to 6 weeks Depends on the host maintenance policy Depends on the host maintenance policy
*These frequencies are approximations, Compute Engine might occasionally perform maintenance more frequently.

Compute Engine also applies some lightweight hypervisor and network upgrades in the background nondisruptively.

Host maintenance policy

A VM's host maintenance policy determines how it behaves during the following events:

  • When there is a maintenance event where Google must move a VM to another host machine
  • When there is a host error where Google must terminate or restart a VM

You can configure VMs to continue running during host maintenance, while Compute Engine live migrates them to another host or you can choose to stop your VMs instead. You can update a VM's host maintenance policy at any time to control how you want your VMs to behave.

You can change a VM's host maintenance policy by configuring the following settings:

  • Maintenance behavior: whether the VM is live migrated or stopped when there is a maintenance event.
  • Restart behavior: whether Compute Engine restarts or terminates the VM if the VM crashes or experiences a host error.
  • Host error detection time: the maximum amount of time Compute Engine waits to restart or terminate a VM after detecting that the VM is unresponsive.
  • Local SSD recovery time: the maximum amount of time Compute Engine spends recovering the data on local SSD disks after detecting a host error. The local SSD data is lost if the specified time elapses without a successful recovery.

Maintenance scheduling

Google Cloud provides features that allow tighter control around maintenance. By using certain VM families, you can specify maintenance preferences to get multi-day notifications through Cloud Logging. Upon receipt of a notification, you can trigger the maintenance at any point of your choosing until the scheduled event.

You can use these features in combination with your host maintenance policy to customize a schedule that fits your workload.

Live migrate

By default, all VMs except Z3 VMs are set to live migrate, where Compute Engine automatically migrates your VM away from an infrastructure maintenance event, and your VM remains running during the migration. Your VM might experience a short period of decreased performance, but in general, most VMs shouldn't perform noticeably different. This is ideal for VMs that require constant uptime, and can tolerate a short period of decreased performance.

When Compute Engine migrates your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Live migration events have the following operation type:

    compute.instances.migrateOnHostMaintenance

Stop and (optionally) restart

If you don't want your VM to live migrate, you can choose to stop and optionally restart your VM. For VMs set to stop and optionally restart, Compute Engine sends a soft power-off signal to shut down the VM. Then, it waits 60 seconds for the VM to shut down cleanly, terminates the VM, and restarts it away from the maintenance event. If the VM doesn't shut down cleanly in 60 seconds, it is terminated.

This option is ideal if your VMs demand constant, maximum performance, and if your overall application is built to handle VM failures or reboots.

When Compute Engine stops and reboots your VMs, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Stopped events have the following operation type:

compute.instances.terminateOnHostMaintenance

When your VM reboots, it uses the same persistent boot disk and reattaches any secondary persistent disks that you configured. The data on those disks persists through VM migration and restart.

Local SSD data does not persist when a VM is stopped due to a maintenance event. When the VM restarts, it creates a new Local SSD that you must format and mount.

Local SSD data does persist on storage-optimized Z3 VMs (Preview). When there is a maintenance event, the Z3 VM restarts in place instead of migrating to a new host. At the end of routine maintenance, your VM is restarted. Google Cloud employs best effort to ensure your Local SSD data remains intact. However, there are cases where data can't be recovered, such as a timeout case.

Automatic restart

If your VM is set to stop when there is a maintenance event, or your VM crashes because of an underlying hardware issue, you can set Compute Engine to automatically restart the VM by setting the automaticRestart field to true. This setting does not apply if the VM is taken offline through a user action, such as calling sudo shutdown, or during a zone outage.

When Compute Engine automatically restarts your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Automatic restart events have the following operation type:

compute.instances.automaticRestart

Host errors

A host error (compute.instances.hostError) means that there was a hardware or software issue on the physical machine hosting your VM that caused your VM to crash. A host error which involves total hardware failure or other hardware issues might prevent live migration of your VM. If your VM is set to automatically restart, which is the default setting, Google restarts your VM, typically within three minutes from the time the error was detected. Depending on the issue, the restart might take up to 5.5 minutes.

VMs with local SSD disks

If a host error occurs on a VM that has one or more Local SSD disks attached, Compute Engine makes a best effort to reconnect to the VM and preserve the Local SSD data. While Compute Engine is recovering your VM and Local SSD disk, the host system and the underlying disk are unresponsive.

You can specify how much time Compute Engine spends trying to recover Local SSD data by setting the Local SSD recovery timeout.

For more information about how Local SSD disks behave when a host error occurs, see Local SSD data persistence.

Unresponsive VMs

Occasionally, a VM might become unresponsive before a host error is detected. You can reduce the time Compute Engine waits to restart or terminate the VM by setting the host error recovery timeout (Preview). For more information, see Set availability policies.

Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:

Google also offers managed services such as App Engine and the App Engine flexible environment.

Local SSD recovery timeout

When a host error occurs, Compute Engine tries to recover any Local SSD disks attached to the VM. You can control how much time Compute Engine spends trying to recover the data with the Local SSD recovery timeout. By default, Compute Engine spends 1 hour recovering the data, but valid values are between 0 - 168, in increments of 1 hour. The exception to this is Z3, which has a default recovery time of up to 6 hours.

If the timeout expires and the data still can't be recovered, Compute Engine restarts the VM without the Local SSD disk. Compute Engine attaches a new, blank Local SSD disk to the restarted VM.

If the timeout is 1 hour or more, the VM is in a REPAIRING state while Compute Engine recovers any attached Local SSD disks. The VM and Local SSD disks are unresponsive during recovery.

If the timeout is 0, Compute Engine won't attempt to recover the Local SSD disks and the data is unrecoverable. You can set the recovery timeout to 0 if resuming the workload is more important than recovering the Local SSD data.

Stop Local SSD disk recovery

You can interrupt the recovery process before the Local SSD recovery timeout expires. To do so, use the gcloud compute instances stop command with the --discard-local-ssd=True flag.

This will stop the recovery process, stop the VM, and discard the Local SSD data. You can restart the VM afterward. See Stop a VM with Local SSD for more information.

To set the Local SSD recovery timeout, see Set VM host maintenance policy.

What's next