About host events

Linux Windows

During the lifespan of a virtual machine (VM) instance or bare metal instance, the host machine that your instance runs on can experience multiple host events. A host event can include the regular maintenance of Compute Engine infrastructure, or, in rare cases a host error. You can choose how your VM and bare metal instances respond during or after a host event by configuring the host maintenance policy.

By default, most instances are set to live migrate during host events. You can override this behavior and explicitly set the instances to terminate and optionally restart. Some machine types don't support live migration, such as bare metal instances or VMs with attached GPUs. These instances terminate during host events. For more information, see Maintenance and restart behaviors.

Types of host events

There are two types of host events, which are described in more detail in the following sections:

Maintenance events
Host errors

If your instance becomes unresponsive, this can also trigger a restart or termination of the instance.

Maintenance events

A maintenance event is when Compute Engine has to perform a maintenance or repair activity that requires VMs to be moved out of the host server. If you enable the live migration host maintenance policy for a supported instance type, Compute Engine moves the instance to a new host, and there is minimal disruption to your application.

Instance behavior during a maintenance event can vary depending on the tenancy of the instance as well as the machine type. The following table summarizes the behavior for planned maintenance events.

Host tenancy	Approximate maintenance event frequency	Live migration supported	Host selection
Multi-tenant (shared)	Every 2 weeks	Yes	Compute Engine
Sole-tenant	Every 4 to 6 weeks	Depends on the host maintenance policy	Depends on the host maintenance policy
X4	Minimum of 90 days	No	Compute Engine
C3	Minimum of 30 days	No	Compute Engine

Compute Engine also applies some lightweight hypervisor and network upgrades in the background nondisruptively by retaining the instance on the same host.

Host errors

A host error (compute.instances.hostError) means that there was a hardware or software issue on the physical machine or the data center infrastructure hosting your compute instance that caused your instance to crash. A host error involving a total hardware failure or other hardware issues might prevent the live migration of your instance. If your instance is set to automatically restart, which is the default setting, Compute Engine restarts your instance, typically within three minutes from the time the error was detected. Depending on the issue, the restart might take up to 5.5 minutes.

Occasionally, a compute instance might become unresponsive before a host error is signaled. You can reduce the amount of time Compute Engine waits to restart or terminate the instance by setting the host error recovery timeout (Preview). For more information, see Set availability policies.

Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:

Google also offers managed services such as App Engine and the App Engine flexible environment.

Host maintenance policy overview

An instance's host maintenance policy determines how it behaves during the following host events:

Maintenance event
Host error event or instance not responding

You can configure instances to continue running during host maintenance, while Compute Engine live migrates them to another host or you can choose to stop your instance instead.

You can change a instance's host maintenance policy by configuring the following settings:

Maintenance behavior: whether the instance is live migrated or stopped when there is a maintenance event.
Restart behavior: whether Compute Engine restarts or terminates the instance if the instance crashes, experiences a host error, or becomes unresponsive.
Host error detection time: the maximum amount of time that Compute Engine waits to restart or terminate an instance after detecting that the instance is unresponsive.
Local SSD recovery time: the maximum amount of time that Compute Engine spends recovering the data on Local SSD disks after detecting a host error. The Local SSD data is lost if the specified time elapses without a successful recovery.

You can update an instance's host maintenance policy at any time to control how you want your instances to behave.

Maintenance and restart behaviors

When a host event occurs, the compute instance can either use live migration, or the instance can be terminated. If an instance is terminated, you can choose to restart the instance yourself or have Compute Engine automatically restart it.

The following machine series don't support live migration and instead terminate during host events:

Z3 and X4 instances restart in place.
C3 bare metal instances terminate and restart, meaning they might restart on a different host.
Instances with GPUs
Instances with TPUs

Live migrate

By default, most instance types are set to live migrate, excluding:

Instances with attached GPUs and TPUs
C3 bare metal or X4 instances
Z3 instances

During live migration, Compute Engine automatically migrates your instance away from an infrastructure maintenance event, and your instance remains running during the migration. Your instance might experience a short period of decreased performance, but in general, most instances shouldn't perform noticeably different. This is ideal for instances that require constant uptime and can tolerate a short period of decreased performance.

When Compute Engine migrates your instance, it reports a system event that is published to the list of zone operations and to the System Events logs. You can review this event by viewing the Compute Engine operations for a specific zone. Live migration events have the following operation type:

compute.instances.migrateOnHostMaintenance

Terminate and restart

If you don't want your instance to live migrate, or if your instance type doesn't support live migration, then you can instead choose to allow Google Cloud to stop the instance when a host event occurs. With this configuration, if a host event occurs, Compute Engine sends a soft power-off signal to shut down the instance. It then waits 60 seconds for the instance to shut down cleanly, and sets the instance status to TERMINATED. If the instance doesn't shut down cleanly in 60 seconds, it is forcibly terminated.

This option is ideal if your instances demand constant, maximum performance, and if your overall application is built to handle instance failures or reboots.

When Compute Engine stops an instance because of a host event, it reports a system event that is published to the list of zone operations and to the System Events logs. You can review this event by viewing the Compute Engine operations for a specific zone. Instance termination events have the following operation type:

compute.instances.terminateOnHostMaintenance

Automatic restart

If your instance is configured to stop when there is a maintenance event, or your instance crashes because of an underlying hardware issue, Compute Engine can automatically restart the instance. The instance is either restarted on the same host server, or moved to another server in the same zone that isn't participating in the maintenance event.

By default, Compute Engine tries to recover instances with attached Local SSD disks for one hour. If the time limit is reached, Compute Engine attempts to restart the instance on a different host server in the same zone. Z3 and X4 instances have different default wait times. These instance types restart on the same host server after instance termination.

To configure automatic restart, set the host maintenance policy field automaticRestart to true. This setting does not apply if the instance is taken offline due to a zonal outage or through a user action, such as calling sudo shutdown within the guest OS.

When Compute Engine automatically restarts your instance, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Automatic restart events have the following operation type:

compute.instances.automaticRestart

Disk persistence following instance termination

Because Persistent Disk and Hyperdisk are network-attached storage, when your instance restarts, Compute Engine reattaches the boot disk and any secondary disks to the instance. The data on those disks persists through live migration and instance restarts.

Compute Engine preserves the data on Local SSD disks following a host event when possible. However, Compute Engine does not guarantee Local SSD data persistence.

Local SSD disks are preserved if:
- You configure your instance for live migration and the instance goes through a host maintenance event.
- A host error occurs and Compute Engine reconnects the instance to the Local SSD disks within the timeout limit.
- A compute instance with attached Local SSD disks that supports only termination and automatic restart undergoes a maintenance event. The instance restarts in place, preserving the Local SSD data, instead of migrating to a new host.
Local SSD disks are not preserved if:
- You shutdown the guest operating system and force the instance to stop.
- You configure the instance to stop on host maintenance events and the instance goes through a host maintenance event.
- A host error occurs and Compute Engine can't reconnect the disks to the instance before the timeout expires. In this case, the instance is restarted without recovering the Local SSD disks. When the instance restarts, Compute Engine attaches blank Local SSD disks to the restarted instance. You must format and mount these disks before the instance can use them. The data on the original Local SSD disks is unrecoverable.

Google Cloud uses a best effort to ensure your Local SSD data remains intact. However, there are cases where data can't be recovered, such as a timeout case. For more information about when Local SSD disks are preserved, see Local SSD data persistence.

Local SSD recovery timeout

When a host error occurs, Compute Engine tries to recover any Local SSD disks attached to the instance. You can control how much time Compute Engine spends trying to recover the data with the host policy localSsdRecoveryTimeout setting.

By default, Compute Engine spends 1 hour recovering the data, but valid values for this setting are between 0 and 168, in increments of 1 hour. For Z3 instances, the default value is 6, which means Z3 instances will try to recover the Local SSD data for 6 hours before reaching the timeout limit.

If you set the Local SSD recovery timeout to 0, then Compute Engine doesn't attempt to recover any attached Local SSD disks. The instance is restarted as soon as possible and the Local SSD data is unrecoverable. Use this configuration if resuming the workload is more important than recovering the Local SSD data.

If the recovery timeout is not set to 0, but the time limit is reached before the Local SSD data is recovered, then Compute Engine restarts the instance without the Local SSD disk. Compute Engine attaches new, blank Local SSD disks to the restarted instance. You must format and mount these disks before the instance can use them.

The instance is in a REPAIRING state while Compute Engine attempts to recover the Local SSD disks. The instance and Local SSD disks are unavailable during this time.

If you set the Local SSD recovery timeout to the maximum value of 168, then the instance remains in the REPAIRING state for up to 7 days while Compute Engine attempts to recover the Local SSD disks.

Stop Local SSD disk recovery

You can interrupt the Local SSD disk recovery process before Compute Engine reaches the recovery timeout limit. To do so, use the gcloud compute instances stop command with the --discard-local-ssd=True flag.

This command stops the recovery process, stops the compute instance, and discards the Local SSD data. You can then restart the instance. See Stop an instance with Local SSD for more information.

To set the Local SSD recovery timeout, see Set instance host maintenance policy.

Maintenance scheduling

Google Cloud provides features that allow tighter control around maintenance. By using certain machine families, you can specify maintenance preferences and get notifications of upcoming maintenance events through Cloud Logging, the instance's metadata server, the gcloud CLI compute instances describe command or the REST instances.describe method. Upon receipt of a notification, you have a period of time in which you can start the scheduled maintenance at a time you choose. If you don't trigger the scheduled maintenance, then the maintenance event occurs at the end of the notification time period, which is the scheduled time listed in the notification.

You can use these features in combination with your host maintenance policy to customize a maintenance schedule that fits your workload.

What's next

Learn more about live migration.
Learn more about setting instance host maintenance policy.
Learn more about getting live migration notices.
Learn more about simulating host maintenance. + Learn more about handling GPU host maintenance events.
Learn more about manually live migrating sole-tenant VMs.