You can choose how your virtual machines (VMs) respond during or after a host system event by setting the availability policies. A host system can include the regular maintenance of Compute Engine infrastructure, or a host error on a VM. By default, the VMs are set to live migration during host system events, but you can set it to terminate and optionally restart.
The following host events lead to either the live migration or termination of your VM depending on the availability policies that you set:
Compute Engine maintenance events entail hardware and software updates. Some of these maintenance events require Google to move your VM away from the host that is undergoing maintenance. Compute Engine automatically manages the scheduling behavior of these VMs. If you configured the VM's availability policy to use live migration, Compute Engine live migrates your VMs which prevents your applications from experiencing disruptions during these events. Alternatively, you can stop your VMs during these events rather than live migrating them.
The following table categorizes Compute Engine maintenance events into two broad categories, illustrates each with examples, and signifies which maintenance event requires live migration of your VM to a different host.
|Maintenance event type||Examples||Approximate frequency *||Requires live migration to new host|
|Host maintenance||Host kernel upgrade, hardware repair or upgrade||Once every two weeks||Yes|
|Lightweight||Hypervisor-level upgrade, networking stack upgrade||1-2 times per week||No|
* These frequencies are approximations, not guarantees. Compute Engine may occasionally perform maintenance more frequently than mentioned here.
A host error (
compute.instances.hostError) means that there was a hardware or software issue on the physical machine hosting
your VM that caused your VM to crash. A host error which involves total hardware failure or other hardware issues might prevent
live migration of your VM. If your VM is set to automatically restart, which is the default setting, Google restarts
your VM, typically within three minutes from the time the error was detected. In cases with certain hardware issues, the attempt to
restart your VM might get delayed by 5.5 minutes to 16.5 minutes.
Certain resources behave differently, such as local SSDs. If there is a host error, Compute Engine makes a best effort to reconnect to the VM and preserve the local SSD data, but if the underlying drive does not recover within 60 minutes, the VM restarts without the local SSD data. While Compute Engine is recovering your VM and local SSD, which can take up to 60 minutes, the host system and the underlying drive are unresponsive. For more information about how local SSD disks behave in the event of host errors, see Local SSD data persistence.
If a hardware fails completely or otherwise prevents live migration, the VM crashes and restarts automatically and a host error is logged.
Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:
A VM's availability policy determines how the VM behaves when there is a maintenance event where Google must move your VM to another host machine. You can configure your VM to continue running while Compute Engine live migrates the VM to another host, or you can choose to stop your VMs instead. You can update a VM's availability policy at any time to control how you want your VMs to behave.
You can change a VM's availability policy by configuring the following two settings:
- The VM's maintenance behavior, which determines whether the VM is live migrated or stopped when there is a maintenance event.
- The VM's restart behavior, which determines whether the VM automatically restarts in case it crashes or gets stopped.
The default maintenance behavior for VMs is to live migrate, but you can change the behavior to stop your VM during maintenance events instead.
By default, standard VMs are set to live migrate, where Compute Engine automatically migrates your VM away from an infrastructure maintenance event, and your VM remains running during the migration. Your VM might experience a short period of decreased performance, but in general, most VMs should not perform noticeably different. This setting is ideal for VMs that require constant uptime and can tolerate a short period of decreased performance.
When Compute Engine migrates your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Live migration events have the following operation type:
Stop and (optionally) restart
If you do not want your VM to live migrate, you can choose to stop and optionally restart your VM. For VMs set to stop and optionally restart, Compute Engine sends a soft power-off signal to shut down the VM. Then, it waits 60 seconds for the VM to shut down cleanly, terminates the VM, and restarts it away from the maintenance event. If the VM doesn't shut down cleanly in 60 seconds, it is terminated.
This option is ideal if your VMs demand constant, maximum performance, and if your overall application is built to handle VM failures or reboots.
When Compute Engine stops and reboots your VMs, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Stopped events have the following operation type:
When your VM reboots, it uses the same persistent boot disk and reattaches any secondary persistent disks that you configured. The data on those disks persists through VM migration and restart.
Local SSD data does not persist when a VM is stopped. When the VM restarts, it creates a new Local SSD that you must format and mount.
If your VM is set to stop when there is a maintenance event, or your
VM crashes because of an underlying hardware issue, you can set
Compute Engine to automatically restart the VM by setting the
automaticRestart field to
true. This setting does not apply if the
VM is taken offline through a user action, such as calling
sudo shutdown, or during a zone outage.
When Compute Engine automatically restarts your VM, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Automatic restart events have the following operation type:
- Learn more about live migration.
- Learn more about setting VM availability policies.
- Learn more about getting live migration notices.
- Learn more about simulating host maintenance.
- Learn more about handling GPU host maintenance events.
- Learn more about manually live migrating sole-tenant VMs.