Live migration process during maintenance events


During a planned maintenance event on a virtual machine (VM) instance's underlying hardware, Compute Engine might move the VM to another host. To keep a VM running during a host event, Compute Engine performs a live migration of the VM to another host in the same zone. For more information about host events, see About host events.

Live migration lets Google Cloud perform maintenance without interrupting a workload, rebooting a VM, or modifying any of the VM's properties, such as IP addresses, metadata, block storage data, application state, and network settings.

In addition to keeping VMs running during planned host events, live migration keeps VMs running during the following situations:

  • Infrastructure maintenance. Infrastructure maintenance includes host hardware, network and power grids in data centers, and host OSes and BIOSes.

  • Security-related updates and system configuration changes. These include events such as installing security patches and changing the size of the host root partition for storage of the host OS image and packages.

  • Hardware failures. This includes failures in memory, CPUs, network interface cards, and disks. If the hardware fails completely or otherwise prevents live migration, the VM terminates, restarts automatically, and Compute Engine logs a hostError.

Compute Engine only performs a live migration of VMs that have the host maintenance policy set to migrate. For information about how to change the host maintenance policy, see Set VM host maintenance policy.

Live migration process and local SSDs

Compute Engine can live migrate VMs with local SSDs attached, moving the VMs along with their local SSD to a new machine in advance of any planned maintenance.

Limitations

Live migration is not supported for the following VM types:

  • Some Confidential VM instances. Live migration is only supported on N2D machine types with AMD EPYC Milan CPU platforms running AMD SEV. All other Confidential VM types must be set to stop and optionally restart. See Live migration for more details.

  • VMs with GPUs attached. VM instances with GPUs attached must be set to stop and optionally restart. Compute Engine offers a 60-minute notice before a VM instance with a GPU attached is stopped. To learn more about these maintenance event notices, read Getting live migration notices.

    To learn more about handling host maintenance with GPUs, read Handling host maintenance on the GPUs documentation.

  • Cloud TPUs. Cloud TPUs don't support live migration.

  • Preemptible VMs. You can't configure a preemptible VM to live migrate. The maintenance behavior for preemptible instances is always set to TERMINATE by default, and you can't change this option. You can't set the automatic restart option for preemptible instances, but you can manually start preemptible VMs again from the VM Instances details page after they are preempted.

    If you need to change your instance to no longer be preemptible, detach the boot disk from your preemptible instance and attach it to a new instance that is not configured to be preemptible. You can also create a snapshot of the boot disk and use it to create a new instance without preemptibility.

  • Spot VMs. Spot VMs can't live migrate to become standard VMs while they are running or be set to automatically restart when there is a host event.

  • Storage-optimized VMs. Z3 VMs (Preview) don't support live migration. The maintenance behavior for Z3 VMs is set to TERMINATE.

How does the live migration process work?

When a VM is scheduled to live migrate, Google Cloud provides a notification. During live migration, Google Cloud ensures a minimum disruption time, which is typically much less than 1 second. If a VM is not set to live migrate, Compute Engine terminates the VM during host maintenance. VMs that are set to terminate during a host event stop and (optionally) restart.

When Google Cloud migrates a running VM from one host to another, it moves the complete state of the VM from the source to the destination in a way that is transparent to the guest OS and anything communicating with it. There are many components involved in making this work seamlessly, but the high-level steps are shown in the following illustration:

Migrating a VM and each of its resources to a new host system
            without requiring the guest operating system to restart.
Live migration components

The process begins with a notification that VMs need to be moved from their current host machine. The notification might start with a file change indicating that a new BIOS version is available, a hardware operation scheduling maintenance, or an automatic signal from an impending hardware failure.

Google Cloud's cluster management software constantly watches for these events and schedules them based on policies that control the data centers, such as capacity utilization rates and the number of VMs that a single customer can migrate at once.

After a VM is selected for migration, Google Cloud provides a notification to the guest that a migration is happening soon. After a waiting period, a target host is selected and the host is asked to set up a new, empty "target" VM to receive the migrating "source" VM. Authentication is used to establish a connection between the source and the target.

There are three stages involved in the VM's migration:

  1. Source brownout. The VM is still executing on the source, while most state is sent from the source to the target. For example, Google Cloud copies all the guest memory to the target, while tracking the pages that have been changed on the source. The time spent in source brownout is a function of the size of the guest memory and the rate at which pages are being changed.

  2. Blackout. A very brief moment when the VM is not running anywhere, the VM is paused and all the remaining state required to begin running the VM on the target is sent. The VM enters blackout stage when sending state during source brownout reaches a point of diminishing returns. An algorithm is used that balances numbers of bytes of memory being sent against the rate at which the guest VM is making changes.

    During blackout events, the system clock appears to jump forward, up to 5 seconds. If a blackout event exceeds 5 seconds, Google Cloud stops and resynchronizes the clock using a daemon that is included as part of the VM guest packages.

  3. Target brownout. The VM executes on the target VM. The source VM is present and might provide supporting functionality for the target VM. For example, until the network fabric has caught up with the new location of the target VM, the source VM provides forwarding services for packets to and from the target VM.

Finally, the migration is complete and the system deletes the source VM. You can see that the migration took place in your VM logs.

Manual live migration process

As your workload runs, you might want to move VMs to a different node or node group. Sole-tenancy lets you move VMs to a specific sole-tenant node or to a group of nodes. If you move a VM to a group of nodes, Compute Engine determines which node to place it on. For information about sole-tenancy, see Sole-tenancy overview.

To move sole-tenant VMs to a different node or node group, you can manually initiate a live migration. You can also manually initiate a live migration to move a multi-tenant VM into sole-tenancy. For more information, see Manually live migrate VMs.

What's next