Live migration during maintenance events

During a planned maintenance event on a virtual machine (VM) instance's underlying hardware, Compute Engine might move the VM to another host. To keep a VM running during a host event, Compute Engine performs a live migration of the VM to another host in the same zone. For more information about host events, see About host events.

Live migration lets Google perform maintenance without interrupting a workload, rebooting a VM, or modifying any of the VM's properties, such as IP addresses, metadata, block storage data, application state, and network settings.

In addition to keeping VMs running during planned host events, live migration keeps VMs running during the following situations:

  • Infrastructure maintenance. Infrastructure maintenance includes host hardware, network and power grids in data centers, and host OSes and BIOSes.

  • Security-related updates and system configuration changes. These include events such as installing security patches and changing the size of the host root partition for storage of the host OS image and packages.

  • Hardware failures. This includes failures in memory, CPUs, network interface cards, and disks. If the hardware fails completely or otherwise prevents live migration, the VM terminates, restarts automatically, and Compute Engine logs a hostError.

Compute Engine only performs a live migration of VMs that have the host maintenance policy set to migrate. For information about how to change the host maintenance policy, see Set VM host maintenance policy.

Manual live migration

As your workload runs, you might want to move VMs to a different node or node group. Sole-tenancy lets you move VMs to a specific sole-tenant node or to a group of nodes. If you move a VM to a group of nodes, Compute Engine determines which node to place it on. For information about sole-tenancy, see Sole-tenancy overview.

To move sole-tenant VMs to a different node or node group, you can manually initiate a live migration. You can also manually initiate a live migration to move a multi-tenant VM into sole-tenancy. For more information, see Manually live migrate VMs.

How does the live migration process work?

When a VM is scheduled to live migrate, Google provides a notification. During live migration, Google ensures a minimum disruption time, which is typically much less than 1 second. If a VM is not set to live migrate, Compute Engine terminates the VM during host maintenance. VMs that are set to terminate during a host event stop and (optionally) restart.

When Google migrates a running VM from one host to another, it moves the complete state of the VM from the source to the destination in a way that is transparent to the guest OS and anything communicating with it. There are many components involved in making this work seamlessly, but the high-level steps are shown here:

Migrating a VM and each of its resources to a new host system
            without requiring the guest operating system to restart.
Live migration components

The process begins with a notification that VMs need to be moved from their current host machine. The notification might start with a file change indicating that a new BIOS version is available, a hardware operation scheduling maintenance, or an automatic signal from an impending hardware failure.

Google's cluster management software constantly watches for these events and schedules them based on policies that control the data centers, such as capacity utilization rates and the number of VMs that a single customer can migrate at once.

After a VM is selected for migration, Google provides a notification to the guest that a migration is happening soon. After a waiting period, a target host is selected and the host is asked to set up a new, empty "target" VM to receive the migrating "source" VM. Authentication is used to establish a connection between the source and the target.

There are three stages involved in the VM's migration:

  1. Source brownout. The VM is still executing on the source, while most state is sent from the source to the target. For example, Google copies all the guest memory to the target, while tracking the pages that have been changed on the source. The time spent in source brownout is a function of the size of the guest memory and the rate at which pages are being changed.

  2. Blackout. A very brief moment when the VM is not running anywhere, the VM is paused and all the remaining state required to begin running the VM on the target is sent. The VM enters blackout stage when sending state during source brownout reaches a point of diminishing returns. An algorithm is used that balances numbers of bytes of memory being sent against the rate at which the guest VM is making changes.

    During blackout events, the system clock appears to jump forward, up to 5 seconds. If a blackout event exceeds 5 seconds, Google stops and resynchronizes the clock using a daemon that is included as part of the VM guest packages.

  3. Target brownout. The VM executes on the target VM. The source VM is present and might provide supporting functionality for the target VM. For example, until the network fabric has caught up with the new location of the target VM, the source VM provides forwarding services for packets to and from the target VM.

Finally, the migration is complete and the system deletes the source VM. You can see that the migration took place in your VM logs. Live migration is a critical component of our platform, so Google continuously tests live migration with a very high level of scrutiny. During testing, we use fault-injection to trigger failures at all of the interesting points in the migration algorithm. We generate both active and passive failures for each component. Achieving this complex and multifaceted process requires deep integration throughout the infrastructure and a powerful set of scheduling, orchestration, and automation processes.

Live migration and Confidential VMs

Confidential VMs do not support live migration. They must be set to stop and optionally restart. Compute Engine offers a 60-second notice before a Confidential VM is stopped. To learn more about these maintenance event notices, read Getting live migration notices.

Live migration and GPUs

Instances with GPUs attached cannot be live migrated. They must be set to stop and optionally restart. Compute Engine offers a 60-minute notice before a VM instance with a GPU attached is stopped. To learn more about these maintenance event notices, read Getting live migration notices.

To learn more about handling host maintenance with GPUs, read Handling host maintenance on the GPUs documentation.

Live migration and local SSDs

Compute Engine can also live migrate instances with local SSDs attached, moving the VMs along with their local SSD to a new machine in advance of any planned maintenance.

Live migration and Cloud TPUs

Cloud TPUs do not support live migration.

Live migration for preemptible instances

You can't configure a preemptible instance to live migrate. The maintenance behavior for preemptible instances is always set to TERMINATE by default, and you can't change this option. You cannot set the automatic restart option for preemptible instances, but you can manually restart preemptible instances again from the VM Instances details page after they are preempted.

  1. Go to the VM instances page.
  2. Select your preemptible instance.
  3. At the top of the VM Instance details page, click Start.

If you need to change your instance to no longer be preemptible, detach the boot disk from your preemptible instance and attach it to a new instance that is not configured to be preemptible. You can also create a snapshot of the boot disk and use it to create a new instance without preemptibility.

What's next