Compute Engine offers live migration to keep your virtual machine instances running even when a host system event, such as a software or hardware update, occurs. Compute Engine live migrates your running instances to another host in the same zone instead of requiring your VMs to be rebooted. This allows Google to perform maintenance that is integral to keeping infrastructure protected and reliable without interrupting any of your VMs. When a VM is scheduled to be live migrated, Google provides a notification to the guest that a migration is imminent.
Live migration keeps your instances running during:
- Regular infrastructure maintenance and upgrades.
- Network and power grid maintenance in the data centers.
- Failed hardware such as memory, CPU, network interface cards, disks, power,
and so on. This is done on a best-effort basis; if a hardware fails completely
or otherwise prevents live migration, the VM crashes and restarts automatically
- Host OS and BIOS upgrades.
- Security-related updates, with the need to respond quickly.
- System configuration changes, including changing the size of the host root partition, for storage of the host image and packages.
Live migration does not change any attributes or properties of the VM itself. The live migration process just transfers a running VM from one host machine to another host machine within the same zone. All VM properties and attributes remain unchanged, including internal and external IP addresses, instance metadata, block storage data and volumes, OS and application state, network settings, network connections, and so on.
How does the live migration process work?
When Google migrates a running VM instance from one host to another, it moves the complete instance state from the source to the destination in a way that is transparent to the guest OS and anyone communicating with it. There are many components involved in making this work seamlessly, but the high-level steps are illustrated here:
The process begins with a notification that VMs need to be moved from their current host machine. The notification might start with a file change indicating that a new BIOS version is available, a hardware operation scheduling maintenance, or an automatic signal from an impending hardware failure.
Google's cluster management software constantly watches for these events and schedules them based on policies that control the data centers, such as capacity utilization rates and the number of VMs that a single customer can migrate at once.
After a VM is selected for migration, Google provides a notification to the guest that a migration is happening soon. After a waiting period, a target host is selected and the host is asked to set up a new, empty "target" VM to receive the migrating "source" VM. Authentication is used to establish a connection between the source and the target.
There are three stages involved in the VM’s migration:
During pre-migration brownout, the VM is still executing on the source, while most state is sent from the source to the target. For example, Google copies all the guest memory to the target, while tracking the pages that have been changed on the source. The time spent in pre-migration brownout is a function of the size of the guest memory and the rate at which pages are being changed.
During blackout, which is a very brief moment when the VM is not running anywhere, the VM is paused and all the remaining state required to begin running the VM on the target is sent. The VM enters blackout stage when sending state during pre-migration brownout reaches a point of diminishing returns. An algorithm is used that balances numbers of bytes of memory being sent against the rate at which the guest VM is making changes.
During post-migration brownout, the VM executes on the target VM. The source VM is present and might provide supporting functionality for the target VM. For example, until the network fabric has caught up with the new location of the target VM, the source VM provides forwarding services for packets to and from the target VM.
Finally, the migration is complete and the system deletes the source VM. You can see that the migration took place in your VM logs. Live migration is a critical component of our platform, so Google continuously tests live migration with a very high level of scrutiny. During testing, we use fault-injection to trigger failures at all of the interesting points in the migration algorithm. We generate both active and passive failures for each component. Achieving this complex and multifaceted process requires deep integration throughout the infrastructure and a powerful set of scheduling, orchestration, and automation processes.
Live migration and GPUs
Instances with GPUs attached cannot be live migrated. They must be set to stop and optionally restart. Compute Engine offers a 60-minute notice before a VM instance with a GPU attached is stopped. To learn more about these maintenance event notices, read Getting live migration notices.
To learn more about handling host maintenance with GPUs, read Handling host maintenance on the GPUs documentation.
Live migration and local SSDs
Compute Engine can also live migrate instances with local SSDs attached, moving the VMs along with their local SSD to a new machine in advance of any planned maintenance.
Live migration for preemptible instances
You can't configure a
to live migrate. The maintenance behavior for preemptible instances is always
TERMINATE by default, and you can't change this option. You cannot set
the automatic restart option for preemptible instances, but you can manually
restart preemptible instances again from the VM Instances details page
after they are preempted.
- Go to the VM instances page.
- Select your preemptible instance.
- At the top of the VM Instance details page, click Start.
If you need to change your instance to no longer be preemptible, detach the boot disk from your preemptible instance and attach it to a new instance that is not configured to be preemptible. You can also create a snapshot of the boot disk and use it to create a new instance without preemptibility.
- Set availability policies to configure your instances to live migrate.
- Read tips for designing a robust system that can handle service disruptions.