Designing resilient systems

This document describes best practices for designing resilient systems on Compute Engine. It provides general advice and covers some features in Compute Engine that can help mitigate instance downtime and prepare for times when your virtual machine (VM) instances unexpectedly fail.

A resilient system is a system that can withstand a certain amount of failures or disruptions without interrupting your service or affecting your users' experience using your service. While Compute Engine makes every effort to prevent such disruptions, certain events are unpredictable, and it's best to be prepared for these events.

Types of failures

At some point, one or more of your VMs might be lost due to system or hardware failures. The following list contains some types of failure scenarios that you can mitigate:

Unexpected single VM failure

Unexpected single VM failures can be due to hardware or system failure. You can mitigate these events by using persistent disks and startup scripts to save your data and re-enable software after you restart the VM.
Unexpected single VM reboot

At some point in time, you might experience an unexpected single VM failure and reboot. Unlike an unexpected single VM failure, Compute Engine automatically reboots your VM after it fails. To help mitigate these events, backup your data, use persistent disks, and use startup scripts to quickly re-configure software.
Zone or region failures

Zone and region failures are rare failures that can cause all of your VMs in a given zone or region to be inaccessible or fail. To mitigate these failures, create diversity across regions and zones and implement load balancing. You should also back up your data or replicate your persistent disks across multiple zones.

Tips for designing resilient systems

To help mitigate VM failures, design your application to be resilient against failures, network interruptions, and unexpected disasters. A resilient system gracefully handles failures, for example, by redirecting traffic from an inaccessible VM to a live VM or by automating tasks on reboot.

Here are some general tips to help you design a resilient system against failures.

Use live migration

Google periodically performs maintenance on its infrastructure by patching systems with the latest software, performing routine tests and preventative maintenance, and generally ensuring that its infrastructure is as secure, fast, and efficient as possible. Compute Engine employs live migration to ensure that this infrastructure maintenance is transparent by default to your VMs.

Live migration is a technology that moves your running VMs away from systems that are about to undergo maintenance work. Compute Engine does this automatically.

During live migration, your VM might experience a decrease in performance for a short period of time. For VMs that demand constant, maximum performance, you can use a configuration where they stop and restart on a host not involved in a maintenance event instead. This option is suitable for overall applications that are also built to handle VM failures or reboots.

To configure your VMs for live migration or to configure them to restart instead of migrate, see Setting instance scheduling options.

Distribute your VMs

Create VMs across more than one region and zone so that you have alternative VMs to point to if a zone or region containing one of your VMs is disrupted. If you host all your VMs in the same zone or region, you won't be able to access any of those VMs if that zone or region becomes unreachable.

Use zone-specific internal DNS names

Set the default internal DNS type for your project or organization to zonal DNS. In your applications, use zonal DNS names when accessing other VMs. Internal DNS servers are distributed across all zones, so you can rely on zonal DNS names to resolve even if there are failures in other locations.

Global DNS is less resilient, due to single point failures. Zonal DNS mitigates the risk of cross-regional outages. Zonal DNS does not require VM name uniqueness across all regions in a project, which allows for faster VM creation.

To check if a VM uses zonal DNS names or global DNS names, see Determining the internal DNS name for a VM instance.

If your project uses global DNS names, you can switch to using zonal DNS names. For more information, see Migrating to zonal DNS names.

Create groups of VMs

Use managed instance groups to create homogeneous groups of VMs so that load balancers can direct traffic to more than one VM in case a single VM becomes unhealthy.

Managed instance groups (MIGs) also offer features like autoscaling and autohealing. Autoscaling lets you deal with spikes in traffic by scaling the number of VMs up or down based on specific signals. Autohealing performs health checking and, if necessary, automatically recreates unhealthy VMs.

MIGs are also available for regions, so you can create a group of VMs distributed across multiple zones within a single region. For more information, see Creating and managing regional MIGs.

Use load balancing

Google Cloud offers a load balancing service that helps you support periods of heavy traffic so that you don't overload your VMs. With Cloud Load Balancing, you can do the following:

Deploy your application on VMs within multiple zones using regional MIGs. Then, you can configure a forwarding rule that can spread traffic across all VMs in all zones within the region. Each forwarding rule can define one entry point to your application using an external IP address.
Deploy VMs across multiple regions using global load balancing. HTTP(S) load balancing enables your traffic to enter the Google Cloud system at the location nearest the client. Cross-regional load balancing provides redundancy so that if a region is unreachable, traffic is automatically diverted to another region. In this way, your service remains reachable using the same external IP address.
Use autoscaling to automatically add or delete VMs from a MIG based on increases or decreases in load.

Additionally, Cloud Load Balancing offers VM health checking, providing support in detecting and handling VM failures.

Use startup and shutdown scripts

Compute Engine offers startup and shutdown scripts that run when a VM boots up or shuts down, respectively. Startup and shutdown scripts can automate tasks like installing software, running updates, making backups, and logging data.

Both startup and shutdown scripts are an efficient and invaluable way to bootstrap or cleanly shut down your VMs. Instead of configuring your VMs using custom images, it can be beneficial to configure VMs using startup scripts.

Startup scripts run whenever the VM reboots or restarts due to failures, and can be used to install software and updates. You can also use startup scripts to ensure that services are running within the VM. Coding the changes to configure a VM in a startup script is often easier than trying to figure out what files or bytes have changed on a custom image.

Shutdown scripts run when your VM shuts down, either intentionally or not. They can perform last minute tasks like backing up data, saving logs, and gracefully closing connections before you stop a VM.

For more information, see Running startup scripts and Running shutdown scripts.

Backup your data

Backup your data regularly and in multiple locations. You can upload your files to Cloud Storage, create persistent disk snapshots, or replicate your data to a persistent disk in another region or zone.