Designing a robust system is important to help mitigate instance downtime and to be prepared for times where your instances suffer an unexpected failure.
Google periodically performs maintenance on its infrastructure: patching systems with the latest software, performing routine tests and preventative maintenance, and generally ensuring that our infrastructure is as fast and efficient as possible. Compute Engine employs live migration to ensure that this infrastructure maintenance is transparent by default to your virtual machine instances.
Live migration is a technology that Google has built to move your running instances away from systems that are about to undergo maintenance work. Compute Engine does this automatically.
During live migration, your instance may experience a decrease in performance for a short period of time. You also have the option to configure your virtual machine instances to terminate and reboot away from the maintenance event. This option is suitable for instances that demand constant, maximum performance, and when your overall application is built to handle instance failures or reboots.
For more information about configuring your virtual machines to terminate and reboot instead of migrate, see Setting Instance Scheduling Options.
For more information about Transparent maintenance in general, see the Transparent maintenance documentation.
Types of failures
At some point, one or more of your instances might be lost due to system or hardware failures. Some of the failures include:
Unexpected single instance failure
Unexpected single instance failures can be due to hardware or system failure. To mitigate these events, use persistent disks and startup scripts to save your data and re-enable software after you restart the instance.
Unexpected single instance reboot
At some point in time, you will experience an unexpected single instance failure and reboot. Unlike unexpected single instance failures, your instance fails and is automatically rebooted by the Google Compute Engine service. To help mitigate these events, back up your data, use persistent disks, and use startup scripts to quickly re-configure software.
Zone or region failures
Zone and region failures are rare failures that can cause all of your instances in a given zone or region to be inaccessible or fail.
How to design robust systems
To help mitigate instance failures, you should design your application on the Google Compute Engine service to be robust against failures, network interruptions, and unexpected disasters. A robust system should be able to gracefully handle failures, including redirecting traffic from a downed instance to a live instance or automating tasks on reboot.
Here are some general tips to help you design a robust system against failures.
Distribute your instances
Create instances across more than one region and zones so that you have alternative virtual machine instances to point to if a zone or region containing one of your instances is disrupted. If you host all your instances in the same zone or region, you won’t be able to access any of these instances if that zone or region becomes unreachable.
Use Google Compute Engine load balancing
Google Compute Engine offers a load balancing service that helps you support periods of heavy traffic so that you don't overload your instances. With the load balancing service, you can:
Deploy your application on instances within multiple zones. Then, you can configure a forwarding rule that can spread traffic across all virtual machine instances in all zones within the region. Each forwarding rule can define one entry point to your application using an external IP address.
Deploy instances across multiple regions. Cross-regional load balancing provides redundancy so that if a region is unreachable, traffic will automatically be diverted to another region so that your service reminas reachable using the same external IP address.
In addition, the load balancing service also offers instance health checking, providing support in detecting and handling instance failures.
For more information, see Google Compute Engine load balancing.
Use startup scripts
Startup scripts are an efficient and invaluable way to bootstrap your instances. If an instance fails, it can bring itself back up using startup scripts, and be able to install and access the appropriate resources as if it never went down. Instead of configuring your instances via custom images, it can be beneficial to configure them using startup scripts. Startup scripts run whenever the instance is rebooted or restarted due to failures, and can be used to install software and updates, and to ensure that services are running within the VM. Coding the changes to configure an instance in a startup script is easier than trying to figure out what files or bytes have changed on a custom image.
gcloud compute instances create example-instance --metadata-from-file startup-script=install-apache.sh
For more information, see startup scripts.
Back up your data
If you need access to data on a virtual machine instance or persistent disk that is in a zone scheduled to be taken offline, you can back up your files to Google Cloud Storage, your local computer, or migrate your data to another persistent disk in another region or zone.
To copy files from an instance to Google Cloud Storage:
Log into your instance from
gcloud compute ssh example-instance
If you have never used
gsutilon this instance, set up your credentials.
Alternatively, if you have set up your instance to use a service account with a Google Cloud Storage scope, you do can skip this and the next step.
Follow the instructions to authenticate to Google Cloud Storage.
Copy your data to Google Cloud Storage by using the following command:
gsutil cp <file1> <file2> <file3> ... gs://<your bucket>
You can also use the
gcloud compute tool to copy files to a local computer. For more
Copying files to or from an instance.