Disaster recovery building blocks

Last reviewed 2022-06-10 UTC

This document is the second part of a series that discusses disaster recovery (DR) in Google Cloud. This part discusses services and products that you can use as building blocks for your DR plan—both Google Cloud products and products that work across platforms.

The series consists of these parts:

Disaster recovery planning guide
Disaster recovery building blocks (this article)
Disaster recovery scenarios for data
Disaster recovery scenarios for applications
Architecting disaster recovery for locality-restricted workloads
Disaster recovery use cases: locality-restricted data analytic applications
Architecting disaster recovery for cloud infrastructure outages

Introduction

Google Cloud has a wide range of products that you can use as part of your disaster recovery (DR) architecture. This section discusses DR-related features of the products that are most commonly used as Google Cloud DR building blocks.

Many of these services have high availability (HA) features. HA doesn't entirely overlap with DR, but many of the goals of HA also apply to designing a DR plan. For example, by taking advantage of HA features, you can design architectures that optimize uptime and that can mitigate the effects of small-scale failures, such as a single VM failing. For more about the relationship of DR and HA, see the Disaster recovery planning guide.

The following sections describe these Google Cloud DR building blocks and how they help you implement your DR goals.

Compute and storage

Compute Engine	Scalable compute resources Predefined and custom machine types Fast boot times Snapshots Instance templates Managed instance groups Reservations Persistent disks Live migration
Cloud Storage	Highly durable object store Redundancy across regions Storage classes Object lifecycle management Data transfer from other sources Encryption at rest by default Soft deletion
GKE	Managed environment for deploying and scaling containerized applications Node auto-repair Liveness and readiness probes Persistent volumes Multi-zone and regional clusters Command-line tool for managing cross-regional clusters

Compute Engine

Compute Engine provides virtual machine (VM) instances; it's the workhorse of Google Cloud. In addition to configuring, launching, and monitoring Compute Engine instances, you typically use a variety of related features in order to implement a DR plan.

For DR scenarios, you can prevent accidental deletion of VMs by setting the delete protection flag. This is particularly useful where you are hosting stateful services such as databases. To help meet low RTO and RPO values, follow the best practices for designing robust systems.

You can configure an instance with your application preinstalled, and then save that configuration as a custom image. Your custom image can reflect the RTO you want to achieve.

Instance templates

You can use Compute Engine instance templates to save the configuration details of the VM and then create instances from existing instance templates. You can use the template to launch as many instances as you need, configured exactly the way you want when you need to stand up your DR target environment. Instance templates are globally replicated, so you can recreate the instance anywhere in Google Cloud with the same configuration.

You can create instance templates using a custom image or based on existing VM instances.

We provide more details about using Compute Engine images in the balancing image configuration and deployment speed section later in this document.

Managed instance groups

Managed instance groups work with Cloud Load Balancing (discussed later in this document) to distribute traffic to groups of identically configured instances that are copied across zones. Managed instance groups allow for features like autoscaling and autohealing, where the managed instance group can delete and recreate instances automatically.

Reservations

Compute Engine allows for the reservation of VM instances in a specific zone, using custom or predefined machine types, with or without additional GPUs or local SSDs. In order to assure capacity for your mission critical workloads for DR, you should create reservations in your DR target zones. Without reservations, there is a possibility that you might not get the on-demand capacity you need to meet your recovery time objective. Reservations can be useful in cold, warm, or hot DR scenarios. They let you keep recovery resources available for failover to meet lower RTO needs, without having to fully configure and deploy them in advance.

Persistent disks and snapshots

Persistent disks are durable network storage devices that your instances can access. They are independent of your instances, so you can detach and move persistent disks to keep your data even after you delete your instances.

You can take incremental backups or snapshots of Compute Engine VMs that you can copy across regions and use to recreate persistent disks in the event of a disaster. Additionally, you can create snapshots of persistent disks to protect against data loss due to user error. Snapshots are incremental, and take only minutes to create even if your snapshot disks are attached to running instances.

Persistent disks have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events. Persistent disks are either zonal or regional. Regional persistent disks replicate writes across two zones in a region. In the event of a zonal outage, a backup VM instance can force-attach a regional persistent disk in the secondary zone. To learn more, see High availability options using regional persistent disks.

Live Migration

Live Migration keeps your VM instances running even when a host system event occurs, such as a software or hardware update. Compute Engine live migrates your running instances to another host in the same zone rather than requiring your VMs to be rebooted. This allows Google to perform maintenance that is integral to keeping infrastructure protected and reliable without interrupting any of your VMs.

Virtual disk import tool

The virtual disk import tool lets you import file formats including VMDK, VHD, and RAW to create new Compute Engine virtual machines. Using this tool, you can create Compute Engine virtual machines that have the same configuration as your on-premises virtual machines. This is a good approach for when you are not able to configure Compute Engine images from the source binaries of software that's already installed on your images.

Cloud Storage

Cloud Storage is an object store that's ideal for storing backup files. It provides different storage classes that are suited for specific use cases, as outlined in the following diagram.

Diagram showing Standard storage for high-frequency access, Nearline and Coldline for low-frequency access, and Archive for lowest-frequency access

In DR scenarios, Nearline, Coldline, and Archive storage are of particular interest. These storage classes reduce your storage cost compared to Standard storage. However, there are additional costs associated with retrieving data or metadata stored in these classes, as well as minimum storage durations that you are charged for. Nearline is designed for backup scenarios where access is at most once a month, which is ideal for allowing you to undertake regular DR stress tests while keeping costs low.

Nearline, Coldline, and Archive are optimized for infrequent access, and the pricing model is designed with this in mind. Therefore, you are charged for minimum storage durations, and there are additional costs for retrieving data or metadata in these classes earlier than the minimum storage duration for the class.

To protect your data in a Cloud Storage bucket against accidental or malicious deletion, you can use the Soft Delete feature to preserve deleted and overwritten objects for a specified period.

Storage Transfer Service lets you import data from Amazon S3 or HTTP-based sources into Cloud Storage. In DR scenarios, you can use Storage Transfer Service to:

Back up data from other storage providers to a Cloud Storage bucket.
Move data from a bucket in a dual-region or multi-region to a bucket in a region to lower your costs for storing backups.

Filestore

Filestore instances are fully managed NFS file servers for use with applications running on Compute Engine instances or GKE clusters.

Filestore instances are zonal and don't support replication across zones. A Filestore instance is unavailable if the zone it resides in is down. We recommend that you periodically back up your data by syncing your Filestore volume to a Filestore instance in another region using the gsutil rsync command. This requires a job to be scheduled to run on Compute Engine instances or GKE clusters.

In DR scenarios, applications can resume access to Filestore volumes quickly by switching to Filestore in failover regions without needing to wait for any restore process to complete. The RTO value of this DR solution is largely dependent on the frequency of the scheduled job.

GKE

GKE is a managed, production-ready environment for deploying containerized applications. GKE lets you orchestrate HA systems, and includes the following features:

Node auto repair. If a node fails consecutive health checks over an extended time period (approximately 10 minutes), GKE initiates a repair process for that node.
Liveness probe. You can specify a liveness probe, which periodically tells GKE that the pod is running. If the pod fails the probe, it can be restarted.
Persistent volumes. Databases must be able to persist beyond the life of a container. By using the persistent volume abstraction, which maps to a Compute Engine persistent disk, you can maintain storage availability independently of the individual containers.
Multi-zone and regional clusters. You can distribute Kubernetes resources across multiple zones within a region.
Multi-cluster Gateway lets you configure shared load balancing resources across multiple GKE clusters in different regions.
Backup for GKE lets you back up and restore workloads in GKE clusters.

Networking and data transfer

Cloud Load Balancing	Health checks Single Anycast IP Cross-region Cloud CDN integration Autoscaling integration
Traffic Director	Google-managed Global L7 ILB Control plane for xDSv2-compliant open service proxies Supports VMs and Containers Health check offloading Rapid autoscaling Advanced request routing and rich traffic-control policies
Cloud DNS	Programmatic DNS management Access control Anycast to serve zones DNS policies
Cloud Interconnect	Cloud VPN (IPsec VPN) Direct peering

Cloud Load Balancing

Cloud Load Balancing provides HA for Compute Engine by distributing user requests among a set of instances. You can configure Cloud Load Balancing with health checks that determine whether instances are available to do work so that traffic is not routed to failing instances.

Cloud Load Balancing provides a single globally accessible IP address to front your Compute Engine instances. Your application can have instances running in different regions (for example, in Europe and in the US), and your end users are directed to the closest set of instances. In addition to providing load balancing for services that are exposed to the internet, you can configure internal load balancing for your services behind a private load-balancing IP address. This IP address is accessible only to VM instances that are internal to your Virtual Private Cloud (VPC).

Traffic Director

With Traffic Director deploy a fully managed traffic control plane for your service mesh. Traffic Director manages the configuration of service proxies running in both Compute Engine and GKE. Deploy a service in multiple regions to make it HA. Traffic Director will offload service health checks and initiate a failover configuration of service proxies, thereby redirecting traffic to healthy instances.

Traffic Director also supports advanced traffic control concepts, circuit breaking, and fault injection. With circuit breaking, you can enforce limits on requests to a particular service, after which requests are prevented from reaching the service, preventing the service from degrading further. With fault injection, Traffic Director can introduce delays or abort a fraction of requests to a service, enabling you to test your service's ability to survive request delays or aborted requests.

Cloud DNS

Cloud DNS provides a programmatic way to manage your DNS entries as part of an automated recovery process. Cloud DNS uses Google's global network of Anycast name servers to serve your DNS zones from redundant locations around the world, providing high availability and lower latency for your users.

If you chose to manage DNS entries on-premises, you can enable VMs in Google Cloud to resolve these addresses through Cloud DNS forwarding.

Cloud DNS supports policies to configure how it responds to DNS requests. For example, you can configure routing policies to enable failover to a backup configuration to provide high availability, or to route DNS requests based on their geographic location.

Cloud Interconnect

Cloud Interconnect provides ways to move information from other sources to Google Cloud. We discuss this product later under Transferring data to and from Google Cloud.

Management and monitoring

Cloud Status Dashboard	Status of Google Cloud services
Google Cloud Observability	Uptime monitoring Alerts Logging Error reporting

Cloud Status Dashboard

The Cloud Status Dashboard shows you the current availability of Google Cloud services. You can view the status on the page, and you can subscribe to an RSS feed that is updated whenever there is news about a service.

Cloud Monitoring

Cloud Monitoring collects metrics, events, and metadata from Google Cloud, AWS, hosted uptime probes, application instrumentation, and a variety of other application components. You can configure alerting to send notifications to third-party tools such as Slack or Pagerduty in order to provide timely updates to administrators. Another way to use Cloud Monitoring for DR is to configure a Pub/Sub sink and use Cloud Functions to trigger an automated process in response to a Cloud Monitoring alert.

Cross-platform DR building blocks

When you run workloads across more than one platform, a way to reduce the operational overhead is to select tooling that works with all of the platforms you're using. This section discusses some tools and services that are platform-independent and therefore support cross-platform DR scenarios.

Declarative templating tools

Declarative templating tools let you automate the deployment of infrastructure across platforms. Terraform is a popular declarative templating tool.

Configuration management tools

For large or complex DR infrastructure, we recommend platform-agnostic software management tools like Chef and Ansible. These tools ensure that reproducible configurations can be applied no matter where your compute workload is.

Object storage

A common DR pattern is to have copies of objects in object stores in different cloud providers. One cross-platform tool for this is boto, which is an open source Python library that lets you interface with both Amazon S3 and Cloud Storage.

Orchestrator tools

Containers can also be considered a DR building block. Containers are a way to package services and introduce consistency across platforms.

If you work with containers, you typically use an orchestrator. Kubernetes works not just to manage containers within Google Cloud (using GKE), but provides a way to orchestrate container-based workloads across multiple platforms. Google Cloud, AWS, and Microsoft Azure all provide managed versions of Kubernetes.

To distribute traffic to Kubernetes clusters running in different cloud platforms, you can use a DNS service that supports weighted records and incorporates health checking.

You also need to ensure you can pull the image to the target environment. This means you need to be able to access your image registry in the event of a disaster. A good option that's also platform-independent is Artifact Registry.

Data transfer

Data transfer is a critical component of cross-platform DR scenarios. Make sure that you design, implement, and test your cross-platform DR scenarios using realistic mockups of what the DR data transfer scenario calls for. We discuss data transfer scenarios in the next section.

Patterns for DR

This section discusses some of the most common patterns for DR architectures based on the building blocks discussed earlier.

Transferring data to and from Google Cloud

An important aspect of your DR plan is how quickly data can be transferred to and from Google Cloud. This is critical if your DR plan is based on moving data from on-premises to Google Cloud or from another cloud provider to Google Cloud. This section discusses networking and Google Cloud services that can ensure good throughput.

When you are using Google Cloud as the recovery site for workloads that are on-premises or on another cloud environment, consider the following key items:

How do you connect to Google Cloud?
How much bandwidth is there between you and the interconnect provider?
What is the bandwidth provided by the provider directly to Google Cloud?
What other data will be transferred using that link?

If you use a public internet connection to transfer data, network throughput is unpredictable, because you're limited by the ISP's capacity and routing. The ISP might offer a limited SLA, or none at all. On the other hand, these connections have relatively low costs.

Cloud Interconnect provides several options to connect to Google and Google Cloud:

Cloud VPN enables the creation of IPsec VPN tunnels between a Google Cloud VPC network and target network. Traffic traveling between the two networks is encrypted by one VPN gateway, then decrypted by the other VPN gateway. HA VPN lets you to create high-availability VPN connections with a SLA of 99.99%, plus a simplified setup compared to creating redundant VPNs.
Direct peering provides minimal network hops to Google's public IP addresses. You can use direct peering to exchange internet traffic between your network and Google's edge points of presence (PoPs).
Dedicated Interconnect provides a direct physical connection between your on-premises network and Google's network. It provides an SLA along with more consistent throughput for large data transfers. Circuits are either 10 Gbps or 100 Gbps and are terminated at one of Google's colocation facilities. With larger bandwidth, you can reduce the time it takes to transfer data from on-premises to Google Cloud. The following table illustrates the speed gains when upgrading from 10 Gbps to 100 Gbps.
Partner Interconnect provides similar capabilities as Dedicated Interconnect, but at circuit speeds less than 10 Gbps. See Supported service providers.

The following diagram provides guidance about which transfer method to use, depending on how much data you need to transfer to Google Cloud.

Chart showing amount of data on the Y axis (0 to past 100 TB) and data location categories on the X axis (for example, 'In Google Cloud', 'On-premises with good connectivity', etc.), with different transfer solutions in each category

You can use the transfer time calculator to understand how much time a transfer might take, given the size of the dataset you're moving and the bandwidth available for the transfer. For more information about data transfer as part of your DR planning, see Transferring big datasets.

Balancing image configuration and deployment speed

When you configure a machine image for deploying new instances, consider the effect that your configuration will have on the speed of deployment. There is a tradeoff between the amount of image preconfiguration, the costs of maintaining the image, and the speed of deployment. For example, if a machine image is minimally configured, the instances that use it will require more time to launch, because they need to download and install dependencies. On the other hand, if your machine image is highly configured, the instances that use it launch more quickly, but you must update the image more frequently. The time taken to launch a fully operational instance will have a direct correlation to your RTO.

Diagram showing 3 levels of bundling (unbundled to fully bundled) mapped against image boot time (the most bundled is the fastest to boot)

Maintaining machine image consistency across hybrid environments

If you implement a hybrid solution (on-premises-to-cloud or cloud-to-cloud), you need to find a way to maintain VM consistency across production environments.

If a fully configured image is required, consider something like Packer, which can create identical machine images for multiple platforms. You can use the same scripts with platform-specific configuration files. In the case of Packer, you can put the configuration file in version control to keep track of what version is deployed in production.

As another option, you can use configuration management tools such as Chef, Puppet, Ansible, or Saltstack to configure instances with finer granularity, creating base images, minimally-configured images, or fully-configured images as needed. For a discussion of how to use these tools effectively, see Zero-to-Deploy with Chef on Google Cloud.

You can also manually convert and import existing images such as Amazon AMIs, Virtualbox images, and RAW disk images to Compute Engine.

Implementing tiered storage

The tiered storage pattern is typically used for backups where the most recent backup is on faster storage, and you slowly migrate your older backups to lower cost (but slow) storage. There are two ways to implement the pattern using Cloud Storage, depending on where your data originates—on Google Cloud or on-premises. In both cases, you migrate objects between buckets of different storage classes, typically from Standard to Nearline lower cost storage classes.

Diagram showing data migrating from a persistent disk to Standard storage to Nearline storage

If your source data is generated on-premises, the implementation looks similar to the following diagram:

Diagram showing data migrating from on-premises through Cloud Interconnect to Cloud Storage

Alternatively, you can change the storage class of the objects in a bucket using object lifecycle rules to automate the change in object class.

Maintaining the same IP address for private instances

A common pattern is to maintain a single serving instance of a VM. If the VM has to be replaced, the replacement needs to appear as if it was the original VM. Therefore, the IP address that clients use to connect with the new instance should remain the same.

The simplest configuration is to set up a managed instance group that maintains exactly one instance. This managed instance group is integrated with an internal (private) load balancer that ensures that the same IP address is used to front the instance regardless of whether it's the original image or a replacement.

Technology partners

Google has a robust partner ecosystem that supports backup and DR use cases with Google Cloud. In particular, we see customers using partner solutions to do the following:

Back up data from on-premises to Google Cloud. In these cases, Cloud Storage is integrated as a storage target for most on-premises backup platforms. You can use this approach to replace tape and other storage appliances.
Implement a DR plan that goes from on-premises to Google Cloud. Our partners can help eliminate secondary data centers and use Google Cloud as the DR site.
Implement DR and backup for cloud-based workloads.

For more information about partner solutions, see the Partners page on the Google Cloud website.

What's next

Read about Google Cloud geography and regions.
Read other articles in this DR series:
Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.