This article is the second part of a series that discusses disaster recovery (DR) in Google Cloud. This part discusses services and products that you can use as building blocks for your DR plan—both Google Cloud products and products that work across platforms.
The series consists of these parts:
- Disaster recovery planning guide
- Disaster recovery building blocks (this article)
- Disaster recovery scenarios for data
- Disaster recovery scenarios for applications
- Architecting disaster recovery for locality-restricted workloads
- Architecting disaster recovery for cloud infrastructure outages
Google Cloud has a wide range of products that you can use as part of your disaster recovery (DR) architecture. This section discusses DR-related features of the products that are most commonly used as Google Cloud DR building blocks.
Many of these services have high availability (HA) features. HA doesn't entirely overlap with DR, but many of the goals of HA also apply to designing a DR plan. For example, by taking advantage of HA features, you can design architectures that optimize uptime and that can mitigate the effects of small-scale failures, such as a single VM failing. For more about the relationship of DR and HA, see the Disaster recovery planning guide.
The following sections describe these Google Cloud DR building blocks and how they help you implement your DR goals.
Compute and storage
Compute Engine provides virtual machine (VM) instances; it's the workhorse of Google Cloud. In addition to configuring, launching, and monitoring Compute Engine instances, you typically use a variety of related features in order to implement a DR plan.
For DR scenarios, you can prevent accidental deletion of VMs by setting the delete protection flag. This is particularly useful where you are hosting stateful services such as databases. To help meet low RTO and RPO values, follow the best practices for designing robust systems.
You can configure an instance with your application preinstalled, and then save that configuration as a custom image. Your custom image can reflect the RTO you want to achieve.
You can use Compute Engine instance templates to save the configuration details of the VM and then create instances from existing instance templates. You can use the template to launch as many instances as you need, configured exactly the way you want when you need to stand up your DR target environment. Instance templates are globally replicated, so you can easily recreate the instance anywhere in Google Cloud with the same configuration.
We provide more details about using Compute Engine images in the balancing image configuration and deployment speed section later in this document.
Managed instance groups
Managed instance groups work with Cloud Load Balancing (discussed later in this document) to distribute traffic to groups of identically configured instances that are copied across zones. Managed instance groups allow for features like autoscaling and autohealing, where the managed instance group can delete and recreate instances automatically.
Compute Engine allows for the reservation of VM instances in a specific zone, using custom or predefined machine types, with or without additional GPUs or local SSDs. In order to assure capacity for your mission critical workloads for DR, you should create reservations in your DR target zones. Without reservations, there is a possibility that you might not get the on-demand capacity you need to meet your recovery time objective. Reservations can be useful in cold, warm, or hot DR scenarios. They let you keep recovery resources available for failover to meet lower RTO needs, without having to fully configure and deploy them in advance.
Persistent disks and snapshots
Persistent disks are durable network storage devices that your instances can access. They are independent of your instances, so you can detach and move persistent disks to keep your data even after you delete your instances.
You can take incremental backups or snapshots of Compute Engine VMs that you can copy across regions and use to recreate persistent disks in the event of a disaster. Additionally, you can create snapshots of persistent disks to protect against data loss due to user error. Snapshots are incremental, and take only minutes to create even if your snapshot disks are attached to running instances.
Persistent disks have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events. Persistent disks are either zonal or regional. Regional persistent disks replicate writes across two zones in a region. In the event of a zonal outage, a backup VM instance can force-attach a regional persistent disk in the secondary zone. To learn more, see High availability options using regional persistent disks.
Live Migration keeps your VM instances running even when a host system event occurs, such as a software or hardware update. Compute Engine live migrates your running instances to another host in the same zone rather than requiring your VMs to be rebooted. This allows Google to perform maintenance that is integral to keeping infrastructure protected and reliable without interrupting any of your VMs.
Virtual disk import tool
The virtual disk import tool lets you import file formats including VMDK, VHD, and RAW to create new Compute Engine virtual machines. Using this tool, you can create Compute Engine virtual machines that have the same configuration as your on-premises virtual machines. This is a good approach for when you are not able to configure Compute Engine images from the source binaries of software that's already installed on your images.
In DR scenarios, Nearline, Coldline, and Archive Storage are of particular interest. These storage classes reduce your storage cost compared to Standard storage. However, there are additional costs associated with retrieving data or metadata stored in these classes, as well as minimum storage durations that you are charged for. Nearline is designed for backup scenarios where access is at most once a month, which is ideal for allowing you to undertake regular DR stress tests while keeping costs low.
Nearline, Coldline, and Archive are optimized for infrequent access, and the pricing model is designed with this in mind. Therefore, you are charged for minimum storage durations, and there are additional costs for retrieving data or metadata in these classes earlier than the minimum storage duration for the class.
Storage Transfer Service lets you import data from Amazon S3 or HTTP-based sources into Cloud Storage. In DR scenarios, you can use Storage Transfer Service to:
- Back up data from other storage providers to a Cloud Storage bucket.
- Move data from a bucket in a multiregion to a bucket in a region to lower your costs for storing backups.
Filestore instances are fully managed NFS file servers for use with applications running on Compute Engine instances or GKE clusters.
Filestore instances are zonal and do not support replication
across zones. A Filestore instance is unavailable if the zone it
resides in is down. We recommend that you periodically back up your data by
syncing your Filestore volume to a Filestore
instance in another region using the
gsutil rsync command. This requires a job to be scheduled to run on
Compute Engine instances or GKE clusters.
In DR scenarios, applications can resume access to Filestore volumes quickly by switching to Filestore in failover regions without needing to wait for any restore process to complete. The RTO value of this DR solution is largely dependent on the frequency of the scheduled job.
GKE is a managed, production-ready environment for deploying containerized applications. GKE lets you orchestrate HA systems, and includes the following features:
- Node auto repair. If a node fails consecutive health checks over an extended time period (approximately 10 minutes), GKE initiates a repair process for that node.
- Liveness probe. You can specify a liveness probe, which periodically tells GKE that the pod is running. If the pod fails the probe, it can be restarted.
- Persistent volumes. Databases must be able to persist beyond the life of a container. By using the persistent volume abstraction, which maps to a Compute Engine persistent disk, you can maintain storage availability independently of the individual containers.
- Multi-zone and regional clusters. You can distribute Kubernetes resources across multiple zones within a region.
- Multi-cluster Ingress lets you configure shared load balancing resources across multiple GKE clusters in different regions.
Networking and data transfer
| Cloud Load Balancing
Cloud Load Balancing
Cloud Load Balancing provides HA for Compute Engine by distributing user requests among a set of instances. You can configure Cloud Load Balancing with health checks that determine whether instances are available to do work so that traffic is not routed to failing instances.
Cloud Load Balancing provides a single globally accessible IP address to front your Compute Engine instances. Your application can have instances running in different regions (for example, in Europe and in the US), and your end users are directed to the closest set of instances. In addition to providing load balancing for services that are exposed to the internet, you can configure internal load balancing for your services behind a private load-balancing IP address. This IP address is accessible only to VM instances that are internal to your Virtual Private Cloud (VPC).
With Traffic Director deploy a fully managed traffic control plane for your service mesh. Traffic Director manages the configuration of service proxies running in both Compute Engine and GKE. Deploy a service in multiple regions to make it HA. Traffic Director will offload service health checks and initiate a failover configuration of service proxies, thereby redirecting traffic to healthy instances.
Traffic Director also supports advanced traffic control concepts, circuit breaking, and fault injection. With circuit breaking, you can enforce limits on requests to a particular service, after which requests are prevented from reaching the service, preventing the service from degrading further. With fault injection, Traffic Director can introduce delays or abort a fraction of requests to a service, enabling you to easily test your service's ability to survive request delays or aborted requests.
Cloud DNS provides a programmatic way to manage your DNS entries as part of an automated recovery process. Cloud DNS uses Google's global network of Anycast name servers to serve your DNS zones from redundant locations around the world, providing high availability and lower latency for your users.
If you chose to manage DNS entries on-premises, you can enable VMs in Google Cloud to resolve these addresses through Cloud DNS DNS forwarding.
Cloud Interconnect provides ways to move information from other sources to Google Cloud. We discuss this product later under Transferring data to and from Google Cloud.
Management and monitoring
|Cloud Status Dashboard||
|Google Cloud's operations suite||
Cloud Status Dashboard
The Cloud Status Dashboard shows you the current availability of Google Cloud services. You can view the status on the page, and you can subscribe to an RSS feed that is updated whenever there is news about a service.
Cloud Monitoring collects metrics, events, and metadata from Google Cloud, AWS, hosted uptime probes, application instrumentation, and a variety of other application components. You can configure alerting to send notifications to third-party tools such as Slack or Pagerduty in order to provide timely updates to administrators. Another way to use Cloud Monitoring for DR is to configure a Pub/Sub sink and use Cloud Functions to trigger an automated process in response to a Cloud Monitoring alert.
Deployment Manager lets you define your Google Cloud environment in a set of templates. You can then use the templates to create environments with a single command repeatedly and consistently. Similarly, you can tear down the environment with a single command. This makes Deployment Manager ideal to define a DR recovery environment that you can reliably create in whichever region you want.
Cross-platform DR building blocks
When you run workloads across more than one platform, a way to reduce the operational overhead is to select tooling that works with all of the platforms you're using. This section discusses some tools and services that are platform agnostic and therefore support cross-platform DR scenarios.
Declarative templating tools
Declarative templating tools make it easy to deploy infrastructure across platforms. As noted earlier, for Google Cloud-only deployments, you can use Deployment Manager. For cross-platform deployments, Terraform is one of the most popular declarative templating tools.
Configuration management tools
For large or complex DR infrastructure, we recommend platform-agnostic software management tools like Chef and Ansible. These tools ensure that reproducible configurations can be applied no matter where your compute workload is. For examples of using these types of tools, see the Ansible with Spinnaker tutorial and Zero-to-Deploy with Chef on Google Cloud.
A common DR pattern is to have copies of objects in object stores in different cloud providers. One cross-platform tool for this is boto, which is an open source Python library that lets you interface with both Amazon S3 and Cloud Storage.
Containers can also be considered a DR building block. Containers are a way to package services and introduce consistency across platforms.
If you work with containers, you typically use an orchestrator. Kubernetes works not just to manage containers within Google Cloud (using GKE), but provides a way to orchestrate container-based workloads across multiple platforms. Google Cloud, AWS, and Microsoft Azure all provide managed versions of Kubernetes.
To distribute traffic to Kubernetes clusters running in different cloud platforms, you can use a DNS service that supports weighted records and incorporates health checking.
You also need to ensure you can pull the image to the target environment. This means you need to be able to access your image registry in the event of a disaster. A good option that's also platform agnostic is Container Registry.
Data transfer is a critical component of cross-platform DR scenarios. Make sure that you design, implement, and test your cross-platform DR scenarios using realistic mockups of what the DR data transfer scenario calls for. We discuss data transfer scenarios in the next section.
Patterns for DR
This section discusses some of the most common patterns for DR architectures based on the building blocks discussed earlier.
Transferring data to and from Google Cloud
An important aspect of your DR plan is how quickly data can be transferred to and from Google Cloud. This is critical if your DR plan is based on moving data from on-premises to Google Cloud or from another cloud provider to Google Cloud. This section discusses networking and Google Cloud services that can ensure good throughput.
When you are using Google Cloud as the recovery site for workloads that are on-premises or on another cloud environment, consider the following key items:
- How do you connect to Google Cloud?
- How much bandwidth is there between you and the interconnect provider?
- What is the bandwidth provided by the provider directly to Google Cloud?
- What other data will be transferred using that link?
If you use a public internet connection to transfer data, network throughput is unpredictable, because you're limited by the ISP's capacity and routing. The ISP might offer a limited SLA, or none at all. On the other hand, these connections have relatively low costs.
Cloud Interconnect provides several options to connect to Google and Google Cloud:
- Cloud VPN enables the creation of IPsec VPN tunnels between a Google Cloud VPC network and target network. Traffic traveling between the two networks is encrypted by one VPN gateway, then decrypted by the other VPN gateway. HA VPN enables you to create high-availability VPN connections with a SLA of 99.99%, plus a simplified setup compared to creating redundant VPNs.
- Direct peering provides minimal network hops to Google's public IP addresses. You can use direct peering to exchange internet traffic between your network and Google's edge points of presence (PoPs).
- Dedicated Interconnect provides a direct physical connection between your on-premises network and Google's network. It provides an SLA along with more consistent throughput for large data transfers. Circuits are either 10 Gbps or 100 Gbps and are terminated at one of Google's colocation facilities. With larger bandwidth, you can reduce the time it takes to transfer data from on-premises to Google Cloud. The following table illustrates the speed gains when upgrading from 10 Gbps to 100 Gbps.
- Partner Interconnect provides similar capabilities as Dedicated Interconnect, but at circuit speeds less than 10 Gbps. See Supported service providers.
The following diagram provides guidance about which transfer method to use, depending on how much data you need to transfer to Google Cloud.
You can use the transfer time calculator to understand how much time a transfer might take, given the size of the dataset you're moving and the bandwidth available for the transfer. For more information about data transfer as part of your DR planning, see Transferring big datasets.
Balancing image configuration and deployment speed
When you configure a machine image for deploying new instances, consider the effect that your configuration will have on the speed of deployment. There is a tradeoff between the amount of image preconfiguration, the costs of maintaining the image, and the speed of deployment. For example, if a machine image is minimally configured, the instances that use it will require more time to launch, because they need to download and install dependencies. On the other hand, if your machine image is highly configured, the instances that use it launch more quickly, but you must update the image more frequently. The time taken to launch a fully operational instance will have a direct correlation to your RTO.
Maintaining machine image consistency across hybrid environments
If you implement a hybrid solution (on-premises-to-cloud or cloud-to-cloud), you need to find a way to maintain VM consistency across production environments.
If a fully configured image is required, consider something like Packer, which can create identical machine images for multiple platforms. You can use the same scripts with platform-specific configuration files. In the case of Packer, you can put the configuration file in version control to keep track of what version is deployed in production. For a discussion of how to create an automated pipeline for continuously building images with Packer and other open source utilities, see Automated Image Builds with Jenkins, Packer, and Kubernetes.
As another option, you can use configuration management tools such as Chef, Puppet, Ansible, or Saltstack to configure instances with finer granularity, creating base images, minimally-configured images, or fully-configured images as needed. For a discussion of how to use these tools effectively, see the Ansible with Spinnaker tutorial and Zero-to-Deploy with Chef on Google Cloud.
You can also manually convert and import existing images such as Amazon AMIs, Virtualbox images, and RAW disk images to Compute Engine.
Implementing tiered storage
The tiered storage pattern is typically used for backups where the most recent backup is on faster storage, and you slowly migrate your older backups to cheaper slow storage. There are two ways to implement the pattern using Cloud Storage, depending on where your data originates—on Google Cloud or on-premises. In both cases, you migrate objects between buckets of different storage classes, typically from Standard to Nearline (cheaper) storage classes.
If your source data is generated on-premises, the implementation looks similar to the following diagram:
Alternatively, you can change the storage class of the objects in a bucket using object lifecycle rules to automate the change in object class.
Maintaining the same IP address for private instances
A common pattern is to maintain a single serving instance of a VM. If the VM has to be replaced, the replacement needs to appear as if it was the original VM. Therefore, the IP address that clients use to connect with the new instance should remain the same.
The simplest configuration is to set up a managed instance group that maintains exactly one instance. This managed instance group is integrated with an internal (private) load balancer that ensures that the same IP address is used to front the instance regardless of whether it's the original image or a replacement.
Google has a robust partner ecosystem that supports backup and DR use cases with Google Cloud. In particular, we see customers using partner solutions to do the following:
- Back up data from on-premises to Google Cloud. In these cases, Cloud Storage is integrated as a storage target for most on-premises backup platforms. You can use this approach to replace tape and other storage appliances.
- Implement a DR plan that goes from on-premises to Google Cloud. Our partners can help eliminate secondary data centers and use Google Cloud as the DR site.
- Implement DR and backup for cloud-based workloads.
For more information about partner solutions, see the Partners page on the Google Cloud website.
- Read about Google Cloud geography and regions.
Read other articles in this DR series:
Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.