Disaster Recovery Building Blocks

This article is the second part of a multi-part series that discusses disaster recovery (DR) in Google Cloud Platform (GCP). This part discusses services and products that you can use as building blocks for your DR plan—both GCP products and products that work across platforms.

The series consists of these parts:

Introduction

GCP has a wide range of products that you can use as part of your disaster recovery (DR) architecture. In this section, we discuss DR-related features of the products that are most commonly used as GCP DR building blocks.

Many of these services have high availability (HA) features. HA doesn't entirely overlap with DR, but many of the goals of HA also apply to designing a DR plan. For example, by taking advantage of HA features, you can design architectures that optimize uptime and that can mitigate the effects of small-scale failures, such as a single VM failing. For more about the relationship of DR and HA, see the Disaster Recovery Planning Guide.

The following sections describe these GCP DR building blocks and how they help you implement your DR goals.

Compute and storage

Compute Engine
  • Scalable compute resources
  • Predefined and custom machine types
  • Fast boot times
  • Snapshots
  • Instance templates
  • Managed instance groups
  • Persistent disks
  • Live migration
Cloud Storage
  • Highly durable object store
  • Geo-redundant storage
  • Storage classes
  • Object lifecycle management
  • Data transfer from other sources
  • Encryption at rest by default
GKE
  • Managed environment for deploying and scaling containerized applications
  • Persistent volumes
  • Node auto-repair
  • Liveness and readiness probes
  • Multi-zone and regional clusters
  • Command-line tool for managing cross-regional clusters

Compute Engine

Compute Engine provides virtual machine (VM) instances; it's the workhorse of GCP. In addition to configuring, launching, and monitoring Compute Engine instances, you typically use a variety of related features in order to implement a DR plan.

For DR scenarios, you can prevent accidental deletion of VMs by setting the delete protection flag. This is particularly useful where you are hosting stateful services such as databases. To help meet low RTO and RPO values, follow the best practices for designing robust systems.

You can configure an instance with your application preinstalled, and then save that configuration as a custom image. Your custom image can reflect the RTO you want to achieve.

Instance templates

You can use Compute Engine instance templates to save the configuration details of the VM and then create instances from existing instance templates. You can use the template to launch as many instances as you need, configured exactly the way you want when you need to stand up your DR target environment. Instance templates are globally replicated, so you can easily recreate the instance anywhere in GCP with the same configuration.

You can create instance templates using a custom image or based on existing VM instances.

We provide more details about using Compute Engine images in the balancing image configuration and deployment speed section later in this document.

Managed instance groups

Managed instance groups work with Cloud Load Balancing (discussed later in this document) to distribute traffic to groups of identically configured instances that are copied across zones. Managed instance groups allow for features like autoscaling and autohealing, where the managed instance group can delete and recreate instances automatically.

Persistent disks and snapshots

Persistent disks are durable network storage devices that your instances can access. They are independent of your instances, so you can detach and move persistent disks to keep your data even after you delete your instances.

You can take incremental backups or snapshots of Compute Engine VMs that you can copy across regions and use to recreate persistent disks in the event of a disaster. Additionally, you can create snapshots of persistent disks to protect against data loss due to user error. Snapshots are incremental, and take only minutes to create even if your snapshot disks are attached to running instances.

Persistent disks have built-in redundancy to protect your data against equipment failure and to ensure data availability through data center maintenance events.

Live Migration

Live Migration keeps your VM instances running even when a host system event occurs, such as a software or hardware update. Compute Engine live migrates your running instances to another host in the same zone rather than requiring your VMs to be rebooted. This allows Google to perform maintenance that is integral to keeping infrastructure protected and reliable without interrupting any of your VMs.

Virtual disk import tool

The virtual disk import tool allows you to import file formats including VMDK, VHD, and RAW to create new Compute Engine virtual machines. Using this tool, you can create Compute Engine virtual machines that have the same configuration as your on-premises virtual machines. This is a good approach for when you are not able to configure Compute Engine images from the source binaries of software that's already installed on your images.

Cloud Storage

Cloud Storage is an object store that's ideal for storing backup files. It provides different storage classes that are suited for specific use cases, as outlined in the following diagram.

Diagram showing multi-regional and regional storage for high-frequency access, Nearline for  low-frequency access, and Coldline for lowest-frequency access

In DR scenarios, Nearline and Coldline Storage are of particular interest. Both Nearline and Coldline reduce your storage cost compared to Multi-Regional or Regional storage. However, there are additional costs associated with retrieving data or metadata stored in these classes, as well as minimum storage durations that you are charged for. Nearline is designed for backup scenarios where access is at most once a month, which is ideal for allowing you to undertake regular DR stress tests while keeping costs low.

Both Nearline and Coldline are optimized for infrequent access, and the pricing model is designed with this in mind. Therefore, you are charged for minimum storage durations, and there are additional costs for retrieving data or metadata in these classes earlier than the minimum storage duration for the class.

Storage Transfer Service allows you to import data from Amazon S3 or HTTP-based sources into Cloud Storage. In DR scenarios, you can use Storage Transfer Service to:

  • Back up data from other storage providers to a Cloud Storage bucket.
  • Move data from a multi-regional storage bucket to a Nearline storage bucket to lower your costs for storing backups.

GKE

GKE is a managed, production-ready environment for deploying containerized applications. GKE lets you orchestrate HA systems, and includes the following features:

  • Node auto repair. If a node fails consecutive health checks over an extended time period (approximately 10 minutes), GKE initiates a repair process for that node.
  • Liveness probe. You can specify a liveness probe, which periodically tells GKE that the pod is running. If the pod fails the probe, it can be restarted.
  • Persistent volumes. Databases must be able to persist beyond the life of a container. By using the persistent volume abstraction, which maps to a Compute Engine persistent disk, you can maintain storage availability independently of the individual containers.
  • Multi-zone and regional clusters. You can distribute Kubernetes resources across multiple zones within a region.
  • Kubemci. This tool lets you configure a global load balancer to load-balance traffic across multiple GKE clusters in different regions.

Networking and data transfer

Cloud Load Balancing
  • Cross-region
  • Single Anycast IP
  • Health checks
  • Cloud CDN integration
  • Autoscaling integration
Cloud Interconnect
  • Cloud VPN (IPsec VPN)
  • Direct peering
Cloud DNS
  • Programmatic DNS management
  • Access control
  • Anycast to serve zones

Cloud Load Balancing

Cloud Load Balancing provides HA for Compute Engine by distributing user requests among a set of instances. You can configure Cloud Load Balancing with health checks that determine whether instances are available to do work so that traffic is not routed to failing instances.

Cloud Load Balancing provides a single globally accessible IP address to front your Compute Engine instances. Your application can have instances running in different regions (for example, in Europe and in the US), and your end users are directed to the closest set of instances. In addition to providing load balancing for services that are exposed to the internet, you can configure internal load balancing for your services behind a private load-balancing IP address. This IP address is accessible only to VM instances that are internal to your Virtual Private Cloud (VPC).

Cloud DNS

Cloud DNS provides a programmatic way to manage your DNS entries as part of an automated recovery process. Cloud DNS uses Google's global network of Anycast name servers to serve your DNS zones from redundant locations around the world, providing high availability and lower latency for your users.

Cloud Interconnect

Cloud Interconnect provides ways to move information from other sources to GCP. We discuss this product later under Transferring data to and from GCP.

Management and monitoring

Cloud Status Dashboard
  • Status of GCP services
Stackdriver
  • Uptime monitoring
  • Alerts
  • Logging
  • Error reporting
Cloud Deployment Manager
  • Repeatable and consistent deployment process
  • Parallel deployment
  • Templates
  • Infrastructure as code

Cloud Status Dashboard

The Cloud Status Dashboard shows you the current availability of GCP services. You can view the status on the page, and you can subscribe to an RSS feed that is updated whenever there is news about a service.

Monitoring

Monitoring collects metrics, events, and metadata from GCP, AWS, hosted uptime probes, application instrumentation, and a variety of other application components. You can configure alerting to send notifications to third-party tools such as Slack or Pagerduty in order to provide timely updates to administrators. Another way to use Stackdriver for DR is to configure a Cloud Pub/Sub sink and use Cloud Functions to trigger an automated process in response to a Stackdriver alert.

Cloud Deployment Manager

Cloud Deployment Manager allows you to define your GCP environment in a set of templates. You can then use the templates to create environments with a single command repeatedly and consistently. Similarly, you can tear down the environment with a single command. This makes Cloud Deployment Manager ideal to define a DR recovery environment that you can reliably create in whichever region you want.

Cross-platform DR building blocks

When you run workloads across more than one platform, a way to reduce the operational overhead is to select tooling that works with all of the platforms you're using. In this section, we discuss some tools and services that are platform agnostic and therefore support cross-platform DR scenarios.

Declarative templating tools

Declarative templating tools make it easy to deploy infrastructure across platforms. As noted earlier, for GCP-only deployments, you can use Cloud Deployment Manager. For cross-platform deployments, Terraform is one of the most popular declarative templating tools.

Configuration management tools

For large or complex DR infrastructure, we recommend platform-agnostic software management tools like Chef and Ansible. These tools ensure that reproducible configurations can be applied no matter where your compute workload is. For examples of using these types of tools, see the Ansible with Spinnaker tutorial and Zero-to-Deploy with Chef on GCP.

Object storage

A common DR pattern is to have copies of objects in object stores in different cloud providers. One cross-platform tool for this is boto, which is an open source Python library that allows you to interface with both Amazon S3 and Cloud Storage.

Orchestrator tools

Containers can also be considered a DR building block. Containers are a way to package services and introduce consistency across platforms.

If you work with containers, you typically use an orchestrator. Kubernetes works not just to manage containers within GCP (using GKE), but provides a way to orchestrate container-based workloads across multiple platforms. GCP, AWS, and Microsoft Azure all provide managed versions of Kubernetes.

To distribute traffic to Kubernetes clusters running in different cloud platforms, you can use a DNS service that supports weighted records and incorporates health checking.

You also need to ensure you can pull the image to the target environment. This means you need to be able to access your image registry in the event of a disaster. A good option that's also platform agnostic is Container Registry.

Data transfer

Data transfer is a critical component of cross-platform DR scenarios. Make sure that you design, implement, and test your cross-platform DR scenarios using realistic mockups of what the DR data transfer scenario calls for. We discuss data transfer scenarios in the next section.

Patterns for DR

This section discusses some of the most common patterns for DR architectures based on the building blocks discussed earlier.

Transferring data to and from GCP

An important aspect of your DR plan is how quickly data can be transferred to and from GCP. This is critical if your DR plan is based on moving data information from on-premises to GCP or from another cloud provider to GCP. In this section, we discuss networking and GCP services that can ensure good throughput.

When you are using GCP as the recovery site for workloads that are on premises or on another cloud environment, you must consider the following key items:

  • How do you connect to GCP?
  • How much bandwidth is there between you and the interconnect provider?
  • What is the actual bandwidth provided by the provider directly to GCP?
  • What other data will be transferred using that link?

If you use a public internet connection to transfer data, network throughput is unpredictable, because you're limited by the ISP's capacity and routing. The ISP might offer a limited SLA, or none at all. On the other hand, these connections have relatively low costs.

Other options for moving data to and from GCP include the following:

  • Cloud VPN. When you transfer data to and from GCP, you can use Cloud VPN to encrypt data in transit. An IPsec VPN connection is initiated between the source and target networks. Traffic traveling between the two networks is encrypted by one VPN gateway, then decrypted by the other VPN gateway.
  • Direct peering. To minimize network hops, you can use direct peering to exchange internet traffic between your network and Google's edge points of presence (PoPs).
  • Cloud Interconnect. This service offers a direct GCP connection through one of several service providers. Cloud Interconnect provides more consistent throughput for large data transfers, and typically includes an SLA for network availability and performance. To learn more, contact a service provider directly.

The following diagram provides guidance about which transfer method to use, depending on how much data you need to transfer to GCP.

Chart showing amount of data on the Y axis (0 to past 100TB) and data location categories on the X axis (for example, 'In GCP', 'On-premises with good connectivity', etc.), with different transfer solutions in each category

For more information about data transfer as part of your DR planning, see Transferring Big Data Sets.

Balancing image configuration and deployment speed

When you configure a machine image for deploying new instances, consider the effect that your configuration will have on the speed of deployment. There is a tradeoff between the amount of image preconfiguration, the costs of maintaining the image, and the speed of deployment. For example, if a machine image is minimally configured, the instances that use it will require more time to launch, because they need to download and install dependencies. On the other hand, if your machine image is highly configured, the instances that use it launch more quickly, but you must update the image more frequently. The time taken to launch a fully operational instance will have a direct correlation to your RTO.

Diagram showing 3 levels of bundling (unbundled to fully bundled) mapped against image boot time (the most bundled is the fastest to boot)

Maintaining machine image consistency across hybrid environments

If you implement a hybrid solution (on-premises-to-cloud or cloud-to-cloud), you need to find a way to maintain VM consistency across production environments.

If a fully configured image is required, consider something like Packer, which can create identical machine images for multiple platforms. You can use the same scripts with platform-specific configuration files. In the case of Packer, you can put the configuration file in version control to keep track of what version is deployed in production. For a discussion of how to create an automated pipeline for continuously building images with Packer and other open source utilities, see Automated Image Builds with Jenkins, Packer, and Kubernetes.

As another option, you can use configuration management tools such as Chef, Puppet, Ansible, or Saltstack to configure instances with finer granularity, creating base images, minimally-configured images, or fully-configured images as needed. For a discussion of how to use these tools effectively, see the Ansible with Spinnaker tutorial and Zero-to-Deploy with Chef on GCP.

You can also manually convert and import existing images such as Amazon AMIs, Virtualbox images, and RAW disk images to Compute Engine.

Implementing tiered storage

The tiered storage pattern is typically used for backups where the most recent backup is on faster storage, and you slowly migrate your older backups to cheaper slow storage. There are two ways to implement the pattern using Cloud Storage, depending on where your data originates—on GCP or on-premises. In both cases, you migrate objects between buckets of different storage classes, typically from standard to Nearline (cheaper) storage classes.

Diagram showing data migrating from a persistent disk to standard storage to Nearline storage

If your source data is generated on-premises, the implementation looks similar to the following diagram:

Diagram showing data migrating from on-premises through Cloud Interconnect to Cloud Storage

Alternatively, you can change the storage class of the objects in a bucket using object lifecycle rules to automate the change in object class.

Maintaining the same IP address for private instances

A common pattern is to maintain a single serving instance of a VM. If the VM has to be replaced, the replacement needs to appear as if it was the original VM. Therefore, the IP address that clients use to connect with the new instance should remain the same.

The simplest configuration is to set up a managed instance group that maintains exactly one instance. This managed instance group is integrated with an internal (private) load balancer that ensures that the same IP address is used to front the instance regardless of whether it's the original image or a replacement.so ,

Technology partners

Google has a robust partner ecosystem that supports backup and DR use cases with GCP. In particular, we see customers using partner solutions to do the following:

  • Back up data from on-premises to GCP. In these cases, Cloud Storage is integrated as a storage target for most on-premises backup platforms. You can use this approach to replace tape and other storage appliances.
  • Implement a DR plan that goes from on-premises to GCP. Our partners can help eliminate secondary data centers and use GCP as the DR site.
  • Implement DR and backup for cloud-based workloads.

For more information about partner solutions, see the Partners page on the GCP website.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Architectures