How to Design a Disaster Recovery Plan

Service-interrupting events can happen at any time. Your network could have an outage, your latest application push might introduce a critical bug, or—in rare cases—you might even have to contend with a natural disaster.

When things go awry, it's important to have a robust, targeted, and well-tested disaster recovery plan. This article discusses general principles for designing and testing a disaster recovery plan with Google Cloud Platform.

Basics of disaster recovery planning

As a subset of business continuity planning, disaster recovery planning begins with a business impact analysis. The idea behind this analysis is to work out two key metrics:

  • A recovery time objective (RTO), which is the maximum acceptable length of time that your application can be offline. This value is usually defined as part of a larger service level agreement (SLA).
  • A recovery point objective (RPO), which is the maximum acceptable length of time during which data might be lost from your application due to a major incident. This metric will vary based on the ways that the data is used; for example, frequently modified user data could have an RPO of just a few minutes, whereas less critical, infrequently modified data could have an RPO of several hours. Note that this metric describes the length of time only; it does not address the amount or quality of the data lost.

Taken together, these metrics have a roughly asymptotic impact on your bottom line. Typically, the smaller your RTO and RPO values are, the more your application will cost to run:

Ratio of cost to RTO/RPO
Figure 1: Ratio of cost to RTO/RPO

Because smaller RTO and RPO values often come with an increase in complexity, the associated administrative overhead follows a similar curve. A high-availability application might find you managing distribution between two physically separated datacenters, managing replication, and more.

Why Google Cloud Platform?

Google Cloud Platform can greatly reduce the costs associated with both RTO and RPO. For example, traditional disaster recovery planning requires you to account for any number of requirements, including:

  • Capacity: acquiring enough resources to scale as needed
  • Security: physical security to protect assets
  • Network infrastructure: software components such as firewalls and load balancers
  • Support: skilled technicians to perform maintenance and address issues
  • Bandwidth: suitable bandwidth for peak load
  • Facilities: physical infrastructure, including equipment and power

By providing a highly-managed solution on a world-class production platform, Google Cloud Platform allows you to bypass most or all of these complicating factors, removing many business costs in the process. In addition, Google Cloud Platform's focus on administrative simplicity means that the costs of managing a complex application are reduced as well.

Google Cloud Platform offers several advantages relevant to disaster recovery planning, including:

  • A global network

    Google has one of the largest and most advanced computer networks. Google’s backbone network has thousands of miles of fiber optic cable, uses advanced software-defined networking and has edge-caching services to deliver fast, consistent and scalable performance.

  • Redundancy

    Multiple points of presence across the globe mean strong redundancy. Your data is automatically mirrored across storage devices in multiple locations.

  • Scalability

    Cloud Platform is designed to scale like Google’s own products, even when you experience a huge traffic spike. Managed services such as App Engine, Compute Engine Autoscaler, and Cloud Datastore give you automatic scaling that enables your application to grow and shrink as needed.

  • Security

    The Google security model is an end-to-end process, built on over 15 years of experience focused on keeping customers safe on Google applications like Gmail and G Suite. In addition, Google’s site reliability engineering teams oversee operations of the platform systems to ensure high availability, and prevent abuse of platform resources.

  • Compliance

    Google undergoes regular independent third-party audits to verify that Google Cloud Platform is in alignment with security, privacy, and compliance regulations and best practices. Cloud Platform complies with top certifications such as ISO 27001, SOC 2/3, and PCI DSS 3.0.

  • Low cost

    Google Cloud Platform applies Moore's Law to pricing, so when the cost of hardware decreases, the cost of the various Google Cloud Platform components will accurately reflect those decreases.

Designing your disaster recovery plan

This section outlines best practices for designing a disaster recovery plan for your service.

Design according to your recovery goals

When designing your disaster recovery plan, choose targeted strategies that address your specific use cases. For example, in the case of historical compliance-oriented data, you probably don't need speedy access to the data; however, in the event that your service experiences an interruption, you'll most likely want to be able to recover both the data and the application as quickly as possible.

For guidance on addressing common disaster recovery scenarios using Google Cloud Platform, check out the Disaster Recovery Cookbook, which provides targeted disaster recovery strategies for a variety of use cases and offers example implementations on Google Cloud Platform for each.

Design for end-to-end recovery

It isn't enough to simply have a plan for backing up or archiving your data. Make sure your disaster recovery plan addresses the full recovery process, from backups to restores to cleanup.

Make your tasks specific

When it's time to run your disaster recovery plan, you don't want to be stuck guessing what each step means. Each task in your disaster recovery plan should consist of one or more concrete, unambiguous commands or actions. For example, "Run the restore script" is too general; however, "Open Bash and run /home/foo/restore.sh" is precise and concrete.

Implement control measures

Implement measures to minimize the probability of a disaster occurring. Add controls to prevent disaster events from occurring and to detect issues before they occur. For example, you could add a monitor that sends an alert when a data-destructive flow, such as a deletion pipeline, exhibits unexpected spikes or other unusual activity. This monitor could also kill the pipeline processes if a certain deletion threshold is reached, preventing a catastrophic situation.

Integrate your standard security mechanisms

Ensure that the security controls that apply to your production environment are factored into your recovery plan as well.

Keep your software licenses current

To avoid unpleasant surprises when executing a recovery, ensure that you are properly licensed for any software that you will be deploying as part of your disaster recovery plan. Check with the supplier of the software for guidance.

Configure your machine images to reflect your RTO

When configuring the machine image you will be using to deploy new instances, consider the effect your configuration will have on speed of deployment. There is a trade off between the amount of image preconfiguration, costs of maintaining the image, and deployment speed. For example, if your machine image is minimally configured, the instances that use it will require more time to spin up, as they will need to download and install any dependencies. If your machine image is highly preconfigured, however, the instances that use it will spin up more quickly, but you will have to update the image more regularly.

Continuum of image configurations
Figure 2: Continuum of image configurations

For most customers, the right approach will be somewhere in-between. Choose the configuration that makes sense for you, your application, and your required RTO. Typically, the smaller your RTO is, the more preconfigured you will want your image to be.

Maintain more than one data recovery path

Don't settle for a single backup and restore plan. It's important to have more than one data recovery path in the event that your main path fails.

Test your plan regularly

Once you have a disaster recovery plan in place, you should test it regularly, noting issues that have come up and adjusting your plan accordingly. Using Google Cloud Platform, it’s easy to test recovery scenarios at minimal cost:

  • Automate infrastructure provisioning with Google Cloud Deployment Manager

    You can use Google Cloud Deployment Manager to automate the provisioning of virtual machine instances and other Google Cloud Platform infrastructure. If you are running production on-premises, you need to ensure that you are able to leverage a monitoring solution so as to trigger the appropriate recovery actions.

  • Monitor and debug your tests with Stackdriver Logging and Stackdriver Monitoring

    Google Cloud Platform has excellent logging and monitoring tools that you can access via an API call, allowing you to easily automate the deployment of recovery scenarios by reacting to metrics. When designing tests, ensure that you have appropriate monitoring and alerting in place which can trigger appropriate recovery actions.

Utilizing Google Cloud Platform in your disaster recovery plan

Google Cloud Platform provides many products and features that can be utilized when designing and testing a disaster recovery plan. This section highlights a selection of the most relevant products and features and briefly discusses how they might be integrated into a disaster recovery plan.

Data backup and recovery

Google Cloud Platform includes features to help you back up and recover your data.

Cloud Storage

Google Cloud Storage provides a durable backup or archive endpoint for unstructured data and binary data. Cloud Storage offers four storage classes—Multi-Regional Storage, Regional Storage, Nearline Storage, and Coldline Storage—with different availability and latency characteristics for your specific backup or archive needs. All storage classes offer the same high level of durability:

  • Multi-Regional storage is useful for storing data that requires low latency access or data that is frequently accessed ("hot" objects), such as serving website content, interactive workloads, or gaming and mobile applications. This storage class maintains geo-redundant copies of your data: that is, data is stored in multiple, geographically distinct areas.
  • Regional storage is also useful for storing data that requires low latency access or data that is frequently accessed. While this (and all) classes store redundant copies of your data, only the above Multi-Regional storage class does so geo-redundantly.
  • Nearline storage is useful for data accessed typically less than once a month, especially if the highest levels of availability are not required. This storage class can be useful as part of a multitiered storage solution, where each successive tier represents a colder form of storage.
  • Coldline storage is an inexpensive, low-availability solution that is ideal for archiving data accessed typically less than once a year, such as compliance-related data or data that is stored chiefly for future historical analysis. More generally, Coldline storage is analogous to tape backup and can be used for any use case that would traditionally use tape. The primary difference between Coldline storage and tape backup is that Coldline is immediately accessible when needed, and uses the same interfaces as other Cloud Storage classes.

Cloud SQL

Google Cloud SQL is a distributed MySQL database. It has most of the capabilities and functionality of MySQL, as well as standard Google Cloud Platform benefits such as built-in data replication across zones. Cloud SQL is particularly useful as a hot standby for your MySQL database. If you also enable automated backups and binary logging on your Cloud SQL instance, you can perform point-in-time restores of your MySQL data with ease.

BigQuery

BigQuery is a fast, economical, and fully managed enterprise data warehouse for large-scale data analytics. In the event of a disaster, BigQuery is an excellent place to redirect high-volume logging, mitigating many of the complexities of standard compute-based log aggregation strategies. Sending logs to BigQuery is easy:

  • If your whole application runs on Google Cloud Platform, you can configure your Stackdriver Logging output to stream to BigQuery with the click of a checkbox.
  • If you use a tool like Fluentd to send your logs to a custom on-premises, logs-aggregating solution, the solution is similarly straightforward—you can simply set up a different configuration in your disaster recovery plan to send your logs to BigQuery. This allows you to maintain centralized logs without having to recreate your on-premises logging solution on virtual machine instances.

Because BigQuery was designed for analyzing large datasets, it is an ideal solution for performing logs analysis, simplifying the process of diagnosing disaster-causing issues. And if you want to export your logs data at a later date, BigQuery lets you export to CSV, JSON, and Avro formats.

BigQuery can also be used as a standalone archiving solution.

Application backup and recovery

Google Cloud Platform includes features to help you back up and recover your application.

HTTP load balancing

Google Compute Engine provides an HTTP load balancing service that can be used to automatically fail over in the event that a virtual machine instance is down. The HTTP load balancer accepts traffic through a single global external IP address, then distributes it according to forwarding-rules you define. In a disaster recovery context, this service can be used as a replacement for on-premises hardware-based load balancing and routing infrastructure, routing traffic to your standby Compute Engine instances.

Instance snapshots

Compute Engine instance snapshots allow you to create diff-based backups of persistent disks attached to the instance, including boot volumes. You can create new persistent disks from snapshots, making them useful for backing up data, recreating a persistent disk that might have been lost, or copying a persistent disk. Snapshots can be applied across persistent disk types. Snapshots are globally available, so you can restore to any zone in any region.

Cloud DNS

Google Cloud DNS provides high-volume, high-performance DNS serving, enabling you to provide reliable, low-latency access to your service from anywhere in the world by way of Google's network of Anycast name servers. If your use case does not permit you to leverage Compute Engine's HTTP load balancing service, you can still use Cloud DNS to manage application failover, either manually via the gcloud tool or programatically via the Cloud DNS API.

Recovery plan testing and deployment

Google Cloud Platform provides several useful tools for testing, debugging, and deploying your disaster recovery plan.

Stackdriver Logging

Stackdriver Logging collects and stores logs from applications and services running on Google Cloud Platform. Logs can be viewed in the Google Cloud Platform Console or streamed to Google Cloud Storage, Google BigQuery, or Google Cloud Pub/Sub.

Stackdriver Monitoring

Stackdriver Monitoring provides dashboards and alerts for your applications. Stackdriver Monitoring lets you review performance metrics for various Google Cloud Platform services and, via the monitoring agent, provides service-specific monitoring hooks for many popular open source services. You also can use the Stackdriver Monitoring API to access monitoring data and create custom metrics.

Cloud Deployment Manager

Google Cloud Deployment Manager is an infrastructure management service that makes it simple to automate the creation and deployment of Google Cloud Platform resources. Simply create a static or dynamic template that describes the configuration of your Google Cloud Platform environment, then use Deployment Manager to create these resources as a single deployment.

Remote connectivity

You can also use Google Cloud Platform as a remote recovery solution for your on-premises production environment or from another cloud service. Google Cloud Platform provides several services that can help ensure a seamless connection between your production environment and Google's cloud.

Carrier Interconnect

Carrier Interconnect enables you to connect your infrastructure to Google by way of enterprise-grade connections to Google's network edge. The connections are offered by Carrier Interconnect service providers. Connecting with Carrier Interconnect will enable your infrastructure to connect to Google Cloud Platform with high availability and low latency.

Direct peering

If you meet Google’s technical peering requirements, you might be able to establish a direct peering connection between your business network and Google’s. With this connection, you will be able to exchange Internet traffic between your network and Google’s at one of our broad-reaching edge network locations.

Compute Engine VPN

Google Compute Engine VPN lets you connect your existing network to your Compute Engine network using an IPsec connection. Alternatively, you can use Compute Engine VPN to connect two different Compute Engine VPN gateways.

Conclusion

With a well-designed, well-tested disaster recovery plan in place, you can ensure that the impact on your business's bottom line will be minimal when catastrophe hits. No matter what your disaster recovery needs look like, Google Cloud Platform has a robust, flexible, and cost-efficient selection of products and features that you can use to build or augment the solution that is right for you.

Next steps

Check out the Disaster Recovery Cookbook
The Disaster Recovery Cookbook examines a number of common disaster recovery scenarios and provides targeted advice for implementing robust solutions using Google Cloud Platform.
Build a scalable and resilient app on Google Cloud Platform
Building Scalable and Resilient Web Applications walks you through a sample application deployment and provides guidance on how to make your application scalable, durable, and cost-efficient.
Try some example implementations
Performing MySQL Hot Backups with Percona XtraBackup and Cloud Storage and Using Cloud Storage for Cassandra Disaster Recovery offer end-to-end recovery strategies for MySQL and Cassandra, respectively.

Send feedback about...