Disaster Recovery Cookbook

When it comes to disaster recovery, there's no silver bullet—that is, no single recovery plan can cover all use cases. This article provides guidance for handling a variety of disaster recovery scenarios using Google's cloud infrastructure.

Terminology

This article uses the following terms:

  • The recovery time objective (RTO), which is the maximum acceptable length of time that your application can be offline. This value is usually defined as part of a larger service level agreement (SLA).
  • A recovery point objective (RPO), which is the maximum acceptable length of time during which data might be lost due to a major incident. Note that this metric describes the length of time only; it does not address the amount or quality of the data lost.

For a broader discussion of these concepts, as well as general principles for designing a disaster recovery plan, see Designing a Disaster Recovery Plan with Google Cloud Platform.

Scenarios

This section explores common disaster recovery scenarios and provides recovery strategies and example implementations on Google Cloud Platform for each.

Historical data recovery

Historical data most often needs to be archived for compliance reasons, but it is also commonly archived for use in future historical analysis. In both cases, it's important to archive relevant log and database data in a durable way using an easily accessible and transformable format.

Typically, historical data has a medium or large RTO. However, as it is expected to be complete and accurate, historical data tends to have a small RPO.

Archiving log data

Log data is usually used for historical trend analysis and for potential forensic analysis. Generally, this data does not need to be stored for years. However, as noted earlier, it's important that this data can be easily imported into a format that lends itself to analysis.

Google Cloud Platform provides several options for exporting log data, including:

  • Stream to Google Cloud Storage bucket, which periodically writes your logs to Cloud Storage. The files are timestamped, encrypted, and stored in appropriately-named folders, making it simple to locate logs from a given time period.
  • Stream to BigQuery dataset, which streams your logs to a BigQuery dataset. BigQuery stores data in an immutable, read-only manner.

For details on exporting logs, see Exporting Your Logs.

Archiving database data

Relational database backups often use a multitiered solution, where the live data is stored on a local storage device and backups are stored on progressively "colder" storage solutions. In this solution, a cron job (or similar) backs up the live data to the second tier at regular intervals, and another job is used to back up data from that tier to another tier at slightly wider intervals.

One possible implementation of this strategy on Google Cloud Platform would be to use persistent disk for the live data tier, a standard Cloud Storage bucket for the second tier, and a Cloud Storage Nearline bucket for the final tier. In this implementation, the tiers would be connected as follows:

  1. Configure your application to back up data to the persistent disk attached to the instance.
  2. Set up a task, such as a cron job, to move the data to the standard Cloud Storage bucket after a defined period of time.
  3. Finally, set up another cron job or use Cloud Storage Transfer Service to move your data from the standard bucket to the Nearline bucket.

The following diagram illustrates this example implementation:

Multitiered backup
Figure 1: Multitiered backup

To make this a complete disaster recovery solution, you must also implement some method of restoring your backups to a compatible version of the database. Three viable approaches are as follows:

  • Create a custom image that has the proper version of the database system installed.

    You can then create a new Compute Engine instance with this image to test the import process. Note that this approach requires regular and rigorous testing.

  • Take regular snapshots of your database system.

    If your database system lives on a Compute Engine persistent disk, you can take snapshots of your system each time you upgrade. If your database system goes down or you need to roll back to a previous version, you can simply create a new persistent disk from your desired snapshot and make that disk the boot disk for a new Compute Engine instance. Note that, to avoid data corruption, this approach requires you to freeze the database system's disk while taking a snapshot.

  • Export the data to a highly-portable flat format such as CSV, XML, or JSON, and store it in Cloud Storage Nearline.

    This approach will provide maximum flexibility, allowing you to import the data into any database system you choose to use. In addition, JSON and CSV can be easily imported into BigQuery, which will make future analysis simple and straightforward.

Archiving directly to BigQuery

If your use case permits, you can archive real-time event data directly into BigQuery by using streaming inserts. This approach is particularly useful for performing big data analytics. To prevent accidental overwrites, you should use IAM to manage who has update and delete access to the data written to the tables.

Data corruption recovery

When database data has been corrupted, your data will need to be recovered easily and made available quickly. A good approach here is to use backups in combination with transactional log files from the corrupted database to roll back to a known-good state.

If you have chosen to use Cloud SQL, Google Cloud Platform's fully-managed MySQL database, you should enable automated backups and binary logging for your Cloud SQL instances. This will allow you to easily perform a point-in-time recovery, which restores your database from a backup and recovers it to a fresh Cloud SQL instance. For more details, see Cloud SQL Backups and Recovery.

If you manage your own relational databases with Compute Engine, the principles remain the same, but you are responsible for managing the database service and implementing an appropriate backup process.

If you are using an append-only data store like BigQuery, there are a number of mitigating strategies you can adopt:

  • Export the data from BigQuery, and create a new table that contains the exported data but excludes the corrupted data.
  • Store your data in different tables for specific time periods. This method ensures that you will need to restore only a subset of data to a new table, rather than a whole dataset.
  • Store the original data on Cloud Storage. This will allow you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table.

Additionally, if your RTO permits, you can prevent access to the table with the corrupted data by leaving your applications offline until the uncorrupted data has been restored to a new table.

Application recovery

It's important to maintain high levels of uptime—if your service is unavailable, you're losing business. This section will examine ways of failing your application over to another location as quickly as possible.

Hot standby server failover

In this solution, you have a continuously online server on standby. This server does not receive traffic while the main application server is functional.

If your service is running entirely on Google Compute Engine, you can streamline application failover by using Compute Engine's HTTP load balancing service. The HTTP load balancer accepts traffic through a single global external IP address, and then distributes it according to forwarding rules you define. Properly configured, this service will automatically fail over to your standby server in the event that a main instance becomes unhealthy.

Warm standby server failover

This solution is identical to hot standby server failover, but omits use of Compute Engine's HTTP load balancing service in favor of manual DNS adjustment. Here, RTO is determined by how quickly you can adjust the DNS record to cut over to the standby server.

Cold standby server failover

In this solution, you have an offline application server on standby that is identical to the main application server. In the event that the main application server goes offline, the standby server is instantiated. Once it is online, traffic fails over to it.

The following diagram illustrates one possible implementation:

Cold standby server example
Figure 2: Cold standby server example

In this example, you would run the following:

  • A serving instance. This instance is part of an instance group, and said group is used as a backend service for an HTTP load balancer.
  • A minimal instance that performs the following functions:

    • Runs a cron job to snapshot the serving instance at regular intervals
    • Checks the health of the serving instance at regular intervals

This minimal instance is part of a managed instance group, and this group is controlled by a Compute Engine autoscaler. The autoscaler is configured to keep exactly one minimal instance running at all times, utilizing an instance template to create a new instance in the event that the current running instance becomes unavailable.

If the minimal instance detects that the serving instance has been unresponsive for a specified period of time, it instantiates a new instance using the latest snapshot and adds the new instance to the managed instance group. When the new instance comes online, the HTTP load balancer begins directing traffic to it, as illustrated below:

Cold standby server example
Figure 3: Cold standby post-recovery state

Warm static site failover

In the unlikely event that you are unable to serve your application from Compute Engine instances, you can mitigate service interruption by having a Cloud Storage-based static site on standby. This solution is very economical, and can be particularly effective if your website has few or no dynamic elements—in the event of failure, you can simply change your DNS settings, and you will have something serving immediately.

The following diagram illustrates an example implementation:

Warm static site example
Figure 4: Warm static site example

In the above example, the primary application runs on Compute Engine instances. These instances are grouped into managed instance groups, and these instance groups serve as backend services for the HTTP load balancer. The HTTP load balancer directs incoming traffic to the instances according to the load balancer configuration, the instance groups' respective configurations, and the health of each instance.

In the normal configuration, Cloud DNS is configured to point at this primary application, and the standby static site sits dormant. In the event that the Compute Engine application is unable to serve, you would simply configure Cloud DNS to point to this static site.

Remote recovery

If your production environment is on-premises or on another cloud provider, Google Cloud Platform can be useful as a target for backups and archives. Using Carrier Interconnect, Direct Peering, and/or Compute Engine VPN, you can easily adapt the previously described disaster recovery strategies to your own situation. This section discusses methods for integrating Google Cloud Platform into your remote disaster recovery strategies.

Replicating storage with Google Cloud Platform

If you are replicating from an on-premises storage appliance, you can use Carrier Interconnect or Direct Peering to establish a connection with Google Cloud Platform, then copy your data to the storage solution of your choice. Data can then be restored to your on-premises storage or to a storage location on Google Cloud Platform.

The diagram below illustrates one possible implementation of this solution:

Replication from on-premises storage to Google Cloud Storage
Figure 5: Replication from on-premises storage to Google Cloud Storage

If you are replicating from other cloud services, you might be able to use the Google Cloud Storage XML API. This API is interoperable with some cloud storage tools and libraries that work with services such as Amazon Simple Storage Service (Amazon S3) and HP Helion Eucalyptus Storage (Walrus).

Replicating application data with Google Cloud Platform

In this scenario, production workloads are on-premises and Google Cloud Platform is the disaster recovery failover target.

One possible solution is to set up a minimal recovery suite—a cold standby application server and a hot/active database—on Google Cloud Platform, configuring the former to quickly scale up in the event that it needs to run a production workload. In this situation, the database must be kept up-to-date; however, the application servers would only be instantiated when there is a need to switch over to production. Depending on your RTO, the appropriate image starting point would be used to start and configure a working instance.

The diagram below illustrates how a multitiered application can run on-premises while using a minimal recovery suite on Google Cloud Platform:

On-premises to Google Cloud Platform recovery plan
Figure 6: On-premises to Google Cloud Platform recovery plan

Notice that only the database server instance is running on the Google Cloud Platform side. As noted earlier, this instance must run at all times so that it can receive the replicated data.

To reduce costs, you can run the database on the smallest machine type capable of running the database service. When the on-premises application needs to fail over, you can make your database system production-ready as follows:

  1. Destroy the minimal instance, making sure to keep the persistent disk containing your database system intact. If your system is on the boot disk, you will need to set the auto-delete state of the disk to false before destroying this instance.
  2. Create a new instance, using a machine type that has appropriate resources for handling a production load.
  3. Attach the persistent disk containing your database system to the new instance.

In the event of a disaster, your monitoring service will be triggered to spin up the web tier and application tier instances in Google Cloud Platform. You can then adjust the Cloud DNS record to point to the web tier or, if you are using the Compute Engine HTTP load balancing service, to the load balancer's external IP. The following diagram illustrates the state of the overall production environment after the disaster recovery plan has been executed:

Post-recovery state
Figure 7: Post-recovery state

For smaller RTO values, you could adjust the above strategy by keeping all of the Compute Engine instances operational but not receiving traffic (see Warm standby server failover). This strategy is generally not cost-efficient. If your RTO does not allow for the time it would take to bootstrap from a minimal configuration, consider implementing a fully-operational environment that serves traffic both from on-premises and from Google Cloud Platform, as illustrated below:

Active/active hybrid production environment (on-premises and
    Google Cloud Platform)
Figure 8: Active/active hybrid production environment (on-premises and Google Cloud Platform)

If you choose to implement this hybrid approach, be sure to use a DNS service that supports weighted routing when routing traffic to the two production environments so that you are able to deliver the same application from both. In the event that one environment becomes unavailable, you can then disable DNS routing to the unavailable environment.

Maintaining machine image consistency

If you choose to implement an on-premises/cloud or cloud/cloud hybrid solution, you will most likely need to find a way to maintain consistency across production environments.

For a discussion of how to create an automated pipeline for continuously building images with Packer and other open source utilities, see Automated Image Builds with Jenkins, Packer, and Kubernetes.

If a fully-configured image is required, consider something like Packer, which can create identical machine images for multiple platforms from a single configuration file. In the case of Packer, you can put the configuration file in version control to keep track of what version is deployed in production.

As another option, you could use configuration management tools such as Chef, Puppet, Ansible, or Saltstack to configure instances with finer granularity, creating base images, minimally-configured images, or fully-configured images as needed. For a discussion of how to use these tools effectively, see Compute Engine Management with Puppet, Chef, Salt, and Ansible.

You can also manually convert and import existing images such as Amazon AMIs, Virtualbox images, and RAW disk images to Compute Engine.

Send feedback about...