Disaster recovery scenarios for applications

This article is the last part of a series that discusses disaster recovery (DR) in Google Cloud. This part explores common disaster recovery scenarios for applications.

The series consists of these parts:

Introduction

This article frames DR scenarios for applications in terms of DR patterns that indicate how readily the application can recover from a disaster event. It uses the concepts discussed in the DR building blocks article to describe how you can implement an end-to-end DR plan appropriate for your recovery goals.

To begin, consider some typical workloads to illustrate how thinking about your recovery goals and architecture has a direct influence on your DR plan.

Batch processing workloads

Batch processing workloads tend not to be mission critical, so you typically don't need to incur the cost of designing a high availability (HA) architecture to maximize uptime; in general, batch processing workloads can deal with interruptions. This type of workload can take advantage of cost-effective products such as preemptible VM instances, which is an instance you can create and run at a much lower price than normal instances. (However, Compute Engine will terminate—preempt—these instances if it requires access to those resources for other tasks, and they will be terminated up to 24 hours after being launched.)

By implementing regular checkpoints as part of the processing task, the processing job can resume from the point of failure when new VMs are launched. If you're using Dataproc, the process of launching preemptible worker nodes is managed by a managed instance group. This can be considered a warm pattern, where there's a short pause waiting for replacement VMs to continue processing.

Ecommerce sites

In ecommerce sites, some parts of the application can have larger RTO values. For example, the actual purchasing pipeline needs to have high availability, but the email process that sends order notifications to customers can tolerate a few hours' delay. Customers know about their purchase, and so although they expect a confirmation email, the notification is not a crucial part of the process. This is a mix of a hot (purchasing) and warm/cold (notification) patterns.

The transactional part of the application needs high uptime with a minimal RTO value. Therefore, you use HA, which maximizes the availability of this part of the application. This approach can be considered a hot pattern.

The ecommerce scenario illustrates how you can have varying RTO values within the same application.

Video streaming

A video streaming solution has many components that need to be highly available, from the search experience to the actual process of streaming content to the user. In addition, the system requires low latency to create a satisfactory user experience. If any aspect of the solution fails to provide a great experience, it's bad for the supplier as well as the customer. Moreover, customers today can easily turn to a competitive product.

In this scenario, an HA architecture is a must-have, and small RTO values are needed. This scenario requires a hot pattern throughout the application architecture to guarantee minimal impact in case of a disaster.

DR and HA architectures for production on-premises

This section examines how to implement three patterns—cold, warm, and hot—when your application runs on-premises and your DR solution is on Google Cloud.

Cold pattern: Recovery to Google Cloud

In a cold pattern, you have minimal resources in the DR Google Cloud project—just enough to enable a recovery scenario. When there's a problem that prevents the production environment from running production workloads, the failover strategy requires a mirror of the production environment to be started in Google Cloud. Clients then start using the services from the DR environment.

In this section we examine an example of this pattern. In the example, Cloud Interconnect is configured with a self-managed (non-Google Cloud) VPN solution to provide connectivity to Google Cloud. Data is copied to Cloud Storage as part of the production environment.

DR building blocks:

  • Cloud DNS
  • Cloud Interconnect
  • Self-managed VPN solution
  • Cloud Storage
  • Compute Engine
  • Cloud Load Balancing
  • Deployment Manager

The following diagram illustrates this example architecture:

Architecture for cold pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and the Google Cloud network.
  3. Create a Cloud Storage bucket as the target for your data backup.
  4. Generate a service account key for a dedicated service account. This file is used to pass credentials to an automated script.
  5. Copy the service account key to the on-premises machine where you will run the script that uploads your database backups. (This could be your database server, but your security policies might not permit you to install additional software on your database server.)

  6. Create an IAM policy to restrict who can access the bucket and its objects. You include the service account created specifically for this purpose. You also add the user account or group to the policy for your operator or system admin, granting to all these identities the relevant permissions. For details about permissions for access to Cloud Storage, see IAM permissions for gsutil commands.

  7. Test that you can upload and download files in the target bucket.

  8. Create a data-transfer script.

  9. Create a scheduled task to run the script.

  10. Create custom images that are configured for each server in the production environment. Each image should be of the same configuration as its on-premises equivalent.

    As part of the database server custom image configuration, create a startup script that will automatically copy the latest backup from a Cloud Storage bucket to the instance and then invoke the restore process.

  11. Configure Cloud DNS to point to your internet-facing web services.

  12. Create a Deployment Manager template that will create application servers in your Google Cloud network using the previously configured custom images. This template should also set up the appropriate firewall rules required.

You need to implement processes to ensure that the custom images have the same version of the application as on-premises. Ensure that you incorporate upgrades to the custom images as part of your standard upgrade cycle, and ensure that your Deployment Manager template is using the latest custom image.

Failover process and post-restart tasks

If a disaster occurs, you can recover to the system that's running on Google Cloud. To do this, you launch your recovery process in order to create the recovery environment using the Deployment Manager template you create. When the instances in the recovery environment are ready to accept production traffic, you adjust the DNS to point to the web server in Google Cloud.

A typical recovery sequence is this:

  1. Use the Deployment Manager template to create a deployment in Google Cloud.
  2. Apply the most recent database backup in Cloud Storage to the database server running in Google Cloud by following the instructions your database system for recovering backup files.
  3. Apply the most recent transaction logs in Cloud Storage.
  4. Test that the application works as expected by simulating user scenarios on the recovered environment.
  5. When tests succeed, configure Cloud DNS to point to the web server on Google Cloud. (For example, you can use an anycast IP address behind a Google Cloud load balancer, with multiple web servers behind the load balancer.)

The following diagram shows the recovered environment:

Configuration of cold pattern for recovery when production is on-premises

When the production environment is running on-premises again and the environment can support production workloads, you reverse the steps that you followed to failover to the Google Cloud recovery environment. A typical sequence to return to the production environment is this:

  1. Take a backup of the database running on Google Cloud.
  2. Copy the backup file to your production environment.
  3. Apply the backup file to your production database system.
  4. Prevent connections to the application in Google Cloud. For example, prevent connections to the global load balancer. From this point your application will be unavailable until you finish restoring the production environment.
  5. Copy any transaction log files over to the production environment and apply them to the database server.
  6. Configure Cloud DNS to point to your on-premises web service.
  7. Ensure that the process you had in place to copy data to Cloud Storage is operating as expected.
  8. Delete your deployment.

Warm standby: Recovery to Google Cloud

A warm pattern is typically implemented to keep RTO and RPO values as small as possible without the effort and expense of a fully HA configuration. The smaller the RTO and RPO value, the higher the costs as you approach having a fully redundant environment that can serve traffic from two environments. Therefore, implementing a warm pattern for your DR scenario is a good trade-off between budget and availability.

An example of this approach is to use Cloud Interconnect configured with a self-managed VPN solution to provide connectivity to Google Cloud. A multitiered application is running on-premises while using a minimal recovery suite on Google Cloud. The recovery suite consists of an operational database server instance on Google Cloud. This instance must run at all times so that it can receive replicated transactions through asynchronous or semisynchronous replication techniques. To reduce costs, you can run the database on the smallest machine type that's capable of running the database service. Because you can use a long-running instance, sustained use discounts will apply.

DR building blocks:

  • Cloud DNS
  • Cloud Interconnect
  • Self-managed VPN solution
  • Compute Engine
  • Deployment Manager

Compute Engine snapshots provide a way to take backups that you can roll back to a previous state. Snapshots are used in this example because updated web pages and application binaries are written frequently to the production web and to application servers. These updates are regularly replicated to the reference web server and application server instances on Google Cloud. (The reference servers don't accept production traffic; they are used to create the snapshots.)

The following diagram illustrates an architecture that implements this approach. The replication targets are not shown in the diagram.

Architecture for a warm pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and the Google Cloud network.
  3. Replicate your on-premises servers to Google Cloud VM instances. One option is to use a partner solution; the method you employ depends on your circumstances.
  4. Create a custom image of your database server on Google Cloud that has the same configuration as your on-premises database server.
  5. Create snapshots of the web server and application server instances.
  6. Start a database instance in Google Cloud using the custom image you created earlier. Use the smallest machine type that is capable of accepting replicated data from the on-premises production database.
  7. Attach persistent disks to the Google Cloud database instance for the databases and transaction logs.
  8. Configure replication between your on-premises database server and the database server in Google Cloud by following the instructions for your database software.
  9. Set the auto delete flag on the persistent disks attached to the database instance to no-auto-delete.
  10. Configure a scheduled task to create regular snapshots of the persistent disks of the database instance on Google Cloud.
  11. Create reservations to assure capacity for your web server and application servers as needed.
  12. Test the process of creating instances from snapshots and of taking snapshots of the persistent disks.
  13. Create instances of the web server and the application server using the snapshots created earlier.
  14. Create a script that copies updates to the web application and the application server whenever the corresponding on-premises servers are updated. Write the script to create a snapshot of the updated servers.
  15. Configure Cloud DNS to point to your internet-facing web service on premises.

Failover process and post-restart tasks

To manage a failover, you typically use your monitoring and alerting system to invoke an automated failover process. When the on-premises application needs to fail over, you configure the database system on Google Cloud so it is able to accept production traffic. You also start instances of the web and application server.

The following diagram shows the configuration after failover to Google Cloud enabling production workloads to be served from Google Cloud:

Configuration of warm pattern for recovery when production is on-premises

A typical recovery sequence is this:

  1. Resize the database server instance so that it can handle production loads.
  2. Use the web server and application snapshots on Google Cloud to create new web server and application instances.
  3. Test that the application works as expected by simulating user scenarios on the recovered environment.
  4. When tests succeed, configure Cloud DNS to point to your web service on Google Cloud.

When the production environment is running on-premises again and can support production workloads, you reverse the steps that you followed to fail over to the Google Cloud recovery environment. A typical sequence to return to the production environment is this:

  1. Take a backup of the database running on Google Cloud.
  2. Copy the backup file to your production environment.
  3. Apply the backup file to your production database system.
  4. Prevent connections to the application in Google Cloud. One way to do this is to prevent connections to the web server by modifying the firewall rules. From this point your application will be unavailable until you finish restoring the production environment.
  5. Copy any transaction log files over to the production environment and apply them to the database server.
  6. Test that the application works as expected by simulating user scenarios on the production environment.
  7. Configure Cloud DNS to point to your on-premises web service.
  8. Delete the web server and application server instances that are running in Google Cloud. Leave the reference servers running.
  9. Resize the database server on Google Cloud back to the minimum instance size that can accept replicated data from the on-premises production database.
  10. Configure replication between your on-premises database server and the database server in Google Cloud by following the instructions for your database software.

Hot HA across on-premises and Google Cloud

If you have small RTO and RPO values, you can achieve these only by running HA across your production environment and Google Cloud concurrently. This approach gives you a hot pattern, because both on-premises and Google Cloud are serving production traffic.

The key difference from the warm pattern is that the resources in both environments are running in production mode and serving production traffic.

DR building blocks:

  • Cloud Interconnect
  • Cloud VPN
  • Compute Engine
  • Managed instance groups
  • Cloud Monitoring
  • Cloud Load Balancing

The following diagram illustrates this example architecture. By implementing this architecture, you have a DR plan that requires minimal intervention in the event of a disaster.

Architecture for a hot pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and your Google Cloud network.
  3. Create custom images in Google Cloud that are configured for each server in the on-premises production environment. Each Google Cloud image should have the same configuration as its on-premises equivalent.
  4. Configure replication between your on-premises database server and the database server in Google Cloud by following the instructions for your database software.

    Many database systems permit only a single writeable database instance when you configure replication. Therefore, you might need to ensure that one of the database replicas acts as a read-only server.

  5. Create individual instance templates that use the images for the application servers and the web servers.

  6. Configure regional managed instance groups for the application and web servers.

  7. Configure health checks using Cloud Monitoring.

  8. Configure load balancing using the regional managed instance groups that were configured earlier.

  9. Configure a scheduled task to create regular snapshots of the persistent disks.

  10. Configure a DNS service to distribute traffic between your on-premises environment and the Google Cloud environment.

With this hybrid approach, you need to use a DNS service that supports weighted routing to the two production environments so that you can serve the same application from both.

You need to design the system for failures that might occur in only part of an environment (partial failures). In that case, traffic should be rerouted to the equivalent service in the other backup environment. For example, if the on-premises web servers become unavailable, you can disable DNS routing to that environment. If your DNS service supports health checks, this will occur automatically when the health check determines that web servers in one of the environments can't be reached.

If you're using a database system that allows only a single writeable instance, in many cases the database system will automatically promote the read-only replica to be the writeable primary when the heartbeat between the original writable database and the read replica loses contact. Be sure that you understand this aspect of your database replication in case you need to intervene after a disaster.

You must implement processes to ensure that the custom VM images in Google Cloud have the same version of the application as the versions on-premises. Incorporate upgrades to the custom images as part of your standard upgrade cycle, and ensure that your Deployment Manager template is using the latest custom image.

Failover process and post-restart tasks

In the configuration described here for a hot scenario, a disaster simply means that one of the two environments isn't available. There is no failover process in the same way that there is with the warm or cold scenarios, where you need to move data or processing to the second environment. However, you might need to handle the following configuration changes:

  • If your DNS service doesn't automatically reroute traffic based on a health check failure, you need to manually reconfigure DNS routing to send traffic to the system that's still up.
  • If your database system doesn't automatically promote a read-only replica to be the writeable primary on failure, you need to intervene to ensure that the replica is promoted.

When the second environment is running again and can handle production traffic, you need to resynchronize databases. Because both environments support production workloads, you don't have to take any further action to change which database is the primary. After the databases are synchronized, you can allow production traffic to be distributed across both environments again by adjusting the DNS settings.

DR and HA architectures for production on Google Cloud

When you design your application architecture for production workload on Google Cloud, the HA features of the platform have a direct influence on your DR architecture.

Cold: recoverable application server

In a cold failover scenario where you need a single active server instance, only one instance should write to disk. In an on-premises environment, you often use an active / passive cluster. When you run a production environment on Google Cloud, you can create a VM in a managed instance group that only runs one instance.

DR building blocks:

  • Compute Engine
  • Managed instance groups

This cold failover scenario is shown in the following example architecture image:

Configuration of cold pattern for recovery when production is on Google Cloud

For a complete overview of how to deploy this example scenario and test the recovery from failure, see Deploy a cold recoverable application server with persistent disk snapshots.

The following steps outline how to configure this cold failover scenario:

  1. Create a VPC network.
  2. Create a custom VM image that's configured with your application web service.
    1. Configure the VM so that the data processed by the application service is written to an attached persistent disk.
  3. Create a snapshot from the attached persistent disk.
  4. Create an instance template that references the custom VM image for the web server.
    1. Configure a startup script to create a persistent disk from the latest snapshot and to mount the disk. This script must be able to get the latest snapshot of the disk.
  5. Create a managed instance group and health checks with a target size of one that references the instance template.
  6. Create a scheduled task to create regular snapshots of the persistent disk.
  7. Configure external HTTP(S) Load Balancing.
  8. Configure alerts using Cloud Monitoring to send an alert when the service fails.

This cold failover scenario takes advantage of some of the HA features available in Google Cloud. If a VM fails, the managed instance group tries to recreate the VM automatically. You don't have to initiate this failover step. External HTTP(S) Load Balancing makes sure that even when a replacement VM is needed, the same IP address is used in front of the application server. The instance template and custom image make sure that the replacement VM is configured identically to the instance it replaces.

Your RPO is determined by the last snapshot taken. The more often you take snapshots, the smaller the RPO value.

The managed instance group provides HA in depth. The managed instance group provides ways to react to failures at the application or VM level. You don't manually intervene if any of those scenarios occur. A target size of one makes sure that you only ever have one active instance that runs in the managed instance group and serves traffic.

Persistent disks are zonal, so you must take snapshots to re-create disks if there's a zonal failure. Snapshots are also available across regions, which lets you restore a disk to a different region as easily as restoring it to the same region.

In the unlikely event of a zonal failure, you must manually intervene to recover, as outlined in the next section.

Failover process

If a VM fails, the managed instance group automatically tries to recreate a VM in the same zone. The startup script in the instance template creates a persistent disk from the latest snapshot and attaches it to the new VM.

However, a managed instance group with size one doesn't recover if there's a zone failure. In the scenario where a zone fails, you must react to the Cloud Monitoring alert, or other monitoring platform, when the service fails and manually create an instance group in another zone.

A variation on this configuration is to use regional persistent disks instead of zonal persistent disks. With this approach, you don't need to use snapshots to restore the persistent disk as part of the recovery step. However, this variation consumes twice as much storage and you need to budget for that. To deploy this alternative scenario and test the recovery from failure, see Deploy a cold recoverable application server with regional persistent disks.

The approach you choose is dictated by your budget and RTO and RPO values.

Warm: static site failover

If Compute Engine instances fail, you can mitigate service interruption by having a Cloud Storage-based static site on standby. Using a static site is an option when your web application is mostly static. In this scenario, the primary application runs on Compute Engine instances. These instances are grouped into managed instance groups, and the instance groups serve as backend services for an HTTPs load balancer. The HTTP load balancer directs incoming traffic to the instances according to the load balancer configuration, the configuration of each instance groups, and the health of each instance.

DR building blocks:

  • Compute Engine
  • Cloud Storage
  • Cloud Load Balancing
  • Cloud DNS

The following diagram illustrates this example architecture:

Architecture for a warm failover to a static site when production is on Google Cloud

For a complete overview of how to deploy this example scenario and test the recovery from failure, see Deploy a warm recoverable web server using Cloud DNS with Compute Engine and Cloud Storage.

The following steps outline how to configure this scenario:

  1. Create a VPC network.
  2. Create a custom image that's configured with the application web service.
  3. Create an instance template that uses the image for the web servers.
  4. Configure a managed instance group for the web servers.
  5. Configure health checks using Monitoring.
  6. Configure load balancing using the managed instance groups that you configured earlier.
  7. Create a Cloud Storage-based static site.

In the production configuration, Cloud DNS is configured to point at this primary application, and the standby static site sits dormant. If the Compute Engine application goes down, you would configure Cloud DNS to point to this static site.

Failover process

If the application server or servers go down, your recovery sequence is to configure Cloud DNS to point to your static website. The following diagram shows the architecture in its recovery mode:

Configuration after failover to a static site when production is on GCP

When the application Compute Engine instances are running again and can support production workloads, you reverse the recovery step: you configure Cloud DNS to point to the load balancer that fronts the instances.

For an alternative approach that uses external HTTP(S) Load Balancing instead of Cloud DNS to control the failover, see Deploy a warm recoverable web server with Compute Engine and Cloud Storage. This pattern is useful if you don't have, or don't want to use, Cloud DNS.

Hot: HA web application

A hot pattern when your production environment is running on Google Cloud is to establish a well-architected HA deployment.

DR building blocks:

  • Compute Engine
  • Cloud Load Balancing
  • Cloud SQL

The following diagram illustrates this example architecture:

Architecture of a hot patterns when production is on Google Cloud

This scenario takes advantage of HA features in Google Cloud—you don't have to initiate any failover steps, because they will occur automatically in the event of a disaster.

As shown in the diagram, the architecture uses a regional managed instance group together with global load balancing and Cloud SQL. The example here uses a regional managed instance group, so the instances are distributed across three zones.

With this approach, you get HA in depth. Regional managed instance groups provide mechanisms to react to failures at the application, instance, or zone level, and you don't have to manually intervene if any of those scenarios occurs.

To address application-level recovery, as part of setting up the managed instance group, you configure HTTP health checks that verify that the services are running properly on the instances in that group. If a health check determines that a service has failed on an instance, the group automatically re-creates that instance.

For detailed steps on one way to configure an HA web application on Google Cloud, see Scalable and resilient web application on Google Cloud.

What's next