Disaster Recovery Scenarios for Applications

This article is the last part of a multi-part series that discusses disaster recovery (DR) in Google Cloud Platform (GCP). This part explores common disaster recovery scenarios for applications.

The series consists of these parts:

Introduction

This article frames DR scenarios for applications in terms of DR patterns that indicate how readily the application can recover from a disaster event. It uses the concepts discussed in the DR building blocks article to describe how you can implement an end-to-end DR plan appropriate for your recovery goals.

To begin, consider some typical workloads to illustrate how thinking about your recovery goals and architecture has a direct influence on your DR plan.

Batch processing workloads

Batch processing workloads tend not to be mission critical, so you typically don't need to incur the cost of designing an HA architecture to maximize uptime; in general, batch processing workloads can deal with interruptions. This type of workload can take advantage of cost-effective products such as preemptible VM instances, which is an instance you can create and run at a much lower price than normal instances. (However, Compute Engine will terminate—preempt—these instances if it requires access to those resources for other tasks, and they will be terminated up to 24 hours after being launched.)

By implementing regular checkpoints as part of the processing task, the processing job can resume from the point of failure when new VMs are launched. If you're using Cloud Dataproc, the process of launching preemptible worker nodes is managed by a managed instance group. This can be considered a warm pattern, where there's a short pause waiting for replacement VMs to continue processing.

Ecommerce sites

In ecommerce sites, some parts of the application can have larger RTO values. For example, the actual purchasing pipeline needs to have high availability, but the email process that sends order notifications to customers can tolerate a few hours' delay. Customers know about their purchase, and so although they expect a confirmation email, the notification is not a crucial part of the process. This is a mix of a hot (purchasing) and warm/cold (notification) patterns.

The transactional part of the application needs high uptime with a minimal RTO value. Therefore, you use HA, which maximizes the availability of this part of the application. This approach can be considered a hot pattern.

The ecommerce scenario illustrates how you can have varying RTO values within the same application.

Video streaming

A video streaming solution has many components that need to be highly available, from the search experience to the actual process of streaming content to the user. In addition, the system requires low latency to create a satisfactory user experience. If any aspect of the solution fails to provide a great experience, it's bad for the supplier as well as the customer. Moreover, customers today can easily turn to a competitive product.

In this scenario, an HA architecture is a must-have, and small RTO values are needed. This scenario requires a hot pattern throughout the application architecture to guarantee minimal impact in case of a disaster.

DR and HA architectures for production on-premises

This section examines how to implement three patterns—cold, warm, and hot—when your application runs on-premises and your DR solution is on GCP.

Cold pattern: Recovery to GCP

In a cold pattern, you have minimal resources in the DR GCP project—just enough to enable a recovery scenario. When there's a problem that prevents the production environment from running production workloads, the failover strategy requires a mirror of the production environment to be started in GCP. Clients then start using the services from the DR environment.

In this section we examine an example of this pattern. In the example, Cloud Interconnect is configured with Cloud VPN to provide connectivity to GCP. Data is copied to Cloud Storage as part of the production environment.

DR building blocks:

  • Cloud DNS
  • Cloud Interconnect
  • Cloud VPN
  • Cloud Storage
  • Compute Engine
  • Cloud Load Balancing
  • Cloud Deployment Manager

The following diagram illustrates this example architecture:

Architecture for cold pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and the GCP network.
  3. Create a Cloud Storage bucket as the target for your data backup.
  4. Generate a service account key for a dedicated service account. This file is used to pass credentials to an automated script.
  5. Copy the service account key to the on-premises machine where you will run the script that uploads your database backups. (This could be your database server, but your security policies might not permit you to install additional software on your database server.)

  6. Create an IAM policy to restrict who can access the bucket and its objects. You include the service account created specifically for this purpose. You also add the user account or group to the policy for your operator or system admin, granting to all these identities the relevant permissions. For details about permissions for access to Cloud Storage, see IAM permissions for gsutil commands.

  7. Test that you can upload and download files in the target bucket.

  8. Create a data-transfer script.

  9. Create a scheduled task to run the script.

  10. Create custom images that are configured for each server in the production environment. Each image should be of the same configuration as its on-premises equivalent.

    As part of the database server custom image configuration, create a startup script that will automatically copy the latest backup from a Cloud Storage bucket to the instance and then invoke the restore process.

  11. Configure Cloud DNS to point to your internet-facing web services.

  12. Create a Cloud Deployment Manager templatethat will create application servers in your GCP network using the previously configured custom images. This template should also set up the appropriate firewall rules required.

You need to implement processes to ensure that the custom images have the same version of the application as on-premises. Ensure that you incorporate upgrades to the custom images as part of your standard upgrade cycle, and ensure that your Cloud Deployment Manager template is using the latest custom image.

Failover process and post-restart tasks

If a disaster occurs, you can recover to the system that's running on GCP. To do this, you launch your recovery process in order to create the recovery environment using the Cloud Deployment Manager template you create. When the instances in the recovery environment are ready to accept production traffic, you adjust the DNS to point to the web server in GCP.

A typical recovery sequence is this:

  1. Use the Cloud Deployment Manager template to create a deployment in GCP.
  2. Apply the most recent database backup in Cloud Storage to the database server running in GCP by following the instructions your database system for recovering backup files.
  3. Apply the most recent transaction logs in Cloud Storage.
  4. Test that the application works as expected by simulating user scenarios on the recovered environment.
  5. When tests succeed, configure Cloud DNS to point to the web server on GCP. (For example, you can use an anycast IP address behind a GCP load balancer, with multiple web servers behind the load balancer.)

The following diagram shows the recovered environment:

Configuration of cold pattern for recovery when production is on-premises

When the production environment is running on-premises again and the environment can support production workloads, you reverse the steps that you followed to failover to the GCP recovery environment. A typical sequence to return to the production environment is this:

  1. Take a backup of the database running on GCP.
  2. Copy the backup file to your production environment.
  3. Apply the backup file to your production database system.
  4. Prevent connections to the application in GCP. For example, prevent connections to the global load balancer. From this point your application will be unavailable until you finish restoring the production environment.
  5. Copy any transaction log files over to the production environment and apply them to the database server.
  6. Configure Cloud DNS to point to your on-premises web service.
  7. Ensure that the process you had in place to copy data to Cloud Storage is operating as expected.
  8. Delete your deployment.

Warm standby: Recovery to GCP

A warm pattern is typically implemented to keep RTO and RPO values as small as possible without the effort and expense of a fully HA configuration. The smaller the RTO and RPO value, the higher the costs as you approach having a fully redundant environment that can serve traffic from two environments. Therefore, implementing a warm pattern for your DR scenario is a good trade-off between budget and availability.

An example of this approach is to use Cloud Interconnect configured with Cloud VPN to provide connectivity to GCP. A multitiered application is running on-premises while using a minimal recovery suite on GCP. The recovery suite consists of an operational database server instance on GCP. This instance must run at all times so that it can receive replicated transactions through asynchronous or semisynchronous replication techniques. To reduce costs, you can run the database on the smallest machine type that's capable of running the database service. Because you can use a long-running instance, sustained use discounts will apply.

DR building blocks:

  • Cloud DNS
  • Cloud Interconnect
  • Cloud VPN
  • Compute Engine
  • Cloud Deployment Manager

Compute Engine snapshots provide a way to take backups that you can roll back to a previous state. Snapshots are used in this example because updated web pages and application binaries are written frequently to the production web and to application servers. These updates are regularly replicated to the reference web server and application server instances on GCP. (The reference servers don't accept production traffic; they are used to create the snapshots.)

The following diagram illustrates an architecture that implements this approach. The replication targets are not shown in the diagram.

Architecture for a warm pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and the GCP network.
  3. Replicate your on-premises servers to GCP VM instances. One option is to use a partner solution; the method you employ depends on your circumstances.
  4. Create a custom image of your database server on GCP that has the same configuration as your on-premises database server.
  5. Create snapshots of the web server and application server instances.
  6. Start a database instance in GCP using the custom image you created earlier. Use the smallest machine type that is capable of accepting replicated data from the on-premises production database.
  7. Attach persistent disks to the GCP database instance for the databases and transaction logs.
  8. Configure replication between your on-premises database server and the database server in GCP by following the instructions for your database software.
  9. Set the auto delete flag on the persistent disks attached to the database instance to no-auto-delete.
  10. Configure a scheduled task to create regular snapshots of the persistent disks of the database instance on GCP.
  11. Test the process of creating instances from snapshots and of taking snapshots of the persistent disks.
  12. Create instances of the web server and the application server using the snapshots created earlier.
  13. Create a script that copies updates to the web application and the application server whenever the corresponding on-premises servers are updated. Write the script to create a snapshot of the updated servers.
  14. Configure Cloud DNS to point to your internet-facing web service on premises.

Failover process and post-restart tasks

To manage a failover, you typically use your monitoring and alerting system to invoke an automated failover process. When the on-premises application needs to fail over, you configure the database system on GCP so it is able to accept production traffic. You also start instances of the web and application server.

The following diagram shows the configuration after failover to GCP enabling production workloads to be served from GCP:

Configuration of warm pattern for recovery when production is on-premises

A typical recovery sequence is this:

  1. Resize the database server instance so that it can handle production loads.
  2. Use the web server and application snapshots on GCP to create new web server and application instances.
  3. Test that the application works as expected by simulating user scenarios on the recovered environment.
  4. When tests succeed, configure Cloud DNS to point to your web service on GCP.

When the production environment is running on-premises again and can support production workloads, you reverse the steps that you followed to fail over to the GCP recovery environment. A typical sequence to return to the production environment is this:

  1. Take a backup of the database running on GCP.
  2. Copy the backup file to your production environment.
  3. Apply the backup file to your production database system.
  4. Prevent connections to the application in GCP. One way to do this is to prevent connections to the web server by modifying the firewall rules. From this point your application will be unavailable until you finish restoring the production environment.
  5. Copy any transaction log files over to the production environment and apply them to the database server.
  6. Test that the application works as expected by simulating user scenarios on the production environment.
  7. Configure Cloud DNS to point to your on-premises web service.
  8. Delete the web server and application server instances that are running in GCP. Leave the reference servers running.
  9. Resize the database server on GCP back to the minimum instance size that can accept replicated data from the on-premises production database.
  10. Configure replication between your on-premises database server and the database server in GCP by following the instructions for your database software.

Hot HA across on-premises and GCP

If you have small RTO and RPO values, you can achieve these only by running HA across your production environment and GCP concurrently. This approach gives you a hot pattern, because both on-premises and GCP are serving production traffic.

The key difference from the warm pattern is that the resources in both environments are running in production mode and serving production traffic.

DR building blocks:

  • Cloud Interconnect
  • Cloud VPN
  • Compute Engine
  • Managed instance groups
  • Stackdriver
  • Cloud Load Balancing

The following diagram illustrates this example architecture. By implementing this architecture, you have a DR plan that requires minimal intervention in the event of a disaster.

Architecture for a hot pattern when production is on-premises

The following steps outline how you can configure the environment:

  1. Create a VPC network.
  2. Configure connectivity between your on-premises network and your GCP network.
  3. Create custom images in GCP that are configured for each server in the on-premises production environment. Each GCP image should have the same configuration as its on-premises equivalent.
  4. Configure replication between your on-premises database server and the database server in GCP by following the instructions for your database software.

    Many database systems permit only a single writeable database instance when you configure replication. Therefore, you might need to ensure that one of the database replicas acts as a read-only server.

  5. Create individual instance templates that use the images for the application servers and the web servers.

  6. Configure regional managed instance groups for the application and web servers.

  7. Configure health checks using Stackdriver Monitoring.

  8. Configure load balancing using the regional managed instance groups that were configured earlier.

  9. Configure a scheduled task to create regular snapshots of the persistent disks.

  10. Configure a DNS service to distribute traffic between your on-premises environment and the GCP environment.

With this hybrid approach, you need to use a DNS service that supports weighted routing to the two production environments so that you can serve the same application from both.

You need to design the system for failures that might occur in only part of an environment (partial failures). In that case, traffic should be rerouted to the equivalent service in the other backup environment. For example, if the on-premises web servers become unavailable, you can disable DNS routing to that environment. If your DNS service supports health checks, this will occur automatically when the health check determines that web servers in one of the environments can't be reached.

If you're using a database system that allows only a single writeable instance, in many cases the database system will automatically promote the read-only replica to be the writeable primary when the heartbeat between the original writable database and the read replica loses contact. Be sure that you understand this aspect of your database replication in case you need to intervene after a disaster.

You must implement processes to ensure that the custom VM images in GCP have the same version of the application as the versions on-premises. Incorporate upgrades to the custom images as part of your standard upgrade cycle, and ensure that your Cloud Deployment Manager template is using the latest custom image.

Failover process and post-restart tasks

In the configuration described here for a hot scenario, a disaster simply means that one of the two environments isn't available. There is no failover process in the same way that there is with the warm or cold scenarios, where you need to move data or processing to the second environment. However, you might need to handle the following configuration changes:

  • If your DNS service doesn't automatically reroute traffic based on a health check failure, you need to manually reconfigure DNS routing to send traffic to the system that's still up.
  • If your database system doesn't automatically promote a read-only replica to be the writeable primary on failure, you need to intervene to ensure that the replica is promoted.

When the second environment is running again and can handle production traffic, you need to resynchronize databases. Because both environments support production workloads, you don't have to take any further action to change which database is the primary. After the databases are synchronized, you can allow production traffic to be distributed across both environments again by adjusting the DNS settings.

DR and HA architectures for production on GCP

When you design your application architecture for production workload on GCP, the HA features of the platform have a direct influence on your DR architecture.

Cold: recoverable application server

In a cold scenario where you need only a single server instance, you want only one instance writing to disk. On-premises, this is often achieved by having an active/passive cluster. In contrast, when you run a production environment on GCP, the server instance is part of a managed instance group, and that group is used as a backend service for the internal load balancing service.

DR building blocks:

  • Compute Engine
  • GCP Internal Load Balancing

The following diagram illustrates this example architecture. No client connections are illustrated, because you wouldn't normally have an external client connect directly to an application server; instead, there would usually be a proxy or web application in front of the application server.

Architecture of a cold scenario when production is on GCP

The following steps outline how to configure this scenario:

  1. Create a VPC network.
  2. Create a custom image that's configured with the application service. As part of the image, make the following configurations:

    1. Configure the server so that data processed by the application service is written to an attached persistent disk.
    2. Create a snapshot from the attached persistent disk.
    3. Configure a startup script to create a persistent disk from the latest snapshot and to mount the disk. This script needs to be able to get the latest snapshot of the disk.
  3. Create an instance template that uses the image.

  4. Configure a regional managed instance group with a target size of 1 using the instance template that you created earlier.

  5. Configure health checks using Monitoring.

  6. Configure internal load balancing using the regional managed instance group that you configured earlier.

  7. Configure a scheduled task to create regular snapshots of the persistent disk.

    This scenario takes advantage of some of the HA features available in GCP. You don't have to initiate any failover steps, because they will occur automatically in the event of a disaster. The internal load balancer ensures that even when a replacement instance is needed, the same IP address is used to front the application server. The instance template and custom image ensure that the replacement instance is configured identically to the instance it is replacing.

Your RPO will be determined by the last snapshot taken. The more often the snapshots are taken, the smaller the RPO value.

The regional managed instance group provides HA in depth. It provides mechanisms to react to failures at the application, instance, or zone level. You don't have to manually intervene if any of those scenarios should occur. Setting a target size of 1 ensures you only ever have 1 instance running.

Persistent disks are zonal, so snapshots are required in order to re-create disks in case of zonal failure. Snapshots are also available across regions, which permits you to restore a disk to a different region as easily restoring it to the same region.

Failover process

In this scenario, in the event of a zonal failure, the regional instance group launches a replacement instance in a different zone in the same region. A new persistent disk is created from the latest snapshot and attached to the new instance. The following diagram illustrates this state:

Configuration of cold pattern for recovery when production is on GCP

Some variations on this configuration include the following:

  • Using regional persistent disks instead of zonal persistent disks. With this approach, you don't need to use snapshots to restore the persistent disk as part of the recovery step. However, this variation consumes twice as much storage and you will need to budget for that.
  • Using a managed instance group instead of a regional managed instance group, and attaching the persistent disk to the replacement instance that is started in the same zone as the instance that failed. In this scenario, you must configure the persistent disk's auto-delete setting to no-auto-delete.

The approach you choose will be dictated by your budget and RTO and RPO values.

Warm: static site failover

If Compute Engine instances fail, you can mitigate service interruption by having a Cloud Storage-based static site on standby. Using a static site is an option when your web application is mostly static. In this scenario, the primary application runs on Compute Engine instances. These instances are grouped into managed instance groups, and the instance groups serve as backend services for an HTTPs load balancer. The HTTP load balancer directs incoming traffic to the instances according to the load balancer configuration, the configuration of each instance groups, and the health of each instance.

DR building blocks:

  • Compute Engine
  • Cloud Storage
  • Cloud Load Balancing
  • Cloud DNS

The following diagram illustrates this example architecture:

Architecture for a warm failover to a static site when production is on GCP

The following steps outline how to configure this scenario:

  1. Create a VPC network.
  2. Create a custom image that's configured with the application web service.
  3. Create an instance template that uses the image for the web servers.
  4. Configure a managed instance group for the web servers.
  5. Configure health checks using Monitoring.
  6. Configure load balancing using the managed instance groups that you configured earlier.
  7. Create a Cloud Storage-based static site.

In the production configuration, Cloud DNS is configured to point at this primary application, and the standby static site sits dormant. If the Compute Engine application goes down, you would configure Cloud DNS to point to this static site.

Failover process

If the application server or servers go down, your recovery sequence is to configure Cloud DNS to point to your static website. The following diagram shows the architecture in its recovery mode:

Configuration after failover to a static site when production is on GCP

When the application Compute Engine instances are running again and can support production workloads, you reverse the recovery step: you configure Cloud DNS to point to the load balancer that fronts the instances.

Hot: HA web application

A hot pattern when your production environment is running on GCP is to establish a well-architected HA deployment.

DR building blocks:

  • Compute Engine
  • Cloud Load Balancing
  • Cloud SQL

The following diagram illustrates this example architecture:

Architecture of a hot patterns when production is on GCP

This scenario takes advantage of HA features in GCP—you don't have to initiate any failover steps, because they will occur automatically in the event of a disaster.

As shown in the diagram, the architecture uses a regional managed instance group together with global load balancing and Cloud SQL. The example here uses a regional managed instance group, so the instances are distributed across three zones.

With this approach, you get HA in depth. Regional managed instance groups provide mechanisms to react to failures at the application, instance, or zone level, and you don't have to manually intervene if any of those scenarios occurs.

To address application-level recovery, as part of setting up the managed instance group, you configure HTTP health checks that verify that the services are running properly on the instances in that group. If a health check determines that a service has failed on an instance, the group automatically re-creates that instance.

For detailed steps on one way to configure an HA web application on GCP, see Scalable and resilient web application on GCP.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Architectures