Disaster Recovery Scenarios for Data

This article is the third part of a multi-part series that discusses disaster recovery (DR) in Google Cloud Platform (GCP). This part discusses scenarios for backing up and recovering data.

The series consists of these parts:

Introduction

Your disaster recovery plans must specify how you can avoid losing data during a disaster. The term data here covers two scenarios. Backing up and then recovering database, log data, and other data types fits into one of the following scenarios:

  • Data backups. Backing up data alone involves copying a discrete amount of data from one place to another. Backups are made as part of a recovery plan either to recover from a corruption of data so that you can restore to a known good state directly in the production environment, or so that you can restore data in your DR environment if your production environment is down. Typically, data backups have a small to medium RTO and a small RPO.
  • Database backups. Database backups are slightly more complex, because they typically involve recovering to the point in time. Therefore, in addition to considering how to back up and restore the database backups and ensuring that the recovery database system mirrors the production configuration (same version, mirrored disk configuration), you also need to consider how to back up transaction logs. During recovery, after you restore database functionality, you have to apply the latest database backup and then the recovered transaction logs that were backed up after the last backup. Due to the complicating factors inherent to database systems (for example, having to match versions between production and recovery systems), adopting a high-availability-first approach to minimize the time to recover from a situation that could cause unavailability of the database server allows you to achieve smaller RTO and RPO values.

The rest of this article discusses examples of how to design some scenarios for data and databases that can help you meet your RTO and RPO goals.

Production environment is on-premises

In this scenario, your production environment is on-premises, and your disaster recovery plan involves using GCP as the recovery site.

Data backup and recovery

You can use a number of strategies to implement a process to regularly back up data from on-premises to GCP. This section looks at two of the most common solutions.

Solution 1: Back up to Cloud Storage using a scheduled task

DR building blocks:

  • Cloud Storage

One option for backing up data is to create a scheduled task that runs a script or application to transfer the data to Cloud Storage. You can automate a backup process to Cloud Storage using the gsutil command-line tool or by using one of the Cloud Storage client libraries. For example, the following gsutil command copies all files from a source directory to a specified bucket.

gsutil -m cp -r [SOURCE_DIRECTORY] gs://[BUCKET_NAME]

The following steps outline how to implement a backup and recovery process using the gsutil tool.

  1. Install gsutil on the on-premises machine that you use to upload your data files from.
  2. Create a bucket as the target for your data backup.
  3. Generate a service account key for a dedicated service account in JSON format. This file is used to pass credentials to gsutil as part of an automated script
  4. Copy the service account key to the on-premises machine where you run the script that you use to upload your backups.

  5. Create an IAM policy to restrict who can access the bucket and its objects. (Include the service account created specifically for this purpose and an on-premises operator account). For details about permissions for access to Cloud Storage, see IAM permissions for gsutil commands.

  6. Test that you can upload and download files in the target bucket.

  7. Follow the guidance in scripting data transfer tasks to set up a scheduled script.

  8. Configure a recovery process that uses gsutil to recover your data to your recovery DR environment on GCP.

For more information, see Transfer from colocation or on-premises storage, which includes ways to optimize the transfer process.

Solution 2: Back up to Cloud Storage using a partner gateway solution

DR building blocks:

  • Cloud Interconnect
  • Cloud Storage tiered storage

On-premises applications are often integrated with third-party solutions that can be used as part of your data backup and recovery strategy. The solutions often use a tiered storage pattern where you have the most recent backups on faster storage, and slowly migrate your older backups to cheaper (slower) storage. When you use GCP as the target, you can use Cloud Storage Nearline or Cloud Storage Coldline storage as the equivalent of the slower tier.

One way to implement this pattern is to use a partner gateway between your on-premises storage and GCP to facilitate this transfer of data to Cloud Storage. The following diagram illustrates this arrangement, with a partner solution that manages the transfer from the on-premises NAS appliance or SAN.

Architectural diagram showing an on-premises data center connected to GCP using a dedicated interconnection

In the event of a failure, the data being backed up must be recovered to your DR environment. The DR environment is used to serve production traffic until you are able to revert to your production environment. How you achieve this depends on your application, and on the partner solution and its architecture. (Some end-to-end scenarios are discussed in the DR application document.)

For further guidance on ways to transfer data from on-premises to GCP, see Transferring big data sets to GCP.

For more information about partner solutions, see the Partners page on the GCP website.

Database backup and recovery

You can use a number of strategies to implement a process to recover a database system from on-premises to GCP. This section looks at two of the most common solutions.

It is out of scope in this article to discuss in detail the various built-in backup and recovery mechanisms included with third-party databases. This section provides general guidance, which is implemented in the solutions discussed here.

Solution 1: Backup and recovery using a recovery server on GCP

  1. Create a database backup using the built-in backup mechanisms of your database management system.
  2. Create a Cloud Storage bucket as the target for your data backup.
  3. Copy the backup files to Cloud Storage using gsutil or a partner gateway solution (see the steps discussed earlier in the data backup and recovery section). For details, see Transferring big data sets to GCP.
  4. Copy the transaction logs to your recovery site on GCP. Having a backup of the transaction logs helps keep your RPO values small.

After configuring this backup topology, you must ensure that you can recover to the system that's on GCP. This step typically involves not only restoring the backup file to the target database but also replaying the transaction logs to get to the smallest RTO value. A typical recovery sequence looks like this:

  1. Connect your on-premises network and your GCP network.
  2. Create a custom image of your database server on GCP. The database server should have the same configuration on the image as your on-premises database server.
  3. Implement a process to copy your on-premises backup files and transaction log files to Cloud Storage. See solution 1 for an example implementation.
  4. Start a minimally sized instance from the custom image and attach any persistent disks that are needed.
  5. Set the auto delete flag to false for the persistent disks.
  6. Apply the latest backup file that was previously copied to Cloud Storage, following the instructions from your database system for recovering backup files.
  7. Apply the latest set of transaction log files that have been copied to Cloud Storage.
  8. Replace the minimal instance with a larger instance that is capable of accepting production traffic.
  9. Switch clients to point at the recovered database in GCP.

When you have your production environment running and able to support production workloads, you have to reverse the steps that you followed to fail over to the GCP recovery environment. A typical sequence to return to the production environment looks like this:

  1. Take a backup of the database running on GCP.
  2. Copy the backup file to your production environment.
  3. Apply the backup file to your production database system.
  4. Prevent clients from connecting to the database system in GCP; for example, by stopping the database system service. From this point, your application will be unavailable until you finish restoring the production environment.
  5. Copy any transaction log files over to the production environment and apply them.
  6. Redirect client connections to the production environment.

Solution 2: Replication to a standby server on GCP

One way to achieve very small RTO and RPO values is to replicate (not just back up) data and in some cases database state in real time to a hot standby of your database server.

  1. Connect your on-premises network and your GCP network.
  2. Create a custom image of your database server on GCP. The database server should have the same configuration on the image as the configuration of your on-premises database server.
  3. Start an instance from the custom image and attach any persistent disks that are needed.
  4. Set the auto delete flag to false for the persistent disks.
  5. Configure replication between your on-premises database server and the target database server in GCP following the instructions specific to the database software.
  6. Clients are configured in normal operation to point to the database server on premises.

After configuring this replication topology, switch clients to point to the standby server running in your GCP network.

When you have your production environment back up and able to support production workloads, you have to resynchronize the production database server with the GCP database server and then switch clients to point back to the production environment

Production environment is GCP

In this scenario, both your production environment and your disaster recovery environment run on GCP.

Data backup and recovery

A common pattern for data backups is to use a tiered storage pattern. When your production workload is on GCP, the tiered storage system looks like the following diagram. You migrate data to a tier that has lower storage costs, because the requirement to access the backed-up data is less likely.

DR building blocks:

Conceptual diagram showing image showing decreasing cost as data is migrated from persistent disks to Nearline to Coldline

Because Cloud Storage Nearline and Cloud Storage Coldline are intended for storing infrequently accessed data, there are additional costs associated with retrieving data or metadata stored in these classes, as well as minimum storage durations that you are charged for.

Database backup and recovery

When you use a self-managed database (for example, you've installed MySQL, PostgreSQL, or SQL Server on an instance of Compute Engine), the same operational concerns apply as managing production databases on premises, but you no longer need to manage the underlying infrastructure.

You can set up HA configurations by using the appropriate DR building block features to keep RTO small. You can design your database configuration to make it easy to recover to a state as close as possible to the pre-disaster state; this helps keep your RPO values small. GCP provides a wide variety of options for this scenario.

Two common approaches to designing your database recovery architecture for self-managed databases on GCP are discussed in this section.

Recovering a database server without synchronizing state

A common pattern is to enable recovery of a database server that does not require system state to be synchronized with a hot standby.

DR building blocks:

  • Compute Engine
  • Managed instance groups
  • Cloud Load Balancing (internal load balancing)

The following diagram illustrates an example architecture that addresses the scenario. By implementing this architecture, you have a DR plan that reacts automatically to a failure without requiring manual recovery.

Architectural diagram showing a persistent disk image taken from a persistent disk in one zone

The following steps outline how to configure this scenario:

  1. Create a VPC network.
  2. Create a custom image that is configured with the database server by doing the following:

    1. Configure the server so the database files and log files are written to an attached standard persistent disk.
    2. Create a snapshot from the attached persistent disk.
    3. Configure a startup script to create a persistent disk from the snapshot and to mount the disk.
    4. Create a custom image of the boot disk.
  3. Create an instance template that uses the image.

  4. Using the instance template, configure a regional managed instance group with a target size of 1.

  5. Configure health checking using Stackdriver metrics.

  6. Configure internal load balancing using the regional managed instance group.

  7. Configure a scheduled task to create regular snapshots of the persistent disk.

In the event a replacement database instance is needed, this configuration automatically does the following:

  • Brings up another database server of the correct version.
  • Attaches a persistent disk that has the latest backup and transaction log files to the newly created database server instance.
  • Minimizes the need to reconfigure clients that communicate with your database server in response to an event.
  • Ensures that the GCP security controls (IAM policies, firewall settings) that apply to the production database server apply to the recovered database server.

Because the replacement instance is created from an instance template, the controls that applied to the original apply to the replacement instance.

This scenario takes advantage of some of the HA features available in GCP; you don't have to initiate any failover steps, because they occur automatically in the event of a disaster. The internal load balancer ensures that even when a replacement instance is needed, the same IP address is used for the database server. The instance template and custom image ensure that the replacement instance is configured identically to the instance it is replacing. By taking regular snapshots of the persistent disks, you ensure that when disks are re-created from the snapshots and attached to the replacement instance, the replacement instance is using data recovered according to an RPO value dictated by the frequency of the snapshots. In this architecture, the latest transaction log files that were written to the persistent disk are also automatically restored.

The regional managed instance group provides HA in depth. It provides mechanisms to react to failures at the application, instance, or zone level, and you don't have to manually intervene if any of those scenarios occur. Setting a target size of 1 ensures you only ever have one instance running.

Standard persistent disks are zonal, so if there's a zonal failure, snapshots are required to re-create disks. Snapshots are also available across regions, which lets you restore a disk to a different region as easily as you can restore it to the same region.

The following diagram shows the recovery scenario, where the persistent disk snapshot has been used to restore the persistent disk in a different zone.

Architectural diagram showing a persistent disk image used to recover into a second zone

Some variations on this configuration include the following:

  • Using regional persistent disks in place of standard persistent disks. In this case, you don't need to restore the snapshot as part of the recovery step.
  • Using a managed instance group rather than a regional managed instance group and attaching the persistent disk to the replacement instance started in the same zone as the instance that failed. In this case, the persistent disk auto-delete setting must be set to no-auto-delete.

The variation you choose is dictated by your budget and RTO and RPO values.

For more information about database configurations designed for HA and DR scenarios on GCP, see the following:

Recovering from partial corruption in very large databases

If you're using a database that's capable of storing petabytes of data, you might experience an outage that affects some of the data, but not all of it. In that case, you want to minimize the amount of data that you need to restore; you don't need to (or want to) recover the entire database just to restore some of the data.

There are a number of mitigating strategies you can adopt:

  • Store your data in different tables for specific time periods. This method ensures that you need to restore only a subset of data to a new table, rather than a whole dataset.
  • Store the original data on Cloud Storage. This allows you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table.

Additionally, if your RTO permits, you can prevent access to the table that has the corrupted data by leaving your applications offline until the uncorrupted data has been restored to a new table.

Managed database services on GCP

This section discusses some methods you can use to implement appropriate backup and recovery mechanisms for the managed database services on GCP.

Managed databases are designed for scale, so the traditional backup and restore mechanisms you see with traditional RDMBSs are usually not available. As in the case of self-managed databases, if you are using a database that is capable of storing petabytes of data, you want to minimize the amount of data that you need to restore in a DR scenario. There are a number of strategies for each managed database to help you achieve this goal.

Cloud Bigtable provides Cloud Bigtable regional replication. A replicated Cloud Bigtable database can provide higher availability than a single cluster additional read throughput, and higher durability and resilience in the face of zonal failures.

You can also export tables from Cloud Bigtable as a series of Hadoop sequence files. You can then store these files in Cloud Storage or use them to import the data back into another instance of Cloud Bigtable. You can replicate your Cloud Bigtable dataset asynchronously across zones within a GCP region.

BigQuery. If you want to archive data, you can take advantage of BigQuery's long term storage. If a table is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent. There is no degradation of performance, durability, availability, or any other functionality when a table is considered long term storage. If the table is edited, though, it reverts back to the regular storage pricing and the 90 day countdown starts again.

BigQuery is replicated, but this won't help with corruption in your tables. Therefore, you need to have a plan to be able to recover from that scenario. For example, you can do the following:

  • If the corruption is caught within 7 days, query the table to a point in time in the past to recover the table prior to the corruption using snapshot decorators.
  • Export the data from BigQuery, and create a new table that contains the exported data but excludes the corrupted data.
  • Store your data in different tables for specific time periods. This method ensures that you will need to restore only a subset of data to a new table, rather than a whole dataset.
  • Store the original data on Cloud Storage. This allows you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table.

Cloud Datastore. The managed export and import service allows you to import and export Cloud Datastore entities using a Cloud Storage bucket. This in turn allows you to implement a process that you can use to recover from accidental deletion of data.

Cloud SQL. If you use Cloud SQL, the fully managed GCP MySQL database, you should enable automated backups and binary logging for your Cloud SQL instances. This allows you to perform a point-in-time recovery, which restores your database from a backup and recovers it to a fresh Cloud SQL instance. For more details, see Cloud SQL Backups and Recovery.

You can also configure Cloud SQL in an HA configuration to maximize up time.

Cloud Spanner. You can use Cloud Dataflow templates for making a full export of your database to a set of Avro files in a Cloud Storage bucket, and use another template for re-importing the exported files into a new Cloud Spanner database.

For more controlled backups, the Cloud Dataflow connector allows you to write code to read and write data to Cloud Spanner in a Cloud Dataflow pipeline. For example, you can use the connector to copy data out of Cloud Spanner and into Cloud Storage as the backup target. The speed at which data can be read from Cloud Spanner (or written back to it) depends on the number of configured nodes. This has a direct impact on your RTO values.

The Cloud Spanner commit timestamp feature can be useful for incremental backups, by allowing you to select only the rows that have been added or modified since the last full backup.

For small RTO values, you could set up a warm standby Cloud Spanner instance configured with the minimum number of nodes required to meet your storage and read and write throughput requirements.

Cloud Composer. You can use Cloud Composer (a managed version of Apache Airflow) to schedule regular backups of multiple GCP databases. You can create a directed acyclic graph (DAG) to run on a schedule (for example, daily) to either copy the data to another project, dataset, or table (depending on the solution used), or to export the data to Cloud Storage.

Exporting or copying data can be done using the various Cloud Platform operators.

For example you can create a DAG to do any of the following:

Production environment is another cloud

In this scenario, your production environment uses another cloud provider, and your disaster recovery plan involves using GCP as the recovery site.

Data backup and recovery

Transferring data between object stores is a common use case for DR scenarios. Storage Transfer Service is compatible with Amazon S3 and is the recommended way to transfer objects from Amazon S3 to Cloud Storage. You can configure a transfer job to schedule periodic synchronization from data source to data sink, with advanced filters based on file creation dates, filename filters, and the times of day you prefer to import data.

Another option for moving data from AWS to GCP is to use boto, which is a Python tool that is compatible with Amazon S3 and Cloud Storage. It can be installed as a plugin to the gsutil command line tool.

Database backup and recovery

It is out of scope in this article to discuss in detail the various built-in backup and recovery mechanisms included with third-party databases or the backup and recovery techniques used on other cloud providers. If you are operating non-managed databases on the compute services, you can take advantage of the HA facilities that your production cloud provider has available. You can extend those to incorporate a HA deployment to GCP, or use Cloud Storage as the ultimate destination for the cold storage of your database backup files.

What's next?

Was this page helpful? Let us know how we did:

Send feedback about...

Architectures