Disaster recovery with environment snapshots

Cloud Composer 1 | Cloud Composer 2

This page describes how to use environment snapshots for disaster recovery.

Definitions

This guide uses the following definitions:

  • Disaster is an event where Cloud Composer or other components essential for your environment's operation are unavailable. This event requires a failover to a different region and Cloud Composer environments. The cause of a disaster could be natural, or human-made, including both downtime of Google Cloud regions and outages on your own infrastructure.
  • Disaster Recovery (DR), in the context of Cloud Composer, is a process of restoring the environment's operation after a disaster. The process involves recreating the environment, possibly in another region. For more information about Disaster Recovery, see Disaster recovery planning guide.
  • Primary environment is a Cloud Composer environment that you want to enable a DR capability for.
  • Failover environment is a Cloud Composer environment designated to take over activities from the primary environment.
  • Warm DR scenario is a variant of Disaster Recovery, where you use a standby failover environment, which you create before a disaster occurs.
  • Cold DR scenario is a variant of Disaster Recovery, where you create a failover environment after a disaster occurs.
  • Cross-region DR is a variant of warm or cold Disaster Recovery where the primary and the failover environment are located in different regions.

About the disaster recovery procedure

The disaster recovery procedure solves the problem when your primary environment has become inoperative (broken or otherwise not accessible) because of a disaster.

This procedure assumes that your primary environment is not going to be fixed in-place to address the disaster. Instead, you create a second (failover) environment side-by-side. This environment operates instead of the primary environment. At a later stage, you might decide to go back to the primary environment or to keep using the failover environment.

Because the procedure uses a failover environment, changes will be introduced when you switch from the primary environment. Changes between the primary and the failover environment include (the list is not comprehensive):

  • Web server URL is going to be different. This changes the address of the Airflow UI and the Airflow REST API endpoint.

  • Environment's bucket URL is going to be different.

  • The network and access permissions configuration might require adjusting.

If you use the warm DR scenario, you know the values for the web server, environment's bucket addresses, and the network configuration in advance.

Before you begin

  • Snapshots are supported in Cloud Composer 2.0.7 and later versions.

  • The mechanism for scheduled snapshots is not VPC SC compliant yet.

Preparation overview

Both DR scenarios include the following preparation steps:

  1. Create a failover environment.

    • In the warm DR scenario, you keep this environment available.
    • In the cold DR scenario, you create this environment only to test your disaster recovery procedure. After you complete the preparation, you delete this environment, and create it again after a disaster happens.
  2. Create a bucket for snapshots.

    • The bucket must be available in the DR region. For cross-region DR, the snapshots bucket must be either multiregional or located in a different region than the primary environment.

    • Check that DAGs can access regional resources.

  3. Set up DB maintenance.

  4. Set up scheduled snapshots.

  5. Test your disaster recovery procedure.

Disaster recovery overview

After a disaster happens:

  1. (Cold DR only) Create a failover environment.
  2. If possible, stop the primary environment from executing DAGs.
  3. Load a snapshot from the snapshots bucket to the failover environment.
  4. If required, adjust the configuration of the failover environment.
  5. Decide what to do with the primary environment.

Preparation steps

Do the steps outlined below to set up disaster recovery for your environment.

Create a failover environment

Create an environment that acts as a failover environment.

Use the following guidelines:

  • Your primary and failover environment must use the same version of Cloud Composer and Airflow.

  • In the warm DR scenario, make sure that you update and upgrade both environments in sync. For example, if you upgrade the primary environment to a later Cloud Composer version, or install PyPI packages, then your failover environment must also have these changes.

  • We recommend creating the failover environment in a region different from the primary environment. As a result, a broader range of possible disaster scenarios can be covered, such as a disaster that affects the availability of the whole region.

  • We recommend to use Terraform to create primary and failover environments, so that both have consistent configuration. Make sure that Terraform definitions for both primary and failover environments are synchronized.

  • The configuration of the failover environment (such as environment size, number of schedulers, and IAM permissions) is recommended to conform to the configuration of the primary environment. IAM permissions for both environments must give appropriate access to users and snapshots.

Check the resource availability

DAGs can operate on external resources, and access to those resources might be dependent on the environment's configuration (such as permissions granted to the environment's service account, network configuration, or project). Make sure that those resources are available to the failover environment.

An environment might interact with some external resources through connections stored in Airflow. Check if these resources should be adjusted in the failover environment compared to the primary environment.

Create a storage bucket for snapshots

Create a new storage bucket for environment snapshots. Do not use environment buckets for disaster recovery, since configuration for the retention policy and lifecycle is applied on the bucket level.

Make sure that this storage bucket has IAM permissions, a retention policy, and a lifecycle configuration set in a way that prevents accidental deletion or unauthorized access. For more information about configuring a bucket for snapshots, see Configuring scheduled snapshots.

You can:

  • Create a bucket in a different region.
  • Create a multi-regional bucket.

Set up DB maintenance

Keep the Airflow metadata database small by executing the database maintenance DAG. Doing so makes the process of saving and loading snapshots quicker. The Airflow metadata database must have less than 20 Gb of data to support snapshots.

Set up scheduled snapshots

Set up scheduled snapshots for the primary environment.

Snapshots can be created only in a healthy environment, so snapshots must be saved before the disaster occurs.

For more information about how snapshots work, see Save and load environment snapshots. See the Save an environment snapshot section of the documentation for information about where to find the saved snapshots.

Test your disaster recovery procedure

Make sure to test your disaster recovery procedure after you set it up and then periodically afterwards. This lets you address potential problems that might impact the actual disaster recovery process.

In the cold DR scenario, you can delete the failover environment after you finish testing the disaster recovery procedure.

After a disaster happens

Do the steps outlined below after a disaster happens to recover your primary environment.

(Cold DR only) Create a failover environment

Follow the instructions in the Create a failover environment section.

Stop the primary environment from executing DAGs

If possible, stop the primary environment from executing DAGs:

  • If the primary environment is still accessible, pause all DAGs.
  • If the primary environment's bucket is accessible, move all DAGs from the environment's bucket, or to a folder ouside /dags in the primary environment's bucket.

Load a snapshot to the failover environment

Load a snapshot from the primary environment into the failover environment.

Once the snapshot is loaded to the failover environment, it schedules and executes tasks as if nothing was executed by the primary environment after creating a snapshot. However, some of those tasks might have already been executed by the primary environment. The failover environment does not have any means to recognize which tasks have been executed after creating the snapshot and before a disaster. As a result, some tasks might be executed twice (in both the primary and the failover environment). We recommend that all tasks are idempotent and that the scheduled snapshots are created every two hours.

(If required) Adjust the configuration of the failover environment

In some cases, you might want to change the configuration of the failover environment after you load the snapshot of the primary environment into it.

For example, in a cold DR scenario, you might need to use a different set of Airflow environment variables in the failover environment. As another example, in a warm DR scenario, you might need to grant permissions to users in the Airflow UI, so that they can access the failover environment.

You can either perform these changes manually, or prepare a shell script with commands that change the configuration of the failover environment by running gcloud composer environment update commands.

Decide what to do with the primary environment

Some disasters might happen because the primary environment is not reachable but still operative or does not operate properly. For example, you cannot access the primary environment through the network because of an infrastructure failure. As another example, the environment operates with some errors or with reduced capacity, but some DAGs are still executed.

If the original environment is still running, then it might generate costs directly related to Cloud Composer or other services accessed through the DAGs, even though a new environment was created as a replacement. This environment can still execute some DAGs; as a result, some operations might be executed twice: in the primary environment that is still running and in the failover environment after loading the snapshot.

If the primary environment exists, but does not operate properly

Primary environment can be deleted, if all relevant data was recovered. For example, you might want to recover data that is not included in the environment snapshots, such as networking configuration, or contents of the environment's bucket outside the /dags and /plugins folders.

If the primary environment becomes accessible and healthy again

If the primary environment was inaccessible only temporarily, and becomes accessible and healthy again, then you can choose an approach:

  • Keep using the failover environment.
  • Return to the primary environment.

To keep using the failover environment:

  1. If the primary environment still executes DAGs, pause them as soon as possible.
  2. Make sure that all relevant data is recovered, then delete the primary environment.
  3. Repeat the DR preparation steps for the failover environment, such as setting up scheduled snapshots.

To return to the primary environment:

  1. Pause all DAGs in the failover environment.
  2. Wait for all DAG runs to complete in the failover environment, or stop them.
  3. Save a snapshot of the failover environment.
  4. Load this snapshot to the primary environment.
  5. Unpause the DAGs in the primary environment.
  6. If required, delete the failover environment.

What's next