Managed disaster recovery

This document provides an overview of BigQuery managed disaster recovery and how to implement it for your data and workloads.

Overview

BigQuery supports disaster recovery scenarios in the case of a total region outage. BigQuery disaster recovery relies on cross-region dataset replication to manage storage failover. After creating a dataset replica in a secondary region, you can control failover behavior for compute and storage to maintain business continuity during an outage. After a failover, you can access compute capacity (slots) and replicated datasets in the promoted region. Disaster recovery is only supported with the Enterprise Plus edition.

Managed disaster recovery performs a hard failover when a failover is initiated. In a hard failover, the reservation and dataset replicas in the secondary region are immediately promoted to the primary, even if the previous primary region is down, without waiting for the replication of any unreplicated data. Because of this, data loss can occur during hard failover. Any jobs that committed data in the source region before the replica's value of replication_time may need to be rerun in the destination region after failover.

To enable disaster recovery, you are required to create an Enterprise Plus edition reservation in the primary region, which is the region the dataset is in before failover. Standby compute capacity in the paired region is included in the Enterprise Plus reservation. You then attach a dataset to this reservation to enable failover for that dataset. You can only attach a dataset to a reservation if the dataset is backfilled and has the same paired primary and secondary locations as the reservation. After a dataset is attached to a failover reservation, only Enterprise Plus reservations can write to those datasets and you can't perform a cross-region replication promotion on the dataset. You can read from datasets attached to a failover reservation with any capacity model. For more information about reservations, see Introduction to workload management.

The compute capacity of your primary region is available in the secondary region promptly after a failover. This availability applies to your reservation baseline, whether it is used or not.

You must actively choose to fail over as part of testing or in response to a real disaster. You shouldn't fail over more than once in a 10-minute window. In data replication scenarios, backfill refers to the process of populating a replica of a dataset with historical data that existed before the replica was created or became active. Datasets must complete their backfill before you can fail over to the dataset.

The following diagram shows the architecture of managed disaster recovery:

Managed disaster recovery architecture.

Limitations

The following limitations apply to BigQuery disaster recovery:

  • BigQuery disaster recovery is subject to the same limitations as cross-region dataset replication.

  • Autoscaling after a failover depends on compute capacity availability in the secondary region. Only the reservation baseline is available in the secondary region.

  • The INFORMATION_SCHEMA.RESERVATIONS view doesn't have failover details.

  • If you have multiple failover reservations with the same administration project but whose attached datasets use different secondary locations, don't use one failover reservation with the datasets attached to a different failover reservation.

  • If you want to convert existing reservations to a failover reservation, the existing reservations can't have more than 1,000 assignments.

Locations

The following regions are available when creating a failover reservation:

Location code Region Name Region Description
ASIA
ASIA-EAST1 Taiwan
ASIA-SOUTHEAST1 Singapore
AU
AUSTRALIA-SOUTHEAST1 Sydney
AUSTRALIA-SOUTHEAST2 Melbourne
CA
NORTHAMERICA-NORTHEAST1 Montréal
NORTHAMERICA-NORTHEAST2 Toronto
EU
EU EU multi-region
EUROPE-CENTRAL2 Warsaw
EUROPE-NORTH1 Finland
EUROPE-SOUTHWEST1 Madrid
EUROPE-WEST1 Belgium
EUROPE-WEST3 Frankfurt
EUROPE-WEST4 Netherlands
EUROPE-WEST8 Milan
EUROPE-WEST9 Paris
IN
ASIA-SOUTH1 Mumbai
ASIA-SOUTH2 Delhi
US
US US multi-region
US-CENTRAL1 Iowa
US-EAST1 South Carolina
US-EAST4 Northern Virginia
US-EAST5 Columbus
US-SOUTH1 Dallas
US-WEST1 Oregon
US-WEST2 Los Angeles
US-WEST3 Salt Lake City
US-WEST4 Las Vegas

Region pairs must be selected within ASIA, AU, CA, EU, IN or the US. For example, a region within the US cannot be paired with a region within EU.

If your BigQuery dataset is in a multi-region location, you can't use the following region pairs. This limitation ensures that your failover reservation and data are geographically separated after replication. For more information about regions that are contained within multi-regions, see Multi-regions.

  • us-central1 - us multi-region
  • us-west1 - us multi-region
  • eu-west1 - eu multi-region
  • eu-west4 - eu multi-region

Before you begin

  1. Ensure you have the bigquery.reservations.update Identity and Access Management (IAM) permission to update reservations.
  2. Ensure that you have existing datasets that are configured for replication. For more information, see Replicate a dataset.

Turbo replication

Disaster recovery uses Turbo replication for faster data replication across regions, which reduces the risk of data loss exposure, minimize service downtime, and helps support uninterrupted service following a regional outage.

Turbo replication doesn't apply to the initial backfill operation. After the initial backfill operation is completed, turbo replication aims to replicate datasets to a single failover region pair with a secondary replica within 15 minutes, as long as the bandwidth quota isn't exceeded and there are no user errors.

Recovery time objective

A recovery time objective (RTO) is the target time allowed for recovery in BigQuery in the event of a disaster. For more information on RTO, see Basics of DR planning.Managed disaster recovery has a five minute RTO after you initiate a failover. Because of the RTO, capacity is available in the secondary region within five minutes of starting the failover process.

Recovery point objective

A recovery point objective (RPO) is the most recent point in time from which data must be able to be restored. For more information on RPO, see Basics of DR planning. Managed disaster recovery has a RPO that is defined per dataset. The RPO aims to keep the secondary replica within 15 minutes of the primary. To meet this RPO, you can't exceed the bandwidth quota and there can't be any user errors.

Quota

You must have your chosen compute capacity in the secondary region before configuring a failover reservation. If there is not available quota in the secondary region, you can't configure the reservation. For more information, see Quotas and limits.

Turbo replication bandwidth has quota. For more information, see Quotas and limits.

Pricing

Configuring managed disaster recovery requires the following pricing plans:

Customers are only required to pay for compute capacity in the primary region. Secondary compute capacity (based on the reservation baseline) is available in the secondary region at no additional cost. Idle slots can't use the secondary compute capacity unless the reservation has failed over.

If you need to perform stale reads in the secondary region, you must purchase additional compute capacity.

Create or alter an Enterprise Plus reservation

Before attaching a dataset to a reservation, you must create an Enterprise Plus reservation or alter an existing reservation and configure it for disaster recovery.

Create a reservation

Select one of the following:

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the navigation menu, click Capacity management, and then click Create reservation.

  3. In the Reservation name field, enter a name for the reservation.

  4. In the Location list, select the location.

  5. In the Edition list, select the Enterprise Plus edition.

  6. In the Max reservation size selector list, select the maximum reservation size.

  7. Optional: In the Baseline slots field, enter the number of baseline slots for the reservation.

    The number of available autoscaling slots is determined by subtracting the Baseline slots value from the Max reservation size value. For example, if you create a reservation with 100 baseline slots and a max reservation size of 400, your reservation has 300 autoscaling slots. For more information about baseline slots, see Using reservations with baseline and autoscaling slots.

  8. In the Secondary location list, select the secondary location.

  9. To disable idle slot sharing and use only the specified slot capacity, click the Ignore idle slots toggle.

  10. To expand the Advanced settings section, click the expander arrow.

  11. Optional: To set the target job concurrency, click the Override automatic target job concurrency toggle to on, and then enter a value for Target Job Concurrency. The breakdown of slots is displayed in the Cost estimate table. A summary of the reservation is displayed in the Capacity summary table.

  12. Click Save.

The new reservation is visible in the Slot reservations tab.

SQL

To create a reservation, use the CREATE RESERVATION data definition language (DDL) statement.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE RESERVATION
      `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME`
    OPTIONS (
      slot_capacity = NUMBER_OF_BASELINE_SLOTS,
      edition = ENTERPRISE_PLUS,
      secondary_location = SECONDARY_LOCATION);

    Replace the following:

    • ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource.
    • LOCATION: the location of the reservation. If you select a BigQuery Omni location, your edition option is limited to the Enterprise edition.
    • RESERVATION_NAME: the name of the reservation.

      The name must start and end with a lowercase letter or a number and contain only lowercase letters, numbers, and dashes.

    • NUMBER_OF_BASELINE_SLOTS: the number of baseline slots to allocate to the reservation. You cannot set the slot_capacity option and the edition option in the same reservation.
    • SECONDARY_LOCATION: the secondary location of the reservation. In the case of an outage, any datasets attached to this reservation will fail over to this location.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

Alter an existing reservation

Select one of the following:

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the navigation panel, go to the Capacity management section.

  3. Click the Slot reservations tab.

  4. Find the reservation you want to update.

  5. Expand the Actions option.

  6. Click Edit.

  7. In the Secondary location field, enter the secondary location.

  8. Click Save.

SQL

To add or change a secondary location to a reservation, use the ALTER RESERVATION SET OPTIONS DDL statement.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    ALTER RESERVATION
      `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME`
    SET OPTIONS (
      secondary_location = SECONDARY_LOCATION);

    Replace the following:

    • ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource.
    • LOCATION: the location of the reservation, for example europe-west9.
    • RESERVATION_NAME: the name of the reservation. The name must start and end with a lowercase letter or a number and contain only lowercase letters, numbers, and dashes.

    • SECONDARY_LOCATION: the secondary location of the reservation. In the case of an outage, any datasets attached to this reservation will fail over to this location.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

Attach a dataset to a reservation

To enable disaster recovery for the previously created reservation, complete the following steps. The dataset must already be configured for replication in the same primary and secondary regions as the reservation. For more information, see Cross-region dataset replication.

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the navigation menu, click Capacity management, then click the Slot Reservations tab.

  3. Click the reservation you want to attach a dataset to.

  4. Click the Disaster recovery tab.

  5. Click Add failover dataset.

  6. Enter the name of the dataset you want to associate with the reservation.

  7. Click Add.

SQL

To attach a dataset to a reservation, use the ALTER SCHEMA SET OPTIONS DDL statement.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    ALTER SCHEMA
      `DATASET_NAME`
    SET OPTIONS (
      failover_reservation = ADMIN_PROJECT_ID.RESERVATION_NAME);

    Replace the following:

    • DATASET_NAME: the name of the dataset.

    • ADMIN_PROJECT_ID.RESERVATION_NAME: the name of the reservation you want to associate the dataset to.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

Detach a dataset from a reservation

To stop managing the failover behavior of a dataset through a reservation, detach the dataset from the reservation. This doesn't change the current primary replica for the dataset nor does it remove any existing dataset replicas. For more information about removing dataset replicas after detaching a dataset, see Remove dataset replica.

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the navigation menu, click Capacity management, then click the Slot Reservations tab.

  3. Click the reservation you want to detach a dataset from.

  4. Click the Disaster recovery tab.

  5. Expand the Actions option for the primary replica of the dataset.

  6. Click Remove.

SQL

To detach a dataset from a reservation, use the ALTER SCHEMA SET OPTIONS DDL statement.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    ALTER SCHEMA
      `DATASET_NAME`
    SET OPTIONS (
      failover_reservation = NULL);

    Replace the following:

    • DATASET_NAME: the name of the dataset.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

Initiate a failover

In the event of a regional outage, you must manually failover your reservation to the location used by the replica. Failing over the reservation also includes any associated datasets. To manually fail over a reservation, do the following:

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the navigation menu, click Disaster recovery.

  3. Click the name of the reservation you want to fail over to.

  4. Click Failover.

SQL

To add or change a secondary location to a reservation, use the ALTER RESERVATION SET OPTIONS DDL statement and set is_primary to TRUE.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    ALTER RESERVATION
      `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME`
    SET OPTIONS (
      is_primary = TRUE);

    Replace the following:

    • ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource.
    • LOCATION: the location of the reservation-for example, europe-west9.
    • RESERVATION_NAME: the name of the reservation. The name must start and end with a lowercase letter or a number and contain only lowercase letters, numbers, and dashes.

    • PRIMARY_STATUS: a boolean status that declares whether the reservation is the primary replica.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

Monitoring

To determine the state of your replicas, query the INFORMATION_SCHEMA.SCHEMATA_REPLICAS view. For example:

SELECT
  schema_name,
  replica_name,
  creation_complete,
  replica_primary_assigned,
  replica_primary_assignment_complete
FROM
  `region-LOCATION`.INFORMATION_SCHEMA.SCHEMATA_REPLICAS
WHERE
  schema_name="my_dataset"

The following query returns the jobs from the last seven days that would fail if their datasets were failover datasets:

WITH
  non_epe_reservations AS (
    SELECT project_id, reservation_name
    FROM `PROJECT_ID.region-LOCATION`.INFORMATION_SCHEMA.RESERVATIONS
    WHERE edition != 'ENTERPRISE_PLUS'
  )
SELECT *
FROM
  (
    SELECT job_id
    FROM
      (
        SELECT
          job_id,
          reservation_id,
          ARRAY_CONCAT(referenced_tables, [destination_table]) AS all_referenced_tables,
          query
        FROM
          `PROJECT_ID.region-LOCATION`.INFORMATION_SCHEMA.JOBS
        WHERE
          creation_time
          BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
          AND CURRENT_TIMESTAMP()
      ) A,
      UNNEST(all_referenced_tables) AS referenced_table
  ) jobs
LEFT OUTER JOIN non_epe_reservations
  ON (
    jobs.reservation_id = CONCAT(
      non_epe_reservations.project_id, ':', 'LOCATION', '.', non_epe_reservations.reservation_name))
WHERE
  CONCAT(jobs.project_id, ':', jobs.dataset_id)
  IN UNNEST(
    [
      'PROJECT_ID:DATASET_ID',
      'PROJECT_ID:DATASET_ID']);

Replace the following:

  • PROJECT_ID: the project ID.
  • DATASET_ID: the dataset ID.
  • LOCATION: the location.

What's next