Managed disaster recovery

This document provides an overview of BigQuery managed disaster recovery and how to implement it for your data and workloads.

Overview

BigQuery supports disaster recovery scenarios in the case of a total region outage. BigQuery disaster recovery relies on cross-region dataset replication to manage storage failover. After creating a dataset replica in a secondary region, you can control failover behavior for compute and storage to maintain business continuity during an outage. After a failover, you can access compute capacity (slots) and replicated datasets in the promoted region. Disaster recovery is only supported with the Enterprise Plus edition.

Managed disaster recovery performs a hard failover when a failover is initiated. In a hard failover, the reservation and dataset replicas in the secondary region are immediately promoted to the primary, even if the previous primary region is down, without waiting for the replication of any unreplicated data. Because of this, data loss can occur during hard failover. Any jobs that committed data in the source region before the replica's value of replication_time may need to be rerun in the destination region after failover.

To enable disaster recovery, you are required to create an Enterprise Plus edition reservation in the primary region, which is the region the dataset is in before failover. Standby compute capacity in the paired region is included in the Enterprise Plus reservation. You then attach a dataset to this reservation to enable failover for that dataset. You can only attach a dataset to a reservation if the dataset is backfilled and has the same paired primary and secondary locations as the reservation. After a dataset is attached to a failover reservation, only Enterprise Plus reservations can read or write to those datasets and you can't perform a cross-region replication promotion on the dataset. For more information about reservations, see Introduction to workload management.

The compute capacity of your primary region is available in the secondary region promptly after a failover. This availability applies to your reservation baseline, whether it is used or not.

You must actively choose to fail over as part of testing or in response to a real disaster. You shouldn't fail over more than once in a 10-minute window. In data replication scenarios, backfill refers to the process of populating a replica of a dataset with historical data that existed before the replica was created or became active. Datasets must complete their backfill before you can fail over to the dataset.

The following diagram shows the architecture of managed disaster recovery:

Managed disaster recovery architecture.

Limitations

The following limitations apply to BigQuery disaster recovery:

BigQuery disaster recovery is subject to the same limitations as cross-region dataset replication.
Region support is based on dual-region buckets. You can't configure the following region pairs:
- us-central1 - us multi-region
- us-west1 - us multi-region
- eu-west1 - eu multi-region
- eu-west4 - eu multi-region
Autoscaling after a failover depends on compute capacity availability in the secondary region. Only the reservation baseline is available in the secondary region.
The INFORMATION_SCHEMA.RESERVATIONS view doesn't have failover details.
If you have multiple failover reservations with the same administration project but whose attached datasets use different secondary locations, don't use one failover reservation with the datasets attached to a different failover reservation.
If you want to convert existing reservations to a failover reservation, the existing reservations can't have more than 1,000 assignments.

Before you begin

Ensure you have the bigquery.reservations.update Identity and Access Management (IAM) permission to update reservations.
Ensure that you have existing datasets that are configured for replication. For more information, see Replicate a dataset.

Quota

You must have your chosen compute capacity in the secondary region before configuring a failover reservation. If there is not available quota in the secondary region, you can't configure the reservation. For more information, see Quotas and limits.

Pricing

Configuring managed disaster recovery requires the following pricing plans:

Compute capacity: You must purchase the Enterprise Plus edition.
Turbo replication: Disaster recovery relies on Turbo replication during replication. You are charged based on physical bytes and on a per physical GB replicated basis. For more information, see Cloud Storage pricing.

Caution: Turbo replication is not available in Preview but is mandatory when disaster recovery becomes generally available (GA). During Preview, you are charged the standard replication pricing.
Storage: Storage bytes in the secondary region are billed at the same price as storage bytes in the primary region. For more information, see Storage pricing.

Customers are only required to pay for compute capacity in the primary region. Secondary compute capacity (based on the reservation baseline) is available in the secondary region at no additional cost. Idle slots can't use the secondary compute capacity unless the reservation has failed over.

If you need to perform stale reads in the secondary region, you can purchase additional Enterprise Plus compute capacity in the secondary region.

Create or alter an Enterprise Plus reservation

Before attaching a dataset to a reservation, you must create an Enterprise Plus reservation or alter an existing reservation and configure it for disaster recovery.

Create a reservation

Select one of the following:

Console

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the navigation menu, click Capacity management, and then click Create reservation.
In the Reservation name field, enter a name for the reservation.
In the Location list, select the location.
In the Edition list, select the Enterprise Plus edition.
In the Max reservation size selector list, select the maximum reservation size.
Optional: In the Baseline slots field, enter the number of baseline slots for the reservation.

The number of available autoscaling slots is determined by subtracting the Baseline slots value from the Max reservation size value. For example, if you create a reservation with 100 baseline slots and a max reservation size of 400, your reservation has 300 autoscaling slots. For more information about baseline slots, see Using reservations with baseline and autoscaling slots.
In the Secondary location list, select the secondary location.
To disable idle slot sharing and use only the specified slot capacity, click the Ignore idle slots toggle.
To expand the Advanced settings section, click the expander arrow.
Optional: To set the target job concurrency, click the Override automatic target job concurrency toggle to on, and then enter a value for Target Job Concurrency. The breakdown of slots is displayed in the Cost estimate table. A summary of the reservation is displayed in the Capacity summary table.
Click Save.

The new reservation is visible in the Slot reservations tab.

SQL

To create a reservation, use the CREATE RESERVATION data definition language (DDL) statement.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the query editor, enter the following statement:
```
CREATE RESERVATION
  `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME`
OPTIONS (
  slot_capacity = NUMBER_OF_SLOTS,
  edition = ENTERPRISE_PLUS,
  secondary_location = SECONDARY_LOCATION);
```
Replace the following:
- ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource.
- LOCATION: the location of the reservation. If you select a BigQuery Omni location, your edition option is limited to the Enterprise edition.
- RESERVATION_NAME: the name of the reservation.
  
  The name must start and end with a lowercase letter or a number and contain only lowercase letters, numbers, and dashes.
- NUMBER_OF_SLOTS: the number of slots to allocate to the reservation. There must be enough unallocated slots in the commitment. You cannot set the slot_capacity option and the edition option in the same reservation.
- SECONDARY_LOCATION: the secondary location of the reservation. In the case of an outage, any datasets attached to this reservation will fail over to this location.
Click Run.

For more information about how to run queries, see Run an interactive query.

Alter an existing reservation

Select one of the following:

Console

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the navigation panel, go to the Capacity management section.
Click the Slot reservations tab.
Find the reservation you want to update.
Expand the Actions option.
Click Edit.
In the Secondary location field, enter the secondary location.
Click Save.

SQL

To add or change a secondary location to a reservation, use the ALTER RESERVATION SET OPTIONS DDL statement.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the query editor, enter the following statement:
```
ALTER RESERVATION
  `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME`
SET OPTIONS (
  secondary_location = SECONDARY_LOCATION);
```
Replace the following:
- ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource.
- LOCATION: the location of the reservation, for example europe-west9.
- RESERVATION_NAME: the name of the reservation. The name must start and end with a lowercase letter or a number and contain only lowercase letters, numbers, and dashes.
- SECONDARY_LOCATION: the secondary location of the reservation. In the case of an outage, any datasets attached to this reservation will fail over to this location.
Click Run.

For more information about how to run queries, see Run an interactive query.

Attach a dataset to a reservation

To enable disaster recovery for the previously created reservation, complete the following steps. The dataset must already be configured for replication in the same primary and secondary regions as the reservation. For more information, see Cross-region dataset replication.

Console

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the navigation menu, click Capacity management, then click the Slot Reservations tab.
Click the reservation you want to attach a dataset to.
Click the Disaster recovery tab.
Click Add associated dataset.
Enter the name of the dataset you want to associate with the reservation.
Click Add.

SQL

To attach a dataset to a reservation, use the ALTER SCHEMA SET OPTIONS DDL statement.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the query editor, enter the following statement:
```
ALTER SCHEMA
  `DATASET_NAME`
SET OPTIONS (
  failover_reservation = RESERVATION_NAME);
```
Replace the following:
- DATASET_NAME: the name of the dataset.
- RESERVATION_NAME: the name of the reservation you want to associate the dataset to.
Click Run.

For more information about how to run queries, see Run an interactive query.

Initiate a failover

In the event of a regional outage, you must manually failover your reservation to the location used by the replica. Failing over the reservation also includes any associated datasets. To manually fail over a reservation, do the following:

Console

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the navigation menu, click Disaster recovery.
Click the name of the reservation you want to fail over to.
Click Failover.

SQL

To add or change a secondary location to a reservation, use the ALTER RESERVATION SET OPTIONS DDL statement and set is_primary to TRUE.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the query editor, enter the following statement:
```
ALTER RESERVATION
  `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME`
SET OPTIONS (
  is_primary = TRUE);
```
Replace the following:
- ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource.
- LOCATION: the location of the reservation-for example, europe-west9.
- RESERVATION_NAME: the name of the reservation. The name must start and end with a lowercase letter or a number and contain only lowercase letters, numbers, and dashes.
- PRIMARY_STATUS: a boolean status that declares whether the reservation is the primary replica.
Click Run.

For more information about how to run queries, see Run an interactive query.

Monitoring

To determine the state of your replicas, query the INFORMATION_SCHEMA.SCHEMATA_REPLICAS view. For example:

SELECT
  schema_name,
  replica_name,
  creation_complete,
  replica_primary_assigned,
  replica_primary_assignment_complete
FROM
  `region-us`.INFORMATION_SCHEMA.SCHEMATA_REPLICAS
WHERE
  schema_name="project2.my_dataset"

The following query returns the jobs from the last seven days that would fail if their datasets were failover datasets:

WITH
  non_epe_reservations AS (
    SELECT project_id, reservation_name
    FROM `PROJECT_ID.region-LOCATION`.INFORMATION_SCHEMA.RESERVATIONS
    WHERE edition != 'ENTERPRISE_PLUS'
  )
SELECT *
FROM
  (
    SELECT job_id
    FROM
      (
        SELECT
          job_id,
          reservation_id,
          ARRAY_CONCAT(referenced_tables, [destination_table]) AS all_referenced_tables,
          query
        FROM
          `PROJECT_ID.region-LOCATION`.INFORMATION_SCHEMA.JOBS
        WHERE
          creation_time
          BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
          AND CURRENT_TIMESTAMP()
      ) A,
      UNNEST(all_referenced_tables) AS referenced_table
  ) jobs
LEFT OUTER JOIN non_epe_reservations
  ON (
    jobs.reservation_id = CONCAT(
      non_epe_reservations.project_id, ':', 'LOCATION', '.', non_epe_reservations.reservation_name))
WHERE
  CONCAT(jobs.project_id, ':', jobs.dataset_id)
  IN UNNEST(
    [
      'PROJECT_ID:DATASET_ID',
      'PROJECT_ID:DATASET_ID']);

Replace the following:

PROJECT_ID: the project ID.
DATASET_ID: the dataset ID.
LOCATION: the location.

What's next

Learn more about cross-region dataset replication.
Learn more about reliability.