Cross-region dataset replication
With BigQuery dataset replication, you can set up automatic replication of a dataset between two different regions or multi-regions.
When you create a dataset in BigQuery, you select the region or multi-region where the data is stored. A region is a collection of data centers within a geographical area, and a multi-region is a large geographic area that contains two or more geographic regions. Your data is stored in one of the contained regions.
BigQuery always stores copies of your data in two different Google Cloud zones within the dataset location. A zone is a deployment area for Google Cloud resources within a region. In all regions, replication between zones uses synchronous dual writes. Selecting a multi-region location does not provide cross-region replication or regional redundancy, so there is no increase in dataset availability in the event of a regional outage. Data is stored in a single region within the geographic location.
For additional geo-redundancy, you can replicate any dataset. BigQuery creates a secondary replica of the dataset, located in another region that you specify. This replica is then asynchronously replicated between two zones with the other region, for a total of four zonal copies.
If you replicate a dataset, BigQuery stores the data in the region that you specify.
Primary region. When you first create a dataset, BigQuery places the dataset in the primary region.
Secondary region. When you add a dataset replica, BigQuery places the replica in the secondary region.
Initially, the replica in the primary region is the primary replica, and the replica in the secondary region is the secondary replica.
The primary replica is writeable, and the secondary replica is read-only. Writes to the primary replica are asynchronously replicated to the secondary replica. Within each region, the data is stored redundantly in two zones. Network traffic never leaves the Google Cloud network.
The following diagram shows the replication that occurs when a dataset is replicated:
If the primary region is online, you can manually switch to the secondary replica. For more information, see Promote the secondary replica.
You are billed for the following for replicated datasets:
- Storage. Storage bytes in the secondary region are billed as a separate copy in the secondary region. See BigQuery storage pricing.
- Data replication. For more information on how you are billed for data replication, see Data replication pricing.
Compute capacity in the secondary region
To run jobs and queries against the replica in the secondary region, you must purchase slots within the secondary region or run an on-demand query.
You can use the slots to perform read-only queries from the secondary replica. If you promote the secondary replica to be the primary, you can also use those slots to write to the replica.
You can purchase the same number of slots as you have in the primary region, or a different number of slots. If you purchase fewer slots, it might affect query performance.
Before adding a dataset replica, you need to create the initial dataset you want to replicate in BigQuery if it doesn't exist already. The location of the added replica is set to the location that you specify when adding the replica. The location of the added replica must be different from the location of the initial dataset. This means that the data in your dataset is continually replicated between the location the dataset was created in and the location of the replica. For replicas that require colocation, such as views, materialized views, non-BigLake external tables, or BigLake tables, adding a replica in a location that is different from, or not compatible with, your source data's location could result in job errors.
When customers replicate a dataset across regions, BigQuery ensures data is located only in the locations where the replicas were created.
Using dataset replication is dependent on the following colocation requirements.
Querying data on Cloud Storage requires that the Cloud Storage bucket is colocated with the replica. Use the external tables location considerations when deciding where to place your replica.
BigQuery dataset replication is subject to the following limitations:
- Data in streams that haven't been committed are not replicated in the
secondary replica if the table is written to with the
BigQuery Storage Write API. Replication of streaming
data from the BigQuery Storage Write API or
tabledata.insertAllis best-effort and may see high replication delay.
- Tables injected through Datastream to BigQuery by using Change Data Capture is not supported.
- Replication and switchover are managed through SQL data definition language (DDL) statements.
- You are limited to one replica of each dataset for each region or multi-region. You cannot create two secondary replicas of the same dataset in the same destination region.
- Resources within replicas are subject to the limitations as described in Resource behavior.
- Policy tags and associated data policies are not replicated to the secondary replica. Any queries that reference columns with policy tags in regions other than the original region fail, even if that replica is promoted.
- Time travel is only available in the secondary replica after the creation of the secondary replica is completed.
- You can only replicate a dataset with fewer than 100,000 tables.
- You are limited to a maximum of 4 replica promotions per day per dataset.
- Replication bandwidth for the initial copy operation is limited to 1 GB/second per project per continent-pair. Replication at this speed is not guaranteed.
- Tables with Customer-managed encryption
keys (CMEK) applied are not
queryable in the secondary region if the
replica_kms_keyvalue is not configured.
Primary and secondary replicas are subject to the following differences:
|Region 1 primary replica||Region 2 secondary replica||Notes|
|BigLake table||BigLake table||Same as external tables.|
|External table||External table||Only the external table definition is replicated. The query fails when the Cloud Storage bucket is not co-located in the same location as a replica.|
|Logical view||Logical view||Logical views that reference a dataset or resource that is not located in the same location as the logical view fail when queried.|
|Managed table||Managed table||No difference.|
|Materialized view||Materialized view||If a referenced table is not in the same region as the materialized view, the query fails. Replicated materialized views may see staleness above the view's max staleness.|
|Model||Model||Stored as managed tables.
|Remote function||Remote function||Connections are regional. Remote functions that reference a dataset or resource (connection) that is not located in the same location as the remote function fail when run.|
|Routines||User-defined function (UDF) or stored procedure||Routines that reference a dataset or resource that is not located in the same location as the routine fail when run. Any routine that references a connection, such as remote functions, does not work outside the source region.|
|Row Access Policy||Row Access Policy||No difference.|
|Search index||Search index||Not replicated.|
|Stored procedure||Stored procedure||Stored procedures that reference a dataset or resource that is not located in the same location as the stored procedure fail when run.|
|Table clone||Managed table||Billed as a deep copy in secondary replica.|
|Table snapshot||Table snapshot||Billed as a deep copy in secondary replica.|
|Table-valued function (TVF)||TVF||TVFs that reference a dataset or resource that is not located in the same location as the TVF fail when run.|
|UDF||UDF||UDFs that reference a dataset or resource that is not located in the same location as the UDF fail when run.|
Cross-region replication is not intended for use as a disaster recovery plan during a total-region outage. In the case of a total region outage in the primary replica's region, you cannot promote the secondary replica. Because secondary replicas are read-only, you can't run any write jobs on the secondary replica and can't promote the secondary region until the primary replica's region is restored.
The following table explains the impact of total-region outages on your replicated data:
|Region 1||Region 2||Outage region||Impact|
|Primary replica||Secondary replica||Region 2||Read-only jobs running in region 2 against the secondary replica fail.|
|Primary replica||Secondary replica||Region 1||All jobs running in region 1 fail. Read-only jobs continue to run in region 2 where the secondary replica is located. The contents of region 2 are stale until it is successfully synced with region 1.|
To create a writable dataset in the region containing the secondary dataset replica, you must create a new dataset in that region and copy the contents of the secondary dataset replica into the new dataset. For information about copying datasets, see Copy a dataset.
Use dataset replication
This section describes how to replicate a dataset, promote the secondary replica, and run BigQuery read jobs in the secondary region.
To get the permissions that you need to manage replicas,
ask your administrator to grant you the
BigQuery Data Editor (
roles/bigquery.dataEditor) IAM role on your schema.
For more information about granting roles, see Manage access.
Replicate a dataset
To replicate a dataset, use the
ALTER SCHEMA ADD REPLICA DDL statement.
You can add a replica to any dataset that's located in a region or multi-region that is not already replicated in that region or multi-region. After you add a replica, it takes time for the initial copy operation to complete. You can still run queries referencing the primary replica while the data is being replicated, with no reduction in query processing capacity. You can't replicate data within the geo-locations within a multi-region.
The following example creates a dataset named
my_dataset in the
multi-region and then adds a replica named
-- Create the primary replica in the US multi-region. CREATE SCHEMA my_dataset OPTIONS(location='us'); -- Create a replica in the secondary region. ALTER SCHEMA my_dataset ADD REPLICA `us-east4` OPTIONS(location='us-east4');
To confirm when the secondary replica has successfully been created, you can
creation_complete column in the
Promote the secondary replica
If the primary region is online, you can promote the secondary replica. Promotion switches the secondary replica to be the writeable primary. This operation completes within a few seconds if the secondary replica is caught up with the primary replica. If the secondary replica is not caught up, the promotion can't complete until it is caught up. The secondary replica can't be promoted to the primary if the region containing the primary has an outage.
Note the following:
- All writes to tables return errors while promotion is in process. The old primary replica becomes non-writable immediately when the promotion begins.
- Tables that aren't fully replicated at the time the promotion is initiated return stale reads.
To promote a replica to be the primary replica, use the
ALTER SCHEMA SET
and set the
Note the following: - You must explicitly set the job location to the secondary region in query settings. See BigQuery specify locations.
The following example promotes the
us-east4 replica to be the primary:
ALTER SCHEMA my_dataset SET OPTIONS(primary_replica = 'us-east4')
To confirm when the secondary replica has successfully been promoted, you can
replica_primary_assignment_complete column in the
Remove a dataset replica
To remove a replica and stop replicating the dataset, use the
ALTER SCHEMA DROP REPLICA DDL statement.
The following example removes the
ALTER SCHEMA my_dataset DROP REPLICA IF EXISTS `us`;
You must first drop any secondary replicas to delete the entire dataset. If you
delete the entire dataset—for example, by using the
dropping all secondary replicas, you receive an error.
List dataset replicas
To list the dataset replicas in a project, query the
You can use cross-region dataset replication to migrate your datasets from one
region to another. The following example demonstrates the process of migrating
my_migration dataset from the
US multi-region to the
multi-region using cross-region replication.
Replicate the dataset
To begin the migration process, first replicate the dataset in the region that you
want to migrate the data to. In this scenario, you are migrating the
my_migration dataset to the
-- Create a replica in the secondary region. ALTER SCHEMA my_migration ADD REPLICA `eu` OPTIONS(location='eu');
This creates a secondary replica named
eu in the
The primary replica is the
my_migration dataset in the
Promote the secondary replica
To continue migrating the dataset to the
EU multi-region, promote the
ALTER SCHEMA my_migration SET OPTIONS(primary_replica = 'eu')
After the promotion is complete,
eu is the primary replica. It is a writable
Complete the migration
To complete the migration from the
US multi-region to the
us replica. This step is not required but is useful if you don't
need a dataset replica beyond your migration needs.
ALTER SCHEMA my_migration DROP REPLICA IF EXISTS us;
Your dataset is located in the
EU multi-region and there are no replicas of
my_migration dataset. You have successfully migrated your dataset to the
EU multi-region. The complete list of resources that are migrated can be found
in Resource behavior.
Customer-managed encryption keys (CMEK)
Customer-managed Cloud Key Management Service keys
are not automatically replicated when you create a secondary replica. In order
to maintain the encryption on your replicated dataset, you must set the
replica_kms_key for the location of the added replica. You can set the
replica_kms_key using the
ALTER SCHEMA ADD REPLICA DDL
Replicating datasets with CMEK behaves as described in the following scenarios:
If the source dataset has a
default_kms_key, you must provide a
replica_kms_keythat was created in the replica dataset's region when using the
ALTER SCHEMA ADD REPLICADDL statement.
If the source dataset doesn't have a value set for
default_kms_key, you can't set the
If the source dataset doesn't have a value set for
default_kms_key, but there are individual tables in the source dataset with CMEK applied, those tables aren't queryable in the replicated dataset. To query the tables, do the following:
- Add a
default_kms_keyvalue for the source dataset.
- When you create a
new replica using the
ALTER SCHEMA ADD REPLICADDL statement, set a value for the
replica_kms_keyoption. The CMEK tables are queryable in the destination region.
All the CMEK tables in the destination region use the same
replica_kms_key, regardless of the key used in the source region.
- Add a
Create a replica with CMEK
The following example creates a replica in the
us-west1 region with a
replica_kms_key value set. For CMEK key, grant the
BigQuery service account permission
to encrypt and decrypt.
-- Create a replica in the secondary region. ALTER SCHEMA my_dataset ADD REPLICA `us-west1` OPTIONS(location='us-west1', replica_kms_key='my_us_west1_kms_key_name');
Replicating datasets with CMEK applied are subject to the following limitations:
You can't update the replicated Cloud KMS key after the replica is created.
You can't update the
default_kms_keyvalue on the source dataset after the dataset replicas have been created.
If the provided
replica_kms_keyis not valid in the destination region, the data set won't be replicated.
CMEK-encrypted tables replicated to the destination region are visible, but cannot be queried or written to because the source CMEK is not recognized in the destination region.