You can copy a dataset within a region or from one region to another, without extracting, moving, and reloading data into BigQuery. You can copy the dataset one time or on a customized recurring schedule.
Before you begin
Before you create a dataset copy:
- Dataset copying uses features of the BigQuery Data Transfer Service. In the Google Cloud project for your destination dataset, verify that you have completed all actions required to enable the BigQuery Data Transfer Service.
- Create a BigQuery dataset as the destination dataset, in the same region or a different region from your source dataset. You specify a location for storing your BigQuery data when you create a dataset. Not all regions are supported yet for dataset copying (see Supported regions). The dataset name must be unique per project.
- Locate the ID of your source dataset that you want to copy, and the source project's ID.
- If you intend to set up transfer run notifications for Pub/Sub, you
must have
pubsub.topics.setIamPolicy
permissions. Pub/Sub permissions are not required if you just set up email notifications. For more information, see BigQuery Data Transfer Service run notifications. - If you intend to overwrite the destination table when copying by using
the
overwrite_destination_table
flag, both tables must have the same partitioning schema.
Required permissions
Before creating a dataset copy, ensure that the person creating the dataset copy has the following permissions in BigQuery:
The
bigquery.transfers.update
andbigquery.jobs.create
permissions on the project to create the copy transfer.The
bigquery.datasets.get
permission on the source dataset.The
bigquery.datasets.get
,bigquery.datasets.update
andbigquery.tables.create
permissions on the destination dataset.
The bigquery.admin
predefined, project-level IAM role includes all the
required permissions for a dataset copy. For more information on IAM roles in
BigQuery, see
Predefined roles and permissions or the
IAM permissions reference.
Setting up a dataset copy
To create a dataset copy:
Console
Option 1: Use the Copy Dataset icon.
Go to the BigQuery page in the Cloud Console.
Select the dataset name of the source dataset that you want to copy.
Click the Copy Dataset icon.
In the Copy dataset dialog, select the project ID and destination dataset ID. The project and dataset can be in different regions, but not all regions are supported for cross-region dataset copying.
Dataset names within a project must be unique.
Optionally, check the Overwrite destination table box if you want to refresh (overwrite) all the data in the destination dataset. Both tables and schema are overwritten.
Consider deleting the old dataset to avoid additional storage costs.
Option 2: Use the Transfers button.
Go to the BigQuery page in the Cloud Console.
Click Transfers.
Click + CREATE A TRANSFER.
On the Create Transfer page:
In the Source type section, for Source, choose Dataset Copy.
In the Transfer config name section, for Display name, enter a name for the transfer such as
My Transfer
. The transfer name can be any value that lets you easily identify the transfer if you need to modify it later.In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.
For Repeats, choose an option for how often to run the transfer. Options include:
- Daily (default)
- Weekly
- Monthly
- Custom
- On-demand
If you choose an option other than Daily, additional options are available. For example, if you choose Weekly, an option appears for you to select the day of the week. If you choose Custom, a Cron-like time specification is expected, for example
every 12 hours
. The shortest allowed period is 12 hours.For Start date and run time, enter the date and time to start the transfer. If you choose Start now, you can't modify this field.
For Destination dataset, choose the dataset you created to store your data, in a different region.
For Source dataset, input the name of the dataset you want to copy.
For Source project, input the ID of the project your source dataset is in.
(Optional) Check the Overwrite destination table box if you want to refresh (overwrite) all data in the destination dataset. Both tables and schema are overwritten.
(Optional) In the Notification options section:
- Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
- For Select a Pub/Sub topic, choose your topic name or click Create a topic to create one. This option configures Pub/Sub run notifications for your transfer.
Click Save.
Consider deleting the old dataset to avoid additional storage costs.
bq
Enter the bq mk
command and supply the transfer creation flag
--transfer_config
. The following flags are also required:
--project_id
--data_source
--target_dataset
--display_name
--params
bq mk --transfer_config --project_id=PROJECT_ID --data_source=DATA_SOURCE --target_dataset=DATASET --display_name=NAME --params='PARAMETERS'
Replace the following:
PROJECT_ID
: your Google Cloud project ID. If--project_id
isn't specified, the default project is used.DATA_SOURCE
: the data source:cross_region_copy
.DATASET
: the BigQuery target dataset for the transfer configuration.NAME
: the display name for the copy job or transfer configuration. The transfer name can be any value that lets you easily identify the transfer if you need to modify it later.PARAMETERS
: contains the parameters for the created transfer configuration in JSON format. For example:--params='{"param":"param_value"}'
. For dataset copying, you must supply thesource_dataset_id
, thesource_project_id
, and optionally theoverwrite_destination_table
parameters.
Parameters for a dataset copy configuration are:
source_dataset_id
: the ID of the source dataset you want to copy.source_project_id
: the ID of the project your source dataset is in.- (Optional)
overwrite_destination_table
: Include this parameter if you want to truncate the tables of a previous copy and refresh all the data.
For example, the following command creates a dataset copy configuration
named My Transfer
with a target dataset named mydataset
and a project
with the ID of myproject
.
bq mk --transfer_config --project_id=myproject --data_source=cross_region_copy --target_dataset=mydataset --display_name='My Dataset Copy' --params='{"source_dataset_id":"123_demo_eu","source_project_id":"mysourceproject","overwrite_destination_table":"true"}'
For additional information on the bq mk
command, see
Command-line tool reference.
API
Use the projects.locations.transferConfigs.create
method and supply an instance of the
TransferConfig
resource.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery Quickstart Using Client Libraries. For more information, see the BigQuery Java API reference documentation.
Python
Before trying this sample, follow the Python setup instructions in the
BigQuery Quickstart Using Client Libraries.
For more information, see the
BigQuery Python API reference documentation.
pip install google-cloud-bigquery-datatransfer
. Then create a transfer configuration to copy the dataset.
Viewing and refreshing dataset copy jobs
You can see progress and view details of a dataset copy under Transfers.
Console
Dataset copy refresh runs are scheduled to run with the schedule you specify, which is configurable. The default is every 24 hours. Click the More menu and then Refresh Now if you want to request a new run for immediate execution.
Pricing
There is no charge for dataset copying during the beta period.
At general availability, data copied between regions is billed at the same rates as pricing for Compute Engine network egress between regions.
BigQuery sends compressed data for copying across regions, so the GB billed might be less than the size of your dataset.
Quotas and limitations
The following limits apply to copying datasets in BigQuery. The
limits apply to copy jobs that are created by using the bq
command-line tool or the
Cloud Console and to copy jobs that are submitted programmatically by
using the copy-type
jobs.insert
API method.
All standard BigQuery usage charges for storage and querying apply on copied data. For more information, see Pricing.
BigQuery dataset copies are subject to the following limitations.
Quotas
In-region
Copying datasets within the same region follows the same quota for all BigQuery copy jobs.
The following limits apply to copying tables
in BigQuery. The limits apply to jobs created automatically by copying data using
the bq
command-line tool or the Cloud Console. The limits also apply to copy jobs submitted
programmatically by using the copy-type
jobs.insert
API method.
- Copy jobs per destination table per day — 1,000 (including failures)
- Copy jobs per project per day — 100,000 (including failures)
Cross-region
- Copy jobs per destination table per day: 100 (including failures)
- Copy jobs per project per day: 2,000 (including failures)
General limitations
You must create the destination dataset before you can create a transfer configuration for the dataset copy.
For each dataset copy configuration, you can have one active copy at a time. Additional transfer runs are queued.
Data type limitations
- Copying views is not supported.
- Copying external tables is not supported.
- Copying storage in a streaming buffer is not supported.
Support for copying tables encrypted with customer-managed keys varies if you are copying in the same region or across regions.
In-region
Copying encrypted tables, including tables encrypted with customer-managed keys (CMEK), is supported for dataset copying within the same region.
Cross-region
Copying tables with the default encryption across regions is supported. Copying tables encrypted with customer-managed keys (CMEK) across regions is not currently supported. CMEK-encrypted tables are skipped when copying tables to the destination dataset.
Source dataset limitations
The source dataset size maximum is 20,000 tables. How many tables can be copied for each run varies when you copy in the same region or across regions.
In-region
The source dataset can contain a maximum of 20,000 tables.
A maximum of 20,000 tables can be copied for each run to the destination dataset.
Cross-region
The source dataset size can have a maximum of 20,000 tables in it.
A maximum of 1000 tables can be copied per run to the destination dataset.
Example
If you configure a cross-region copy of a dataset with 8,000 tables in it, the BigQuery Data Transfer Service automatically creates 8 runs in a sequential manner. The first run copies 1,000 tables, then, 24 hours later, another run copies 1,000 tables, etc., until all tables in the dataset are copied, up to the maximum of 20,000 tables per dataset.
Table limitations
Copying partitioned tables is currently supported. However, appending data to a partitioned table is not supported.
If a table exists in the source dataset and the destination dataset, and it has not changed since the last successful copy, it is skipped. This is true even if the Override destination table box is checked.
When truncating tables, the dataset copy does not detect any changes made to the destination dataset before beginning the copy. All the data in the destination dataset is refreshed (overwritten). Both tables and schema are overwritten.
- If you intend to overwrite the destination table when copying by using
the
overwrite_destination_table
flag, both tables must have the same partitioning schema.
- If you intend to overwrite the destination table when copying by using
the
Supported regions
There are two types of locations:
A region is a specific geographic place, such as London.
A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.
Datasets can be copied from region to region, single region to multi-region, multi-region to single region, or multi-region to multi-region.
Not all regions are currently supported for dataset copying. You can create dataset copies in regions where the BigQuery Data Transfer Service is currently supported, as shown below.
Regional locations
Region description | Region name | |
---|---|---|
Americas | ||
Iowa | us-central1 |
|
Las Vegas | us-west4 |
|
Los Angeles | us-west2 |
|
Montréal | northamerica-northeast1 |
|
Northern Virginia | us-east4 |
|
Oregon | us-west1 |
|
Salt Lake City | us-west3 |
|
São Paulo | southamerica-east1 |
|
South Carolina | us-east1 |
|
Europe | ||
Belgium | europe-west1 |
|
Finland | europe-north1 |
|
Frankfurt | europe-west3 |
|
London | europe-west2 |
|
Netherlands | europe-west4 |
|
Warsaw | europe-central2 |
|
Zürich | europe-west6 |
|
Asia Pacific | ||
Hong Kong | asia-east2 |
|
Jakarta | asia-southeast2 |
|
Mumbai | asia-south1 |
|
Osaka | asia-northeast2 |
|
Seoul | asia-northeast3 |
|
Singapore | asia-southeast1 |
|
Sydney | australia-southeast1 |
|
Taiwan | asia-east1 |
|
Tokyo | asia-northeast1 |
Multi-regional locations
Multi-region description | Multi-region name |
---|---|
Data centers within member states of the European Union1 | EU |
Data centers in the United States | US |
1 Data located in the EU
multi-region is not
stored in the europe-west2
(London) or europe-west6
(Zürich) data
centers.
What's next
- For more information about using transfers, including getting information about a transfer configuration, listing transfer configurations, and viewing a transfer's run history, see Working with transfers.