The BigQuery Data Transfer Service for Cloud Storage allows you to schedule recurring data loads from Cloud Storage to BigQuery.
Before you begin
Before you create a Cloud Storage transfer:
- Verify that you have completed all actions required in Enabling the BigQuery Data Transfer Service.
- Retrieve your Cloud Storage URI.
- Allow pop-ups in your browser from
bigquery.cloud.google.com
so that you can view the permissions window. You must allow the BigQuery Data Transfer Service permission to manage your transfer. - Create a BigQuery dataset to store your data.
- Create the destination table for your transfer and specify the schema definition.
Limitations
Recurring transfers from Cloud Storage to BigQuery are subject to the following limitations:
- All files matching the patterns defined by either a wildcard or by runtime parameters for your transfer must share the same schema you defined for the destination table, or the transfer will fail. Table schema changes between runs also causes the transfer to fail.
- Because Cloud Storage objects can be versioned, it's important to note that archived Cloud Storage objects are not supported for BigQuery transfers. Objects must be live to be transferred.
- Unlike individual loads of data from Cloud Storage to BigQuery, for ongoing transfers you need to create the destination table and its schema in advance of setting up the transfer. BigQuery cannot create the table as part of the recurring data transfer process.
- Transfers from Cloud Storage are always triggered with the
WRITE_APPEND
preference which appends data to the destination table. See configuration.load.writeDisposition in theload
job configuration for additional details. - BigQuery Data Transfer Service does not guarantee all files will be transferred or transferred only once if Cloud Storage files are touched while in mid-transfer.
- If your dataset's location is set to a value other than
US
, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset. BigQuery Data Transfer Service does not guarantee data consistency for external data sources. Changes to the underlying data while a query is running can result in unexpected behavior.
Depending on the format of your Cloud Storage source data, there may be additional limitations. For more information, see:
Minimum intervals
- Source files in Cloud Storage need to be at least one hour old to get picked up by the transfer.
- The minimum interval time between recurring transfers is 1 hour. The default interval for a recurring transfer is every 24 hours.
Required permissions
When you load data into BigQuery, you need permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you'll also need access to the bucket that contains your data. Ensure that you have the following required permissions:
- BigQuery:
bigquery.transfers.update
permissions are required to create the scheduled transfer. Thebigquery.admin
predefined Cloud IAM role includesbigquery.transfers.update
permissions. For more information on Cloud IAM roles in BigQuery Data Transfer Service, see Access control reference. - Cloud Storage:
storage.objects.get
permissions are required on the individual bucket or higher. If you are using a URI wildcard, you must also havestorage.objects.list
permissions. If you would like to delete the source files after each successful transfer, you also needstorage.objects.delete
permissions. Thestorage.objectAdmin
predefined Cloud IAM role includes all of these permissions.
Setting up a Cloud Storage transfer
To create a Cloud Storage transfer in BigQuery Data Transfer Service:
Console
Go to the Cloud Console.
Click Transfers.
Click Create.
On the Create Transfer page:
In the Source type section, for Source, choose Cloud Storage.
In the Transfer config name section, for Display name, enter a name for the transfer such as
My Transfer
. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.
- For Repeats, choose an option for how often to run the
transfer. The minimum interval is 1 hour.
- Daily (default)
- Weekly
- Monthly
- Custom
- On-demand
For Start date and run time, enter the date and time to start the transfer. If you choose Start now, this option is disabled.
- For Repeats, choose an option for how often to run the
transfer. The minimum interval is 1 hour.
In the Destination settings section, for Destination dataset, choose the dataset you created to store your data.
In the Data source details section:
- For Destination table, enter the name of your destination table. The destination table must follow the table naming rules. Destination table names also support parameters.
- For Cloud Storage URI, enter the Cloud Storage URI. Wildcards and parameters are supported.
- For Delete source files after transfer, check the box if you want to delete the source files after each successful transfer. Delete jobs are best effort. Delete jobs do not retry if the first effort to delete the source files fails.
In the Transfer Options section:
- Under All Formats:
- For Number of errors allowed, enter the maximum number
of bad records that BigQuery can ignore when
running the job. If the number of bad records exceeds this
value, an 'invalid' error is returned in the job result,
and the job fails. The default value is
0
.
- For Number of errors allowed, enter the maximum number
of bad records that BigQuery can ignore when
running the job. If the number of bad records exceeds this
value, an 'invalid' error is returned in the job result,
and the job fails. The default value is
- Under JSON, CSV:
- For Ignore unknown values, check the box if you want the transfer to drop data that does not fit the destination table's schema.
Under CSV:
- For Field delimiter, enter the character that separates fields. The default value is a comma.
- For Header rows to skip, enter the number of header rows
in the source file(s) if you don't want to import them. The
default value is
0
. - For Allow quoted newlines, check the box if you want to allow newlines within quoted fields.
- For Allow jagged rows, check the box if you want to
allow the transfer of rows with missing
NULLABLE
columns.
- Under All Formats:
(Optional) In the Notification options section:
- Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
- For Select a Pub/Sub topic, choose your topic name or click Create a topic. This option configures Pub/Sub run notifications for your transfer.
Click Save.
Classic UI
Go to the BigQuery web UI.
Click Transfers.
Click Add Transfer.
On the New Transfer page:
- For Source, choose Cloud Storage.
- For Display name, enter a name for the transfer such as
My Transfer
. The display name can be any value that allows you to easily identify the transfer if you need to modify it later. (Optional) For Schedule, you can leave the default value of Daily (every 24 hours, based on creation time), or click Edit to change the time. You can also change the interval to Weekly, Monthly, or Custom. When selecting Custom, a Cron-like time specification is expected, for example
every 12 hours
. The minimum interval is 1 hour. See theschedule
field under TransferConfig for additional valid API values.For Destination dataset, choose the appropriate dataset.
For Destination table, enter the name of your destination table. The destination table must follow the table naming rules. Destination table names also support parameters.
For Cloud Storage URI, enter the Cloud Storage URI. Wildcards and parameters are supported.
For Delete source files after transfer, check the box if you want to delete the source files after each successful transfer. Delete jobs are best effort. Delete jobs do not retry if the first effort to delete the source files fails.
For File format, select the type of files you wish to transfer.
Under the Transfer Options - All Formats section:
- For Number of errors allowed, enter the maximum number of bad
records that BigQuery can ignore when running the job.
If the number of bad records exceeds this value, an 'invalid' error
is returned in the job result, and the job fails. The default value
is
0
.
- For Number of errors allowed, enter the maximum number of bad
records that BigQuery can ignore when running the job.
If the number of bad records exceeds this value, an 'invalid' error
is returned in the job result, and the job fails. The default value
is
Under the Transfer Options - JSON, CSV section:
- For Ignore unknown values, check the box if you want the transfer to drop data that does not fit the destination table's schema.
Under the Transfer Options - CSV section:
- For Field delimiter, enter the character that separates fields. The default value is a comma.
- For Header rows to skip, enter the number of header rows in the
source file(s) if you don't want to import them. The default value
is
0
. - For Allow quoted newlines, check the box if you want to allow newlines within quoted fields.
- For Allow jagged rows, check the box if you want to allow the
transfer of rows with missing
NULLABLE
columns.
(Optional) Expand the Advanced section and configure run notifications for your transfer.
For Cloud Pub/Sub topic, enter your Pub/Sub topic name, for example,
projects/myproject/topics/mytopic
.Check Send email notifications to allow email notifications of transfer run failures.
Click Add.
CLI
Enter the bq mk
command and supply the transfer creation flag —
--transfer_config
. The following flags are also required:
--data_source
--display_name
--target_dataset
--params
bq mk \ --transfer_config \ --project_id=project_id \ --data_source=data_source \ --display_name=name \ --target_dataset=dataset \ --params='parameters'
Where:
- project_id is your project ID. If
--project_id
isn't supplied to specify a particular project, the default project is used. - data_source is the data source —
google_cloud_storage
. - name is the display name for the transfer configuration. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.
- dataset is the target dataset for the transfer configuration.
- parameters contains the parameters for the created transfer
configuration in JSON format. For example:
--params='{"param":"param_value"}'
.- For Cloud Storage, you must supply the
data_path_template
, thedestination_table_name_template
and thefile_format
parameters.data_path_template
is the Cloud Storage URI that contains your files to be transferred, which can include one wildcard. Thedestination_table_name_template
is the name of your destination table. Forfile_format
, indicate the type of files you wish to transfer:CSV
,JSON
,AVRO
,PARQUET
, orORC
. The default value is CSV. - For all file_format values, you can include the optional param
max_bad_records
. The default value is0
. - For the JSON or CSV values in file_format, you can include the optional
param
ignore_unknown_values
. This param will be ignored if you haven't selectedCSV
orJSON
for thefile_format
. - For CSV file_format, you can include the optional param
field_delimiter
for the character that separates fields. The default value is a comma. This param will be ignored if you haven't selectedCSV
for thefile_format
. - For CSV file_format, you can include the optional param
skip_leading_rows
to indicate header rows you don't want to import. The default value is 0. This param will be ignored if you haven't selectedCSV
for thefile_format
. - For CSV file_format, you can include the optional param
allow_quoted_newlines
if you want to allow newlines within quoted fields. This param will be ignored if you haven't selectedCSV
for thefile_format
. - For CSV file_format, you can include the optional param
allow_jagged_rows
if you want to accept rows that are missing trailing optional columns. The missing values will be filled in with NULLs. This param will be ignored if you haven't selectedCSV
for thefile_format
. - Optional param
delete_source_files
will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for thedelete_source_files
is false.
- For Cloud Storage, you must supply the
For example, the following command creates a Cloud Storage
transfer named My Transfer
using a data_path_template
value of
gs://mybucket/myfile/*.csv
, target dataset mydataset
, and file_format
CSV
. This example includes non-default values for the optional params
associated with the CSV
file_format.
The transfer is created in the default project:
bq mk --transfer_config \
--target_dataset=mydataset \
--display_name='My Transfer' \
--params='{"data_path_template":"gs://mybucket/myfile/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false",
"delete_source_files":"true"}' \
--data_source=google_cloud_storage
After running the command, you receive a message like the following:
[URL omitted] Please copy and paste the above URL into your web browser and
follow the instructions to retrieve an authentication code.
Follow the instructions and paste the authentication code on the command line.
API
Use the projects.locations.transferConfigs.create
method and supply an instance of the TransferConfig
resource.
Setting up a refresh run of a transfer
In addition to setting up a recurring transfer from Cloud Storage, you can also set up a refresh run to pick up additional data files.
If the transfer configuration is date-related (parameterized), the Cloud Storage URI is parameterized, or both, you can run the refresh for specific dates.
To set up a refresh transfer:
Console
Open the Cloud Console.
Click Transfers.
Click your transfer.
Click the More menu, and then select Refresh Transfer.
In the Schedule a backfill run dialog, choose your Start date and End date. You can use the classic BigQuery web UI to set finer-grained time limits.
Classic UI
Go to the BigQuery web UI.
Click Transfers.
Click your transfer.
Click Refresh Transfer.
In the Start Transfer Runs dialog, choose your Start Time and End Time.
If your Cloud Storage transfer configuration is not parameterized, you won't be given date options when you click Refresh Transfer. Instead, the refresh will happen immediately.
Click OK.
What's next
- Learn about Using runtime parameters in Cloud Storage transfers
- Learn more about the BigQuery Data Transfer Service