The BigQuery Data Transfer Service for Cloud Storage allows you to schedule recurring data loads from Cloud Storage to BigQuery.
Before you begin
Before you create a Cloud Storage transfer:
- Verify that you have completed all actions required in Enabling the BigQuery Data Transfer Service.
- Retrieve your Cloud Storage URI.
- Allow pop-ups in your browser from
bigquery.cloud.google.com, so that you can view the permissions window. You must allow the BigQuery Data Transfer Service permission to manage your transfer.
- Create a BigQuery dataset to store your data.
- Create the destination table for your transfer and specify the schema definition.
Recurring transfers from Cloud Storage to BigQuery are subject to the following limitations:
- All files matching the patterns defined by either a wildcard or by runtime parameters for your transfer must share the same schema you defined for the destination table, or the transfer will fail. Table schema changes between runs also cause the transfer to fail.
- Source files in Cloud Storage need to be at least one hour old to get picked up by the transfer.
- Because Cloud Storage objects can be versioned, it's important to note that archived Cloud Storage objects are not supported for BigQuery transfers. Objects must be live to be transferred.
- Unlike individual loads of data from Cloud Storage to BigQuery, for ongoing transfers you need to create the destination table and its schema in advance of setting up the transfer. BigQuery cannot create the table as part of the recurring data transfer process.
- Transfers from Cloud Storage are always triggered with the
WRITE_APPENDpreference which appends data to the destination table. See configuration.load.writeDisposition under load configuration for additional details.
- If your dataset's location is set to a value other than
US, the regional or multi-regional Cloud Storage bucket must be in the same region as the dataset.
Depending on the format of your Cloud Storage source data, there may be additional limitations. For more information, see:
- CSV limitations
- JSON limitations
- Cloud Datastore export limitations
- Cloud Firestore export limitations
- Limitations on nested and repeated data
When you load data into BigQuery, you need project or dataset-level permissions that allow you to load data into new or existing BigQuery tables and partitions. If you are loading data from Cloud Storage, you'll also need access to the bucket that contains your data. Ensure that you have the following required permissions:
bigquery.transfers.updatepermissions are required to create the scheduled transfer. The
bigquery.adminpredefined, project-level IAM role includes
bigquery.transfers.updatepermissions. For more information on IAM roles in BigQuery, see Access Control.
- Cloud Storage:
storage.objects.getpermissions are required at the project level or on the individual bucket. If you are using a URI wildcard, you must also have
storage.objects.listpermissions. If you would like to delete the source files after each successful transfer, you also need
storage.objectAdminpredefined, project-level IAM role includes all of these permissions.
Setting up a Cloud Storage transfer
To set up a Cloud Storage transfer in BigQuery Data Transfer Service:
Go to the BigQuery web UI.
Click Add Transfer.
On the New Transfer page:
- For Source, choose Google Cloud Storage.
- For Display name, enter a name for the scheduled query such as
My Transfer. The display name can be any value that allows you to easily identify the transfer if you need to modify it later.
(Optional) For Schedule, you can leave the default value of Daily (every 24 hours, based on creation time), or click Edit to change the time. You can also change the interval to Weekly, Monthly, or Custom. When selecting Custom, a Cron-like time specification is expected, for example
every 12 hours. The shortest allowed period is 12 hours. See the
schedulefield under TransferConfig for additional valid API values.
For Destination Dataset, choose the appropriate dataset.
- For Destination table, enter the name of your destination table. The destination table must follow the table naming rules. Destination table names also support parameters.
- For Cloud Storage URI, enter the Cloud Storage URI. Wildcards and parameters are supported.
- For Delete source files after transfer, check the box if you want to delete the source files after each successful transfer. Delete jobs are best effort. Delete jobs do not retry if the first effort to delete the source files fails.
- For File format, select the type of files you wish to transfer.
- Under the Transfer Options - All Formats section:
- For Number of errors allowed, enter the maximum number of bad
records that BigQuery can ignore when running the job.
If the number of bad records exceeds this value, an 'invalid' error
is returned in the job result, and the job fails. The default value
- For Number of errors allowed, enter the maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an 'invalid' error is returned in the job result, and the job fails. The default value is
- Under the Transfer Options - JSON, CSV section:
- For Ignore unknown values, check the box if you want the transfer to drop data that does not fit the desination table's schema.
Under the Transfer Options - CSV section:
- For Field delimiter, enter the character that separates fields. The default value is a comma.
- For Header rows to skip, enter the number of header rows in the
source file(s) if you don't want to import them. The default value
- For Allow quoted newlines, check the box if you want to allow newlines within quoted fields.
For Allow jagged rows, check the box if you want to allow the transfer of rows with missing
- For Cloud Pub/Sub topic, enter your Cloud Pub/Sub topic
name, for example,
Check Send email notifications to allow email notifications of transfer run failures.
- For Cloud Pub/Sub topic, enter your Cloud Pub/Sub topic name, for example,
Setting up a refresh transfer
In addition to setting up a recurring transfer from Cloud Storage, you can also set up refresh run to pick up additional data files.
If the transfer configuration is date-related (parameterized), the Cloud Storage URI is parameterized, or both, you can run the transfer on specific dates.
To set up a refresh transfer:
Go to the BigQuery web UI.
Click your transfer.
Click Refresh Transfer.
In the Start Transfer Runs dialog, choose your Start Time and End Time.
If your Cloud Storage transfer configuration is not parameterized, you won't be given date options when you click Refresh Transfer. Instead, the refresh will happen immediately. Files must be at least one hour old to be picked up in a transfer.