Overview of Blob Storage transfers
The BigQuery Data Transfer Service for Azure Blob Storage lets you automatically schedule and manage recurring load jobs from Blob Storage into BigQuery.
Supported file formats
The BigQuery Data Transfer Service currently supports loading data from Blob Storage in the following formats:
- Comma-separated values (CSV)
- JSON (newline delimited)
- Avro
- Parquet
- ORC
Supported compression types
The BigQuery Data Transfer Service for Blob Storage supports loading compressed data. The compression types supported by the BigQuery Data Transfer Service are the same as the compression types supported by BigQuery load jobs. For more information, see Loading compressed and uncompressed data.
Transfer prerequisites
To load data from a Blob Storage data source, first gather the following:
- The Blob Storage account name, container name, and data path (optional) for your source data. The data path field is optional; it's used to match common object prefixes and file extensions. If the data path is omitted, all files in the container are transferred.
- An Azure shared access signature (SAS) token that grants read access to your data source. For details on creating a SAS token, see Shared access signature (SAS).
Transfer runtime parameterization
The Blob Storage data path and the destination table can both be parameterized, letting you load data from containers organized by date. The parameters used by Blob Storage transfers are the same as those used by Cloud Storage transfers. For details, see Runtime parameters in transfers.
Wildcard support for the Blob Storage data path
You can select source data that is separated into multiple files by specifying
one or more asterisk (*
) wildcard characters in the data path.
While more than one wildcard can be used in the data path, some optimization is possible when only a single wildcard is used:
- There is a higher limit on the maximum number of files per transfer run.
- The wildcard will span directory boundaries. For example, the data path
my-folder/*.csv
will match the filemy-folder/my-subfolder/my-file.csv
.
Blob Storage data path examples
The following are examples of valid data paths for a Blob Storage
transfer. Note that data paths do not begin with /
.
Example: Single file
To load a single file from Blob Storage into BigQuery, specify the Blob Storage filename:
my-folder/my-file.csv
Example: All files
To load all files from a Blob Storage container into BigQuery, set the data path to a single wildcard:
*
Example: Files with a common prefix
To load all files from Blob Storage that share a common prefix, specify the common prefix with or without a wildcard:
my-folder/
or
my-folder/*
Example: Files with a similar path
To load all files from Blob Storage with a similar path, specify the common prefix and suffix:
my-folder/*.csv
When you only use a single wildcard, it spans directories. In this example,
every CSV file in my-folder
is selected, as well as every CSV file in every
subfolder of my-folder
.
Example: Wildcard at end of path
Consider the following data path:
logs/*
All of the following files are selected:
logs/logs.csv
logs/system/logs.csv
logs/some-application/system_logs.log
logs/logs_2019_12_12.csv
Example: Wildcard at beginning of path
Consider the following data path:
*logs.csv
All of the following files are selected:
logs.csv
system/logs.csv
some-application/logs.csv
And none of the following files are selected:
metadata.csv
system/users.csv
some-application/output.csv
Example: Multiple wildcards
By using multiple wildcards, you gain more control over file selection, at the cost of lower limits. When you use multiple wildcards, each individual wildcard only spans a single subdirectory.
Consider the following data path:
*/*.csv
Both of the following files are selected:
my-folder1/my-file1.csv
my-other-folder2/my-file2.csv
And neither of the following files are selected:
my-folder1/my-subfolder/my-file3.csv
my-other-folder2/my-subfolder/my-file4.csv
Shared access signature (SAS)
The Azure SAS token is used to access Blob Storage data on your behalf. Use the following steps to create a SAS token for your transfer:
- Create or use an existing Blob Storage user to access the storage account for your Blob Storage container.
Create a SAS token at the storage account level. To create a SAS token using Azure Portal, do the following:
- For Allowed services, select Blob.
- For Allowed resource types, select both Container and Object.
- For Allowed permissions, select Read and List.
- The default expiration time for SAS tokens is 8 hours. Set an expiration time that works for your transfer schedule.
- Do not specify any IP addresses in the Allowed IP addresses field.
- For Allowed protocols, select HTTPS only.
After the SAS token is created, note the SAS token value that is returned. You need this value when you configure transfers.
Consistency considerations
It should take approximately 10 minutes for a file to become available to the BigQuery Data Transfer Service after it is added to the Blob Storage container. To reduce the possibility of missing data, schedule your Blob Storage transfers to occur at least 10 minutes after your files are added to the container.
Best practices for controlling egress costs
Transfers from Blob Storage could fail if the destination table is not configured properly. Possible causes of an improper configuration include the following:
- The destination table does not exist.
- The table schema is not defined.
- The table schema is not compatible with the data being transferred.
To avoid extra Blob Storage egress costs, first test a transfer with a small but representative subset of files. Ensure that this test is small in both data size and file count.
It's also important to note that prefix matching for data paths happens before files are transferred from Blob Storage, but wildcard matching happens within Google Cloud. This distinction could increase Blob Storage egress costs for files that are transferred to Google Cloud but not loaded into BigQuery.
As an example, consider this data path:
folder/*/subfolder/*.csv
Both of the following files are transferred to Google Cloud, because
they have the prefix folder/
:
folder/any/subfolder/file1.csv
folder/file2.csv
However, only the folder/any/subfolder/file1.csv
file is loaded into
BigQuery, because it matches the full data path.
Pricing
For more information, see BigQuery Data Transfer Service pricing.
You can also incur costs outside of Google by using this service. For more information, see Blob Storage pricing.
Quotas and limits
The BigQuery Data Transfer Service uses load jobs to load Blob Storage data into BigQuery. All BigQuery quotas and limits on load jobs apply to recurring Blob Storage transfers, with the following additional considerations:
Limit | Default |
---|---|
Maximum size per load job transfer run | 15 TB |
Maximum number of files per transfer run when the Blob Storage data path includes 0 or 1 wildcards | 10,000,000 files |
Maximum number of files per transfer run when the Blob Storage data path includes 2 or more wildcards | 10,000 files |
What's next
- Learn more about setting up a Blob Storage transfer.
- Learn more about runtime parameters in transfers.
- Learn more about the BigQuery Data Transfer Service.