Stay organized with collections Save and categorize content based on your preferences.

Overview of Blob Storage transfers

The BigQuery Data Transfer Service for Azure Blob Storage lets you automatically schedule and manage recurring load jobs from Blob Storage into BigQuery.

Supported file formats

The BigQuery Data Transfer Service currently supports loading data from Blob Storage in the following formats:

  • Comma-separated values (CSV)
  • JSON (newline delimited)
  • Avro
  • Parquet
  • ORC

Supported compression types

The BigQuery Data Transfer Service for Blob Storage supports loading compressed data. The compression types supported by the BigQuery Data Transfer Service are the same as the compression types supported by BigQuery load jobs. For more information, see Loading compressed and uncompressed data.

Transfer prerequisites

To load data from a Blob Storage data source, first gather the following:

  • The Blob Storage account name, container name, and data path (optional) for your source data. The data path field is optional; it's used to match common object prefixes and file extensions. If the data path is omitted, all files in the container are transferred.
  • An Azure shared access signature (SAS) token that grants read access to your data source. For details on creating a SAS token, see Shared access signature (SAS).

Transfer runtime parameterization

The Blob Storage data path and the destination table can both be parameterized, letting you load data from containers organized by date. The parameters used by Blob Storage transfers are the same as those used by Cloud Storage transfers. For details, see Runtime parameters in transfers.

Wildcard support for the Blob Storage data path

You can select source data that is separated into multiple files by specifying one or more asterisk (*) wildcard characters in the data path.

While more than one wildcard can be used in the data path, some optimization is possible when only a single wildcard is used:

  • There is a higher limit on the maximum number of files per transfer run.
  • The wildcard will span directory boundaries. For example, the data path my-folder/*.csv will match the file my-folder/my-subfolder/my-file.csv.

Blob Storage data path examples

The following are examples of valid data paths for a Blob Storage transfer. Note that data paths do not begin with /.

Example: Single file

To load a single file from Blob Storage into BigQuery, specify the Blob Storage filename:

my-folder/my-file.csv

Example: All files

To load all files from a Blob Storage container into BigQuery, set the data path to a single wildcard:

*

Example: Files with a common prefix

To load all files from Blob Storage that share a common prefix, specify the common prefix with or without a wildcard:

my-folder/

or

my-folder/*

Example: Files with a similar path

To load all files from Blob Storage with a similar path, specify the common prefix and suffix:

my-folder/*.csv

When you only use a single wildcard, it spans directories. In this example, every CSV file in my-folder is selected, as well as every CSV file in every subfolder of my-folder.

Example: Wildcard at end of path

Consider the following data path:

logs/*

All of the following files are selected:

logs/logs.csv
logs/system/logs.csv
logs/some-application/system_logs.log
logs/logs_2019_12_12.csv

Example: Wildcard at beginning of path

Consider the following data path:

*logs.csv

All of the following files are selected:

logs.csv
system/logs.csv
some-application/logs.csv

And none of the following files are selected:

metadata.csv
system/users.csv
some-application/output.csv

Example: Multiple wildcards

By using multiple wildcards, you gain more control over file selection, at the cost of lower limits. When you use multiple wildcards, each individual wildcard only spans a single subdirectory.

Consider the following data path:

*/*.csv

Both of the following files are selected:

my-folder1/my-file1.csv
my-other-folder2/my-file2.csv

And neither of the following files are selected:

my-folder1/my-subfolder/my-file3.csv
my-other-folder2/my-subfolder/my-file4.csv

Shared access signature (SAS)

The Azure SAS token is used to access Blob Storage data on your behalf. Use the following steps to create a SAS token for your transfer:

  1. Create or use an existing Blob Storage user to access the storage account for your Blob Storage container.
  2. Create a SAS token at the storage account level. To create a SAS token using Azure Portal, do the following:

    1. For Allowed services, select Blob.
    2. For Allowed resource types, select both Container and Object.
    3. For Allowed permissions, select Read and List.
    4. The default expiration time for SAS tokens is 8 hours. Set an expiration time that works for your transfer schedule.
    5. Do not specify any IP addresses in the Allowed IP addresses field.
    6. For Allowed protocols, select HTTPS only.

    Azure portal SAS

  3. After the SAS token is created, note the SAS token value that is returned. You need this value when you configure transfers.

Consistency considerations

It should take approximately 10 minutes for a file to become available to the BigQuery Data Transfer Service after it is added to the Blob Storage container. To reduce the possibility of missing data, schedule your Blob Storage transfers to occur at least 10 minutes after your files are added to the container.

Best practices for controlling egress costs

Transfers from Blob Storage could fail if the destination table is not configured properly. Possible causes of an improper configuration include the following:

  • The destination table does not exist.
  • The table schema is not defined.
  • The table schema is not compatible with the data being transferred.

To avoid extra Blob Storage egress costs, first test a transfer with a small but representative subset of files. Ensure that this test is small in both data size and file count.

It's also important to note that prefix matching for data paths happens before files are transferred from Blob Storage, but wildcard matching happens within Google Cloud. This distinction could increase Blob Storage egress costs for files that are transferred to Google Cloud but not loaded into BigQuery.

As an example, consider this data path:

folder/*/subfolder/*.csv

Both of the following files are transferred to Google Cloud, because they have the prefix folder/:

folder/any/subfolder/file1.csv
folder/file2.csv

However, only the folder/any/subfolder/file1.csv file is loaded into BigQuery, because it matches the full data path.

Pricing

For more information, see BigQuery Data Transfer Service pricing.

You can also incur costs outside of Google by using this service. For more information, see Blob Storage pricing.

Quotas and limits

The BigQuery Data Transfer Service uses load jobs to load Blob Storage data into BigQuery. All BigQuery quotas and limits on load jobs apply to recurring Blob Storage transfers, with the following additional considerations:

Limit Default
Maximum size per load job transfer run 15 TB
Maximum number of files per transfer run when the Blob Storage data path includes 0 or 1 wildcards 10,000,000 files
Maximum number of files per transfer run when the Blob Storage data path includes 2 or more wildcards 10,000 files

What's next