Introduction to Cloud Storage transfers

The BigQuery Data Transfer Service for Cloud Storage lets you schedule recurring data loads from Cloud Storage buckets to BigQuery. The path to the data stored in Cloud Storage and the destination table can both be parameterized, allowing you to load data from Cloud Storage buckets organized by date.

Supported file formats

The BigQuery Data Transfer Service currently supports loading data from Cloud Storage in one of the following formats:

  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • Avro
  • Parquet
  • ORC

Supported compression types

The BigQuery Data Transfer Service for Cloud Storage supports loading compressed data. The compression types supported by BigQuery Data Transfer Service are the same as the compression types supported by BigQuery load jobs. For more information, see Loading compressed and uncompressed data.

Data ingestion for Cloud Storage transfers

You can specify how data is loaded into BigQuery by selecting a Write Preference in the transfer configuration when you set up a Cloud Storage transfer.

There are two types of write preferences available, incremental transfers and truncated transfers.

Incremental transfers

A transfer configuration with an APPEND or WRITE_APPEND write preference, also called an incremental transfer, incrementally appends new data since the previous successful transfer to a BigQuery destination table. When a transfer configuration runs with an APPEND write preference, the BigQuery Data Transfer Service filters for files which have been modified since the previous successful transfer run. To determine when a file is modified, BigQuery Data Transfer Service looks at the file metadata for a "last modified time" property. For example, the BigQuery Data Transfer Service looks at the updated timestamp property in a Cloud Storage file. If the BigQuery Data Transfer Service finds any files with a "last modified time" that have occurred after the timestamp of the last successful transfer, the BigQuery Data Transfer Service transfers those files in an incremental transfer.

To demonstrate how incremental transfers work, consider the following Cloud Storage transfer example. A user creates a file in a Cloud Storage bucket at time 2023-07-01T00:00Z named file_1. The updated timestamp for file_1 is the time that the file was created. The user then creates an incremental transfer from the Cloud Storage bucket, scheduled to run once daily at time 03:00Z, starting from 2023-07-01T03:00Z.

  • At 2023-07-01T03:00Z, the first transfer run starts. As this is the first transfer run for this configuration, BigQuery Data Transfer Service attempts to load all files matching the source URI into the destination BigQuery table. The transfer run succeeds and BigQuery Data Transfer Service successfully loads file_1 into the destination BigQuery table.
  • The next transfer run, at 2023-07-02T03:00Z, detects no files where the updated timestamp property is greater than the last successful transfer run (2023-07-01T03:00Z). The transfer run succeeds without loading any additional data into the destination BigQuery table.

The preceding example shows how the BigQuery Data Transfer Service looks at the updated timestamp property of the source file to determine if any changes were made to the source files, and to transfer those changes if any were detected.

Following the same example, suppose that the user then creates another file in the Cloud Storage bucket at time 2023-07-03T00:00Z, named file_2. The updated timestamp for file_2 is the time that the file was created.

  • The next transfer run, at 2023-07-03T03:00Z, detects that file_2 has an updated timestamp greater than the last successful transfer run (2023-07-01T03:00Z). Suppose that when the transfer run starts it fails due to a transient error. In this scenario, file_2 is not loaded into the destination BigQuery table. The last successful transfer run timestamp remains at 2023-07-01T03:00Z.
  • The next transfer run, at 2023-07-04T03:00Z, detects that file_2 has an updated timestamp greater than the last successful transfer run (2023-07-01T03:00Z). This time, the transfer run completes without issue, so it successfully loads file_2 into the destination BigQuery table.
  • The next transfer run, at 2023-07-05T03:00Z, detects no files where the updated timestamp is greater than the last successful transfer run (2023-07-04T03:00Z). The transfer run succeeds without loading any additional data into the destination BigQuery table.

The preceding example shows that when a transfer fails, no files are transferred to the BigQuery destination table. Any file changes are transferred at the next successful transfer run. Any subsequent successful transfers following a failed transfer does not cause duplicate data. In the case of a failed transfer, you can also choose to manually trigger a transfer outside its regularly scheduled time.

Truncated transfers

A transfer configuration with a MIRROR or WRITE_TRUNCATE write preference, also called a truncated transfer, overwrites data in the BigQuery destination table during each transfer run with data from all files matching the source URI. MIRROR overwrites a fresh copy of data in the destination table. If the destination table is using a partition decorator, the transfer run only overwrites data in the specified partition. A destination table with a partition decorator has the format my_table${run_date}—for example, my_table$20230809.

Repeating the same incremental or truncated transfers in a day does not cause duplicate data. However, if you run multiple different transfer configurations that affect the same BigQuery destination table, this can cause the BigQuery Data Transfer Service to duplicate data.

Cloud Storage resource path

To load data from a Cloud Storage data source, you must provide the path to the data.

The Cloud Storage resource path contains your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named myfile.csv, the resource path would be gs://mybucket/myfile.csv.

BigQuery does not support Cloud Storage resource paths that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutive slashes into a single slash. For example, the following resource path, though valid in Cloud Storage, does not work in BigQuery: gs://bucket/my//object//name.

To retrieve the Cloud Storage resource path:

  1. Open the Cloud Storage console.

    Cloud Storage console

  2. Browse to the location of the object (file) that contains the source data.

  3. Click on the name of the object.

    The Object details page opens.

  4. Copy the value provided in the gsutil URI field, which begins with gs://.

Wildcard support for Cloud Storage resource paths

If your Cloud Storage data is separated into multiple files that share a common base name, you can use a wildcard in the resource path when you load the data.

To add a wildcard to the Cloud Storage resource path, you append an asterisk (*) to the base name. For example, if you have two files named fed-sample000001.csv and fed-sample000002.csv, the resource path would be gs://mybucket/fed-sample*. This wildcard can then be used in the Google Cloud console or Google Cloud CLI.

You can use multiple wildcards for objects (filenames) within buckets. The wildcard can appear anywhere inside the object name.

Wildcards do not expand a directory in a gs://bucket/. For example, gs://bucket/dir/* finds files in the directory dir but doesn't find files in the subdirectory gs://bucket/dir/subdir/.

Neither can you match on prefixes without wildcards. For example, gs://bucket/dir doesn't match on gs://bucket/dir/file.csv nor gs://bucket/file.csv

However, you can use multiple wildcards for filenames within buckets. For example, gs://bucket/dir/*/*.csv matches gs://bucket/dir/subdir/file.csv.

For examples of wildcard support in combination with parameterized table names, see Runtime parameters in transfers.

Location considerations

Your Cloud Storage bucket must be in a region or multi-region that is compatible with the region or multi-region of the destination dataset in BigQuery.

  • If your BigQuery dataset is in a multi-region, the Cloud Storage bucket containing the data you're transferring must be in the same multi-region or in a location that is contained within the multi-region. For example, if your BigQuery dataset is in the `EU` multi-region, the Cloud Storage bucket can be located in the `europe-west1` Belgium region, which is within the EU.
  • If your dataset is in a region, your Cloud Storage bucket must be in the same region. For example, if your dataset is in the `asia-northeast1` Tokyo region, your Cloud Storage bucket cannot be in the `ASIA` multi-region.

For detailed information about transfers and regions, see Dataset locations and transfers.

For more information about Cloud Storage locations, see Bucket locations in the Cloud Storage documentation.

Pricing

  • Standard BigQuery Quotas & limits on load jobs apply.

  • After data is transferred to BigQuery, standard BigQuery storage and query pricing applies.

  • Data will not be automatically deleted from your Cloud Storage bucket after it is uploaded to BigQuery, unless you indicate deletion when setting up the transfer. See Setting up a Cloud Storage transfer.

  • See our transfers Pricing page for details.

Quotas and limits

The BigQuery Data Transfer Service uses load jobs to load Cloud Storage data into BigQuery.

All BigQuery Quotas and limits on load jobs apply to recurring Cloud Storage load jobs, with the following additional considerations:

Value Limit
Maximum size per load job transfer run 15 TB
Maximum number of files per transfer run 10,000 files

What's next