Introduction to Amazon S3 transfers

The BigQuery Data Transfer Service for Amazon S3 lets you automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.

Supported file formats

The BigQuery Data Transfer Service supports loading data from Amazon S3 in one of the following formats:

  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • Avro
  • Parquet
  • ORC

Supported compression types

The BigQuery Data Transfer Service for Amazon S3 supports loading compressed data. The compression types supported by BigQuery Data Transfer Service are the same as the compression types supported by BigQuery load jobs. For more information, see Loading compressed and uncompressed data.

Amazon S3 prerequisites

To load data from an Amazon S3 data source, you must:

  • Provide the Amazon S3 URI for your source data
  • Have your access key ID
  • Have your secret access key
  • Set, at a minimum, the AWS managed policy AmazonS3ReadOnlyAccess on your Amazon S3 source data

Amazon S3 URIs

When you supply the Amazon S3 URI, the path must be in the following format s3://bucket/folder1/folder2/... Only the top-level bucket name is required. Folder names are optional. If you specify a URI that includes only the bucket name, all files in the bucket are transferred and loaded into BigQuery.

Amazon S3 transfer runtime parameterization

The Amazon S3 URI and the destination table can both be parameterized, allowing you to load data from Amazon S3 buckets organized by date. Note that the bucket portion of the URI cannot be parameterized. The parameters used by Amazon S3 transfers are the same as those used by Cloud Storage transfers.

For details, see Runtime parameters in transfers.

Data ingestion for Amazon S3 transfers

You can specify how data is loaded into BigQuery by selecting a Write Preference in the transfer configuration when you set up an Amazon S3 transfer.

There are two types of write preferences available, incremental transfers and truncated transfers.

Incremental transfers

A transfer configuration with an APPEND or WRITE_APPEND write preference, also called an incremental transfer, incrementally appends new data since the previous successful transfer to a BigQuery destination table. When a transfer configuration runs with an APPEND write preference, the BigQuery Data Transfer Service filters for files which have been modified since the previous successful transfer run. To determine when a file is modified, BigQuery Data Transfer Service looks at the file metadata for a "last modified time" property. For example, the BigQuery Data Transfer Service looks at the updated timestamp property in a Cloud Storage file. If the BigQuery Data Transfer Service finds any files with a "last modified time" that have occurred after the timestamp of the last successful transfer, the BigQuery Data Transfer Service transfers those files in an incremental transfer.

To demonstrate how incremental transfers work, consider the following Cloud Storage transfer example. A user creates a file in a Cloud Storage bucket at time 2023-07-01T00:00Z named file_1. The updated timestamp for file_1 is the time that the file was created. The user then creates an incremental transfer from the Cloud Storage bucket, scheduled to run once daily at time 03:00Z, starting from 2023-07-01T03:00Z.

  • At 2023-07-01T03:00Z, the first transfer run starts. As this is the first transfer run for this configuration, BigQuery Data Transfer Service attempts to load all files matching the source URI into the destination BigQuery table. The transfer run succeeds and BigQuery Data Transfer Service successfully loads file_1 into the destination BigQuery table.
  • The next transfer run, at 2023-07-02T03:00Z, detects no files where the updated timestamp property is greater than the last successful transfer run (2023-07-01T03:00Z). The transfer run succeeds without loading any additional data into the destination BigQuery table.

The preceding example shows how the BigQuery Data Transfer Service looks at the updated timestamp property of the source file to determine if any changes were made to the source files, and to transfer those changes if any were detected.

Following the same example, suppose that the user then creates another file in the Cloud Storage bucket at time 2023-07-03T00:00Z, named file_2. The updated timestamp for file_2 is the time that the file was created.

  • The next transfer run, at 2023-07-03T03:00Z, detects that file_2 has an updated timestamp greater than the last successful transfer run (2023-07-01T03:00Z). Suppose that when the transfer run starts it fails due to a transient error. In this scenario, file_2 is not loaded into the destination BigQuery table. The last successful transfer run timestamp remains at 2023-07-01T03:00Z.
  • The next transfer run, at 2023-07-04T03:00Z, detects that file_2 has an updated timestamp greater than the last successful transfer run (2023-07-01T03:00Z). This time, the transfer run completes without issue, so it successfully loads file_2 into the destination BigQuery table.
  • The next transfer run, at 2023-07-05T03:00Z, detects no files where the updated timestamp is greater than the last successful transfer run (2023-07-04T03:00Z). The transfer run succeeds without loading any additional data into the destination BigQuery table.

The preceding example shows that when a transfer fails, no files are transferred to the BigQuery destination table. Any file changes are transferred at the next successful transfer run. Any subsequent successful transfers following a failed transfer does not cause duplicate data. In the case of a failed transfer, you can also choose to manually trigger a transfer outside its regularly scheduled time.

Truncated transfers

A transfer configuration with a MIRROR or WRITE_TRUNCATE write preference, also called a truncated transfer, overwrites data in the BigQuery destination table during each transfer run with data from all files matching the source URI. MIRROR overwrites a fresh copy of data in the destination table. If the destination table is using a partition decorator, the transfer run only overwrites data in the specified partition. A destination table with a partition decorator has the format my_table${run_date}—for example, my_table$20230809.

Repeating the same incremental or truncated transfers in a day does not cause duplicate data. However, if you run multiple different transfer configurations that affect the same BigQuery destination table, this can cause the BigQuery Data Transfer Service to duplicate data.

Wildcard support for Amazon S3 URIs

If your source data is separated into multiple files that share a common base name, you can use a wildcard in the URI when you load the data. A wildcard consists of an asterisk (*), and can be used anywhere in the Amazon S3 URI except for the bucket name.

While more than one wildcard can be used in the Amazon S3 URI, some optimization is possible when the Amazon S3 URI specifies only a single wildcard:

  • There is a higher limit on the maximum number of files per transfer run.

  • The wildcard will span directory boundaries. For example, the Amazon S3 URI s3://my-bucket/*.csv will match the file s3://my-bucket/my-folder/my-subfolder/my-file.csv.

Amazon S3 URI examples

Example 1

To load a single file from Amazon S3 into BigQuery, specify the Amazon S3 URI of the file.

s3://my-bucket/my-folder/my-file.csv

Example 2

To load all files from an Amazon S3 bucket into BigQuery, specify only the bucket name, with or without a wildcard.

s3://my-bucket/

or

s3://my-bucket/*

Note that s3://my-bucket* is not a permitted Amazon S3 URI, as a wildcard can't be used in the bucket name.

Example 3

To load all files from Amazon S3 that share a common prefix, specify the common prefix followed by a wildcard.

s3://my-bucket/my-folder/*

Note that in contrast to loading all files from a top level Amazon S3 bucket, the wildcard must be specified at the end of the Amazon S3 URI for any files to be loaded.

Example 4

To load all files from Amazon S3 with a similar path, specify the common prefix followed by a wildcard.

s3://my-bucket/my-folder/*.csv

Example 5

Note the wildcards span directories, so any csv files in my-folder, as well as in subfolders of my-folder will be loaded into BigQuery.

If you have these source files under a logs folder:

s3://my-bucket/logs/logs.csv
s3://my-bucket/logs/system/logs.csv
s3://my-bucket/logs/some-application/system_logs.log
s3://my-bucket/logs/logs_2019_12_12.csv

then the following identifies them:

s3://my-bucket/logs/*

Example 6

If you have these source files, but want to transfer only those that have logs.csv as the filename:

s3://my-bucket/logs.csv
s3://my-bucket/metadata.csv
s3://my-bucket/system/logs.csv
s3://my-bucket/system/users.csv
s3://my-bucket/some-application/logs.csv
s3://my-bucket/some-application/output.csv

then the following identifies the files with logs.csv in the name:

s3://my-bucket/*logs.csv

Example 7

By using multiple wildcards, more control can be achieved over which files are transferred, at the cost of lower limits. Using multiple wildcards means that each wildcard will only match up to the end of a path within a subdirectory. For example, for the following source files in Amazon S3:

s3://my-bucket/my-folder1/my-file1.csv
s3://my-bucket/my-other-folder2/my-file2.csv
s3://my-bucket/my-folder1/my-subfolder/my-file3.csv
s3://my-bucket/my-other-folder2/my-subfolder/my-file4.csv

If the intention is to only transfer my-file1.csv and my-file2.csv, use the following as the value for the Amazon S3 URI:

s3://my-bucket/*/*.csv

As neither wildcard spans directories, this URI would limit the transfer to only the CSV files that are in my-folder1 and my-other-folder2. Subfolders would not be included in the transfer.

AWS access keys

The access key ID and secret access key are used to access the Amazon S3 data on your behalf. As a best practice, create a unique access key ID and secret access key specifically for Amazon S3 transfers to give minimal access to the BigQuery Data Transfer Service. For information on managing your access keys, see the AWS general reference documentation.

Consistency considerations

When you transfer data from Amazon S3, it is possible that some of your data won't be transferred to BigQuery, particularly if the files were added to the bucket very recently. It should take approximately 10 minutes for a file to become available to the BigQuery Data Transfer Service after it is added to the bucket. In some cases, however, it may take longer than 10 minutes.

For more information about the Amazon S3 consistency model, see Amazon S3 data consistency model in the Amazon S3 documentation.

Outbound data transfer costs best practice

Transfers from Amazon S3 could fail if the destination table has not been configured properly. Reasons that could result in an improper configuration include:

  • The destination table does not exist.
  • The table schema is not defined.
  • The table schema is not compatible with the data being transferred.

To avoid Amazon S3 outbound data transfer costs, you should first test a transfer with a small but representative subset of the files. Small means the test should have a small data size, and a small file count.

Pricing

For information on BigQuery Data Transfer Service pricing, see the Pricing page.

Note that costs can be incurred outside of Google by using this service. Please review the Amazon S3 pricing page for details.

Quotas and limits

The BigQuery Data Transfer Service uses load jobs to load Amazon S3 data into BigQuery. All BigQuery Quotas and limits on load jobs apply to recurring Amazon S3 transfers, with the following additional considerations:

Value Limit
Maximum size per load job transfer run 15 TB
Maximum number of files per transfer run when the Amazon S3 URI includes 0 or 1 wildcards 10,000,000 files
Maximum number of files per transfer run when the Amazon S3 URI includes more than 1 wildcard 10,000 files

What's next