Overview of Amazon S3 transfers
The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.
Supported file formats
The BigQuery Data Transfer Service currently supports loading data from Amazon S3 in one of the following formats:
- Comma-separated values (CSV)
- JSON (newline-delimited)
- Avro
- Parquet
- ORC
Supported compression types
The BigQuery Data Transfer Service for Amazon S3 supports loading compressed data. The compression types supported by BigQuery Data Transfer Service are the same as the compression types supported by BigQuery load jobs. For more information, see Loading compressed and uncompressed data.
Amazon S3 prerequisites
To load data from an Amazon S3 data source, you must:
- Provide the Amazon S3 URI for your source data
- Have your access key ID
- Have your secret access key
- Set, at a minimum, the AWS managed policy
AmazonS3ReadOnlyAccess
on your Amazon S3 source data
Amazon S3 URIs
When you supply the Amazon S3 URI, the path must be in the following format
s3://bucket/folder1/folder2/...
Only the top-level bucket name is required.
Folder names are optional. If you specify a URI that includes only the bucket
name, all files in the bucket are transferred and loaded into
BigQuery.
Amazon S3 transfer runtime parameterization
The Amazon S3 URI and the destination table can both be parameterized, allowing you to load data from Amazon S3 buckets organized by date. Note that the bucket portion of the URI cannot be parameterized. The parameters used by Amazon S3 transfers are the same as those used by Cloud Storage transfers.
For details, see Runtime parameters in transfers.
Wildcard support for Amazon S3 URIs
If your source data is separated into multiple files that share a common base name, you can use a wildcard in the URI when you load the data. A wildcard consists of an asterisk (*), and can be used anywhere in the Amazon S3 URI except for the bucket name.
While more than one wildcard can be used in the Amazon S3 URI, some optimization is possible when the Amazon S3 URI specifies only a single wildcard:
There is a higher limit on the maximum number of files per transfer run.
The wildcard will span directory boundaries. For example, the Amazon S3 URI
s3://my-bucket/*.csv
will match the files3://my-bucket/my-folder/my-subfolder/my-file.csv
.
Amazon S3 URI examples
Example 1
To load a single file from Amazon S3 into BigQuery, specify the Amazon S3 URI of the file.
s3://my-bucket/my-folder/my-file.csv
Example 2
To load all files from an Amazon S3 bucket into BigQuery, specify only the bucket name, with or without a wildcard.
s3://my-bucket/
or
s3://my-bucket/*
Note that s3://my-bucket*
is not a permitted Amazon S3 URI, as a wildcard
can't be used in the bucket name.
Example 3
To load all files from Amazon S3 that share a common prefix, specify the common prefix followed by a wildcard.
s3://my-bucket/my-folder/*
Note that in contrast to loading all files from a top level Amazon S3 bucket, the wildcard must be specified at the end of the Amazon S3 URI for any files to be loaded.
Example 4
To load all files from Amazon S3 with a similar path, specify the common prefix followed by a wildcard.
s3://my-bucket/my-folder/*.csv
Example 5
Note the wildcards span directories, so any csv
files in my-folder
, as well
as in subfolders of my-folder
will be loaded into BigQuery.
If you have these source files under a logs
folder:
s3://my-bucket/logs/logs.csv
s3://my-bucket/logs/system/logs.csv
s3://my-bucket/logs/some-application/system_logs.log
s3://my-bucket/logs/logs_2019_12_12.csv
then the following identifies them:
s3://my-bucket/logs/*
Example 6
If you have these source files, but want to transfer only those that have
logs.csv
as the filename:
s3://my-bucket/logs.csv
s3://my-bucket/metadata.csv
s3://my-bucket/system/logs.csv
s3://my-bucket/system/users.csv
s3://my-bucket/some-application/logs.csv
s3://my-bucket/some-application/output.csv
then the following identifies the files with logs.csv
in the name:
s3://my-bucket/*logs.csv
Example 7
By using multiple wildcards, more control can be achieved over which files are transferred, at the cost of lower limits. Using multiple wildcards means that each wildcard will only match up to the end of a path within a subdirectory. For example, for the following source files in Amazon S3:
s3://my-bucket/my-folder1/my-file1.csv
s3://my-bucket/my-other-folder2/my-file2.csv
s3://my-bucket/my-folder1/my-subfolder/my-file3.csv
s3://my-bucket/my-other-folder2/my-subfolder/my-file4.csv
If the intention is to only transfer my-file1.csv
and my-file2.csv
, use the
following as the value for the Amazon S3 URI:
s3://my-bucket/*/*.csv
As neither wildcard spans directories, this URI would limit the transfer to only
the csv files that are in my-folder1
and my-other-folder2
. Subfolders would
not be included in the transfer.
AWS access keys
The access key ID and secret access key are used to access the Amazon S3 data on your behalf. As a best practice, create a unique access key ID and secret access key specifically for Amazon S3 transfers to give minimal access to the BigQuery Data Transfer Service. For information on managing your access keys, see the AWS general reference documentation.
Consistency considerations
When you transfer data from Amazon S3, it is possible that some of your data will not be transferred to BigQuery, particularly if the files were added to the bucket very recently. It should take approximately 10 minutes for a file to become available to the BigQuery Data Transfer Service after it is added to the bucket.
In some cases, however, it may take longer than 10 minutes. To reduce the possibility of missing data, schedule your Amazon S3 transfers to occur at least 10 minutes after your files are added to the bucket. For more information on the Amazon S3 consistency model, see Amazon S3 data consistency model in the Amazon S3 documentation.
Egress cost best practice
Transfers from Amazon S3 could fail if the destination table has not been configured properly. Reasons that could result in an improper configuration include:
- the destination table does not exist
- the table schema is not defined
- the table schema is not compatible with the data being transferred
To avoid Amazon S3 egress costs, you should first test a transfer with a small but representative subset of the files. Small means the test should have a small data size, and a small file count.
Pricing
For information on BigQuery Data Transfer Service pricing, see the Pricing page.
Note that costs can be incurred outside of Google by using this service. Please review the Amazon S3 pricing page for details.
Quotas and limits
The BigQuery Data Transfer Service uses load jobs to load Amazon S3 data into BigQuery. All BigQuery Quotas and limits on load jobs apply to recurring Amazon S3 transfers, with the following additional considerations:
Value | Limit |
---|---|
Maximum size per load job transfer run | 15 TB |
Maximum number of files per transfer run when the Amazon S3 URI includes 0 or 1 wildcards | 10,000,000 files |
Maximum number of files per transfer run when the Amazon S3 URI includes more than 1 wildcard | 10,000 files |
What's next
- Learn about setting up an Amazon S3 transfer.
- Learn about runtime parameters in S3 transfers.
- Learn more about the BigQuery Data Transfer Service.