Overview of Amazon S3 transfers

The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.

Supported file formats

The BigQuery Data Transfer Service currently supports loading data from Amazon S3 in one of the following formats:

  • Comma-separated values (CSV)
  • JSON (newline-delimited)
  • Avro
  • Parquet
  • ORC

Supported compression types

The BigQuery Data Transfer Service for Amazon S3 supports loading compressed data. The compression types supported by BigQuery Data Transfer Service are the same as the compression types supported by BigQuery load jobs. For more information, see Loading compressed and uncompressed data.

Amazon S3 prerequisites

To load data from an Amazon S3 data source, you must:

  • Provide the Amazon S3 URI for your source data
  • Have your access key ID
  • Have your secret access key
  • Set, at a minimum, the AWS managed policy AmazonS3ReadOnlyAccess on your Amazon S3 source data

Amazon S3 URIs

When you supply the Amazon S3 URI, the path must be in the following format s3://bucket/folder1/folder2/... Only the top-level bucket name is required. Folder names are optional. If you specify a URI that includes only the bucket name, all files in the bucket are transferred and loaded into BigQuery.

The Amazon S3 URI and the destination table can both be parameterized, allowing you to load data from Amazon S3 buckets organized by date. Note that currently, the bucket portion of the URI cannot be parameterized. The parameters used by Amazon S3 transfers are the same as those used by Cloud Storage transfers.

Wildcard support for Amazon S3 URIs

If your source data is separated into multiple files that share a common base-name, you can use a wildcard in the URI when you load the data.

To add a wildcard to the URI, you append an asterisk (*) to the base-name. For example, if you have two files named fed-sample000001.csv and fed-sample000002.csv, the bucket URI would be s3://mybucket/fed-sample*.

You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

AWS access keys

The access key ID and secret access key are used to access the Amazon S3 data on your behalf. As a best practice, create a unique access key ID and secret access key specifically for Amazon S3 transfers to give minimal access to the BigQuery Data Transfer Service. For information on managing your access keys, see the AWS general reference documentation.

Consistency considerations

When you transfer data from Amazon S3, it is possible that some of your data will not be transferred to BigQuery, particularly if the files were added to the bucket very recently. It should take approximately 10 minutes for a file to become available to the BigQuery Data Transfer Service after it is added to the bucket.

In some cases, however, it may take longer than 10 minutes. To reduce the possibility of missing data, schedule your Amazon S3 transfers to occur at least 10 minutes after your files are added to the bucket. For more information on the Amazon S3 consistency model, see Amazon S3 data consistency model in the Amazon S3 documentation.

Pricing

For information on BigQuery Data Transfer Service pricing, see the Pricing page.

Note that costs can be incurred outside of Google by using this service. Please review the Amazon S3 pricing page for details.

Quotas and limits

The BigQuery Data Transfer Service uses load jobs to load Amazon S3 data into BigQuery. All BigQuery Quotas and limits on load jobs apply to recurring Amazon S3 transfers.

What's next

このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...

BigQuery Data Transfer Service