Amazon S3 transfers
The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.
Before you begin
Before you create an Amazon S3 transfer:
- Verify that you have completed all actions required to enable the BigQuery Data Transfer Service.
- Create a BigQuery dataset to store your data.
- Create the destination table for your transfer and specify the schema definition. The destination table must follow the table naming rules. Destination table names also support parameters.
- Retrieve your Amazon S3 URI, your access key ID, and your secret access key. For information on managing your access keys, see the AWS documentation.
- If you intend to setup transfer run notifications for Pub/Sub, you
must have
pubsub.topics.setIamPolicy
permissions. Pub/Sub permissions are not required if you just set up email notifications. For more information, see BigQuery Data Transfer Service run notifications.
Limitations
Amazon S3 transfers are subject to the following limitations:
- Currently, the bucket portion of the Amazon S3 URI cannot be parameterized.
- Transfers from Amazon S3 with the Write disposition parameter set to
WRITE_TRUNCATE
will transfer all matching files to Google Cloud during each run. This may result in additional Amazon S3 egress costs. For more information on which files are transferred during a run, see Impact of prefix matching versus wildcard matching. - Transfers from AWS GovCloud (
us-gov
) regions are not supported. Depending on the format of your Amazon S3 source data, there may be additional limitations. For more information, see:
The minimum interval time between recurring transfers is 24 hours. The default interval for a recurring transfer is 24 hours.
Required permissions
Before creating an Amazon S3 transfer:
Ensure that the person creating the transfer has the following required permissions in BigQuery:
bigquery.transfers.update
permissions to create the transfer- Both
bigquery.datasets.get
andbigquery.datasets.update
permissions on the target dataset
The
bigquery.admin
predefined IAM role includesbigquery.transfers.update
,bigquery.datasets.update
andbigquery.datasets.get
permissions. For more information on IAM roles in BigQuery Data Transfer Service, see Access control.Consult the documentation for Amazon S3 to ensure you have configured any permissions necessary to enable the transfer. At a minimum, the Amazon S3 source data must have the AWS managed policy
AmazonS3ReadOnlyAccess
applied to it.
Set up an Amazon S3 data transfer
To create an Amazon S3 data transfer:
Console
Go to the BigQuery page in the Google Cloud console.
Click Transfers.
Click Create a Transfer.
On the Create Transfer page:
In the Source type section, for Source, choose Amazon S3.
In the Transfer config name section, for Display name, enter a name for the transfer such as
My Transfer
. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.
For Repeats, choose an option for how often to run the transfer. Options include:
- Daily (default)
- Weekly
- Monthly
- Custom
- On-demand
If you choose an option other than Daily, additional options are available. For example, if you choose Weekly, an option appears for you to select the day of the week.
For Start date and run time, enter the date and time to start the transfer. If you choose Start now, this option is disabled.
In the Destination settings section, for Destination dataset, choose the dataset you created to store your data.
In the Data source details section:
- For Destination table, enter the name of the table you created to store the data in BigQuery. Destination table names support parameters.
- For Amazon S3 URI, enter the URI in the following format
s3://mybucket/myfolder/...
. URIs also support parameters. - For Access key ID, enter your access key ID.
- For Secret access key, enter your secret access key.
- For File format choose your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.
For Write Disposition, choose:
- WRITE_APPEND to append new data to your existing destination
table. BigQuery load jobs are triggered with the
WRITE_APPEND
preference. Each run will transfer only the files which have been modified since the previous successful run. The default value for Write preference isWRITE_APPEND
. - WRITE_TRUNCATE to overwrite the data in your existing
destination table. BigQuery load jobs are triggered with
the
WRITE_TRUNCATE
preference. Each run will transfer all matching files, including files that were transferred in a previous run. - For more information, see the
writeDisposition
field inJobConfigurationLoad
.
- WRITE_APPEND to append new data to your existing destination
table. BigQuery load jobs are triggered with the
In the Transfer options - all formats section:
- For Number of errors allowed, enter an integer value for the maximum number of bad records that can be ignored.
- (Optional) For Decimal target types, enter a comma-separated
list of possible SQL data types that the source decimal values could
be converted to. Which SQL data type is selected for conversion
depends on the following conditions:
- The data type selected for conversion will be the first data type in the following list that supports the precision and scale of the source data, in this order: NUMERIC, BIGNUMERIC, and STRING.
- If none of the listed data types will support the precision and the scale, the data type supporting the widest range in the specified list is selected. If a value exceeds the supported range when reading the source data, an error will be thrown.
- The data type STRING supports all precision and scale values.
- If this field is left empty, the data type will default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.
- This field cannot contain duplicate data types.
- The order of the data types that you list in this field is ignored.
If you chose CSV or JSON as your file format, in the JSON,CSV section, check Ignore unknown values to accept rows that contain values that do not match the schema. Unknown values are ignored. For CSV files, this option ignores extra values at the end of a line.
If you chose CSV as your file format, in the CSV section enter any additional CSV options for loading data.
(Optional) In the Notification options section:
- Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
- For Select a Pub/Sub topic, choose your topic name or click Create a topic to create one. This option configures Pub/Sub run notifications for your transfer.
Click Save.
bq
Enter the bq mk
command and supply the transfer creation flag —
--transfer_config
.
bq mk \ --transfer_config \ --project_id=project_id \ --data_source=data_source \ --display_name=name \ --target_dataset=dataset \ --params='parameters'
Where:
- project_id: Optional. Your Google Cloud project ID.
If
--project_id
isn't supplied to specify a particular project, the default project is used. - data_source: Required. The data source —
amazon_s3
. - display_name: Required. The display name for the transfer configuration. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.
- dataset: Required. The target dataset for the transfer configuration.
parameters: Required. The parameters for the created transfer configuration in JSON format. For example:
--params='{"param":"param_value"}'
. The following are the parameters for an Amazon S3 transfer:- destination_table_name_template: Required. The name of your destination table.
data_path: Required. The Amazon S3 URI, in the following format:
s3://mybucket/myfolder/...
URIs also support parameters.
access_key_id: Required. Your access key ID.
secret_access_key: Required. Your secret access key.
file_format: Optional. Indicates the type of files you wish to transfer:
CSV
,JSON
,AVRO
,PARQUET
, orORC
. The default value isCSV
.write_disposition: Optional.
WRITE_APPEND
will transfer only the files which have been modified since the previous successful run.WRITE_TRUNCATE
will transfer all matching files, including files that were transferred in a previous run. The default isWRITE_APPEND
.max_bad_records: Optional. The number of allowed bad records. The default is
0
.decimal_target_types: Optional. A comma-separated list of possible SQL data types that the source decimal values could be converted to. If this field is not provided, the datatype will default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.
ignore_unknown_values: Optional, and ignored if file_format is not
JSON
orCSV
. Whether to ignore unknown values in your data.field_delimiter: Optional, and applies only when
file_format
isCSV
. The character that separates fields. The default value is a comma.skip_leading_rows: Optional, and applies only when file_format is
CSV
. Indicates the number of header rows you don't want to import. The default value is0
.allow_quoted_newlines: Optional, and applies only when file_format is
CSV
. Indicates whether to allow newlines within quoted fields.allow_jagged_rows: Optional, and applies only when file_format is
CSV
. Indicates whether to accept rows that are missing trailing optional columns. The missing values will be filled in with NULLs.
For example, the following command creates an Amazon S3 transfer named
My Transfer
using a data_path
value of
s3://mybucket/myfile/*.csv
, target dataset mydataset
, and file_format
CSV
. This example includes non-default values for the optional params
associated with the CSV
file_format.
The transfer is created in the default project:
bq mk --transfer_config \
--target_dataset=mydataset \
--display_name='My Transfer' \
--params='{"data_path":"s3://mybucket/myfile/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"write_disposition":"WRITE_APPEND",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false"}' \
--data_source=amazon_s3
After running the command, you receive a message like the following:
[URL omitted] Please copy and paste the above URL into your web browser and
follow the instructions to retrieve an authentication code.
Follow the instructions and paste the authentication code on the command line.
API
Use the projects.locations.transferConfigs.create
method and supply an instance of the TransferConfig
resource.
Java
Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Impact of prefix matching versus wildcard matching
The Amazon S3 API supports prefix matching, but not wildcard matching. All Amazon S3 files that match a prefix will be transferred into Google Cloud. However, only those that match the Amazon S3 URI in the transfer configuration will actually get loaded into BigQuery. This could result in excess Amazon S3 egress costs for files that are transferred but not loaded into BigQuery.
As an example, consider this data path:
s3://bucket/folder/*/subfolder/*.csv
Along with these files in the source location:
s3://bucket/folder/any/subfolder/file1.csv
s3://bucket/folder/file2.csv
This will result in all Amazon S3 files with the prefix s3://bucket/folder/
being transferred to Google Cloud. In this example, both file1.csv
and
file2.csv
will be transferred.
However, only files matching s3://bucket/folder/*/subfolder/*.csv
will
actually load into BigQuery. In this example, only file1.csv
will be loaded into BigQuery.
Troubleshooting
The following provides information about common errors and the recommendation resolution.
Amazon S3 PERMISSION_DENIED errors
Error | Recommended action |
---|---|
The AWS Access Key Id you provided does not exist in our records. | Ensure the access key exists and the ID is correct. |
The request signature we calculated does not match the signature you provided. Check your key and signing method. | Ensure that the transfer configuration has the correct corresponding Secret Access Key |
Failed to obtain the location of the source S3 bucket. Additional details: Access Denied Failed to obtain the location of the source S3 bucket. Additional details: HTTP/1.1 403 Forbidden S3 error message: Access Denied |
Ensure the AWS IAM user has permission to perform the following:
|
Server unable to initialize object upload.; InvalidObjectState: The operation is not valid for the object's storage class Failed to obtain the location of the source S3 bucket. Additional details: All access to this object has been disabled |
Restore any objects that are archived to Amazon Glacier. Objects in Amazon S3 that are archived to Amazon Glacier are not accessible until they are restored |
All access to this object has been disabled | Confirm that the Amazon S3 URI in the transfer configuration is correct |
Amazon S3 transfer limit errors
Error | Recommended action |
---|---|
Number of files in transfer exceeds limit of 10000. | Evaluate if the number of wildcards in the Amazon S3 URI can be reduced to just one. If this is possible, retry with a new transfer configuration, as the maximum number of files per transfer run will be higher. Evaluate if the transfer configuration can be split into multiple transfer configurations, each transferring a portion of the source data. |
Size of files in transfer exceeds limit of 16492674416640 bytes. | Evaluate if the transfer configuration can be split into multiple transfer configurations, each transferring a portion of the source data. |
General issues
Error | Recommended action |
---|---|
Files are transferred from Amazon S3 but not loaded into BigQuery. The transfer logs may look similar to this: Moving data from Amazon S3 to Google Cloud complete: Moved <NNN> object(s). No new files found matching <Amazon S3 URI>. |
Confirm that the Amazon S3 URI in the transfer configuration is correct. If the transfer configuration was meant to load all files with a common prefix, ensure that the Amazon S3 URI ends with a wildcard. For example, to load all files in s3://my-bucket/my-folder/ , the Amazon S3 URI in the transfer configuration must be s3://my-bucket/my-folder/* , not just s3://my-bucket/my-folder/ . |
Other issues | See Troubleshooting transfer configurations. |
What's next
- For an introduction to Amazon S3 transfers, see Overview of Amazon S3 transfers
- For an overview of BigQuery Data Transfer Service, see Introduction to BigQuery Data Transfer Service.
- For information on using transfers including getting information about a transfer configuration, listing transfer configurations, and viewing a transfer's run history, see Working with transfers.
- Learn how to load data with cross-cloud operations.