Amazon S3 transfers

The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.

Before you begin

Before you create an Amazon S3 transfer:

  • Verify that you have completed all actions required to enable the BigQuery Data Transfer Service.
  • If you are creating the transfer by using the classic BigQuery web UI, allow pop-ups in your browser from bigquery.cloud.google.com so that you can view the permissions window. You must allow the BigQuery Data Transfer Service permission to manage your transfer.
  • Create a BigQuery dataset to store your data.
  • Create the destination table for your transfer and specify the schema definition. The destination table must follow the table naming rules. Destination table names also support parameters.
  • Retrieve your Amazon S3 URI, your access key ID, and your secret access key. For information on managing your access keys, see the AWS documentation.
  • If you intend to setup transfer run notifications for Pub/Sub, you must have pubsub.topics.setIamPolicy permissions. Pub/Sub permissions are not required if you just set up email notifications. For more information, see BigQuery Data Transfer Service run notifications.

Limitations

Amazon S3 transfers are subject to the following limitations:

  • Currently, the bucket portion of the Amazon S3 URI cannot be parameterized.
  • Transfers from Amazon S3 are always triggered with the WRITE_APPEND preference which appends data to the destination table. See configuration.load.writeDisposition in the load job configuration for additional details.
  • Depending on the format of your Amazon S3 source data, there may be additional limitations. For more information, see:

Required permissions

Before creating an Amazon S3 transfer:

  • Ensure that the person creating the transfer has the following required permissions in BigQuery:

    • bigquery.transfers.update permissions to create the transfer
    • bigquery.datasets.update permissions on the target dataset

    The bigquery.admin predefined Cloud IAM role includes bigquery.transfers.update and bigquery.datasets.update permissions. For more information on Cloud IAM roles in BigQuery Data Transfer Service, see Access control reference.

  • Consult the documentation for Amazon S3 to ensure you have configured any permissions necessary to enable the transfer. At a minimum, the Amazon S3 source data must have the AWS managed policy AmazonS3ReadOnlyAccess applied to it.

Setting up an Amazon S3 data transfer

To create an Amazon S3 data transfer:

Console

  1. Go to the BigQuery web UI in the Cloud Console.

    Go to the Cloud Console

  2. Click Transfers.

  3. Click Create a Transfer.

  4. On the Create Transfer page:

    • In the Source type section, for Source, choose Amazon S3.

      Transfer source

    • In the Transfer config name section, for Display name, enter a name for the transfer such as My Transfer. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.

      Transfer name

    • In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.

      • For Repeats, choose an option for how often to run the transfer. Options include:

        • Daily (default)
        • Weekly
        • Monthly
        • Custom
        • On-demand

        If you choose an option other than Daily, additional options are available. For example, if you choose Weekly, an option appears for you to select the day of the week.

      • For Start date and run time, enter the date and time to start the transfer. If you choose Start now, this option is disabled.

        Transfer schedule

    • In the Destination settings section, for Destination dataset, choose the dataset you created to store your data.

      Transfer dataset

    • In the Data source details section:

      • For Destination table, enter the name of the table you created to store the data in BigQuery. Destination table names support parameters.
      • For Amazon S3 URI, enter the URI in the following format s3://mybucket/myfolder/.... URIs also support parameters.
      • For Access key ID, enter your access key ID.
      • For Secret access key, enter your secret access key.
      • For File format choose your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.

        S3 source details

    • In the Transfer options section, for Number of errors allowed, enter an integer value for the maximum number of bad records that can be ignored.

      Number of errors allowed

    • If you chose CSV or JSON as your file format, in the JSON,CSV section, check Ignore unknown values to accept rows that contain values that do not match the schema. Unknown values are ignored. For CSV files, this option ignores extra values at the end of a line.

      Ignore unknown values

    • If you chose CSV as your file format, in the CSV section enter any additional CSV options for loading data.

      CSV options

    • (Optional) In the Notification options section:

      • Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
      • For Select a Pub/Sub topic, choose your topic name or click Create a topic to create one. This option configures Pub/Sub run notifications for your transfer.
  5. Click Save.

Classic UI

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. Click Transfers.

  3. Click Add Transfer.

  4. On the New Transfer page:

    • For Source, choose Amazon S3.
    • For Display name, enter a name for the transfer such as My Transfer. The display name can be any value that allows you to easily identify the transfer if you need to modify it later.
    • (Optional) For Schedule, you can leave the default value of Daily (every 24 hours, based on creation time), or click Edit to change the time. You can also change the interval to Weekly, Monthly, or Custom. When selecting Custom, a Cron-like time specification is expected, for example every 12 hours. The shortest allowed period is 12 hours. See the schedule field under TransferConfig for additional valid API values.
    • For Destination dataset, choose the appropriate dataset.
    • For Destination table, enter the name of your destination table. The destination table must follow the table naming rules. Destination table names also support parameters.
    • For Amazon S3 URI, enter the Amazon S3 URI. Wildcards and parameters are supported.
    • For Access key ID, enter your access key ID.
    • For Secret access key, enter your secret access key.
    • For File format choose your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.
    • Under the Transfer Options - All Formats section:
      • For Number of errors allowed, enter the maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an 'invalid' error is returned in the job result, and the job fails. The default value is 0.
    • If you chose CSV or JSON as your data format, under the Transfer Options - JSON, CSV section:
      • For Ignore unknown values, check the box if you want the transfer to drop data that does not fit the desination table's schema.
    • If you chose CSV as your data format, under the Transfer Options — CSV section:

      • For Field delimiter, enter the character that separates fields. The default value is a comma.
      • For Header rows to skip, enter the number of header rows in the source file(s) if you don't want to import them. The default value is 0.
      • For Allow quoted newlines, check the box if you want to allow newlines within quoted fields.
      • For Allow jagged rows, check the box if you want to allow the transfer of rows with missing NULLABLE columns.
    • (Optional) Expand the Advanced section and configure run notifications for your transfer.

    • For Cloud Pub/Sub topic, enter your Cloud Pub/Sub topic name, for example, projects/myproject/topics/mytopic.

    • Check Send email notifications to allow email notifications of transfer run failures.

      Cloud Pub/Sub topic

  5. Click Add.

CLI

Enter the bq mk command and supply the transfer creation flag — --transfer_config.

bq mk \
--transfer_config \
--project_id=project_id \
--data_source=data_source \
--display_name=name \
--target_dataset=dataset \
--params='parameters'

Where:

  • project_id: Optional. Your Google Cloud project ID. If --project_id isn't supplied to specify a particular project, the default project is used.
  • data_source: Required. The data source — amazon_s3.
  • display_name: Required. The display name for the transfer configuration. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.
  • dataset: Required. The target dataset for the transfer configuration.
  • parameters: Required. The parameters for the created transfer configuration in JSON format. For example: --params='{"param":"param_value"}'. The following are the parameters for an Amazon S3 transfer:

    • destination_table_name_template: Required. The name of your destination table.
    • data_path: Required. The Amazon S3 URI, in the following format:

      s3://mybucket/myfolder/...

      URIs also support parameters.

    • access_key_id: Required. Your access key ID.

    • secret_access_key: Required. Your secret access key.

    • file_format: Optional. Indicates the type of files you wish to transfer: CSV, JSON, AVRO, PARQUET, or ORC. The default value is CSV.

    • max_bad_records: Optional. The number of allowed bad records. The default is 0.

    • ignore_unknown_values: Optional, and ignored if file_format is not JSON or CSV. Whether to ignore unknown values in your data.

    • field_delimiter: Optional, and applies only when file_format is CSV. The character that separates fields. The default value is a comma.

    • skip_leading_rows: Optional, and applies only when file_format is CSV. Indicates the number of header rows you don't want to import. The default value is 0.

    • allow_quoted_newlines: Optional, and applies only when file_format is CSV. Indicates whether to allow newlines within quoted fields.

    • allow_jagged_rows: Optional, and applies only when file_format is CSV. Indicates whether to accept rows that are missing trailing optional columns. The missing values will be filled in with NULLs.

For example, the following command creates an Amazon S3 transfer named My Transfer using a data_path_template value of s3://mybucket/myfile/*.csv, target dataset mydataset, and file_format CSV. This example includes non-default values for the optional params associated with the CSV file_format.

The transfer is created in the default project:

bq mk --transfer_config \
--target_dataset=mydataset \
--display_name='My Transfer' \
--params='{"data_path_template":"s3://mybucket/myfile/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false",
"delete_source_files":"true"}' \
--data_source=amazon_s3

After running the command, you receive a message like the following:

[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.

Follow the instructions and paste the authentication code on the command line.

API

Use the projects.locations.transferConfigs.create method and supply an instance of the TransferConfig resource.

Querying your data

When your data is transferred to BigQuery, the data is written to ingestion-time partitioned tables. For more information, see Introduction to partitioned tables.

If you query your tables directly instead of using the auto-generated views, you must use the _PARTITIONTIME pseudo-column in your query. For more information, see Querying partitioned tables.

Impact of prefix matching vs. wildcard matching

The Amazon S3 API supports prefix matching, but not wildcard matching. All Amazon S3 files that match a prefix will be transferred into Google Cloud. However, only those that match the Amazon S3 URI in the transfer configuration will actually get loaded into BigQuery. This could result in excess Amazon S3 egress costs for files that are transferred but not loaded into BigQuery.

As an example, consider this data path:

s3://bucket/folder/*/subfolder/*.csv

Along with these files in the source location:

s3://bucket/folder/any/subfolder/file1.csv
s3://bucket/folder/file2.csv

This will result in all Amazon S3 files with the prefix s3://bucket/folder/ being transferred to Google Cloud. In this example, both file1.csv and file2.csv will be transferred.

However, only files matching s3://bucket/folder/*/subfolder/*.csv will actually load into BigQuery. In this example, only file1.csv will be loaded into BigQuery.

Troubleshooting

The following provides information about common errors and the recommendation resolution.

Amazon S3 PERMISSION_DENIED errors

Error Recommended action
The AWS Access Key Id you provided does not exist in our records. Ensure the access key exists and the ID is correct.
The request signature we calculated does not match the signature you provided. Check your key and signing method. Ensure that the transfer configuration has the correct corresponding Secret Access Key
Failed to obtain the location of the source S3 bucket. Additional details: Access Denied

Failed to obtain the location of the source S3 bucket. Additional details: HTTP/1.1 403 Forbidden

S3 error message: Access Denied
Ensure the AWS IAM user has permission to perform the following:
  • List the Amazon S3 bucket.
  • Get the location of the bucket.
  • Read the objects in the bucket.
Server unable to initialize object upload.; InvalidObjectState: The operation is not valid for the object's storage class

Failed to obtain the location of the source S3 bucket. Additional details: All access to this object has been disabled
Restore any objects that are archived to Amazon Glacier. Objects in Amazon S3 that are archived to Amazon Glacier are not accessible until they are restored
All access to this object has been disabled Confirm that the Amazon S3 URI in the transfer configuration is correct

Amazon S3 transfer limit errors

Error Recommended action
Number of files in transfer exceeds limit of 10000. Evaluate if the number of wildcards in the Amazon S3 URI can be reduced to just one. If this is possible, retry with a new transfer configuration, as the maximum number of files per transfer run will be higher.

Evaluate if the transfer configuration can be split into multiple transfer configurations, each transferring a portion of the source data.
Size of files in transfer exceeds limit of 16492674416640 bytes. Evaluate if the transfer configuration can be split into multiple transfer configurations, each transferring a portion of the source data.

General issues

Error Recommended action
Files are transferred from Amazon S3 but not loaded into BigQuery. The transfer logs may look look similar to this:

Moving data from Amazon S3 to Google Cloud complete: Moved <NNN> object(s).
No new files found matching <Amazon S3 URI>.
Confirm that the Amazon S3 URI in the transfer configuration is correct.

If the transfer configuration was meant to load all files with a common prefix, ensure that the Amazon S3 URI ends with a wildcard.
For example, to load all files in s3://my-bucket/my-folder/, the Amazon S3 URI in the transfer configuration must be s3://my-bucket/my-folder/*, not just s3://my-bucket/my-folder/.
Other issues See Troubleshooting transfer configurations.

What's next

本頁內容對您是否有任何幫助?請提供意見:

傳送您對下列選項的寶貴意見...

這個網頁
BigQuery Data Transfer Service