Amazon S3 transfers

The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.

Before you begin

Before you create an Amazon S3 transfer:

Limitations

Amazon S3 transfers are subject to the following limitations:

  • Currently, the bucket portion of the Amazon S3 URI cannot be parameterized.
  • Transfers from Amazon S3 with the Write disposition parameter set to WRITE_TRUNCATE will transfer all matching files to Google Cloud during each run. This may result in additional Amazon S3 egress costs. For more information on which files are transferred during a run, see Impact of prefix matching versus wildcard matching.
  • Transfers from AWS GovCloud (us-gov) regions are not supported.
  • Depending on the format of your Amazon S3 source data, there may be additional limitations. For more information, see:

  • The minimum interval time between recurring transfers is 24 hours. The default interval for a recurring transfer is 24 hours.

Required permissions

Before creating an Amazon S3 transfer:

  • Ensure that the person creating the transfer has the following required permissions in BigQuery:

    • bigquery.transfers.update permissions to create the transfer
    • Both bigquery.datasets.get and bigquery.datasets.update permissions on the target dataset

    The bigquery.admin predefined IAM role includes bigquery.transfers.update, bigquery.datasets.update and bigquery.datasets.get permissions. For more information on IAM roles in BigQuery Data Transfer Service, see Access control.

  • Consult the documentation for Amazon S3 to ensure you have configured any permissions necessary to enable the transfer. At a minimum, the Amazon S3 source data must have the AWS managed policy AmazonS3ReadOnlyAccess applied to it.

Set up an Amazon S3 data transfer

To create an Amazon S3 data transfer:

Console

  1. Go to the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. Click Transfers.

  3. Click Create a Transfer.

  4. On the Create Transfer page:

    • In the Source type section, for Source, choose Amazon S3.

      Transfer source

    • In the Transfer config name section, for Display name, enter a name for the transfer such as My Transfer. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.

      Transfer name

    • In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.

      • For Repeats, choose an option for how often to run the transfer. Options include:

        • Daily (default)
        • Weekly
        • Monthly
        • Custom
        • On-demand

        If you choose an option other than Daily, additional options are available. For example, if you choose Weekly, an option appears for you to select the day of the week.

      • For Start date and run time, enter the date and time to start the transfer. If you choose Start now, this option is disabled.

        Transfer schedule

    • In the Destination settings section, for Destination dataset, choose the dataset you created to store your data.

      Transfer dataset

    • In the Data source details section:

      • For Destination table, enter the name of the table you created to store the data in BigQuery. Destination table names support parameters.
      • For Amazon S3 URI, enter the URI in the following format s3://mybucket/myfolder/.... URIs also support parameters.
      • For Access key ID, enter your access key ID.
      • For Secret access key, enter your secret access key.
      • For File format choose your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.
      • For Write Disposition, choose:

        • WRITE_APPEND to append new data to your existing destination table. BigQuery load jobs are triggered with the WRITE_APPEND preference. Each run will transfer only the files which have been modified since the previous successful run. The default value for Write preference is WRITE_APPEND.
        • WRITE_TRUNCATE to overwrite the data in your existing destination table. BigQuery load jobs are triggered with the WRITE_TRUNCATE preference. Each run will transfer all matching files, including files that were transferred in a previous run.
        • For more information, see the writeDisposition field in JobConfigurationLoad.

        S3 source details

    • In the Transfer options - all formats section:

      • For Number of errors allowed, enter an integer value for the maximum number of bad records that can be ignored.
      • (Optional) For Decimal target types, enter a comma-separated list of possible SQL data types that the source decimal values could be converted to. Which SQL data type is selected for conversion depends on the following conditions:
        • The data type selected for conversion will be the first data type in the following list that supports the precision and scale of the source data, in this order: NUMERIC, BIGNUMERIC, and STRING.
        • If none of the listed data types will support the precision and the scale, the data type supporting the widest range in the specified list is selected. If a value exceeds the supported range when reading the source data, an error will be thrown.
        • The data type STRING supports all precision and scale values.
        • If this field is left empty, the data type will default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.
        • This field cannot contain duplicate data types.
        • The order of the data types that you list in this field is ignored.

      Transfer options all format

    • If you chose CSV or JSON as your file format, in the JSON,CSV section, check Ignore unknown values to accept rows that contain values that do not match the schema. Unknown values are ignored. For CSV files, this option ignores extra values at the end of a line.

      Ignore unknown values

    • If you chose CSV as your file format, in the CSV section enter any additional CSV options for loading data.

      CSV options

    • (Optional) In the Notification options section:

      • Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
      • For Select a Pub/Sub topic, choose your topic name or click Create a topic to create one. This option configures Pub/Sub run notifications for your transfer.
  5. Click Save.

bq

Enter the bq mk command and supply the transfer creation flag — --transfer_config.

bq mk \
--transfer_config \
--project_id=project_id \
--data_source=data_source \
--display_name=name \
--target_dataset=dataset \
--params='parameters'

Where:

  • project_id: Optional. Your Google Cloud project ID. If --project_id isn't supplied to specify a particular project, the default project is used.
  • data_source: Required. The data source — amazon_s3.
  • display_name: Required. The display name for the transfer configuration. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.
  • dataset: Required. The target dataset for the transfer configuration.
  • parameters: Required. The parameters for the created transfer configuration in JSON format. For example: --params='{"param":"param_value"}'. The following are the parameters for an Amazon S3 transfer:

    • destination_table_name_template: Required. The name of your destination table.
    • data_path: Required. The Amazon S3 URI, in the following format:

      s3://mybucket/myfolder/...

      URIs also support parameters.

    • access_key_id: Required. Your access key ID.

    • secret_access_key: Required. Your secret access key.

    • file_format: Optional. Indicates the type of files you wish to transfer: CSV, JSON, AVRO, PARQUET, or ORC. The default value is CSV.

    • write_disposition: Optional. WRITE_APPEND will transfer only the files which have been modified since the previous successful run. WRITE_TRUNCATE will transfer all matching files, including files that were transferred in a previous run. The default is WRITE_APPEND.

    • max_bad_records: Optional. The number of allowed bad records. The default is 0.

    • decimal_target_types: Optional. A comma-separated list of possible SQL data types that the source decimal values could be converted to. If this field is not provided, the datatype will default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.

    • ignore_unknown_values: Optional, and ignored if file_format is not JSON or CSV. Whether to ignore unknown values in your data.

    • field_delimiter: Optional, and applies only when file_format is CSV. The character that separates fields. The default value is a comma.

    • skip_leading_rows: Optional, and applies only when file_format is CSV. Indicates the number of header rows you don't want to import. The default value is 0.

    • allow_quoted_newlines: Optional, and applies only when file_format is CSV. Indicates whether to allow newlines within quoted fields.

    • allow_jagged_rows: Optional, and applies only when file_format is CSV. Indicates whether to accept rows that are missing trailing optional columns. The missing values will be filled in with NULLs.

For example, the following command creates an Amazon S3 transfer named My Transfer using a data_path value of s3://mybucket/myfile/*.csv, target dataset mydataset, and file_format CSV. This example includes non-default values for the optional params associated with the CSV file_format.

The transfer is created in the default project:

bq mk --transfer_config \
--target_dataset=mydataset \
--display_name='My Transfer' \
--params='{"data_path":"s3://mybucket/myfile/*.csv",
"destination_table_name_template":"MyTable",
"file_format":"CSV",
"write_disposition":"WRITE_APPEND",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"|",
"skip_leading_rows":"1",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"false"}' \
--data_source=amazon_s3

After running the command, you receive a message like the following:

[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.

Follow the instructions and paste the authentication code on the command line.

API

Use the projects.locations.transferConfigs.create method and supply an instance of the TransferConfig resource.

Java

Before trying this sample, follow the Java setup instructions in the BigQuery quickstart using client libraries. For more information, see the BigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import com.google.api.gax.rpc.ApiException;
import com.google.cloud.bigquery.datatransfer.v1.CreateTransferConfigRequest;
import com.google.cloud.bigquery.datatransfer.v1.DataTransferServiceClient;
import com.google.cloud.bigquery.datatransfer.v1.ProjectName;
import com.google.cloud.bigquery.datatransfer.v1.TransferConfig;
import com.google.protobuf.Struct;
import com.google.protobuf.Value;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

// Sample to create amazon s3 transfer config.
public class CreateAmazonS3Transfer {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    final String projectId = "MY_PROJECT_ID";
    String datasetId = "MY_DATASET_ID";
    String tableId = "MY_TABLE_ID";
    // Amazon S3 Bucket Uri with read role permission
    String sourceUri = "s3://your-bucket-name/*";
    String awsAccessKeyId = "MY_AWS_ACCESS_KEY_ID";
    String awsSecretAccessId = "AWS_SECRET_ACCESS_ID";
    String sourceFormat = "CSV";
    String fieldDelimiter = ",";
    String skipLeadingRows = "1";
    Map<String, Value> params = new HashMap<>();
    params.put(
        "destination_table_name_template", Value.newBuilder().setStringValue(tableId).build());
    params.put("data_path", Value.newBuilder().setStringValue(sourceUri).build());
    params.put("access_key_id", Value.newBuilder().setStringValue(awsAccessKeyId).build());
    params.put("secret_access_key", Value.newBuilder().setStringValue(awsSecretAccessId).build());
    params.put("source_format", Value.newBuilder().setStringValue(sourceFormat).build());
    params.put("field_delimiter", Value.newBuilder().setStringValue(fieldDelimiter).build());
    params.put("skip_leading_rows", Value.newBuilder().setStringValue(skipLeadingRows).build());
    TransferConfig transferConfig =
        TransferConfig.newBuilder()
            .setDestinationDatasetId(datasetId)
            .setDisplayName("Your Aws S3 Config Name")
            .setDataSourceId("amazon_s3")
            .setParams(Struct.newBuilder().putAllFields(params).build())
            .setSchedule("every 24 hours")
            .build();
    createAmazonS3Transfer(projectId, transferConfig);
  }

  public static void createAmazonS3Transfer(String projectId, TransferConfig transferConfig)
      throws IOException {
    try (DataTransferServiceClient client = DataTransferServiceClient.create()) {
      ProjectName parent = ProjectName.of(projectId);
      CreateTransferConfigRequest request =
          CreateTransferConfigRequest.newBuilder()
              .setParent(parent.toString())
              .setTransferConfig(transferConfig)
              .build();
      TransferConfig config = client.createTransferConfig(request);
      System.out.println("Amazon s3 transfer created successfully :" + config.getName());
    } catch (ApiException ex) {
      System.out.print("Amazon s3 transfer was not created." + ex.toString());
    }
  }
}

Impact of prefix matching versus wildcard matching

The Amazon S3 API supports prefix matching, but not wildcard matching. All Amazon S3 files that match a prefix will be transferred into Google Cloud. However, only those that match the Amazon S3 URI in the transfer configuration will actually get loaded into BigQuery. This could result in excess Amazon S3 egress costs for files that are transferred but not loaded into BigQuery.

As an example, consider this data path:

s3://bucket/folder/*/subfolder/*.csv

Along with these files in the source location:

s3://bucket/folder/any/subfolder/file1.csv
s3://bucket/folder/file2.csv

This will result in all Amazon S3 files with the prefix s3://bucket/folder/ being transferred to Google Cloud. In this example, both file1.csv and file2.csv will be transferred.

However, only files matching s3://bucket/folder/*/subfolder/*.csv will actually load into BigQuery. In this example, only file1.csv will be loaded into BigQuery.

Troubleshooting

The following provides information about common errors and the recommendation resolution.

Amazon S3 PERMISSION_DENIED errors

Error Recommended action
The AWS Access Key Id you provided does not exist in our records. Ensure the access key exists and the ID is correct.
The request signature we calculated does not match the signature you provided. Check your key and signing method. Ensure that the transfer configuration has the correct corresponding Secret Access Key
Failed to obtain the location of the source S3 bucket. Additional details: Access Denied

Failed to obtain the location of the source S3 bucket. Additional details: HTTP/1.1 403 Forbidden

S3 error message: Access Denied
Ensure the AWS IAM user has permission to perform the following:
  • List the Amazon S3 bucket.
  • Get the location of the bucket.
  • Read the objects in the bucket.
Server unable to initialize object upload.; InvalidObjectState: The operation is not valid for the object's storage class

Failed to obtain the location of the source S3 bucket. Additional details: All access to this object has been disabled
Restore any objects that are archived to Amazon Glacier. Objects in Amazon S3 that are archived to Amazon Glacier are not accessible until they are restored
All access to this object has been disabled Confirm that the Amazon S3 URI in the transfer configuration is correct

Amazon S3 transfer limit errors

Error Recommended action
Number of files in transfer exceeds limit of 10000. Evaluate if the number of wildcards in the Amazon S3 URI can be reduced to just one. If this is possible, retry with a new transfer configuration, as the maximum number of files per transfer run will be higher.

Evaluate if the transfer configuration can be split into multiple transfer configurations, each transferring a portion of the source data.
Size of files in transfer exceeds limit of 16492674416640 bytes. Evaluate if the transfer configuration can be split into multiple transfer configurations, each transferring a portion of the source data.

General issues

Error Recommended action
Files are transferred from Amazon S3 but not loaded into BigQuery. The transfer logs may look similar to this:

Moving data from Amazon S3 to Google Cloud complete: Moved <NNN> object(s).
No new files found matching <Amazon S3 URI>.
Confirm that the Amazon S3 URI in the transfer configuration is correct.

If the transfer configuration was meant to load all files with a common prefix, ensure that the Amazon S3 URI ends with a wildcard.
For example, to load all files in s3://my-bucket/my-folder/, the Amazon S3 URI in the transfer configuration must be s3://my-bucket/my-folder/*, not just s3://my-bucket/my-folder/.
Other issues See Troubleshooting transfer configurations.

What's next