Amazon S3 transfers

The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.

Before you begin

Before you create an Amazon S3 transfer:

  • Verify that you have completed all actions required to enable the BigQuery Data Transfer Service.
  • If you are creating the transfer by using the classic BigQuery web UI, allow pop-ups in your browser from bigquery.cloud.google.com so that you can view the permissions window. You must allow the BigQuery Data Transfer Service permission to manage your transfer.
  • Create a BigQuery dataset to store your data.
  • Create the destination table for your transfer and specify the schema definition. The destination table must follow the table naming rules. Destination table names also support parameters.
  • Retrieve your Amazon S3 URI, your access key ID, and your secret access key. For information on managing your access keys, see the AWS documentation.
  • If you intend to setup transfer run notifications for Cloud Pub/Sub, you must have pubsub.topics.setIamPolicy permissions. Cloud Pub/Sub permissions are not required if you just set up email notifications. For more information, see BigQuery Data Transfer Service run notifications.

Limitations

Amazon S3 transfers are subject to the following limitations:

  • Currently, you cannot configure or update Amazon S3 transfers by using the command-line tool.
  • Currently, the bucket portion of the Amazon S3 URI cannot be parameterized.
  • Depending on the format of your Amazon S3 source data, there may be additional limitations. For more information, see:

Required permissions

Before creating an Amazon S3 transfer:

  • Ensure that the person creating the transfer has the following required permissions in BigQuery:

    • bigquery.transfers.update permissions to create the transfer
    • bigquery.datasets.update permissions on the target dataset

    The bigquery.admin predefined Cloud IAM role includes bigquery.transfers.update and bigquery.datasets.update permissions. For more information on Cloud IAM roles in BigQuery Data Transfer Service, see Access control reference.

  • Consult the documentation for Amazon S3 to ensure you have configured any permissions necessary to enable the transfer. At a minimum, the Amazon S3 source data must have the AWS managed policy AmazonS3ReadOnlyAccess applied to it.

Setting up an Amazon S3 data transfer

To create an Amazon S3 data transfer:

Console

  1. Go to the BigQuery web UI in the GCP Console.

    Go to the GCP Console

  2. Click Transfers.

  3. Click Create a Transfer.

  4. On the Create Transfer page:

    • In the Source type section, for Source, choose Amazon S3.

      Transfer source

    • In the Transfer config name section, for Display name, enter a name for the transfer such as My Transfer. The transfer name can be any value that allows you to easily identify the transfer if you need to modify it later.

      Transfer name

    • In the Schedule options section, for Schedule, leave the default value (Start now) or click Start at a set time.

      • For Repeats, choose an option for how often to run the transfer. Options include:

        • Daily (default)
        • Weekly
        • Monthly
        • Custom
        • On-demand

        If you choose an option other than Daily, additional options are available. For example, if you choose Weekly, an option appears for you to select the day of the week.

      • For Start date and run time, enter the date and time to start the transfer. If you choose Start now, this option is disabled.

        Transfer schedule

    • In the Destination settings section, for Destination dataset, choose the dataset you created to store your data.

      Transfer dataset

    • In the Data source details section:

      • For Destination table, enter the name of the table you created to store the data in BigQuery. Destination table names support parameters.
      • For Amazon S3 URI, enter the URI in the following format s3://mybucket/myfolder/.... URIs also support parameters.
      • For Access key ID, enter your access key ID.
      • For Secret access key, enter your secret access key.
      • For File format choose your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.

        S3 source details

    • In the Transfer options section, for Number of errors allowed, enter an integer value for the maximum number of bad records that can be ignored.

      Number of errors allowed

    • If you chose CSV or JSON as your file format, in the JSON,CSV section, check Ignore unknown values to accept rows that contain values that do not match the schema. Unknown values are ignored. For CSV files, this option ignores extra values at the end of a line.

      Ignore unknown values

    • If you chose CSV as your file format, in the CSV section enter any additional CSV options for loading data.

      CSV options

    • (Optional) In the Notification options section:

      • Click the toggle to enable email notifications. When you enable this option, the transfer administrator receives an email notification when a transfer run fails.
      • For Select a Cloud Pub/Sub topic, choose your topic name or click Create a topic to create one. This option configures Cloud Pub/Sub run notifications for your transfer.
  5. Click Save.

Classic UI

  1. Go to the BigQuery web UI.

    Go to the BigQuery web UI

  2. Click Transfers.

  3. Click Add Transfer.

  4. On the New Transfer page:

    • For Source, choose Amazon S3.
    • For Display name, enter a name for the transfer such as My Transfer. The display name can be any value that allows you to easily identify the transfer if you need to modify it later.
    • (Optional) For Schedule, you can leave the default value of Daily (every 24 hours, based on creation time), or click Edit to change the time. You can also change the interval to Weekly, Monthly, or Custom. When selecting Custom, a Cron-like time specification is expected, for example every 12 hours. The shortest allowed period is 12 hours. See the schedule field under TransferConfig for additional valid API values.
    • For Destination dataset, choose the appropriate dataset.
    • For Destination table, enter the name of your destination table. The destination table must follow the table naming rules. Destination table names also support parameters.
    • For Amazon S3 URI, enter the Amazon S3 URI. Wildcards and parameters are supported.
    • For Access key ID, enter your access key ID.
    • For Secret access key, enter your secret access key.
    • For File format choose your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.
    • Under the Transfer Options - All Formats section:
      • For Number of errors allowed, enter the maximum number of bad records that BigQuery can ignore when running the job. If the number of bad records exceeds this value, an 'invalid' error is returned in the job result, and the job fails. The default value is 0.
    • If you chose CSV or JSON as your data format, under the Transfer Options - JSON, CSV section:
      • For Ignore unknown values, check the box if you want the transfer to drop data that does not fit the desination table's schema.
    • If you chose CSV as your data format, under the Transfer Options — CSV section:

      • For Field delimiter, enter the character that separates fields. The default value is a comma.
      • For Header rows to skip, enter the number of header rows in the source file(s) if you don't want to import them. The default value is 0.
      • For Allow quoted newlines, check the box if you want to allow newlines within quoted fields.
      • For Allow jagged rows, check the box if you want to allow the transfer of rows with missing NULLABLE columns.
    • (Optional) Expand the Advanced section and configure run notifications for your transfer.

    • For Cloud Pub/Sub topic, enter your Cloud Pub/Sub topic name, for example, projects/myproject/topics/mytopic.

    • Check Send email notifications to allow email notifications of transfer run failures.

      Cloud Pub/Sub topic

  5. Click Add.

Troubleshooting Amazon S3 transfer setup

If you encounter issues setting up your transfer, consult the troubleshooting page for troubleshooting steps.

Querying your data

When your data is transferred to BigQuery, the data is written to ingestion-time partitioned tables. For more information, see Introduction to partitioned tables.

If you query your tables directly instead of using the auto-generated views, you must use the _PARTITIONTIME pseudo-column in your query. For more information, see Querying partitioned tables.

What's next

Trang này có hữu ích không? Hãy cho chúng tôi biết đánh giá của bạn:

Gửi phản hồi về...

BigQuery Data Transfer Service