Stay organized with collections Save and categorize content based on your preferences.

Load data with cross-cloud operations

As a BigQuery administrator or analyst, you can load data from an Amazon Simple Storage Service (Amazon S3) bucket or Azure Blob Storage into BigQuery tables. You can either join the transferred data with the data present in Google Cloud regions or take advantage of BigQuery features like BigQuery ML.

You can transfer data into BigQuery in the following ways:

  • Transfer data from files in Amazon S3 and Azure Blob Storage into BigQuery tables, by using the LOAD DATA statement.

  • Filter data from files in Amazon S3 or Blob Storage before transferring results into BigQuery tables, by using the CREATE TABLE AS SELECT statement. This feature is in preview. Data manipulation is applied on the external tables that reference data from Amazon S3 or Blob Storage.

Quotas and limits

For information about quotas and limits, see query jobs quotas and limits.

Pricing

You are billed for the bytes that are transferred across clouds by using the LOAD statement. For pricing information, see Data transfer pricing.

You are not billed for any bytes transferred by using the CREATE TABLE AS SELECT statement (in preview).

Both LOAD and CREATE TABLE AS SELECT statements require slots in the BigQuery Omni regions to scan Amazon S3 and Blob Storage files to load them.

Before you begin

To provide Google Cloud with read access to the files in other clouds, ask your administrator to create a connection and share it with you. For information about how to create connections, see Connect to Amazon S3 or Blob Storage.

Required role

To get the permissions that you need to load data using cross-cloud transfers, ask your administrator to grant you the BigQuery Data Editor (roles/bigquery.dataEditor) IAM role on the dataset. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to load data using cross-cloud transfers. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to load data using cross-cloud transfers:

  • bigquery.tables.create
  • bigquery.tables.get
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create
  • bigquery.connections.use

You might also be able to get these permissions with custom roles or other predefined roles.

For more information about IAM roles in BigQuery, see Predefined roles and permissions.

Load data

You can load data into BigQuery with the LOAD DATA [INTO|OVERWRITE] statement.

Limitations

  • Loading data into a destination table with ingestion time partition is not supported.
  • LOAD DATA jobs don't run on reservations. Jobs utilize on-demand slots that are managed by Google Cloud.
  • The connection and the destination dataset must belong to the same project. Loading data across projects is not supported.

Example

Example 1

The following example loads a parquet file named sample.parquet from an Amazon S3 bucket into the test_parquet table with an auto-detect schema:

LOAD DATA INTO mydataset.testparquet
  FROM FILES (
    uris = ['s3://test-bucket/sample.parquet'],
    format = 'PARQUET'
  )
  WITH CONNECTION `aws-us-east-1.test-connection`

Example 2

The following example loads a CSV file with the prefix sampled* from your Blob Storage into the test_csv table with predefined column partitioning by time:

LOAD DATA INTO mydataset.test_csv (Number INT64, Name STRING, Time DATE)
  PARTITION BY Time
  FROM FILES (
    format = 'CSV', uris = ['azure://test.blob.core.windows.net/container/sampled*'],
    skip_leading_rows=1
  )
  WITH CONNECTION `azure-eastus2.test-connection`

Example 3

The following example overwrites the existing table test_parquet with data from a file named sample.parquet with an auto-detect schema:

LOAD DATA OVERWRITE mydataset.testparquet
  FROM FILES (
    uris = ['s3://test-bucket/sample.parquet'],
    format = 'PARQUET'
  )
  WITH CONNECTION `aws-us-east-1.test-connection`

Filter data

You can filter data before transferring them into BigQuery by using the CREATE TABLE AS SELECT data definition language (DDL) statement.

Limitations

  • You can only transfer data from the aws-us-east-1 and azure-eastus2 BigQuery Omni regions to the us multi-region and us-east4 region.

  • The CREATE TABLE and CREATE TABLE IF NOT EXISTS statements are not supported, but the CREATE OR REPLACE TABLE statement is supported.

  • If the result of the SELECT query exceeds 20 GiB in logical bytes, the query fails. The table is not created and data is not transferred. To learn how to reduce the size of data that is scanned, see Reduce data processed in queries.

  • Temporary tables are not supported.

  • Transferring the Well-known binary (WKB) geospatial data format is not supported.

Example

Suppose you have a BigLake table named mydataset.orders that references data from Amazon S3 or Azure Blob Storage. You want to transfer data from that table to a BigQuery table myotherdataset.shipments in the US multi-region.

First, display information about the mydataset.orders table:

SELECT
  table_name, ddl
FROM
  `myproject`.mydataset.INFORMATION_SCHEMA.TABLES
WHERE
  table_name = 'orders';

The output is similar to the following:

  Last modified             Schema              Type     Total URIs   Expiration
----------------- -------------------------- ---------- ------------ -----------
  31 Oct 17:40:28   |- l_orderkey: integer     EXTERNAL   1
                    |- l_partkey: integer
                    |- l_suppkey: integer
                    |- l_linenumber: integer
                    |- l_returnflag: string
                    |- l_linestatus: string
                    |- l_commitdate: date

Next, display information about the myotherdataset.shipments table:

SELECT
  table_name, ddl
FROM
  `myproject`.myotherdataset.INFORMATION_SCHEMA.TABLES
WHERE
  table_name = 'shipments';

The output is similar to the following. Some columns are omitted to simplify the output.

  Last modified             Schema             Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Total Logical
 ----------------- --------------------------- ------------ ------------- ------------ ------------------- ------------------ ---------------
  31 Oct 17:34:31   |- l_orderkey: integer      3086653      210767042                                                         210767042
                    |- l_partkey: integer
                    |- l_suppkey: integer
                    |- l_commitdate: date
                    |- l_shipdate: date
                    |- l_receiptdate: date
                    |- l_shipinstruct: string
                    |- l_shipmode: string

Now, using the CREATE TABLE AS SELECT statement you can selectively load data to myotherdataset.ordersof92 and myotherdataset.ordersof93 tables in the US multi-region:

CREATE OR REPLACE TABLE
  myotherdataset.ordersof92 AS
SELECT
  *
FROM
  mydataset.orders
WHERE
  l_commitdate >= '1992-01-01'
AND
   l_commitdate <= '1992-12-31';
CREATE OR REPLACE TABLE
  myotherdataset.ordersof93 AS
SELECT
  *
FROM
  mydataset.orders
WHERE
  l_commitdate >= '1993-01-01'
AND
  l_commitdate <= '1993-12-31';

You can then perform a join operation with the newly created tables using wildcards:

SELECT
  orders.l_orderkey,
  orders.l_orderkey,
  orders.l_suppkey,
  orders.l_commitdate,
  orders.l_returnflag,
  shipments.l_shipmode,
  shipments.l_shipinstruct
FROM
  myotherdataset.shipments
JOIN
  `myotherdataset.ordersof*` as orders
ON
  orders.l_orderkey = shipments.l_orderkey
AND orders.l_partkey = shipments.l_partkey
AND orders.l_suppkey = shipments.l_suppkey
WHERE orders.l_returnflag = 'R'; -- 'R' means refunded.

Best practices

  • Avoid loading multiple files that are less than 5 MB. Instead, create an external table for your file and export query result to Amazon S3 or Blob Storage to create a larger file. This method helps to improve the transfer time of your data.
  • For information about the limit for maximum query result, see BigQuery Omni maximum query result size.
  • If your source data is in a gzip-compressed file, then while creating external tables, set the external_table_options.compression option to GZIP.

What's next