Load data with cross-cloud operations
As a BigQuery administrator or analyst, you can load data from an Amazon Simple Storage Service (Amazon S3) bucket or Azure Blob Storage into BigQuery tables. You can either join the transferred data with the data present in Google Cloud regions or take advantage of BigQuery features like BigQuery ML.
You can transfer data into BigQuery in the following ways:
Transfer data from files in Amazon S3 and Azure Blob Storage into BigQuery tables, by using the
LOAD DATA
statement.Filter data from files in Amazon S3 or Blob Storage before transferring results into BigQuery tables, by using the
CREATE TABLE AS SELECT
statement. To append data to the destination table, use theINSERT INTO SELECT
statement.Data manipulation is applied on the external tables that reference data from Amazon S3 or Blob Storage.
Quotas and limits
For information about quotas and limits, see query jobs quotas and limits.
Pricing
You are billed for the bytes that are transferred across clouds by using the
LOAD
statement. For pricing
information, see the Omni Cross Cloud Data Transfer section in BigQuery Omni pricing.
You are billed for the bytes that are transferred across clouds by using the
CREATE TABLE AS SELECT
statement or
INSERT INTO SELECT
statement and for the compute capacity.
Both LOAD
and CREATE TABLE AS SELECT
statements require slots in the
BigQuery Omni regions to scan Amazon S3 and
Blob Storage files to load them. For more information, see BigQuery Omni pricing.
Before you begin
To provide Google Cloud with read access to the files in other clouds, ask your administrator to create a connection and share it with you. For information about how to create connections, see Connect to Amazon S3 or Blob Storage.
Required role
To get the permissions that you need to load data using cross-cloud transfers,
ask your administrator to grant you the
BigQuery Data Editor (roles/bigquery.dataEditor
) IAM role on the dataset.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to load data using cross-cloud transfers. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to load data using cross-cloud transfers:
-
bigquery.tables.create
-
bigquery.tables.get
-
bigquery.tables.updateData
-
bigquery.tables.update
-
bigquery.jobs.create
-
bigquery.connections.use
You might also be able to get these permissions with custom roles or other predefined roles.
For more information about IAM roles in BigQuery, see Predefined roles and permissions.
Load data
You can load data into BigQuery with the
LOAD DATA [INTO|OVERWRITE]
statement.
Limitations
LOAD DATA
jobs don't run on reservations. Jobs utilize on-demand slots that are managed by Google Cloud.- The connection and the destination dataset must belong to the same project. Loading data across projects is not supported.
LOAD DATA
is only supported when you transfer data from an Amazon Simple Storage Service (Amazon S3) or Azure Blob Storage to a colocated BigQuery region. For more information, see Locations.- You can transfer data from any
US
region to aUS
multi-region. You can also transfer from anyEU
region to aEU
multi-region.
- You can transfer data from any
Example
Example 1
The following example loads a parquet file named sample.parquet
from an Amazon S3
bucket into the test_parquet
table with an auto-detect schema:
LOAD DATA INTO mydataset.testparquet FROM FILES ( uris = ['s3://test-bucket/sample.parquet'], format = 'PARQUET' ) WITH CONNECTION `aws-us-east-1.test-connection`
Example 2
The following example loads a CSV file with the prefix sampled*
from your
Blob Storage into the test_csv
table with predefined column
partitioning by time:
LOAD DATA INTO mydataset.test_csv (Number INT64, Name STRING, Time DATE) PARTITION BY Time FROM FILES ( format = 'CSV', uris = ['azure://test.blob.core.windows.net/container/sampled*'], skip_leading_rows=1 ) WITH CONNECTION `azure-eastus2.test-connection`
Example 3
The following example overwrites the existing table test_parquet
with
data from a file named sample.parquet
with an auto-detect schema:
LOAD DATA OVERWRITE mydataset.testparquet FROM FILES ( uris = ['s3://test-bucket/sample.parquet'], format = 'PARQUET' ) WITH CONNECTION `aws-us-east-1.test-connection`
Filter data
You can filter data before transferring them into BigQuery by
using the CREATE TABLE AS SELECT
statement
and the INSERT INTO SELECT
statement.
Limitations
If the result of the
SELECT
query exceeds 60 GiB in logical bytes, the query fails. The table is not created and data is not transferred. To learn how to reduce the size of data that is scanned, see Reduce data processed in queries.Temporary tables are not supported.
Transferring the Well-known binary (WKB) geospatial data format is not supported.
INSERT INTO SELECT
statement does not support transferring data into clustered table.In the
INSERT INTO SELECT
statement, if the destination table is the same as the source table in theSELECT
query, then theINSERT INTO SELECT
statement doesn't modify any rows in the destination table. The destination table isn't modified as BigQuery can't read data across regions.CREATE TABLE AS SELECT
andINSERT INTO SELECT
are only supported when you transfer data from an Amazon S3 or Blob Storage to a colocated BigQuery region. For more information, see Locations.- You can transfer data from any
US
region to aUS
multi-region. You can also transfer from anyEU
region to aEU
multi-region.
- You can transfer data from any
Example
Example 1
Suppose you have a BigLake table named myawsdataset.orders
that
references data from Amazon S3.
You want to transfer data from that table to a
BigQuery table myotherdataset.shipments
in the US multi-region.
First, display information about the myawsdataset.orders
table:
bq show myawsdataset.orders;
The output is similar to the following:
Last modified Schema Type Total URIs Expiration ----------------- -------------------------- ---------- ------------ ----------- 31 Oct 17:40:28 |- l_orderkey: integer EXTERNAL 1 |- l_partkey: integer |- l_suppkey: integer |- l_linenumber: integer |- l_returnflag: string |- l_linestatus: string |- l_commitdate: date
Next, display information about the myotherdataset.shipments
table:
bq show myotherdataset.shipments
The output is similar to the following. Some columns are omitted to simplify the output.
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Total Logical ----------------- --------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------- 31 Oct 17:34:31 |- l_orderkey: integer 3086653 210767042 210767042 |- l_partkey: integer |- l_suppkey: integer |- l_commitdate: date |- l_shipdate: date |- l_receiptdate: date |- l_shipinstruct: string |- l_shipmode: string
Now, using the CREATE TABLE AS SELECT
statement you can selectively load data
to the myotherdataset.orders
table in the US multi-region:
CREATE OR REPLACE TABLE myotherdataset.orders PARTITION BY DATE_TRUNC(l_commitdate, YEAR) AS SELECT * FROM myawsdataset.orders WHERE EXTRACT(YEAR FROM l_commitdate) = 1992;
You can then perform a join operation with the newly created table:
SELECT orders.l_orderkey, orders.l_orderkey, orders.l_suppkey, orders.l_commitdate, orders.l_returnflag, shipments.l_shipmode, shipments.l_shipinstruct FROM myotherdataset.shipments JOIN `myotherdataset.orders` as orders ON orders.l_orderkey = shipments.l_orderkey AND orders.l_partkey = shipments.l_partkey AND orders.l_suppkey = shipments.l_suppkey WHERE orders.l_returnflag = 'R'; -- 'R' means refunded.
When new data is available, append the data of the 1993 year to the destination
table using the INSERT INTO SELECT
statement:
INSERT INTO myotherdataset.orders SELECT * FROM myawsdataset.orders WHERE EXTRACT(YEAR FROM l_commitdate) = 1993;
Example 2
The following example inserts data into an ingestion-time partitioned table:
CREATE TABLE mydataset.orders(id String, numeric_id INT) PARTITION BY _PARTITIONDATE;
After creating a partitioned table, you can insert data into the ingestion-time partitioned table:
INSERT INTO mydataset.orders( _PARTITIONTIME, id, numeric_id) SELECT TIMESTAMP("2023-01-01"), id, numeric_id, FROM mydataset.ordersof23 WHERE numeric_id > 4000000;
Best practices
- Avoid loading multiple files that are less than 5 MB. Instead, create an external table for your file and export query result to Amazon S3 or Blob Storage to create a larger file. This method helps to improve the transfer time of your data. For information about the limit for maximum query result, see BigQuery Omni maximum query result size.
- If your source data is in a gzip-compressed file, then while creating external tables, set the
external_table_options.compression
option toGZIP
.
What's next
- Learn about BigQuery ML.
- Learn about BigQuery Omni.
- Learn how to run queries.
- Learn how to set up VPC Service Controls for BigQuery Omni.
- Learn how to schedule and manage recurring load jobs from Amazon S3 into BigQuery and Blob Storage into BigQuery.