Cross-cloud transfer from Azure

This document explains how to transfer data from the Azure Blob Storage into a US multi-regional BigQuery native table by using the LOAD DATA query.

BigQuery Omni lets you analyze data stored in Amazon S3 bucket and Azure Blob Storage using queries and joins. If you want to join the data present in AWS or Azure with the data present in Google Cloud regions or if you want to utilize BigQuery ML capabilities, use the LOAD DATA SQL statement. This SQL statement lets you transfer data from BigQuery Omni accessible Azure Blob Storage into BigQuery native tables.

Before you begin

Required permissions

To run the LOAD DATA query, you need the following (IAM) permissions:

  • bigquery.tables.create
  • bigquery.tables.get
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create
  • bigquery.connections.use

Each of the following predefined IAM roles includes the permissions that you need to run the LOAD DATA query:

  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner

For more information about IAM roles in BigQuery, see Predefined roles and permissions.

Supported formats

To load data using the LOAD DATA query, your source data must be in one of the following formats:

  • CSV (both compressed and uncompressed)
  • JSON (both compressed and uncompressed)
  • Parquet
  • ORC
  • AVRO

Limitations

  • Source files that are hive partitioned are not supported.
  • Destination tables with customer-managed encryption keys (CMEK) enabled are not supported.
  • Loading data into a destination table with JSON columns is not supported.
  • Loading data into a destination table with ingestion time partition is not supported.
  • LOAD DATA jobs don't run on reservations. Jobs utilize on-demand slots that are managed by Google Cloud.
  • The destination dataset must be present in the US multi-region.
  • The connection and the destination dataset must belong to the same project. Loading data across projects is not supported.

Pricing

You are billed for the bytes that are transferred across clouds. For pricing information, see BigQuery Omni Pricing.

Load data

Use the following LOAD DATA [INTO|OVERWRITE] SQL statement to load data into BigQuery from Azure:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement after replacing the variables with the values:

    LOAD DATA INTO [project_name.]dataset_name.]table_name (column_list ,...)
    [PARTITION BY partition_expression, ...]
    [CLUSTER BY clustering_column_list, ...]
    [OPTIONS (table_option_list)]
    FROM FILES(external_table_optional_list)
    WITH CONNECTION connection_name;
    
  3. Click Run.

For more information about how to run queries, see Running interactive queries.

Example

The following sample SQL query loads a parquet file named sample.parquet into the test_parquet table with an auto-detect schema:

LOAD DATA INTO `mydataset.testparquet`
FROM FILES (uris = ['azure://test.blob.core.windows.net/container/sample.parquet'], format = 'PARQUET')
WITH CONNECTION `azure-eastus2.test-connection`

The following sample SQL query loads a CSV file with the prefix samples-* into the test_csv table with predefined column partitioning by time:

LOAD DATA INTO `mydataset.test_csv` (Number INT64, Name STRING, Time DATE)
PARTITION BY Time
FROM FILES (format = 'CSV', uris = ['azure://test.blob.core.windows.net/container/sampled*'], skip_leading_rows=1)
WITH CONNECTION `azure-eastus2.test-connection`

The following sample SQL query overwrite existing table test_parquet with data from file named sample.parquet with an auto-detect schema:

LOAD DATA OVERWRITE `mydataset.testparquet`
FROM FILES (uris = ['azure://test.blob.core.windows.net/container/sample.parquet'], format = 'PARQUET')
WITH CONNECTION `azure-eastus2.test-connection`

Best practices

  • Avoid transferring multiple small files that are less than 5 MB. Instead, create an external table on your file and export query results to Azure Storage to create a larger file. This method helps to improve the transfer time of your data.
  • If your source data is in a gzip-compressed file, then while creating external tables, set the external_table_options.compression option to GZIP.
  • If the size of the source files is large, then the bandwidth might become a bottleneck while loading data. For better performance, try to reduce the data that needs to be loaded by using export statements and try again with a reduced size file.

What's next