Stay organized with collections Save and categorize content based on your preferences.

Cross-cloud transfer from Azure Blob Storage

This document explains how to transfer data from the Azure Blob Storage into a US multi-regional BigQuery standard table by using the LOAD DATA query.

BigQuery Omni lets you analyze data stored in Amazon S3 bucket and Azure Blob Storage using queries and joins. If you want to join the data present in Amazon S3 or Azure Blob Storage with the data present in Google Cloud regions or if you want to utilize BigQuery ML capabilities, use the LOAD DATA SQL statement. This SQL statement lets you transfer data from BigQuery Omni accessible Azure Blob Storage into BigQuery standard tables.

Quotas and limits

For information about quotas and limits, see load jobs quotas and limits.

Pricing

You are billed for the bytes that are transferred across clouds. Cross-cloud transfer will also require slots in the BigQuery Omni regions to scan Amazon S3 and Azure Blob Storage files to load them. For pricing information, see BigQuery Omni Pricing.

Before you begin

To provide Google Cloud with read access to the files that need to be processed, ensure that you have a connection to access Azure Blob Storage data.

Required role

To get the permissions that you need to run the LOAD DATA query, ask your administrator to grant you the BigQuery Data Editor (roles/bigquery.dataEditor) IAM role on the dataset. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to run the LOAD DATA query. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

  • bigquery.tables.create
  • bigquery.tables.get
  • bigquery.tables.updateData
  • bigquery.tables.update
  • bigquery.jobs.create
  • bigquery.connections.use

You might also be able to get these permissions with custom roles or other predefined roles.

For more information about IAM roles in BigQuery, see Predefined roles and permissions.

Limitations

  • Loading data into a destination table with ingestion time partition is not supported.
  • LOAD DATA jobs don't run on reservations. Jobs utilize on-demand slots that are managed by Google Cloud.
  • You can only transfer data from the AZURE_EASTUS2 BigQuery Omni region to US, US-EAST-4.
  • The connection and the destination dataset must belong to the same project. Loading data across projects is not supported.
  • To load data using the LOAD DATA query, your source data must be in one of the following formats:

    • CSV (both compressed and uncompressed)
    • JSON (both compressed and uncompressed)
    • Parquet
    • ORC
    • AVRO

Load data

Use the following LOAD DATA [INTO|OVERWRITE] SQL statement to load data into BigQuery from Azure Blob Storage:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement after replacing the variables with the values:

    LOAD DATA INTO [project_name.]dataset_name.]table_name (column_list ,...)
    [PARTITION BY partition_expression, ...]
    [CLUSTER BY clustering_column_list, ...]
    [OPTIONS (table_option_list)]
    FROM FILES(external_table_optional_list)
    WITH CONNECTION connection_name;
    
  3. Click Run.

For more information about how to run queries, see Running interactive queries.

Example

The following sample SQL query loads a parquet file named sample.parquet into the test_parquet table with an auto-detect schema:

LOAD DATA INTO `mydataset.testparquet`
FROM FILES (uris = ['azure://test.blob.core.windows.net/container/sample.parquet'], format = 'PARQUET')
WITH CONNECTION `azure-eastus2.test-connection`

The following sample SQL query loads a CSV file with the prefix samples-* into the test_csv table with predefined column partitioning by time:

LOAD DATA INTO `mydataset.test_csv` (Number INT64, Name STRING, Time DATE)
PARTITION BY Time
FROM FILES (format = 'CSV', uris = ['azure://test.blob.core.windows.net/container/sampled*'], skip_leading_rows=1)
WITH CONNECTION `azure-eastus2.test-connection`

The following sample SQL query overwrite existing table test_parquet with data from file named sample.parquet with an auto-detect schema:

LOAD DATA OVERWRITE `mydataset.testparquet`
FROM FILES (uris = ['azure://test.blob.core.windows.net/container/sample.parquet'], format = 'PARQUET')
WITH CONNECTION `azure-eastus2.test-connection`

Best practices

  • Avoid transferring multiple small files that are less than 5 MB. Instead, create an external table on your file and export query results to Azure Blob Storage to create a larger file. This method helps to improve the transfer time of your data.
  • If your source data is in a gzip-compressed file, then while creating external tables, set the external_table_options.compression option to GZIP.

What's next