Stay organized with collections Save and categorize content based on your preferences.

Query Azure Blob storage data

This document describes how to query data stored in Azure Blob storage by creating a BigLake table. A BigLake table lets you to use access delegation to query data in Azure Blob storage. Access delegation decouples access to the BigLake table from access to the underlying data store.

Before you begin

Ensure that you have a connection to access data in your Azure Blob storage.

Required role

To get the permissions that you need to create an external table, ask your administrator to grant you the BigQuery Admin (roles/bigquery.admin) IAM role on your dataset. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to create an external table. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

  • bigquery.tables.create
  • bigquery.tables.getData
  • bigquery.jobs.create
  • bigquery.connections.delegate

You might also be able to get these permissions with custom roles or other predefined roles.

Create a dataset

Before you create an external table, you need to create a dataset in the 'azure-eastus2' region. Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select the project where you want to create the dataset.
  3. Expand the View actions option and click Create dataset.
  4. On the Create dataset page, specify the following details:
    1. For Dataset ID enter a unique dataset name.
    2. For Data location choose azure-eastus2.
    3. Optional: To delete tables automatically, select the Enable table expiration checkbox and set the Default maximum table age in days. Data in Azure is not deleted when the table expires.
    4. Optional: Expand the Advanced options section and select the following options:
      1. If you want to use a customer-managed encryption key, then select the Customer-managed encryption key (CMEK) option. By default, BigQuery encrypts customer content stored at rest by using a Google-managed key.
      2. If you want to use default collation, then select the Enable default collation option.
    5. Click Create dataset.

SQL

Use the CREATE SCHEMA DDL statement:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE SCHEMA mydataset
    OPTIONS (
      location = 'azure-eastus2');
    

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

bq

In a command-line environment, create a dataset using the bq mk command:

bq --location=LOCATION mk \
    --dataset \
PROJECT_ID:DATASET_NAME

The --project_id parameter overrides the default project.

Replace the following:

  • LOCATION: the location of your dataset

    Valid value is azure-eastus2. After you create a dataset, you can't change its location. You can set a default value for the location by using the .bigqueryrc file.

  • PROJECT_ID: your project ID

  • DATASET_NAME: the name of the dataset that you want to create

    To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_NAME.

Create an external table

Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.

  3. In the Dataset info section, click Create table.

  4. On the Create table page, in the Source section, do the following:

    1. For Create table from, select Azure Blob Storage.
    2. For Select Azure Blob Storage path, enter an Azure Blob storage path using the following format: azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH

      Replace the following:

      • AZURE_STORAGE_ACCOUNT_NAME: The name of the Azure Storage account. The account's region should be the same as the dataset's region.
      • CONTAINER_NAME: The name of the Azure container.
      • FILE_PATH: The data path that points to the Azure Storage data. For example, for a single CSV file, FILE_PATH can be myfile.csv.
    3. For File format, select the data format in Azure. Supported formats are AVRO, PARQUET, ORC, CSV, and JSONL (Newline delimited JSON).

  5. Optional: If the data in Azure Blob storage uses hive partitioning, then do the following:

    1. Select the Source data partitioning checkbox.
    2. In the Select Source URI Prefix field, enter the URI prefix. The URI prefix is the part of the URI that appears before the partition encoding. For example, azure://account_name.blob.core.windows.net/container/.
    3. Optional: To require a partition filter on all queries for this table, select the Require partition filter checkbox. A partition filter can reduce cost and improve performance. For more information, see Set partition filter requirements.
    4. Select a Partition inference mode. If you select Provide my own, then enter schema information for the partition keys.
  6. In the Destination section, do the following:

    1. For Dataset, choose the appropriate dataset.
    2. In the Table field, enter the name of the table.
    3. Verify that Table type is set to External table.
    4. For Connection ID, choose the appropriate connection ID from the dropdown. For information about connections, see Connect to Azure Blob storage.
  7. Click Create table.

SQL

To create an external table, use the CREATE EXTERNAL TABLE statement:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE DATASET_NAME.TABLE_NAME
    WITH CONNECTION `AZURE_LOCATION.CONNECTION_NAME`
      OPTIONS (
        format = 'DATA_FORMAT',
        uris = ['azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH']);
    

    Replace the following:

    • DATASET_NAME: the name of the dataset you created
    • TABLE_NAME: the name you want to give to this table
    • AZURE_LOCATION: an Azure location in Google Cloud, such as azure-eastus2
    • CONNECTION_NAME: the name of the connection you created
    • DATA_FORMAT: any of the supported BigQuery federated formats, such as AVRO or CSV
    • AZURE_STORAGE_ACCOUNT_NAME: the name of the Azure Storage account
    • CONTAINER_NAME: the name of the Azure container
    • FILE_PATH: the data path that points to the Azure Storage data

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

Example:

CREATE EXTERNAL TABLE absdataset.abstable
WITH CONNECTION `azure-eastus2.abs-read-conn`
  OPTIONS (
    format = 'CSV', uris = ['azure://account_name.blob.core.windows.net/container/path/file.csv']);

bq

Create a table definition file:

bq mkdef  \
    --source_format=DATA_FORMAT \
    --connection_id=AZURE_LOCATION.CONNECTION_NAME \
    "azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH" > table_def

Replace the following:

  • DATA_FORMAT: any of the supported BigQuery federated formats, such as AVRO, CSV, or PARQUET, in capital letters
  • AZURE_LOCATION: an Azure location in Google Cloud, such as azure-eastus2
  • CONNECTION_NAME: the name of the connection that you created
  • AZURE_STORAGE_ACCOUNT_NAME: the name of the Azure Storage account
  • CONTAINER_NAME: the name of the Azure container
  • FILE_PATH: the data path that points to the Azure Storage data

Next, create the external table:

bq mk --external_table_definition=table_def DATASET_NAME.TABLE_NAME

Replace the following:

  • DATASET_NAME: the name of the dataset that you created
  • TABLE_NAME: the name that you want to give to this table

For example, the following commands create a new external table, my_dataset.my_table, which can query your Azure Storage data that's stored at the path azure://account_name.blob.core.windows.net/container/path and has a read connection in the location azure-eastus2:

bq mkdef \
    --source_format=AVRO \
    --connection_id=azure-eastus2.read-conn \
    "azure://account_name.blob.core.windows.net/container/path" > table_def

bq mk \
    --external_table_definition=table_def my_dataset.my_table

Hive partitioning

If the data in Azure Blob storage uses hive partitioning, then create the table definition file as follows:

bq mkdef  \
    --source_format=DATA_FORMAT \
    --connection_id=AZURE_LOCATION.CONNECTION_NAME \
    --hive_partitioning_mode=HIVE_PARTITIONING_MODE \
    --hive_partitioning_source_uri_prefix=HIVE_PARTITIONING_URI_PRFEFIX
    "azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH" > table_def

Replace the following:

  • HIVE_PARTITIONING_MODE: the partition schema detection mode
  • HIVEPARTITIONING: the URI prefix, which is the part of the URI that appears before the partition encoding, such as azure://account_name.blob.core.windows.net/container/

Query the table

BigQuery Omni lets you query the external table like any BigQuery table. The maximum result size for interactive queries is 20 GiB logical bytes (preview). For more information, see BigQuery Omni limitations. If your query result is larger than 20 GiB, we recommend that you export it to Azure Storage. The query result is stored in a BigQuery temporary table.

Run a query on an external table like you would for any BigQuery table:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    SELECT * FROM DATASET_NAME.TABLE_NAME;
    

    Replace the following:

    • DATASET_NAME: the dataset name that you created
    • TABLE_NAME: the external table that name you created

    • Click Run.

For more information about how to run queries, see Running interactive queries.

View resource metadata with INFORMATION_SCHEMA

When you query the JOBS_BY_*, JOBS_TIMELINE_BY_*, and RESERVATION* BigQuery INFORMATION_SCHEMA views, you must specify the query's processing location as us-east4 (Northern Virginia) that is collocated with the table's region. The job fails if the location is not specified. For all other system tables, specifying the query job location is optional.

For information about the system tables that BigQuery Omni supports, see Limitations.

To query JOBS_* and RESERVATION* system tables, specify the processing location as follows:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. If the Editor tab isn't visible, then click Compose new query.

  3. Click More > Query settings. The Query settings dialog opens.

    Query settings

  4. In the Query settings dialog, for Additional settings > Data location, select us-east4 (Northern Virginia).

  5. Select the remaining fields and click Save.

bq

Use the --location flag to set the job's processing location to us-east4.

Example

bq --use_legacy_sql=false --location=us-east4 \
"SELECT * FROM region-azure-eastus2.INFORMATION_SCHEMA.JOBS limit 10;"

API

If you are running jobs programmatically, ensure that you set the location argument to us-east4.

What's next