Create Blob Storage BigLake tables

This document describes how to create an Azure Blob Storage BigLake table. A BigLake table lets you to use access delegation to query data in Blob Storage. Access delegation decouples access to the BigLake table from access to the underlying data store.

For information about how data flows between BigQuery and Blob Storage, see Data flow when querying data.

Before you begin

Ensure that you have a connection to access data in your Blob Storage.

Required roles

To get the permissions that you need to create an external table, ask your administrator to grant you the BigQuery Admin (roles/bigquery.admin) IAM role on your dataset. For more information about granting roles, see Manage access.

This predefined role contains the permissions required to create an external table. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create an external table:

  • bigquery.tables.create
  • bigquery.connections.delegate

You might also be able to get these permissions with custom roles or other predefined roles.

Create a dataset

Before you create an external table, you need to create a dataset in the supported region. Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, select the project where you want to create the dataset.
  3. Expand the View actions option and click Create dataset.
  4. On the Create dataset page, specify the following details:
    1. For Dataset ID enter a unique dataset name.
    2. For Data location choose a supported region.
    3. Optional: To delete tables automatically, select the Enable table expiration checkbox and set the Default maximum table age in days. Data in Azure is not deleted when the table expires.
    4. Optional: Expand the Advanced options section and select the following options:
      1. If you want to use a customer-managed encryption key, then select the Customer-managed encryption key (CMEK) option. By default, BigQuery encrypts customer content stored at rest by using a Google-managed key.
      2. If you want to use default collation, then select the Enable default collation option.
    5. Click Create dataset.

SQL

Use the CREATE SCHEMA DDL statement. The following example create a dataset in the azure-eastus2 region:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE SCHEMA mydataset
    OPTIONS (
      location = 'azure-eastus2');
    

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

bq

In a command-line environment, create a dataset using the bq mk command:

bq --location=LOCATION mk \
    --dataset \
PROJECT_ID:DATASET_NAME

The --project_id parameter overrides the default project.

Replace the following:

  • LOCATION: the location of your dataset

    For information about supported regions, see Locations. After you create a dataset, you can't change its location. You can set a default value for the location by using the .bigqueryrc file.

  • PROJECT_ID: your project ID

  • DATASET_NAME: the name of the dataset that you want to create

    To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_NAME.

Create BigLake tables on unpartitioned data

Select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project, and then select a dataset.

  3. In the Dataset info section, click Create table.

  4. On the Create table page, in the Source section, do the following:

    1. For Create table from, select Azure Blob Storage.
    2. For Select Azure Blob Storage path, enter a Blob Storage path using the following format: azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH

      Replace the following:

      • AZURE_STORAGE_ACCOUNT_NAME: The name of the Blob Storage account. The account's region should be the same as the dataset's region.
      • CONTAINER_NAME: The name of the Blob Storage container.
      • FILE_PATH: The data path that points to the Blob Storage data. For example, for a single CSV file, FILE_PATH can be myfile.csv.
    3. For File format, select the data format in Azure. Supported formats are AVRO, PARQUET, ORC, CSV, and JSONL (Newline delimited JSON).

  5. In the Destination section, do the following:

    1. For Dataset, choose the appropriate dataset.
    2. In the Table field, enter the name of the table.
    3. Verify that Table type is set to External table.
    4. For Connection ID, choose the appropriate connection ID from the drop-down. For information about connections, see Connect to Blob Storage.
  6. In the Schema section, you can either enable schema auto-detection or manually specify a schema if you have a source file. If you don't have a source file, you must manually specify a schema.

    • To enable schema auto-detection, select the Auto-detect option.

    • To manually specify a schema, leave the Auto-detect option unchecked. Enable Edit as text and enter the table schema as a JSON array.

  7. Click Create table.

SQL

To create a BigLake table, use the CREATE EXTERNAL TABLE statement with the WITH CONNECTION clause:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE DATASET_NAME.TABLE_NAME
    WITH CONNECTION `AZURE_LOCATION.CONNECTION_NAME`
      OPTIONS (
        format = 'DATA_FORMAT',
        uris = ['azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH']);
    

    Replace the following:

    • DATASET_NAME: the name of the dataset you created
    • TABLE_NAME: the name you want to give to this table
    • AZURE_LOCATION: an Azure location in Google Cloud, such as azure-eastus2
    • CONNECTION_NAME: the name of the connection you created
    • DATA_FORMAT: any of the supported BigQuery federated formats, such as AVRO or CSV
    • AZURE_STORAGE_ACCOUNT_NAME: the name of the Blob Storage account
    • CONTAINER_NAME: the name of the Blob Storage container
    • FILE_PATH: the data path that points to the Blob Storage data

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

Example:

CREATE EXTERNAL TABLE absdataset.abstable
WITH CONNECTION `azure-eastus2.abs-read-conn`
  OPTIONS (
    format = 'CSV', uris = ['azure://account_name.blob.core.windows.net/container/path/file.csv']);

bq

Create a table definition file:

bq mkdef  \
    --source_format=DATA_FORMAT \
    --connection_id=AZURE_LOCATION.CONNECTION_NAME \
    "azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH" > table_def

Replace the following:

  • DATA_FORMAT: any of the supported BigQuery federated formats, such as AVRO, CSV, or PARQUET, in capital letters
  • AZURE_LOCATION: an Azure location in Google Cloud, such as azure-eastus2
  • CONNECTION_NAME: the name of the connection that you created
  • AZURE_STORAGE_ACCOUNT_NAME: the name of the Blob Storage account
  • CONTAINER_NAME: the name of the Blob Storage container
  • FILE_PATH: the data path that points to the Blob Storage data

Next, create the BigLake table:

bq mk --external_table_definition=table_def DATASET_NAME.TABLE_NAME

Replace the following:

  • DATASET_NAME: the name of the dataset that you created
  • TABLE_NAME: the name that you want to give to this table

For example, the following commands create a new BigLake table, my_dataset.my_table, which can query your Blob Storage data that's stored at the path azure://account_name.blob.core.windows.net/container/path and has a read connection in the location azure-eastus2:

bq mkdef \
    --source_format=AVRO \
    --connection_id=azure-eastus2.read-conn \
    "azure://account_name.blob.core.windows.net/container/path" > table_def

bq mk \
    --external_table_definition=table_def my_dataset.my_table

API

Call the tables.insert method API method, and create an ExternalDataConfiguration in the Table resource that you pass in.

Specify the schema property or set the autodetect property to true to enable schema auto detection for supported data sources.

Specify the connectionId property to identify the connection to use for connecting to Blob Storage.

Create BigLake tables on partitioned data

You can create a BigLake table for Hive partitioned data in Blob Storage. After you create an externally partitioned table, you can't change the partition key. You need to recreate the table to change the partition key.

To create a BigLake table based on Hive partitioned data, select one of the following options:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, expand your project and select a dataset.

  3. Click View actions, and then click Create table. This opens the Create table pane.

  4. In the Source section, specify the following details:

    1. For Create table from, select one of the following options:

      • Amazon S3
      • Azure Blob Storage
    2. Provide the path to the folder, using wildcards. For example:

      • For Amazon S3: s3://mybucket/*
      • For Blob Storage: azure://mystorageaccount.blob.core.windows.net/mycontainer/*

      The folder must be in the same location as the dataset that contains the table you want to create, append, or overwrite.

    3. From the File format list, select the file type.

    4. Select the Source data partitioning checkbox, and then specify the following details:

      1. For Select Source URI Prefix, enter the URI prefix. For example, s3://mybucket/my_files.
      2. Optional: To require a partition filter on all queries for this table, select the Require partition filter checkbox. Requiring a partition filter can reduce cost and improve performance. For more information, see Requiring predicate filters on partition keys in queries.
      3. In the Partition inference mode section, select one of the following options:

        • Automatically infer types: set the partition schema detection mode to AUTO.
        • All columns are strings: set the partition schema detection mode to STRINGS.
        • Provide my own: set the partition schema detection mode to CUSTOM and manually enter the schema information for the partition keys. For more information, see Provide a custom partition key schema.
  5. In the Destination section, specify the following details:

    1. For Project, select the project in which you want to create the table.
    2. For Dataset, select the dataset in which you want to create the table.
    3. For Table, enter the name of the table that you want to create.
    4. For Table type, verify that External table is selected.
    5. For Connection ID, select the connection that you created earlier.
  6. In the Schema section, you can either enable schema auto-detection or manually specify a schema if you have a source file. If you don't have a source file, you must manually specify a schema.

    • To enable schema auto-detection, select the Auto-detect option.

    • To manually specify a schema, leave the Auto-detect option unchecked. Enable Edit as text and enter the table schema as a JSON array.

  7. To ignore rows with extra column values that don't match the schema, expand the Advanced options section and select Unknown values.

  8. Click Create table.

SQL

Use the CREATE EXTERNAL TABLE DDL statement:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
    WITH PARTITION COLUMNS
    (
      PARTITION_COLUMN PARTITION_COLUMN_TYPE,
    )
    WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
    OPTIONS (
      hive_partition_uri_prefix = "HIVE_PARTITION_URI_PREFIX",
      uris=['FILE_PATH'],
      format ="TABLE_FORMAT"
    );
    

    Replace the following:

    • PROJECT_ID: the name of your project in which you want to create the table—for example, myproject
    • DATASET: the name of the BigQuery dataset that you want to create the table in—for example, mydataset
    • EXTERNAL_TABLE_NAME: the name of the table that you want to create—for example, mytable
    • PARTITION_COLUMN: the name of the partitioning column
    • PARTITION_COLUMN_TYPE: the type of the partitioning column
    • REGION: the region that contains the connection—for example, us
    • CONNECTION_ID: the name of the connection—for example, myconnection
    • HIVE_PARTITION_URI_PREFIX: hive partitioning uri prefix–for example:

      • s3://mybucket/
      • azure://mystorageaccount.blob.core.windows.net/mycontainer/
    • FILE_PATH: path to the data source for the external table that you want to create—for example:

      • s3://mybucket/*.parquet
      • azure://mystorageaccount.blob.core.windows.net/mycontainer/*.parquet
    • TABLE_FORMAT: the format of the table that you want to create—for example, PARQUET

  3. Click Run.

For more information about how to run queries, see Running interactive queries.

Examples

The following example creates a BigLake table over partitioned data in Amazon S3. The schema is autodetected.

CREATE EXTERNAL TABLE `my_dataset.my_table`
WITH PARTITION COLUMNS
(
  sku STRING,
)
WITH CONNECTION `us.my-connection`
OPTIONS(
  hive_partition_uri_prefix = "s3://mybucket/products",
  uris = ['s3://mybucket/products/*']
);

The following example creates a BigLake table over partitioned data in Blob Storage. The schema is specified.

CREATE EXTERNAL TABLE `my_dataset.my_table`
(
  ProductId INTEGER,
  ProductName, STRING,
  ProductType, STRING
)
WITH PARTITION COLUMNS
(
  sku STRING,
)
WITH CONNECTION `us.my-connection`
OPTIONS(
  hive_partition_uri_prefix = "azure://mystorageaccount.blob.core.windows.net/mycontainer/products",
  uris = ['azure://mystorageaccount.blob.core.windows.net/mycontainer/*']
);

bq

First, use the bq mkdef command to create a table definition file:

bq mkdef \
--source_format=SOURCE_FORMAT \
--connection_id=REGION.CONNECTION_ID \
--hive_partitioning_mode=PARTITIONING_MODE \
--hive_partitioning_source_uri_prefix=URI_SHARED_PREFIX \
--require_hive_partition_filter=BOOLEAN \
 URIS > DEFINITION_FILE

Replace the following:

  • SOURCE_FORMAT: the format of the external data source. For example, CSV.
  • REGION: the region that contains the connection—for example, us.
  • CONNECTION_ID: the name of the connection—for example, myconnection.
  • PARTITIONING_MODE: the Hive partitioning mode. Use one of the following values:
    • AUTO: Automatically detect the key names and types.
    • STRINGS: Automatically convert the key names to strings.
    • CUSTOM: Encode the key schema in the source URI prefix.
  • URI_SHARED_PREFIX: the source URI prefix.
  • BOOLEAN: specifies whether to require a predicate filter at query time. This flag is optional. The default value is false.
  • URIS: the path to the Amazon S3 or the Blob Storage folder, using wildcard format.
  • DEFINITION_FILE: the path to the table definition file on your local machine.

If PARTITIONING_MODE is CUSTOM, include the partition key schema in the source URI prefix, using the following format:

--hive_partitioning_source_uri_prefix=GCS_URI_SHARED_PREFIX/{KEY1:TYPE1}/{KEY2:TYPE2}/...

After you create the table definition file, use the bq mk command to create the BigLake table:

bq mk --external_table_definition=DEFINITION_FILE \
DATASET_NAME.TABLE_NAME \
SCHEMA

Replace the following:

  • DEFINITION_FILE: the path to the table definition file.
  • DATASET_NAME: the name of the dataset that contains the table.
  • TABLE_NAME: the name of the table you're creating.
  • SCHEMA: specifies a path to a JSON schema file, or specifies the schema in the form field:data_type,field:data_type,.... To use schema auto-detection, omit this argument.

Examples

The following example uses AUTO Hive partitioning mode for Amazon S3 data:

bq mkdef --source_format=CSV \
  --connection_id=us.my-connection \
  --hive_partitioning_mode=AUTO \
  --hive_partitioning_source_uri_prefix=s3://mybucket/myTable \
  --metadata_cache_mode=AUTOMATIC \
  s3://mybucket/* > mytable_def

bq mk --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example uses STRING Hive partitioning mode for Amazon S3 data:

bq mkdef --source_format=CSV \
  --connection_id=us.my-connection \
  --hive_partitioning_mode=STRING \
  --hive_partitioning_source_uri_prefix=s3://mybucket/myTable \
  s3://mybucket/myTable/* > mytable_def

bq mk --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

The following example uses CUSTOM Hive partitioning mode for Blob Storage data:

bq mkdef --source_format=CSV \
  --connection_id=us.my-connection \
  --hive_partitioning_mode=CUSTOM \
  --hive_partitioning_source_uri_prefix=azure://mystorageaccount.blob.core.windows.net/mycontainer/{dt:DATE}/{val:STRING} \
  azure://mystorageaccount.blob.core.windows.net/mycontainer/* > mytable_def

bq mk --external_table_definition=mytable_def \
  mydataset.mytable \
  Region:STRING,Quarter:STRING,Total_sales:INTEGER

API

To set Hive partitioning using the BigQuery API, include the hivePartitioningOptions object in the ExternalDataConfiguration object when you create the table definition file. To create a BigLake table, you must also specify a value for the connectionId field.

If you set the hivePartitioningOptions.mode field to CUSTOM, you must encode the partition key schema in the hivePartitioningOptions.sourceUriPrefix field as follows: s3://BUCKET/PATH_TO_TABLE/{KEY1:TYPE1}/{KEY2:TYPE2}/...

To enforce the use of a predicate filter at query time, set the hivePartitioningOptions.requirePartitionFilter field to true.

Query BigLake tables

For more information, see Query Blob Storage data.

View resource metadata with INFORMATION_SCHEMA

You can view the resource metadata with INFORMATION_SCHEMA views. When you query the JOBS_BY_*, JOBS_TIMELINE_BY_*, and RESERVATION* views, you must specify the query's processing location that is collocated with the table's region. For information about BigQuery Omni locations, see Locations. For all other system tables, specifying the query job location is optional.

For information about the system tables that BigQuery Omni supports, see Limitations.

To query JOBS_* and RESERVATION* system tables, select one of the following methods to specify the processing location:

Console

  1. Go to the BigQuery page.

    Go to BigQuery

  2. If the Editor tab isn't visible, then click Compose new query.

  3. Click More > Query settings. The Query settings dialog opens.

  4. In the Query settings dialog, for Additional settings > Data location, select the BigQuery region that is collocated with the BigQuery Omni region. For example, if your BigQuery Omni region is aws-us-east-1, specify us-east4.

  5. Select the remaining fields and click Save.

bq

Use the --location flag to set the job's processing location to the BigQuery region that is collocated with the BigQuery Omni region. For example, if your BigQuery Omni region is aws-us-east-1, specify us-east4.

Example

bq query --use_legacy_sql=false --location=us-east4 \
"SELECT * FROM region-azure-eastus2.INFORMATION_SCHEMA.JOBS limit 10;"

API

If you are running jobs programmatically, set the location argument to the BigQuery region that is collocated with the BigQuery Omni region. For example, if your BigQuery Omni region is aws-us-east-1, specify us-east4.

VPC Service Controls

You can set up an egress policy that restricts access to only specified external cloud resources from within a VPC Service Controls perimeter. For more information, see Set up VPC Service Controls for BigQuery Omni.

Limitations

For a full list of limitations that apply to BigLake tables based on Amazon S3 and Blob Storage, see Limitations.

What's next