Create an external table for Azure Storage data

BigQuery Omni does not manage data stored in Azure Storage. To access Azure Storage data, define an external table.

Create a dataset

To create a dataset, use the Google Cloud Console, the bq command-line tool, or the client library:

Google Cloud Console

  1. In the Google Cloud Console, open the BigQuery page.

    Go to BigQuery

  2. In the navigation panel, in the Resources section, select your project.

  3. In the details panel, click Create dataset.

  4. On the Create dataset page, enter the following information:

    • For Dataset ID, enter a unique dataset name.
    • For Data location, choose azure-eastus2.

    • For Default table expiration, choose one of the following options:

      • Never: (Default) Tables that you create in the dataset are never automatically deleted. You must delete them manually.
      • Number of days after table creation: This value determines when a newly created table in the dataset is deleted. This value is applied if you do not set a table expiration when the table is created. Data in Azure is not deleted when the table expires.
    • Click Create dataset.

bq

To create a new dataset, use the bq mk command with the --location flag. Optional parameters include --default_table_expiration and --description.

To create a dataset in a project other than your default project, add the project ID to the dataset name in the following format: PROJECT_ID:DATASET_NAME.

bq --location=LOCATION mk \
--dataset \
--default_table_expiration INTEGER1 \
--description DESCRIPTION \
PROJECT_ID:DATASET_NAME

Replace the following:

  • LOCATION: The dataset's location. Replace with azure-eastus2. After you create a dataset, you can't change its location. You can set a default value for the location by using the .bigqueryrc file.
  • INTEGER1: The default lifetime (in seconds) for newly created tables. The minimum value is 3600 (1 hour). The expiration time evaluates to the current time plus the integer value. Any table created in the dataset is deleted INTEGER1 seconds after the table's creation time. If you do not set a table expiration when you create the table, then this value is applied.
  • DESCRIPTION: A description of the dataset in quotation marks.
  • PROJECT_ID: Your project ID.
  • DATASET_NAME: The name of the dataset that you're creating.

Create an external table

To create an external table, follow these steps:

Google Cloud Console

  1. In the Google Cloud Console, open the BigQuery page.

    Go to BigQuery

  2. In the navigation panel, in the Resources section, expand your project and select a dataset.

  3. Expand the Actions option and click Open.

  4. In the details panel, click Create table .

  5. On the Create table page, in the Source section, do the following:

    • For Create table from, select Azure Blob Storage.
    • For Select Azure Blob Storage path, enter an Azure Blob Storage path using the following format: azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH

      Replace the following:

      • AZURE_STORAGE_ACCOUNT_NAME: The name of the Azure Storage account. The account's region should be the same as the dataset's region.
      • CONTAINER_NAME: The name of the Azure container.
      • FILE_PATH: The data path that points to the Azure Storage data. For example, for a single CSV file, FILE_PATH can be myfile.csv.
    • For File format, select the data format in Azure. Supported formats are AVRO, PARQUET, ORC, CSV, NEWLINE_DELIMITED_JSON, and Sheets.

  6. Optional: If the data in Azure Blob Storage uses hive partitioning, then do the following:

    • Select the Source data partitioning checkbox.
    • In the Select Azure Blob Storage path field, enter the Azure Blob Storage path using wildcard format. For example, azure://account_name.blob.core.windows.net/container/*.
    • In the Select Source URI Prefix field, enter the URI prefix. The URI prefix is the part of the URI that appears before the partition encoding. For example, azure://account_name.blob.core.windows.net/container/.
    • Select a Partition inference mode. If you select Provide my own, then enter schema information for the partition keys.
  7. On the Create table page, in the Destination section, do the following:

    • For Dataset name, choose the appropriate dataset.
    • In the Table name field, enter the name of the table that you're creating.
    • Verify that Table type is set to External table.
    • For Connection ID, choose the appropriate connection ID from the dropdown.
  8. Click Create table.

SQL

To create an external table, use the CREATE EXTERNAL TABLE statement:

CREATE EXTERNAL TABLE DATASET_NAME.TABLE_NAME
  WITH CONNECTION `AZURE_LOCATION.CONNECTION_NAME`
  OPTIONS(
    format="DATA_FORMAT",
    uris=["abs://AZURE_STORAGE_ACCOUNT_NAME/CONTAINER_NAME/FILE_PATH"]
  )

Replace the following:

  • DATASET_NAME: the name of the dataset you created.
  • TABLE_NAME: the name you want to give to this table.
  • AZURE_LOCATION: an Azure location in Google Cloud (for example, azure-eastus2).
  • CONNECTION_NAME: the name of the connection you created.
  • DATA_FORMAT: any of the supported BigQuery federated formats (such as AVRO or CSV).
  • AZURE_STORAGE_ACCOUNT_NAME: the name of the Azure Storage account.
  • CONTAINER_NAME: the name of the Azure container.
  • FILE_PATH: the data path that points to the Azure Storage data.

Example:

CREATE EXTERNAL TABLE absdataset.abstable
  WITH CONNECTION `azure-eastus2.abs-read-conn`
  OPTIONS(format="CSV", uris=["abs://account/container/path/file.csv"])

bq

Create a table definition file:

bq mkdef  \
--source_format=DATA_FORMAT \
--connection_id=AZURE_LOCATION.CONNECTION_NAME \
"azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH" > table_def

Replace the following:

  • DATA_FORMAT: any of the supported BigQuery federated formats (such as AVRO, CSV, or PARQUET). Make sure that the format is written in all caps.
  • AZURE_LOCATION: an Azure location in Google Cloud (for example, azure-eastus2).
  • CONNECTION_NAME: the name of the connection that you created.
  • AZURE_STORAGE_ACCOUNT_NAME: the name of the Azure Storage account.
  • CONTAINER_NAME: the name of the Azure container.
  • FILE_PATH: the data path that points to the Azure Storage data.

Next, create the external table:

bq mk --external_table_definition=table_def DATASET_NAME.TABLE_NAME

Replace the following:

  • DATASET_NAME: the name of the dataset that you created.
  • TABLE_NAME: the name that you want to give to this table.

For example, the following commands create a new external table, my_dataset.my_table, which can query your Azure Storage data that's stored at the path azure://account_name.blob.core.windows.net/container/path and has a read connection in the location azure-eastus2:

bq mkdef \
--source_format=AVRO \
--connection_id=azure-eastus2.read-conn \
"azure://account_name.blob.core.windows.net/container/path" > table_def

bq mk \
--external_table_definition=table_def my_dataset.my_table

Hive partitioning

If the data in Azure Blob Storage uses hive partitioning, then create the table definition file as follows:

bq mkdef  \
--source_format=DATA_FORMAT \
--connection_id=AZURE_LOCATION.CONNECTION_NAME \
--hive_partitioning_mode=HIVE_PARTITIONING_MODE \
--hive_partitioning_source_uri_prefix=HIVE_PARTITIONING_URI_PRFEFIX
"azure://AZURE_STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER_NAME/FILE_PATH" > table_def

Replace the following:

  • HIVE_PARTITIONING_MODE: The partition schema detection mode.
  • HIVEPARTITIONING: The URI prefix, which is the part of the URI that appears before the partition encoding. For example, azure://account_name.blob.core.windows.net/container/.

Query the table

BigQuery Omni lets you query the external table like any BigQuery table. The maximum result size for interactive queries is 2 MB. If your query result is larger, consider exporting it to Azure Storage. The query result is stored in a BigQuery anonymous table (prod spanner).

To run a query, use the Google Cloud Console or bq command-line tool.

Google Cloud Console

  1. In the Google Cloud Console, open the BigQuery page.

    Go to BigQuery

  2. Enter a standard SQL query in the Query editor box. Standard SQL is the default syntax in the Google Cloud Console.

    SELECT * FROM DATASET_NAME.TABLE_NAME
    

    Replace the following:

    • DATASET_NAME: the name of the dataset that you created.
    • TABLE_NAME: the name of the external table that you created.
  3. Click Run.

bq

Use the query command and specify standard SQL syntax by using the --nouse_legacy_sql or --use_legacy_sql=false flag.

To run the query, enter the following:

bq query --nouse_legacy_sql \
SELECT * FROM DATASET_NAME.TABLE_NAME

Replace the following:

  • DATASET_NAME: the name of the dataset that you created.
  • TABLE_NAME: the name of the external table that you created.

Java

import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.CsvOptions;
import com.google.cloud.bigquery.DatasetId;
import com.google.cloud.bigquery.ExternalTableDefinition;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.QueryJobConfiguration;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;
import com.google.cloud.bigquery.TableResult;

// Sample to queries an external data source aws s3 using a permanent table
public class QueryExternalTableAws {

  public static void main(String[] args) throws InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "MY_PROJECT_ID";
    String datasetName = "MY_DATASET_NAME";
    String externalTableName = "MY_EXTERNAL_TABLE_NAME";
    // Query to find states starting with 'W'
    String query =
        String.format(
            "SELECT * FROM s%.%s.%s WHERE name LIKE 'W%%'",
            projectId, datasetName, externalTableName);
    queryExternalTableAws(query);
  }

  public static void queryExternalTableAws(String query) throws InterruptedException {
    try {
      // Initialize client that will be used to send requests. This client only needs to be created
      // once, and can be reused for multiple requests.
      BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();

      TableResult results = bigquery.query(QueryJobConfiguration.of(query));

      results
          .iterateAll()
          .forEach(row -> row.forEach(val -> System.out.printf("%s,", val.toString())));

      System.out.println("Query on aws external permanent table performed successfully.");
    } catch (BigQueryException e) {
      System.out.println("Query not performed \n" + e.toString());
    }
  }
}

What's next