Create Delta Lake BigLake tables

To enroll in this preview, complete the following form. To request feedback or support for this feature, send email to biglake-help@google.com.

BigLake lets you access Delta Lake tables with more granular access control. Delta Lake is an open source, tabular data storage format developed by Databricks that supports petabyte scale data tables.

BigQuery supports the following features with Delta Lake tables:

  • Access delegation: Query structured data in external data stores with access delegation. Access delegation decouples access to the Delta Lake table from access to the underlying datastore.
  • Fine-grained access control: Enforce fine-grained security at the table level, including row-level and column-level security. For Delta Lake tables based on Cloud Storage, you can also use dynamic data masking.
  • Schema evolution: Schema changes in the Delta Lake tables are autodetected. Changes to the schema are reflected in the BigQuery table.

Delta Lake tables also support all BigLake features when you configure them as BigLake tables.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery Connection and BigQuery Reservation APIs.

    Enable the APIs

  4. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

  5. Ensure that you have a BigQuery dataset.

  6. Ensure that your version of the Google Cloud SDK is 366.0.0 or later:

    gcloud version
    

    If needed, update the Google Cloud SDK.

  7. Create a Cloud resource connection based on your external data source, and grant that connection access to Cloud Storage. If you don't have the appropriate permissions to create a connection, ask your BigQuery administrator to create a connection and share it with you.

Required roles

The following permissions are required to create a Delta Lake table:

  • bigquery.tables.create
  • bigquery.connections.delegate

The BigQuery Admin (roles/bigquery.admin) predefined Identity and Access Management role includes these permissions.

If you are not a principal in this role, ask your administrator to grant you these permissions or to create the Delta Lake table for you.

Additionally, to allow BigQuery users to query the table, the service account associated with the connection must have the following permission and access:

  • BigQuery Viewer (roles/bigquery.viewer) role
  • BigQuery Connection User (roles/bigquery.connectionUser) role
  • Access to the Cloud Storage bucket that contains that data

For more information on Identity and Access Management roles and permissions in BigQuery, see Predefined roles and permissions.

Create tables with Delta Lake

To create Delta Lake tables, follow these steps.

SQL

Use the CREATE EXTERNAL TABLE statement to create the Delta Lake table:

CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`
WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
OPTIONS (
  format ="DELTA_LAKE",
  uris=['DELTA_TABLE_GCS_BASE_PATH']);

Replace the following values:

  • PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
  • DATASET: the BigQuery dataset to contain the Delta Lake table
  • DELTALAKE_TABLE_NAME: the name of your Delta Lake table
  • REGION: the region that contains the connection to create the Delta Lake table—for example, us
  • CONNECTION_ID: the connection ID—for example, myconnection

    When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection.

  • DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix

bq

In a command-line environment, use the bq mk command to create the Delta Lake table:

bq mk --table --external_table_definition=DEFINITION_FILE PROJECT_ID:DATASET.DELTALAKE_TABLE_NAME

Replace the following values:

  • DEFINITION_FILE: the path to a table definition file
  • PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
  • DATASET: the BigQuery dataset to contain the Delta Lake table
  • DELTALAKE_TABLE_NAME: the name of your Delta Lake table

REST

Use the BigQuery API to create a Delta Lake table by calling the tables.insert API method:

REQUEST='{
  "autodetect": true,
  "externalDataConfiguration": {
  "sourceFormat": "DELTA_LAKE",
  "connectionId": "PROJECT_ID.REGION.CONNECTION_ID",
  "sourceUris": [
    "DELTA_TABLE_GCS_BASE_PATH"
  ],
 },
"tableReference": {
"tableId": "DELTALAKE_TABLE_NAME"
}
}'

echo $REQUEST | curl -X POST -d @- -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/datasets/DATASET/tables?autodetect_schema=true

Replace the following values:

  • PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
  • REGION: the region that contains the connection to create the Delta Lake table—for example, us
  • CONNECTION_ID: the connection ID—for example, myconnection

    When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection.

  • DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix

  • DELTALAKE_TABLE_NAME: the name of your Delta Lake table

  • DATASET: the BigQuery dataset to contain the Delta Lake table

When you create Delta Lake tables, the Delta Lake prefix is used as the URI for the table. For example, for a table that has logs in the bucket gs://bucket/warehouse/basictable/_delta_log, the table URI is gs://bucket/warehouse/basictable. When you run queries on the Delta Lake table, BigQuery reads data under the prefix to identify the current version of the table and then computes the metadata and the files for the table.

Update Delta Lake tables

To update (refresh) the schema of Delta Lake tables, follow these steps.

bq

In a command-line environment, use the bq update command to update (refresh) the schema of the Delta Lake table:

bq update --autodetect_schema PROJECT_ID:DATASET.DELTALAKE_TABLE_NAME

Replace the following values:

  • PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
  • DATASET: the BigQuery dataset to contain the Delta Lake table
  • DELTALAKE_TABLE_NAME: the name of your Delta Lake table

REST

Use the BigQuery API to update a Delta Lake table by calling the tables.patch API method:

REQUEST='{
  "externalDataConfiguration": {
    "sourceFormat": "DELTA_LAKE",
    "sourceUris": [
      "DELTA_TABLE_GCS_BASE_PATH"
    ],
    "connectionId": "PROJECT_ID.REGION.CONNECTION_ID",
    "autodetect": true
  },
  "tableReference": {
    "tableId": "DELTALAKE_TABLE_NAME"
  }
}'
echo $REQUEST |curl -X POST -d @- -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/datasets/DATASET/tables

Replace the following values:

  • DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix
  • PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
  • REGION: the region that contains the connection to create the Delta Lake table—for example, us
  • CONNECTION_ID: the connection ID—for example, myconnection

    When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection.

  • DELTALAKE_TABLE_NAME: the name of your Delta Lake table

  • DATASET: the BigQuery dataset to contain the Delta Lake table

Query Delta Lake tables

After creating a Delta Lake BigLake table, you can query it using GoogleSQL syntax, the same as you would a standard BigQuery table. For example:

SELECT field1, field2 FROM mydataset.my_cloud_storage_table;

For more information, see Query Cloud Storage data in BigLake tables.

An external connection associated with a service account is used to connect to the datastore. Because the service account retrieves data from the datastore, users only need access to the Delta Lake table.

Data mapping

BigQuery converts Delta Lake data types to BigQuery data types as shown in the following table:

Delta Lake Type BigQuery Type
boolean BOOL
byte INT64
int INT64
long INT64
float FLOAT64
double FLOAT64
Decimal(P/S) NUMERIC or BIG_NUMERIC depending on precision
date DATE
time TIME
timestamp (not partition column) TIMESTAMP
timestamp (partition column) DATETIME
string STRING
binary BYTES
array<Type> ARRAY<Type>
struct STRUCT
map<KeyType, ValueType> ARRAY<Struct<key KeyType, value ValueType>>

Limitations

Delta Lake tables have BigLake table limitations and also the following limitations:

  • Supports Delta Lake reader version 3 with deletion vectors and column mapping.
  • You must list the reader version in the last log entry file. For example, new tables must include 00000..0.json.
  • Change data capture (CDC) operations aren't supported. Any existing CDC operations are ignored.
  • The schema is autodetected. Modifying the schema by using BigQuery isn't supported.
  • Table column names must adhere to BigQuery column name restrictions.
  • Materialized views aren't supported.
  • The Read API isn't supported for Delta Lake.