Create Delta Lake BigLake tables
BigLake lets you access Delta Lake tables with more granular access control. Delta Lake is an open source, tabular data storage format developed by Databricks that supports petabyte scale data tables.
BigQuery supports the following features with Delta Lake tables:
- Access delegation: Query structured data in external data stores with access delegation. Access delegation decouples access to the Delta Lake table from access to the underlying datastore.
- Fine-grained access control: Enforce fine-grained security at the table level, including row-level and column-level security. For Delta Lake tables based on Cloud Storage, you can also use dynamic data masking.
- Schema evolution: Schema changes in the Delta Lake tables are autodetected. Changes to the schema are reflected in the BigQuery table.
Delta Lake tables also support all BigLake features when you configure them as BigLake tables.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery Connection and BigQuery Reservation APIs.
-
In the Google Cloud console, activate Cloud Shell.
Ensure that you have a BigQuery dataset.
Ensure that your version of the Google Cloud SDK is 366.0.0 or later:
gcloud version
If needed, update the Google Cloud SDK.
Create a Cloud resource connection based on your external data source, and grant that connection access to Cloud Storage. If you don't have the appropriate permissions to create a connection, ask your BigQuery administrator to create a connection and share it with you.
Required roles
The following permissions are required to create a Delta Lake table:
bigquery.tables.create
bigquery.connections.delegate
The BigQuery Admin (roles/bigquery.admin
) predefined
Identity and Access Management role includes these permissions.
If you are not a principal in this role, ask your administrator to grant you these permissions or to create the Delta Lake table for you.
Additionally, to allow BigQuery users to query the table, the service account associated with the connection must have the following permission and access:
- BigQuery Viewer (
roles/bigquery.viewer
) role - BigQuery Connection User (
roles/bigquery.connectionUser
) role - Access to the Cloud Storage bucket that contains that data
For more information on Identity and Access Management roles and permissions in BigQuery, see Predefined roles and permissions.
Create tables with Delta Lake
To create Delta Lake tables, follow these steps.
SQL
Use the
CREATE EXTERNAL TABLE
statement
to create the Delta Lake table:
CREATE EXTERNAL TABLE `PROJECT_ID.DATASET.DELTALAKE_TABLE_NAME`
WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
OPTIONS (
format ="DELTA_LAKE",
uris=['DELTA_TABLE_GCS_BASE_PATH']);
Replace the following values:
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- DATASET: the BigQuery dataset to contain the Delta Lake table
- DELTALAKE_TABLE_NAME: the name of your Delta Lake table
- REGION: the region that contains the connection to create the Delta Lake table—for example,
us
CONNECTION_ID: the connection ID—for example,
myconnection
When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection
.DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix
bq
In a command-line environment, use the
bq mk
command
to create the Delta Lake table:
bq mk --table --external_table_definition=DEFINITION_FILE PROJECT_ID:DATASET.DELTALAKE_TABLE_NAME
Replace the following values:
- DEFINITION_FILE: the path to a table definition file
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- DATASET: the BigQuery dataset to contain the Delta Lake table
- DELTALAKE_TABLE_NAME: the name of your Delta Lake table
REST
Use the BigQuery API to
create a Delta Lake table by calling the tables.insert
API method:
REQUEST='{
"autodetect": true,
"externalDataConfiguration": {
"sourceFormat": "DELTA_LAKE",
"connectionId": "PROJECT_ID.REGION.CONNECTION_ID",
"sourceUris": [
"DELTA_TABLE_GCS_BASE_PATH"
],
},
"tableReference": {
"tableId": "DELTALAKE_TABLE_NAME"
}
}'
echo $REQUEST | curl -X POST -d @- -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/datasets/DATASET/tables?autodetect_schema=true
Replace the following values:
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- REGION: the region that contains the connection to create the Delta Lake table—for example,
us
CONNECTION_ID: the connection ID—for example,
myconnection
When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection
.DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix
DELTALAKE_TABLE_NAME: the name of your Delta Lake table
DATASET: the BigQuery dataset to contain the Delta Lake table
When you create Delta Lake tables, the Delta Lake prefix is
used as the URI for the table. For example, for a table that has logs in the
bucket gs://bucket/warehouse/basictable/_delta_log
, the table URI is
gs://bucket/warehouse/basictable
. When you run queries on the
Delta Lake table, BigQuery reads data under the prefix
to identify the current version of the table and then computes the metadata and
the files for the table.
Update Delta Lake tables
To update (refresh) the schema of Delta Lake tables, follow these steps.
bq
In a command-line environment, use the
bq update
command
to update (refresh) the schema of the Delta Lake table:
bq update --autodetect_schema PROJECT_ID:DATASET.DELTALAKE_TABLE_NAME
Replace the following values:
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- DATASET: the BigQuery dataset to contain the Delta Lake table
- DELTALAKE_TABLE_NAME: the name of your Delta Lake table
REST
Use the BigQuery API to
update a Delta Lake table by calling the tables.patch
API method:
REQUEST='{
"externalDataConfiguration": {
"sourceFormat": "DELTA_LAKE",
"sourceUris": [
"DELTA_TABLE_GCS_BASE_PATH"
],
"connectionId": "PROJECT_ID.REGION.CONNECTION_ID",
"autodetect": true
},
"tableReference": {
"tableId": "DELTALAKE_TABLE_NAME"
}
}'
echo $REQUEST |curl -X POST -d @- -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/datasets/DATASET/tables
Replace the following values:
- DELTA_TABLE_GCS_BASE_PATH: the Delta Lake table prefix
- PROJECT_ID: the ID of the project that you want to create the Delta Lake table in
- REGION: the region that contains the connection to create the Delta Lake table—for example,
us
CONNECTION_ID: the connection ID—for example,
myconnection
When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example
projects/myproject/locations/connection_location/connections/myconnection
.DELTALAKE_TABLE_NAME: the name of your Delta Lake table
DATASET: the BigQuery dataset to contain the Delta Lake table
Query Delta Lake tables
After creating a Delta Lake BigLake table, you can query it using GoogleSQL syntax, the same as you would a standard BigQuery table. For example:
SELECT field1, field2 FROM mydataset.my_cloud_storage_table;
For more information, see Query Cloud Storage data in BigLake tables.
An external connection associated with a service account is used to connect to the datastore. Because the service account retrieves data from the datastore, users only need access to the Delta Lake table.
Data mapping
BigQuery converts Delta Lake data types to BigQuery data types as shown in the following table:
Delta Lake Type | BigQuery Type |
---|---|
boolean |
BOOL |
byte |
INT64 |
int |
INT64 |
long |
INT64 |
float |
FLOAT64 |
double |
FLOAT64 |
Decimal(P/S) |
NUMERIC or BIG_NUMERIC depending on precision |
date |
DATE |
time |
TIME |
timestamp (not partition column) |
TIMESTAMP |
timestamp (partition column) |
DATETIME |
string |
STRING |
binary |
BYTES |
array<Type> |
ARRAY<Type> |
struct |
STRUCT |
map<KeyType, ValueType> |
ARRAY<Struct<key KeyType, value ValueType>> |
Limitations
Delta Lake tables have BigLake table limitations and also the following limitations:
- Supports Delta Lake reader version 3 with deletion vectors and column mapping.
- Doesn't support Delta Lake V2 checkpoints.
- You must list the reader version in the last log entry file. For example,
new tables must include
00000..0.json
. - Change data capture (CDC) operations aren't supported. Any existing CDC operations are ignored.
- The schema is autodetected. Modifying the schema by using BigQuery isn't supported.
- Table column names must adhere to BigQuery column name restrictions.
- Materialized views aren't supported.
- The Read API isn't supported for Delta Lake.