Create Apache Iceberg BigLake tables

With BigLake, you can access Iceberg tables with finer-grained access control. To do so, you must first create an Iceberg BigLake table.

Iceberg is an open source table format that supports petabyte scale data tables. The Iceberg open specification lets you run multiple query engines on a single copy of data stored in an object store.

As a BigQuery administrator, you can enforce row- and column-level access control including data masking on tables. For information about how to set up access control at the table level, see Set up access control policies. Table access policies are also enforced when you use the BigQuery Storage API as a datasource for the table in Dataproc and Serverless Spark. BigLake tables provide additional integrations with other BigQuery services. For a full list of available integrations, see Introduction to BigLake tables.

You can create Iceberg BigLake tables in the following ways:

  • With BigLake Metastore (recommended). BigLake Metastore is a custom Iceberg catalog. Using BigLake Metastore is the recommended method because it enables synchronization of tables between Spark and BigQuery workloads. To do this, you can use a BigQuery stored procedure for Apache Spark to initialize BigLake Metastore and create the Iceberg BigLake table. However, schema updates still require you to run an update query in BigQuery. For a full list of limitations, see Limitations.

  • With Iceberg JSON metadata file. If you use an Iceberg JSON metadata file, then you must manually update the latest metadata file whenever there are any table updates. To avoid this, use BigLake Metastore. You can use a BigQuery stored procedure for Apache Spark to create Iceberg BigLake tables that reference an Iceberg metadata file.

Before you begin

  • Enable the BigQuery Connection, BigQuery Reservation, and BigLake APIs.

    Enable the APIs

  • If you use a stored procedure for Spark in BigQuery to create Iceberg BigLake tables, you must follow these steps:

    1. Create a Spark connection.
    2. Set up access control for that connection.
  • To store the Iceberg BigLake table metadata and data files in Cloud Storage, create a Cloud Storage bucket. You need to connect to your Cloud Storage bucket to access metadata files. To do so, follow these steps:

    1. Create a Cloud resource connection.
    2. Set up access for that connection.
  • If you're using BigLake Metastore, install the appropriate Iceberg Custom Catalog for Apache Spark. Select the Custom Catalog version that best matches the Iceberg version you are using.

    1. Iceberg 1.5.0: gs://spark-lib/biglake/biglake-catalog-iceberg1.5.0-0.1.1-with-dependencies.jar
    2. Iceberg 1.2.0: gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.1-with-dependencies.jar
    3. Iceberg 0.14.0: gs://spark-lib/biglake/biglake-catalog-iceberg0.14.0-0.1.1-with-dependencies.jar

Required roles

To ensure that the caller of the BigLake API has the necessary permissions to create a BigLake table, ask your administrator to grant the caller of the BigLake API the following IAM roles on the project:

For more information about granting roles, see Manage access.

These predefined roles contain the permissions required to create a BigLake table. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create a BigLake table:

  • bigquery.tables.create
  • bigquery.connections.delegate
  • bigquery.jobs.create

Your administrator might also be able to give the caller of the BigLake API these permissions with custom roles or other predefined roles.

Additionally, to allow BigQuery users to query the table, the service account associated with the connection must have the BigLake Viewer (roles/biglake.viewer) role and access to the Cloud Storage bucket that contains that data.

To create Iceberg BigLake tables with BigLake Metastore, the caller of the BigLake API changes. You must grant the Dataproc or the Spark service account access to the Cloud Storage bucket that contains that data:

Create tables with BigLake Metastore

We recommend creating Iceberg BigLake tables with BigLake Metastore. You can use Apache Spark to create these tables. A convenient way to do this is using BigQuery stored procedures for Spark by following these steps:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the Explorer pane, click the connection in the project that you used to create the connection resource.

  3. To create a stored procedure for Spark, click Create stored procedure.

  4. In the query editor, modify the sample code for initializing BigLake Metastore and creating an Iceberg BigLake table by using the CREATE PROCEDURE statement that appears:

     # Creates a stored procedure that initializes BLMS and database.
     # Creates a table in the database and populates a few rows of data.
     CREATE OR REPLACE PROCEDURE iceberg_demo.iceberg_setup_3_3 ()
     WITH CONNECTION `PROCEDURE_CONNECTION_PROJECT_ID.PROCEDURE_CONNECTION_REGION.PROCEDURE_CONNECTION_ID`
     OPTIONS(engine="SPARK",
     jar_uris=["gs://spark-lib/biglake/biglake-catalog-iceberg1.2.0-0.1.0-with-dependencies.jar"],
     properties=[
     ("spark.jars.packages","org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0"),
     ("spark.sql.catalog.CATALOG", "org.apache.iceberg.spark.SparkCatalog"),
     ("spark.sql.catalog.CATALOG.catalog-impl", "org.apache.iceberg.gcp.biglake.BigLakeCatalog"),
     ("spark.sql.catalog.CATALOG.hms_uri: HMS_URI")
     ("spark.sql.catalog.CATALOG.gcp_project", "PROJECT_ID"),
     ("spark.sql.catalog.CATALOG.gcp_location", "LOCATION"),
     ("spark.sql.catalog.CATALOG.blms_catalog", "CATALOG"),
     ("spark.sql.catalog.CATALOG.warehouse", "DATA_WAREHOUSE_URI")
     ]
     )
     LANGUAGE PYTHON AS R'''
     from pyspark.sql import SparkSession
    
     spark = SparkSession \
       .builder \
       .appName("BigLake Iceberg Example") \
       .enableHiveSupport() \
       .getOrCreate()
    
     spark.sql("CREATE NAMESPACE IF NOT EXISTS CATALOG;")
     spark.sql("CREATE DATABASE IF NOT EXISTS CATALOG.CATALOG_DB;")
     spark.sql("DROP TABLE IF EXISTS CATALOG.CATALOG_DB.CATALOG_TABLE;")
    
     /* Create a BigLake Metastore table and a BigQuery Iceberg table. */
     spark.sql("CREATE TABLE IF NOT EXISTS CATALOG.CATALOG_DB.CATALOG_TABLE (id bigint, demo_name string)
               USING iceberg
               TBLPROPERTIES(bq_table='BQ_DATASET.BQ_TABLE', bq_connection='TABLE_CONNECTION_PROJECT_ID.TABLE_CONNECTION_REGION.TABLE_CONNECTION_ID');
               ")
    
     /* Copy a Hive Metastore table to BigLake Metastore. Can be used together with
        TBLPROPERTIES `bq_table` to create a BigQuery Iceberg table. */
     spark.sql("CREATE TABLE CATALOG.CATALOG_DB.CATALOG_TABLE (id bigint, demo_name string)
                USING iceberg
                TBLPROPERTIES(hms_table='HMS_DB.HMS_TABLE');")
     ''';
    

    Replace the following:

    • PROCEDURE_CONNECTION_PROJECT_ID: the project that contains the connection to run Spark procedures—for example, myproject.

    • PROCEDURE_CONNECTION_REGION: the region that contains the connection to run Spark procedures—for example, us.

    • PROCEDURE_CONNECTION_ID: the connection ID—for example, myconnection.

      When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection.

    • CATALOG: the name of the Iceberg catalog to create for BigLake Metastore.

      The default value is iceberg.

    • HMS_URI: if you want to copy existing Hive Metastore tables to BigLake Metastore, specify a Hive Metastore URI.

      For example, thrift://localhost:9083.

    • PROJECT_ID: the project ID in which you want to create the BigLake Metastore instance.

      Iceberg BigLake tables are also created in the same project.

    • LOCATION: the location where you want to create the BigLake Metastore instance.

      BigQuery can only access BigLake Metastore instances that are stored in the same location.

    • DATA_WAREHOUSE_URI: the Cloud Storage bucket URI that you created to store the Iceberg metadata and data files.

      For example, gs://mybucket/iceberg-warehouse.

    • CATALOG_DB: the name of the database that you want to create in BigLake Metastore.

      This database is equivalent to the BigQuery dataset that will contain the Iceberg BigLake table.

    • CATALOG_TABLE: the name of the table that you want to create in BigLake Metastore.

      This table is equivalent to the Iceberg BigLake table that you want to create.

    • BQ_DATASET: the BigQuery dataset to contain the Iceberg BigLake table.

    • BQ_TABLE: the Iceberg BigLake table that you want to create.

    • TABLE_CONNECTION_PROJECT_ID: the project that contains the connection to create the BigLake table—for example, myproject.

    • TABLE_CONNECTION_REGION: the region that contains the connection to create the BigLake table—for example, us.

    • TABLE_CONNECTION_ID: the connection ID—for example, myconnection.

      When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection.

      The service account associated with the connection must have roles/biglake.viewer to allow BigQuery users to query the table.

    • HMS_DB: if you want to copy existing Hive Metastore tables to BigLake Metastore, specify a Hive Metastore database.

    • HMS_TABLE: if you want to copy existing Hive Metastore tables to BigLake Metastore, specify a Hive Metastore table.

    For information about Iceberg catalog configurations, see Spark Catalogs.

  5. To run the stored procedure, click Run. For more information, see Call the Spark stored procedure. an Iceberg BigLake table is created in BigQuery.

Create tables with a metadata file

You can create Iceberg BigLake tables with a JSON metadata file. However, this is not the recommended method because you have to manually update the URI of the JSON metadata file to keep the BigLake table up to date. If the URI is not kept up to date, queries in BigQuery can either fail or provide different results from other query engines that directly use an Iceberg catalog. To avoid this, reference a BigLake Metastore instance when you create an Iceberg BigLake table.

Iceberg table metadata files are created in the Cloud Storage bucket that you specify when you create an Iceberg table using Spark.

Select one of the following options:

SQL

Use the CREATE EXTERNAL TABLE statement. The following example creates a BigLake table named myexternal-table:

  CREATE EXTERNAL TABLE myexternal-table
  WITH CONNECTION `myproject.us.myconnection`
  OPTIONS (
         format = 'ICEBERG',
         uris = ["gs://mybucket/mydata/mytable/metadata/iceberg.metadata.json"]
   )

Replace the uris value with the latest JSON metadata file for a specific table snapshot.

You can enable the require partition filter by setting the require_partition_filter flag.

bq

In a command-line environment, use the bq mk --table command with the @connection decorator to specify the connection to use at the end of the --external_table_definition parameter. To enable the require partition filter, use --require_partition_filter.

bq mk 
--table
--external_table_definition=TABLE_FORMAT=URI@projects/CONNECTION_PROJECT_ID/locations/CONNECTION_REGION/connections/CONNECTION_ID
PROJECT_ID:DATASET.EXTERNAL_TABLE

Replace the following:

  • TABLE_FORMAT: the format of the table that you want to create

    In this case, ICEBERG.

  • URI: the latest JSON metadata file for a specific table snapshot.

    For example, gs://mybucket/mydata/mytable/metadata/iceberg.metadata.json.

  • CONNECTION_PROJECT_ID: the project that contains the connection to create the BigLake table—for example, myproject

  • CONNECTION_REGION: the region that contains the connection to create the BigLake table—for example, us

  • CONNECTION_ID: the table connection ID—for example, myconnection

    When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection

  • DATASET: the name of the BigQuery dataset that you want to create a table in

    For example, mydataset.

  • EXTERNAL_TABLE: the name of the table that you want to create

    For example, mytable.

Update table metadata

If you use a JSON metadata file to create Iceberg BigLake tables, then update the table definition to the latest table metadata. To update the schema or the metadata file, select one of the following options:

bq

  1. Create a table definition file:

    bq mkdef --source_format=ICEBERG \
    "URI" > TABLE_DEFINITION_FILE
    
  2. Use the bq update command with the --autodetect_schema flag:

    bq update --autodetect_schema --external_table_definition=TABLE_DEFINITION_FILE
    PROJECT_ID:DATASET.TABLE
    

    Replace the following:

    • URI: your Cloud Storage URI with the latest JSON metadata file

      For example, gs://mybucket/us/iceberg/mytable/metadata/1234.metadata.json.

    • TABLE_DEFINITION_FILE: the name of the file containing the table schema

    • PROJECT_ID: the project ID containing the table that you want to update

    • DATASET: the dataset containing the table that you want to update

    • TABLE: the table that you want to update

API

Use the tables.patch method with the autodetect_schema property set to true:

PATCH https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/datasets/DATASET/tables/TABLE?autodetect_schema=true

Replace the following:

  • PROJECT_ID: the project ID that contains the table that you want to update
  • DATASET: the dataset containing the table that you want to update
  • TABLE: the table that you want to update

In the body of the request, specify the updated values for the following fields:

{
     "externalDataConfiguration": {
      "sourceFormat": "ICEBERG",
      "sourceUris": [
        "URI"
      ]
    },
    "schema": null
  }'

Replace URI with the latest Iceberg metadata file. For example, gs://mybucket/us/iceberg/mytable/metadata/1234.metadata.json.

Set up access control policies

You can use several methods to control access to BigLake tables:

For example, let's say you want to limit row access for the table mytable in the dataset mydataset:

+---------+---------+-------+
| country | product | price |
+---------+---------+-------+
| US      | phone   |   100 |
| JP      | tablet  |   300 |
| UK      | laptop  |   200 |
+---------+---------+-------+

You can create a row-level filter for Kim (kim@example.com) that restricts their access to rows where country is equal to US.

CREATE ROW ACCESS POLICY only_us_filter
ON mydataset.mytable
GRANT TO ('user:kim@example.com')
FILTER USING (country = 'US');

Then, Kim runs the following query:

SELECT * FROM projectid.mydataset.mytable;

The output shows only the rows where country is equal to US:

+---------+---------+-------+
| country | product | price |
+---------+---------+-------+
| US      | phone   |   100 |
+---------+---------+-------+

Query BigLake tables

For more information, see Query Iceberg data.

Data mapping

BigQuery converts Iceberg data types to BigQuery data types as shown in the following table:

Iceberg data type BigQuery data type
boolean BOOL
int INT64
long INT64
float FLOAT64
double FLOAT64
Decimal(P/S) NUMERIC or BIG_NUMERIC depending on precision
date DATE
time TIME
timestamp DATETIME
timestamptz TIMESTAMP
string STRING
uuid BYTES
fixed(L) BYTES
binary BYTES
list<Type> ARRAY<Type>
struct STRUCT
map<KeyType, ValueType> ARRAY<Struct<key KeyType, value ValueType>>

Limitations

Iceberg BigLake tables have BigLake table limitations and also the following limitations:

  • The copy-on-write configuration is supported, but the merge-on-read configuration is not supported. For more information, see Iceberg configuration.

  • BigQuery supports manifest pruning using all Iceberg partition transformation functions except for Bucket. For information about how to prune partitions, see Query partitioned tables. Queries referencing Iceberg BigLake tables must contain literals in predicates compared to columns that are partitioned.

  • Creating Iceberg BigLake tables in the BigQuery Omni Azure region (azure-eastus2) is not supported.

  • Only Apache Parquet data files are supported.

  • All Iceberg data files must have the field_id property set in their metadata to associate the columns with the Iceberg schema. Tables with the schema.name-mapping.default property are not supported.

  • Hive tables migrated to Iceberg tables using Iceberg's Spark stored procedures are not supported.

  • If you are using BigLake Metastore, the following limitations apply:

    • BigLake Metastore is not supported in BigQuery Omni regions.
    • When renaming a table, the destination table must be in the same database as the source table. The database of the destination table must be specified explicitly.
    • When inspecting an Iceberg metadata table, you must use a fully qualified table name. For example, prod.db.table.history.

Costs

You will be charged for 1 TB at on-demand (per TB) query pricing for every 6,250,000 requests to BigLake Metastore and for every 625,000 objects stored in BigLake Metastore. On-demand query pricing rates vary by region. For smaller numbers of requests or objects, you will be charged the appropriate fraction of 1 TB.

For example, if you made 6,250,000 requests to BigLake Metastore and also stored 312,500 objects in it, you would be charged for 1.5 TB at the on-demand query pricing rate for the region in which you created the BigLake Metastore instance.

Require partition filter

You can require the use of predicate filters by enabling the require partition filter option for your Iceberg table. If you enable this option, attempts to query the table without specifying a WHERE clause that aligns with each manifest file will produce the following error:

Cannot query over table project_id.dataset.table without a
filter that can be used for partition elimination.

Each manifest file requires at least one predicate suitable for partition elimination.

You can enable the require_partition_filter in the following ways while creating an Iceberg table :

SQL

Use the CREATE EXTERNAL TABLE statement.The following example creates a BigLake table named TABLE with require partition filter enabled:

  CREATE EXTERNAL TABLE TABLE
  WITH CONNECTION `PROJECT_ID.REGION.CONNECTION_ID`
  OPTIONS (
         format = 'ICEBERG',
         uris = [URI],
         require_partition_filter = true
   )

Replace the following:

  • TABLE: the table name that you want to create.
  • PROJECT_ID: the project ID containing the table that you want to create.
  • REGION: the location where you want to create the Iceberg table.
  • CONNECTION_ID: the connection ID. For example, myconnection.

  • URI: the Cloud Storage URI with the latest JSON metadata file.

    For example, gs://mybucket/us/iceberg/mytable/metadata/1234.metadata.json.

bq

Use the bq mk --table command with the @connection decorator to specify the connection to use at the end of the --external_table_definition parameter. Use --require_partition_filter to enable the require partition filter. The following example creates a BigLake table named TABLE with require partition filter enabled:

bq mk \
    --table \
    --external_table_definition=ICEBERG=URI@projects/CONNECTION_PROJECT_ID/locations/CONNECTION_REGION/connections/CONNECTION_ID \
    PROJECT_ID:DATASET.EXTERNAL_TABLE \
    --require_partition_filter

Replace the following:

  • URI: the latest JSON metadata file for a specific table snapshot

    For example, gs://mybucket/mydata/mytable/metadata/iceberg.metadata.json.

  • CONNECTION_PROJECT_ID: the project that contains the connection to create the BigLake table—for example, myproject

  • CONNECTION_REGION: the region that contains the connection to create the BigLake table. For example, us.

  • CONNECTION_ID: : the connection ID. For example, myconnection.

    When you view the connection details in the Google Cloud console, the connection ID is the value in the last section of the fully qualified connection ID that is shown in Connection ID—for example projects/myproject/locations/connection_location/connections/myconnection

  • DATASET: the name of the BigQuery

    dataset that contains the table that you want to update. For example, mydataset.

  • EXTERNAL_TABLE: the name of the table that you want to create

    For example, mytable.

You can also update your Iceberg table to enable the require partition filter.

If you don't enable the require partition filter option when you create the partitioned table, you can update the table to add the option.

bq

Use the bq update command and supply the --require_partition_filter flag.

For example:

To update mypartitionedtable in mydataset in your default project, enter:

bq update --require_partition_filter PROJECT_ID:DATASET.TABLE

What's next