BigQuery tables for Apache Iceberg

For support during the preview, email bigquery-tables-for-apache-iceberg-help@google.com.

BigQuery tables for Apache Iceberg, (hereafter, Iceberg tables) provide the foundation for building open-format lakehouses on Google Cloud. Iceberg tables offer the same fully managed experience as BigQuery tables, but store data in customer-owned storage buckets using Parquet to be interoperable with Iceberg open table formats.

BigQuery tables for Apache Iceberg are distinct from BigLake external tables for Apache Iceberg because only BigQuery tables for Apache Iceberg are modifiable directly within BigQuery. BigLake external tables for Apache Iceberg are read-only tables generated from another query engine, such as Apache Spark, and can only be queried using BigQuery.

Iceberg tables support the following features:

Architecture

Iceberg tables bring the convenience of BigQuery resource management to tables that reside in your own cloud buckets. Iceberg tables let you use BigQuery on these tables without moving the data out of buckets that you control.

The following diagram shows the managed table architecture at a high level: BigQuery for Iceberg tables architecture diagram.

This table management has the following implications on your bucket:

  • BigQuery creates new data files in the bucket in response to write requests and background storage optimizations, such as DML statements and streaming.
  • When you delete a managed table in BigQuery, BigQuery doesn't delete associated data files. You must confirm the deletion by deleting the files and any exported table metadata from the bucket manually.
  • Iceberg tables incur no BigQuery storage costs. For more information, see Billing.

Creating an Iceberg table is similar to creating BigQuery tables. Because it stores data in open formats on Cloud Storage, it has more options for the following:

  • Specify the Cloud resource connection with WITH CONNECTION to configure the connection credentials for BigLake to access Cloud Storage.
  • Specifying the file format of data storage with file_format. PARQUET is supported in Preview.
  • Specifying the open-source metadata table format with table_format. ICEBERG is supported in Preview.

Best practices

Directly changing or adding files to the bucket outside of BigQuery can lead to data loss or unrecoverable errors. The following table describes possible scenarios:

Operation Consequences Prevention
Add new files to the bucket outside BigQuery. Data loss: New files or objects added outside of BigQuery are not tracked by BigQuery. Untracked files are deleted by background garbage collection processes. Add data exclusively through BigQuery. This lets BigQuery track the files and prevent them from being garbage collected.
To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg tables.
Create a new Iceberg table in a non-empty prefix. Data loss: Extant data isn't tracked by BigQuery, so these files are considered untracked, and deleted by background garbage collection processes. Only create new Iceberg tables in empty prefixes.
Modify or replace Iceberg table data files. Data loss: On external modification or replacement, the table fails a consistency check and becomes unreadable. Queries against the table fail.
There is no self-serve way to recover from this point. Contact support for data recovery assistance.
Modify data exclusively through BigQuery. This lets BigQuery track the files and prevent them from being garbage collected.
To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg tables.
Create two BigQuery tables for Apache Iceberg on the same or overlapping URIs. Data loss: BigQuery doesn't bridge identical URI instances of Iceberg tables. Background garbage collection processes for each table will consider the opposite table's files as untracked, and delete them, causing data loss. Use unique URIs for each Iceberg table.

Location consideration

You can improve performance by using Cloud Storage single-region or dual-region buckets instead of multi-region buckets.

Billing

The following features are charged using the existing published pricing:

  • Cloud Storage pricing for all data stored in Cloud Storage buckets, data processing done by Cloud Storage, and network usage for the amount of data read from your bucket.
  • BigQuery compute pricing for queries, DML, and background storage optimization (including clustering, coalescence, and garbage collection).
    • Charges using reservations (slots) follow existing slot pricing.
    • Charges using the on-demand stock-keeping units (SKUs) follow existing on-demand pricing. For more information, see BigLake costs.
  • Batch load and Extract compute are charged using either on-demand SKUs or reservations (slots).
  • Storage Read API pricing for reading from Spark through the Read API.
  • Storage Write API pricing for streaming.

Iceberg table workflows

The following sections describe how to create, load, manage, and query managed tables.

Before you begin

Before creating and using Iceberg tables, ensure that you have set up a cloud resource connection to a storage bucket. Your connection needs write permissions on the storage bucket, as specified in the following Required roles section.

Required roles

To get the permissions that you need to let BigQuery manage tables in your project, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to let BigQuery manage tables in your project. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to let BigQuery manage tables in your project:

  • bigquery.connections.delegate on your project
  • bigquery.jobs.create on your project
  • bigquery.readsessions.create on your project
  • bigquery.tables.create on your project
  • bigquery.tables.get on your project
  • bigquery.tables.getData on your project
  • storage.buckets.get on your project
  • storage.objects.create on your project
  • storage.objects.delete on your project
  • storage.objects.get on your project
  • storage.objects.list on your project

You might also be able to get these permissions with custom roles or other predefined roles.

Create Iceberg tables

To create a Iceberg table, select one of the following methods:

SQL

CREATE TABLE [PROJECT_NAME.]DATASET_NAME.TABLE_NAME (
COLUMN DATA_TYPE[, ...]
)
CLUSTER BY CLUSTER_COLUMN_LIST
WITH CONNECTION CONNECTION_NAME
OPTIONS (
file_format = 'PARQUET',
table_format = 'ICEBERG',
storage_uri = 'STORAGE_URI');

Replace the following:

  • PROJECT_NAME: the project containing the dataset. If undefined, the command assumes the default project.
  • DATASET_NAME: an existing dataset.
  • TABLE_NAME: the name of the table you're creating.
  • DATA_TYPE: the data type of the information that is contained in the column.
  • CLUSTER_COLUMN_LIST: a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
  • CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
  • STORAGE_URI: a fully qualified Cloud Storage URI. For example, gs://mybucket/table.

bq

bq --project_id=PROJECT_NAME mk \
    --file_format=PARQUET \
    --table_format=ICEBERG \
    --connection_id=CONNECTION_NAME \
    --storage_uri=STORAGE_URI \
    --schema=COLUMN_NAME:DATA_TYPE[, ...] \
    --clustering_fields=CLUSTER_COLUMN_LIST \
    MANAGED_TABLE_NAME

Replace the following:

  • PROJECT_NAME: the project containing the dataset. If undefined, the command assumes the default project.
  • CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
  • STORAGE_URI: a fully qualified Cloud Storage URI. For example, gs://mybucket/table.
  • COLUMN_NAME: the column name.
  • DATA_TYPE: the data type of the information contained in the column.
  • CLUSTER_COLUMN_LIST: a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
  • MANAGED_TABLE_NAME: the name of the table you're creating.

API

Call the tables.insert' method with a defined table resource, similar to the following:

{
"tableReference": {
  "tableId": "TABLE_NAME"
},
"biglakeConfiguration": {
  "connectionId": "CONNECTION_NAME",
  "fileFormat": "PARQUET",
  "tableFormat": "ICEBERG",
  "storageUri": "STORAGE_URI"
},
"schema": {
  "fields": [
    {
      "name": "COLUMN_NAME",
      "type": "DATA_TYPE"
    }
    [, ...]
  ]
}
}

Replace the following:

  • TABLE_NAME: the name of the table that you're creating.
  • CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
  • STORAGE_URI: a fully qualified Cloud Storage URI. Wildcards are also supported. For example, gs://mybucket/table.
  • COLUMN_NAME: the column name.
  • DATA_TYPE: the data type of the information contained in the column.

Import data into Iceberg table

The following sections describe how to import data from various table formats into Iceberg tables.

Fast load from Parquet files

The copy_files_only option lets you load data faster by copying your existing Parquet files, instead of reading the content and rewriting the content as new files. Fast load uses less compute capacity compared to a regular file load. Parquet files must be compatible with the Apache Iceberg specification and have complete column statistics. Fast load doesn't detect invalid values (such as out-of-range timestamps) in files because the files are not read and reprocessed. For more information on loading Parquet files, see Loading Parquet data into a new table.

To fast load flat Parquet files into an existing Iceberg table, use the bq load command:

bq load \
    --copy_files_only \
    --source_format=PARQUET \
    DATASET_NAME.TABLE_NAME \
    PATH_TO_SOURCE

Replace the following:

  • DATASET_NAME: the dataset containing your Iceberg table.
  • TABLE_NAME: the name of the Iceberg table into which you're loading data.
  • PATH_TO_SOURCE: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/mydata*.parquet.

Standard load data from flat files

Iceberg tables use BigQuery load jobs to load external files into Iceberg tables. If you have an existing Iceberg table, follow the bq load CLI guide or the LOAD SQL guide to load external data. After loading the data, new Parquet files are written into the STORAGE_URI/data folder.

If the prior instructions are used without an existing Iceberg table, a BigQuery table is created instead.

See the following for tool-specific examples of batch loads into managed tables:

SQL

LOAD DATA INTO MANAGED_TABLE_NAME
FROM FILES (
uris=['STORAGE_URI'],
format='FILE_FORMAT');

Replace the following:

  • MANAGED_TABLE_NAME: the name of an existing Iceberg table.
  • STORAGE_URI: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table.
  • FILE_FORMAT: the source table format. For supported formats, see the format row of load_option_list.

bq

bq load \
  --source_format=FILE_FORMAT \
  MANAGED_TABLE \
  STORAGE_URI

Replace the following:

  • FILE_FORMAT: the source table format. For supported formats, see the format row of load_option_list.
  • MANAGED_TABLE_NAME: the name of an existing Iceberg table.
  • STORAGE_URI: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table.

Standard load from Hive-partitioned files

You can load Hive-partitioned files into Iceberg tables using standard BigQuery load jobs. For more information, see Loading externally partitioned data.

Load streaming data from Pub/Sub

You can load streaming data into Iceberg tables by using a Pub/Sub BigQuery subscription.

Export data from Iceberg tables

The following sections describe how to export data from Iceberg tables into various table formats.

Export data into flat formats

To export a Iceberg table into a flat format, use the EXPORT DATA statement and select a destination format. For more information, see Exporting data.

Create Iceberg table metadata snapshots

To create an Iceberg table metadata snapshot, follow these steps:

  1. Export the metadata into the Iceberg format with the EXPORT TABLE METADATA SQL statement.

  2. Optional: Schedule Iceberg metadata snapshot refresh. To refresh an Iceberg metadata snapshot based on a set time interval, use a scheduled query.

The following example creates a scheduled query named My Scheduled Snapshot Refresh Query using the DDL statement EXPORT TABLE METADATA FROM mydataset.test. The destination dataset is mydataset. The DDL statement runs every 24 hours.

        bq query \
            --use_legacy_sql=false \
            --destination_dataset=mydataset
            --display_name='My Scheduled Snapshot Refresh Query' \
            --schedule='every 24 hours' \
            'EXPORT TABLE METADATA FROM mydataset.test'

View Iceberg table metadata snapshot

After you refresh the Iceberg table metadata snapshot you can find the snapshot in the Cloud Storage URI that the Iceberg table was originally created in. The /data folder contains the Parquet file data shards, and the /metadata folder contains the Iceberg table metadata snapshot.

  SELECT
    table_name,
    REGEXP_EXTRACT(ddl, r"storage_uri\s*=\s*\"([^\"]+)\"") AS storage_uri
  FROM
    `mydataset`.INFORMATION_SCHEMA.TABLES;

Note that mydataset and table_name are placeholders for your actual dataset and table.

Read Iceberg tables with Apache Spark

Setup and read table data in the Apache Spark with HadoopCatalog.

The following sample sets up your environment to use Spark SQL with Apache Iceberg, and then executes a query to fetch data from a specified Iceberg table.

   spark-sql 
--packages org.apache.iceberg:iceberg-spark-runtime-ICEBERG_VERSION_NUMBER
--conf spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.CATALOG_NAME.type=hadoop
--conf spark.sql.catalog.CATALOG_NAME.warehouse='BUCKET_PATH' \

# Queries the table spark-sql> SELECT * FROM CATALOG_NAME.FOLDER_NAME;

Replace the following:

  • ICEBERG_VERSION_NUMBER: the current version of Apache Spark Iceberg runtime. Download the latest version from Spark Releases.
  • CATALOG_NAME: the catalog to reference your Iceberg table.
  • BUCKET_PATH: the path to the bucket containing the table files. For example, gs://mybucket/.
  • FOLDER_NAME: the folder containing the table files. For example, myfolder.

Modify Iceberg tables

To modify a Iceberg table, follow the steps shown in Modifying table schemas.

Pricing

Iceberg table pricing consists of three separate components:

Storage

Iceberg table store all data in Cloud Storage. You are charged for all data stored, including historical table data. Cloud Storage data processing and transfer charges may also apply, as applicable. There are no BigQuery-specific storage fees. For more information, see Cloud Storage Pricing.

Storage optimization

Iceberg tables require storage optimization operations, such as file coalescing and re-clustering. These optimization operations use Enterprise edition pay as you go slots, and don't use existing BACKGROUND reservations.

Data export operations taking place while streaming through the BigQuery Storage Write API are included in Storage Write API pricing and are not charged as background maintenance. For more information, see Data ingestion pricing.

Storage optimization usage is visible in the INFORMATION_SCHEMA.JOBS view.

Queries and jobs

Similar to BigQuery tables, you are charged for queries and bytes read (per TiB) if you are using BigQuery on-demand pricing, or slot consumption (per slot hour) if you are using BigQuery capacity compute pricing.

BigQuery pricing also applies to the BigQuery Storage Read API and the BigQuery Storage Write API.

Load and export operations (such as EXPORT METADATA) use Enterprise edition pay as you go slots. This differs from BigQuery tables, which are not charged for these operations. If PIPELINE reservations with Enterprise or Enterprise Plus slots are available, load and export operations preferentially use these reservation slots instead.

Limitations

Iceberg tables have the following limitations: