BigQuery tables for Apache Iceberg
For support during the preview, email bigquery-tables-for-apache-iceberg-help@google.com.
BigQuery tables for Apache Iceberg, (hereafter, Iceberg tables) provide the foundation for building open-format lakehouses on Google Cloud. Iceberg tables offer the same fully managed experience as BigQuery tables, but store data in customer-owned storage buckets using Parquet to be interoperable with Iceberg open table formats.
BigQuery tables for Apache Iceberg are distinct from BigLake external tables for Apache Iceberg because only BigQuery tables for Apache Iceberg are modifiable directly within BigQuery. BigLake external tables for Apache Iceberg are read-only tables generated from another query engine, such as Apache Spark, and can only be queried using BigQuery.
Iceberg tables support the following features:
- Table mutations using GoogleSQL data manipulation language (DML).
- Unified batch and high throughput streaming using the Storage Write API through BigLake connectors like Spark, Dataflow, and other engines.
- Schema evolution, which lets you add, drop, and rename columns to suit your needs. This feature also lets you change an existing column's data type and column mode. For more information, see type conversion rules.
- Automatic storage optimization, including adaptive file sizing, automatic clustering, garbage collection, and metadata optimization.
- Column-level security and data masking.
Architecture
Iceberg tables bring the convenience of BigQuery resource management to tables that reside in your own cloud buckets. Iceberg tables let you use BigQuery on these tables without moving the data out of buckets that you control.
The following diagram shows the managed table architecture at a high level:
This table management has the following implications on your bucket:
- BigQuery creates new data files in the bucket in response to write requests and background storage optimizations, such as DML statements and streaming.
- When you delete a managed table in BigQuery, BigQuery doesn't delete associated data files. You must confirm the deletion by deleting the files and any exported table metadata from the bucket manually.
- Iceberg tables incur no BigQuery storage costs. For more information, see Billing.
Creating an Iceberg table is similar to creating BigQuery tables. Because it stores data in open formats on Cloud Storage, it has more options for the following:
- Specify the
Cloud resource connection
with
WITH CONNECTION
to configure the connection credentials for BigLake to access Cloud Storage. - Specifying the file format of data storage with
file_format
.PARQUET
is supported in Preview. - Specifying the open-source metadata table format with
table_format
.ICEBERG
is supported in Preview.
Best practices
Directly changing or adding files to the bucket outside of BigQuery can lead to data loss or unrecoverable errors. The following table describes possible scenarios:
Operation | Consequences | Prevention |
---|---|---|
Add new files to the bucket outside BigQuery. | Data loss: New files or objects added outside of BigQuery are not tracked by BigQuery. Untracked files are deleted by background garbage collection processes. | Add data exclusively through BigQuery. This lets
BigQuery track the files and prevent them from being
garbage collected. To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg tables. |
Create a new Iceberg table in a non-empty prefix. | Data loss: Extant data isn't tracked by BigQuery, so these files are considered untracked, and deleted by background garbage collection processes. | Only create new Iceberg tables in empty prefixes. |
Modify or replace Iceberg table data files. | Data loss: On external modification or replacement,
the table fails a consistency check and becomes unreadable. Queries
against the table fail. There is no self-serve way to recover from this point. Contact support for data recovery assistance. |
Modify data exclusively through BigQuery. This lets
BigQuery track the files and prevent them from being
garbage collected. To prevent accidental additions and data loss, we also recommend restricting external tool write permissions on buckets containing Iceberg tables. |
Create two BigQuery tables for Apache Iceberg on the same or overlapping URIs. | Data loss: BigQuery doesn't bridge identical URI instances of Iceberg tables. Background garbage collection processes for each table will consider the opposite table's files as untracked, and delete them, causing data loss. | Use unique URIs for each Iceberg table. |
Location consideration
You can improve performance by using Cloud Storage single-region or dual-region buckets instead of multi-region buckets.
Billing
The following features are charged using the existing published pricing:
- Cloud Storage pricing for all data stored in Cloud Storage buckets, data processing done by Cloud Storage, and network usage for the amount of data read from your bucket.
- BigQuery compute pricing
for queries, DML, and background storage optimization (including
clustering, coalescence, and garbage collection).
- Charges using reservations (slots) follow existing slot pricing.
- Charges using the on-demand stock-keeping units (SKUs) follow existing on-demand pricing. For more information, see BigLake costs.
- Batch load and Extract compute are charged using either on-demand SKUs or reservations (slots).
- Storage Read API pricing for reading from Spark through the Read API.
- Storage Write API pricing for streaming.
Iceberg table workflows
The following sections describe how to create, load, manage, and query managed tables.
Before you begin
Before creating and using Iceberg tables, ensure that you have set up a cloud resource connection to a storage bucket. Your connection needs write permissions on the storage bucket, as specified in the following Required roles section.
Required roles
To get the permissions that you need to let BigQuery manage tables in your project, ask your administrator to grant you the following IAM roles:
-
To create Iceberg tables:
-
BigQuery Data Owner (
roles/bigquery.dataOwner
) on your project -
BigQuery Connection Admin (
roles/bigquery.connectionAdmin
) on your project
-
BigQuery Data Owner (
-
To query Iceberg tables:
-
BigQuery Data Viewer (
roles/bigquery.dataViewer
) on your project -
BigQuery User (
roles/bigquery.user
) on your project
-
BigQuery Data Viewer (
-
For the connection service account to read and write data in Cloud Storage:
-
Storage Object Admin (
roles/storage.objectAdmin
) on the bucket -
Storage Legacy Bucket Reader (
roles/storage.legacyBucketReader
) on the bucket
-
Storage Object Admin (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to let BigQuery manage tables in your project. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to let BigQuery manage tables in your project:
-
bigquery.connections.delegate
on your project -
bigquery.jobs.create
on your project -
bigquery.readsessions.create
on your project -
bigquery.tables.create
on your project -
bigquery.tables.get
on your project -
bigquery.tables.getData
on your project -
storage.buckets.get
on your project -
storage.objects.create
on your project -
storage.objects.delete
on your project -
storage.objects.get
on your project -
storage.objects.list
on your project
You might also be able to get these permissions with custom roles or other predefined roles.
Create Iceberg tables
To create a Iceberg table, select one of the following methods:
SQL
CREATE TABLE [PROJECT_NAME.]DATASET_NAME.TABLE_NAME ( COLUMN DATA_TYPE[, ...] ) CLUSTER BY CLUSTER_COLUMN_LIST WITH CONNECTION CONNECTION_NAME OPTIONS ( file_format = 'PARQUET', table_format = 'ICEBERG', storage_uri = 'STORAGE_URI');
Replace the following:
- PROJECT_NAME: the project containing the dataset. If undefined, the command assumes the default project.
- DATASET_NAME: an existing dataset.
- TABLE_NAME: the name of the table you're creating.
- DATA_TYPE: the data type of the information that is contained in the column.
- CLUSTER_COLUMN_LIST: a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
- CONNECTION_NAME: the name of the connection. For
example,
myproject.us.myconnection
. - STORAGE_URI: a fully qualified
Cloud Storage URI.
For example,
gs://mybucket/table
.
bq
bq --project_id=PROJECT_NAME mk \ --file_format=PARQUET \ --table_format=ICEBERG \ --connection_id=CONNECTION_NAME \ --storage_uri=STORAGE_URI \ --schema=COLUMN_NAME:DATA_TYPE[, ...] \ --clustering_fields=CLUSTER_COLUMN_LIST \ MANAGED_TABLE_NAME
Replace the following:
- PROJECT_NAME: the project containing the dataset. If undefined, the command assumes the default project.
- CONNECTION_NAME: the name of the connection. For
example,
myproject.us.myconnection
. - STORAGE_URI: a fully qualified
Cloud Storage URI.
For example,
gs://mybucket/table
. - COLUMN_NAME: the column name.
- DATA_TYPE: the data type of the information contained in the column.
- CLUSTER_COLUMN_LIST: a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
- MANAGED_TABLE_NAME: the name of the table you're creating.
API
Call the
tables.insert
'
method with a defined
table resource, similar to the
following:
{ "tableReference": { "tableId": "TABLE_NAME" }, "biglakeConfiguration": { "connectionId": "CONNECTION_NAME", "fileFormat": "PARQUET", "tableFormat": "ICEBERG", "storageUri": "STORAGE_URI" }, "schema": { "fields": [ { "name": "COLUMN_NAME", "type": "DATA_TYPE" } [, ...] ] } }
Replace the following:
- TABLE_NAME: the name of the table that you're creating.
- CONNECTION_NAME: the name of the connection. For
example,
myproject.us.myconnection
. - STORAGE_URI: a fully qualified
Cloud Storage URI.
Wildcards
are also supported. For example,
gs://mybucket/table
. - COLUMN_NAME: the column name.
- DATA_TYPE: the data type of the information contained in the column.
Import data into Iceberg table
The following sections describe how to import data from various table formats into Iceberg tables.
Fast load from Parquet files
The copy_files_only
option lets you load data faster by copying your existing
Parquet files, instead of reading the content and rewriting the content as new
files. Fast load uses less compute capacity compared to a regular file load.
Parquet files must be compatible with the
Apache Iceberg specification
and have complete column statistics. Fast load doesn't detect invalid values
(such as out-of-range timestamps) in files because the files
are not read and reprocessed. For more information on loading Parquet files,
see
Loading Parquet data into a new table.
To fast load flat Parquet files into an existing Iceberg
table, use the
bq load
command:
bq load \ --copy_files_only \ --source_format=PARQUET \ DATASET_NAME.TABLE_NAME \ PATH_TO_SOURCE
Replace the following:
- DATASET_NAME: the dataset containing your Iceberg table.
- TABLE_NAME: the name of the Iceberg table into which you're loading data.
- PATH_TO_SOURCE: a fully qualified
Cloud Storage URI
or a comma-separated list of URIs.
Wildcards
are also supported. For example,
gs://mybucket/mydata*.parquet
.
Standard load data from flat files
Iceberg tables use BigQuery load jobs to
load external files
into Iceberg tables. If you have an existing
Iceberg table, follow
the bq load
CLI guide
or the
LOAD
SQL guide
to load external data. After loading the data, new Parquet files are written
into the STORAGE_URI/data
folder.
If the prior instructions are used without an existing Iceberg table, a BigQuery table is created instead.
See the following for tool-specific examples of batch loads into managed tables:
SQL
LOAD DATA INTO MANAGED_TABLE_NAME FROM FILES ( uris=['STORAGE_URI'], format='FILE_FORMAT');
Replace the following:
- MANAGED_TABLE_NAME: the name of an existing Iceberg table.
- STORAGE_URI: a fully qualified
Cloud Storage URI
or a comma-separated list of URIs.
Wildcards
are also supported. For example,
gs://mybucket/table
. - FILE_FORMAT: the source table format. For supported formats,
see the
format
row ofload_option_list
.
bq
bq load \ --source_format=FILE_FORMAT \ MANAGED_TABLE \ STORAGE_URI
Replace the following:
- FILE_FORMAT: the source table format. For supported formats,
see the
format
row ofload_option_list
. - MANAGED_TABLE_NAME: the name of an existing Iceberg table.
- STORAGE_URI: a fully qualified
Cloud Storage URI
or a comma-separated list of URIs.
Wildcards
are also supported. For example,
gs://mybucket/table
.
Standard load from Hive-partitioned files
You can load Hive-partitioned files into Iceberg tables using standard BigQuery load jobs. For more information, see Loading externally partitioned data.
Load streaming data from Pub/Sub
You can load streaming data into Iceberg tables by using a Pub/Sub BigQuery subscription.
Export data from Iceberg tables
The following sections describe how to export data from Iceberg tables into various table formats.
Export data into flat formats
To export a Iceberg table into a flat format, use the
EXPORT DATA
statement
and select a destination format. For more information, see
Exporting data.
Create Iceberg table metadata snapshots
To create an Iceberg table metadata snapshot, follow these steps:
Export the metadata into the Iceberg format with the
EXPORT TABLE METADATA
SQL statement.Optional: Schedule Iceberg metadata snapshot refresh. To refresh an Iceberg metadata snapshot based on a set time interval, use a scheduled query.
The following example creates a scheduled query named My Scheduled Snapshot Refresh Query
using the DDL statement EXPORT TABLE METADATA FROM mydataset.test
. The
destination dataset is mydataset
. The DDL statement runs every 24 hours.
bq query \
--use_legacy_sql=false \
--destination_dataset=mydataset
--display_name='My Scheduled Snapshot Refresh Query' \
--schedule='every 24 hours' \
'EXPORT TABLE METADATA FROM mydataset.test'
View Iceberg table metadata snapshot
After you refresh the Iceberg table metadata snapshot you can
find the snapshot in the Cloud Storage URI
that the Iceberg table was originally created in. The /data
folder contains the Parquet file data shards, and the /metadata
folder contains
the Iceberg table metadata snapshot.
SELECT table_name, REGEXP_EXTRACT(ddl, r"storage_uri\s*=\s*\"([^\"]+)\"") AS storage_uri FROM `mydataset`.INFORMATION_SCHEMA.TABLES;
Note that mydataset
and table_name
are placeholders for your actual dataset and table.
Read Iceberg tables with Apache Spark
Setup and read table data in the Apache Spark with
HadoopCatalog
.
The following sample sets up your environment to use Spark SQL with Apache Iceberg, and then executes a query to fetch data from a specified Iceberg table.
spark-sql
--packages org.apache.iceberg:iceberg-spark-runtime-ICEBERG_VERSION_NUMBER
--conf spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.CATALOG_NAME.type=hadoop
--conf spark.sql.catalog.CATALOG_NAME.warehouse='BUCKET_PATH' \# Queries the table spark-sql> SELECT * FROM CATALOG_NAME.FOLDER_NAME;
Replace the following:
- ICEBERG_VERSION_NUMBER: the current version of Apache Spark Iceberg runtime. Download the latest version from Spark Releases.
- CATALOG_NAME: the catalog to reference your Iceberg table.
- BUCKET_PATH: the path to the bucket containing the
table files. For example,
gs://mybucket/
. - FOLDER_NAME: the folder containing the table files.
For example,
myfolder
.
Modify Iceberg tables
To modify a Iceberg table, follow the steps shown in Modifying table schemas.
Pricing
Iceberg table pricing consists of three separate components:
Storage
Iceberg table store all data in Cloud Storage. You are charged for all data stored, including historical table data. Cloud Storage data processing and transfer charges may also apply, as applicable. There are no BigQuery-specific storage fees. For more information, see Cloud Storage Pricing.
Storage optimization
Iceberg tables require storage optimization operations, such as file coalescing
and re-clustering.
These optimization operations use
Enterprise edition pay as you go slots,
and don't use existing BACKGROUND
reservations.
Data export operations taking place while streaming through the BigQuery Storage Write API are included in Storage Write API pricing and are not charged as background maintenance. For more information, see Data ingestion pricing.
Storage optimization usage is visible in the INFORMATION_SCHEMA.JOBS
view.
Queries and jobs
Similar to BigQuery tables, you are charged for queries and bytes read (per TiB) if you are using BigQuery on-demand pricing, or slot consumption (per slot hour) if you are using BigQuery capacity compute pricing.
BigQuery pricing also applies to the BigQuery Storage Read API and the BigQuery Storage Write API.
Load and export operations (such as EXPORT METADATA
) use
Enterprise edition pay as you go slots.
This differs from BigQuery tables, which are not charged for
these operations. If PIPELINE
reservations with Enterprise or
Enterprise Plus slots are available, load and export operations
preferentially use these reservation slots instead.
Limitations
Iceberg tables have the following limitations:
- Iceberg tables don't support
renaming operations or
ALTER TABLE RENAME TO
statements. - Iceberg tables don't support
table copies or
CREATE TABLE COPY
statements. - Iceberg tables don't support
table clones or
CREATE TABLE CLONE
statements. - Iceberg tables don't support
table snapshots or
CREATE SNAPSHOT TABLE
statements. - Iceberg tables don't support the following table schema:
- Empty schema
- Schema with
INTERVAL
,JSON
,RANGE
, orGEOGRAPHY
data types. - Schema with field collations.
- Schema with default value expressions.
EXPORT METADATA
doesn't support tables containingBIGNUMERIC
orNUMERIC
data types with precision greater than 38 places.- Iceberg tables don't support the following schema
evolution cases:
NUMERIC
toFLOAT
type coercionsINT
toFLOAT
type coercions- Adding new nested fields to an existing
RECORD
columns using SQL DDL statements
- Iceberg tables display a 0-byte storage size when queried by the console or APIs.
- Iceberg tables don't support materialized views.
- Iceberg tables don't support multi-statement transactions.
- Iceberg tables don't support change data capture (CDC) updates.
- When streaming data to Iceberg tables using the BigQuery Storage Write API, you must first disable the query cache.
- Iceberg tables don't support managed disaster recovery
- Iceberg tables don't support partitioning. Consider clustering as an alternative.
- Iceberg tables don't support row-level security.
- Iceberg tables don't support time travel.
- Iceberg tables don't support fail-safe windows.
- Iceberg tables don't support extract jobs.
- The
INFORMATION_SCHEMA.TABLE_STORAGE
view doesn't include Iceberg tables. - Iceberg tables are not supported as query result destinations.
CREATE OR REPLACE
doesn't support replacing standard tables with Iceberg tables, or Iceberg tables with standard tables.- Batch loading and
LOAD DATA
statements only support appending data to existing Iceberg tables. - Batch loading and
LOAD DATA
statements don't support schema updates. TRUNCATE TABLE
doesn't support Iceberg tables. There are two alternatives:CREATE OR REPLACE TABLE
, using the same table creation options.DELETE FROM
tableWHERE
true
- The
APPENDS
table-valued function (TVF) doesn't support Iceberg tables. - Iceberg exports in Apache Spark don't contain recently streamed data in write-optimized storage.
- Fast load doesn't support files with flexible column names.
- Record-based paginated access using
tabledata.list
doesn't support Iceberg tables.- Iceberg tables don't support linked datasets.