Dataflow managed I/O for Apache Iceberg

Managed I/O supports the following capabilities for Apache Iceberg:

Catalogs
  • Hadoop
  • Hive
  • REST-based catalogs
  • BigQuery metastore (requires Apache Beam SDK 2.62.0 or later if not using Runner v2)
Read capabilities Batch read
Write capabilities

For BigQuery tables for Apache Iceberg, use the BigQueryIO connector with BigQuery Storage API. The table must already exist; dynamic table creation is not supported.

Requirements

The following SDKs support managed I/O for Apache Iceberg:

  • Apache Beam SDK for Java version 2.58.0 or later
  • Apache Beam SDK for Python version 2.61.0 or later

Configuration

Managed I/O for Apache Iceberg supports the following configuration parameters:

ICEBERG Read

Configuration Type Description
table str Identifier of the Iceberg table.
catalog_name str Name of the catalog containing the table.
catalog_properties map[str, str] Properties used to set up the Iceberg catalog.
config_properties map[str, str] Properties passed to the Hadoop Configuration.
drop list[str] A subset of column names to exclude from reading. If null or empty, all columns will be read.
filter str SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html
keep list[str] A subset of column names to read exclusively. If null or empty, all columns will be read.

ICEBERG Write

Configuration Type Description
table str A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`.
catalog_name str Name of the catalog containing the table.
catalog_properties map[str, str] Properties used to set up the Iceberg catalog.
config_properties map[str, str] Properties passed to the Hadoop Configuration.
drop list[str] A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'.
keep list[str] A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'.
only str The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'.
partition_fields list[str] Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are:
  • foo
  • truncate(foo, N)
  • bucket(foo, N)
  • hour(foo)
  • day(foo)
  • month(foo)
  • year(foo)
  • void(foo)

For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms.

table_properties map[str, str] Iceberg table properties to be set on the table when it is created. For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties.
triggering_frequency_seconds int32 For a streaming pipeline, sets the frequency at which snapshots are produced.

What's next

For more information and code examples, see the following topics: