To read Change Data Capture (CDC) events from Apache Iceberg using the BigLake REST Catalog, use the Apache Beam managed I/O connector.
Managed I/O supports the following capabilities for Apache Iceberg:
Catalogs |
|
---|---|
Read capabilities | Batch read |
Write capabilities |
|
For BigQuery tables for Apache Iceberg,
use the
BigQueryIO
connector
with BigQuery Storage API. The table must already exist; dynamic table creation is
not supported.
Limitations
- Apache Iceberg CDC is only supported using the Managed API. Managed transforms service features are not yet enabled. Expect changes that affect backward compatibility
- The CDC Managed API only reads append-only snapshots. Full CDC is not yet available.
Prerequisites
- Set up BigLake. Configure your Google Cloud Platform project with the required permissions by following Use the BigLake Metastore with the Iceberg REST catalog. Make sure that you understand the limitations of BigLake Iceberg REST Catalog described on that page.
- Create a source Iceberg table. The example shown here assumes that you have an Apache Iceberg table. To create one, you can use the pipeline shown in Streaming Write to Apache Iceberg with BigLake REST Catalog.
Dependencies
Add the following dependencies to your project:
Java
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-managed</artifactId>
<version>${beam.version}</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-iceberg</artifactId>
<version>${beam.version}</version>
</dependency>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-gcp</artifactId>
<version>${iceberg.version}</version>
</dependency>
Example
The following example demonstrates a streaming pipeline that reads CDC events from an Apache Iceberg table, aggregates user clicks, and writes the results to another Apache Iceberg table.
Java
To authenticate to Dataflow, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
What's next
- Learn more about Managed I/O.
- Learn more about BigLake REST Catalog.