Dataflow managed I/O

The managed I/O connector is an Apache Beam transform that provides a common API for creating sources and sinks.

Overview

You create the managed I/O connector using Apache Beam code, just like any other I/O connector. You specify a source or sink to instantiate and pass in a set of configuration parameters. You can also put the configuration parameters into a YAML file and provide a URL to the file.

On the backend, Dataflow treats the managed I/O connector as a service, which allows Dataflow to manage runtime operations for the connector. You can then focus on the business logic in your pipeline, rather than managing these details.

For more information about the managed I/O API, see Managed in the Apache Beam Java SDK documentation.

Apache Iceberg

Managed I/O supports the following capabilities for Apache Iceberg:

Catalog Batch read Batch write Streaming write Dynamic table creation Dynamic destinations
Hadoop Supported Supported Supported Supported Supported
Hive Supported Supported Supported Supported Supported
REST-based catalogs Supported Supported Supported Supported Supported

For BigQuery tables for Apache Iceberg, use the BigQueryIO connector with BigQuery Storage API. The table must already exist; dynamic table creation is not supported.

Managed I/O uses the following configuration parameters for Apache Iceberg:

Name Data type Description
table string The identifier of the Apache Iceberg table. Example: "db.table1".
catalog_name string The name of the catalog. Example: "local".
catalog_properties map A map of configuration properties for the Apache Iceberg catalog. The required properties depend on the catalog. For more information, see CatalogUtil in the Apache Iceberg documentation.
config_properties map An optional set of Hadoop configuration properties. For more information, see CatalogUtil in the Apache Iceberg documentation.
triggering_frequency_seconds integer For streaming write pipelines, the frequency at which the sink attempts to produce snapshots, in seconds.
drop list of strings A list of field names to drop when writing the table.
keep list of strings A list of field names to keep when writing the table.
only string The name of exactly one field to keep when writing the table. All other fields are dropped. The value of the field must be a row type.

For more information, including code examples, see the following topics:

Apache Kafka

For Apache Kafka, the Managed I/O uses the following configuration parameters.

Read and write configuration Data type Description
bootstrap_servers string Required. A comma-separated list of Kafka bootstrap servers. Example: localhost:9092.
topic string Required. The Kafka topic to read from.
file_descriptor_path string The path to a protocol buffer file descriptor set. Applies only if data_format is "PROTO".
data_format string The format of the messages. Supported values: "AVRO", "JSON", "PROTO", "RAW". The default value is "RAW", which reads or writes the raw bytes of the message payload.
message_name string The name of the protocol buffer message. Required if data_format is "PROTO".
schema string

The Kafka message schema. The expected schema type depends on the data format:

For read pipelines, this parameter is ignored if confluent_schema_registry_url is set.

Read configuration
auto_offset_reset_config string

Specifies the behavior when there is no initial offset or the current offset no longer exists on the Kafka server. The following values are supported:

  • "earliest": Reset the offset to the earliest offset.
  • "latest": Reset the offset to the latest offset.

The default value is "latest".

confluent_schema_registry_subject string The subject of a Confluent schema registry. Required if confluent_schema_registry_url is specified.
confluent_schema_registry_url string The URL of a Confluent schema registry. If specified, the schema parameter is ignored.
consumer_config_updates map Sets configuration parameters for the Kafka consumer. For more information, see Consumer configs in the Kafka documentation. You can use this parameter to customize the Kafka consumer.
max_read_time_seconds int The maximum read time, in seconds. This option produces a bounded PCollection and is mainly intended for testing or other non-production scenarios.
Write configuration
producer_config_updates map Sets configuration parameters for the Kafka producer. For more information, see Producer configs in the Kafka documentation. You can use this parameter to customize the Kafka producer.

To read Avro or JSON messages, you must specify a message schema. To set a schema directly, use the schema parameter. To provide the schema through a Confluent schema registry, set the confluent_schema_registry_url and confluent_schema_registry_subject parameters.

To read or write Protocol Buffer messages, either specify a message schema or set the file_descriptor_path parameter.