Dataflow managed I/O

The managed I/O connector is an Apache Beam transform that provides a common API for creating sources and sinks.

Overview

You create the managed I/O connector using Apache Beam code, just like any other I/O connector. You specify a source or sink to instantiate and pass in a set of configuration parameters. You can also put the configuration parameters into a YAML file and provide a URL to the file.

On the backend, Dataflow treats the managed I/O connector as a service, which allows Dataflow to manage runtime operations for the connector. You can then focus on the business logic in your pipeline, rather than managing these details.

For more information about the managed I/O API, see Managed in the Apache Beam Java SDK documentation.

Apache Iceberg

Managed I/O supports the following capabilities for Apache Iceberg:

Catalog	Batch read	Batch write	Streaming write	Dynamic table creation	Dynamic destinations
Hadoop	Supported	Supported	Supported	Supported	Supported
Hive	Supported	Supported	Supported	Supported	Supported
REST-based catalogs	Supported	Supported	Supported	Supported	Supported

For BigQuery tables for Apache Iceberg, use the BigQueryIO connector with BigQuery Storage API. The table must already exist; dynamic table creation is not supported.

Managed I/O uses the following configuration parameters for Apache Iceberg:

Name	Data type	Description
`table`	string	The identifier of the Apache Iceberg table. Example: `"db.table1"`.
`catalog_name`	string	The name of the catalog. Example: `"local"`.
`catalog_properties`	map	A map of configuration properties for the Apache Iceberg catalog. The required properties depend on the catalog. For more information, see `CatalogUtil` in the Apache Iceberg documentation.
`config_properties`	map	An optional set of Hadoop configuration properties. For more information, see `CatalogUtil` in the Apache Iceberg documentation.
`triggering_frequency_seconds`	integer	For streaming write pipelines, the frequency at which the sink attempts to produce snapshots, in seconds.
`drop`	list of strings	A list of field names to drop when writing the table.
`keep`	list of strings	A list of field names to keep when writing the table.
`only`	string	The name of exactly one field to keep when writing the table. All other fields are dropped. The value of the field must be a row type.

For more information, including code examples, see the following topics:

Apache Kafka

For Apache Kafka, the Managed I/O uses the following configuration parameters.

Read and write configuration	Data type	Description
`bootstrap_servers`	string	Required. A comma-separated list of Kafka bootstrap servers. Example: `localhost:9092`.
`topic`	string	Required. The Kafka topic to read from.
`file_descriptor_path`	string	The path to a protocol buffer file descriptor set. Applies only if `data_format` is `"PROTO"`.
`data_format`	string	The format of the messages. Supported values: `"AVRO"`, `"JSON"`, `"PROTO"`, `"RAW"`. The default value is `"RAW"`, which reads or writes the raw bytes of the message payload.
`message_name`	string	The name of the protocol buffer message. Required if `data_format` is `"PROTO"`.
`schema`	string	The Kafka message schema. The expected schema type depends on the data format: `"AVRO"`: Use Apache Avro Schema syntax. `"JSON"`: Use JSON Schema syntax. `"PROTO"`: Use protobuf schema syntax. For read pipelines, this parameter is ignored if `confluent_schema_registry_url` is set.
Read configuration
`auto_offset_reset_config`	string	Specifies the behavior when there is no initial offset or the current offset no longer exists on the Kafka server. The following values are supported: `"earliest"`: Reset the offset to the earliest offset. `"latest"`: Reset the offset to the latest offset. The default value is `"latest"`.
`confluent_schema_registry_subject`	string	The subject of a Confluent schema registry. Required if `confluent_schema_registry_url` is specified.
`confluent_schema_registry_url`	string	The URL of a Confluent schema registry. If specified, the `schema` parameter is ignored.
`consumer_config_updates`	map	Sets configuration parameters for the Kafka consumer. For more information, see Consumer configs in the Kafka documentation. You can use this parameter to customize the Kafka consumer.
`max_read_time_seconds`	int	The maximum read time, in seconds. This option produces a bounded `PCollection` and is mainly intended for testing or other non-production scenarios.
Write configuration
`producer_config_updates`	map	Sets configuration parameters for the Kafka producer. For more information, see Producer configs in the Kafka documentation. You can use this parameter to customize the Kafka producer.

To read Avro or JSON messages, you must specify a message schema. To set a schema directly, use the schema parameter. To provide the schema through a Confluent schema registry, set the confluent_schema_registry_url and confluent_schema_registry_subject parameters.

To read or write Protocol Buffer messages, either specify a message schema or set the file_descriptor_path parameter.