The managed I/O connector is an Apache Beam transform that provides a common API for creating sources and sinks.
Overview
You create the managed I/O connector using Apache Beam code, just like any other I/O connector. You specify a source or sink to instantiate and pass in a set of configuration parameters. You can also put the configuration parameters into a YAML file and provide a URL to the file.
On the backend, Dataflow treats the managed I/O connector as a service, which allows Dataflow to manage runtime operations for the connector. You can then focus on the business logic in your pipeline, rather than managing these details.
For more information about the managed I/O API, see Managed
in the
Apache Beam Java SDK documentation.
Apache Iceberg
Managed I/O supports the following capabilities for Apache Iceberg:
Catalog | Batch read | Batch write | Streaming write | Dynamic table creation | Dynamic destinations |
---|---|---|---|---|---|
Hadoop | Supported | Supported | Supported | Supported | Supported |
Hive | Supported | Supported | Supported | Supported | Supported |
REST-based catalogs | Supported | Supported | Supported | Supported | Supported |
For BigQuery tables for Apache Iceberg,
use the
BigQueryIO
connector
with BigQuery Storage API. The table must already exist; dynamic table creation is
not supported.
Name | Data type | Description |
---|---|---|
table |
string | The identifier of the Apache Iceberg table. Example:
"db.table1" . |
catalog_name |
string | The name of the catalog. Example: "local" . |
catalog_properties |
map | A map of configuration properties for the Apache Iceberg
catalog. The required properties depend on the catalog. For more
information, see
CatalogUtil in the Apache Iceberg documentation. |
config_properties |
map | An optional set of Hadoop configuration properties. For more
information, see
CatalogUtil in the Apache Iceberg documentation. |
triggering_frequency_seconds |
integer | For streaming write pipelines, the frequency at which the sink attempts to produce snapshots, in seconds. |
drop |
list of strings | A list of field names to drop when writing the table. |
keep |
list of strings | A list of field names to keep when writing the table. |
only |
string | The name of exactly one field to keep when writing the table. All other fields are dropped. The value of the field must be a row type. |
For more information, including code examples, see the following topics:
Apache Kafka
For Apache Kafka, the Managed I/O uses the following configuration parameters.
Read and write configuration | Data type | Description |
---|---|---|
bootstrap_servers |
string | Required. A comma-separated list of Kafka bootstrap servers.
Example: localhost:9092 . |
topic |
string | Required. The Kafka topic to read from. |
file_descriptor_path |
string | The path to a protocol buffer file descriptor set. Applies only if
data_format is "PROTO" . |
data_format |
string | The format of the messages. Supported values: "AVRO" ,
"JSON" , "PROTO" , "RAW" . The
default value is "RAW" , which reads or writes the raw
bytes of the message payload. |
message_name |
string | The name of the protocol buffer message. Required if
data_format is "PROTO" . |
schema |
string | The Kafka message schema. The expected schema type depends on the data format:
For read pipelines, this parameter is ignored if
|
Read configuration | ||
auto_offset_reset_config |
string | Specifies the behavior when there is no initial offset or the current offset no longer exists on the Kafka server. The following values are supported:
The default value is |
confluent_schema_registry_subject |
string | The subject of a Confluent schema registry. Required if
confluent_schema_registry_url is specified. |
confluent_schema_registry_url |
string | The URL of a Confluent schema registry. If specified, the
schema parameter is ignored. |
consumer_config_updates |
map | Sets configuration parameters for the Kafka consumer. For more information, see Consumer configs in the Kafka documentation. You can use this parameter to customize the Kafka consumer. | max_read_time_seconds |
int | The maximum read time, in seconds. This option produces a bounded
PCollection and is mainly intended for testing or other
non-production scenarios. |
Write configuration | ||
producer_config_updates |
map | Sets configuration parameters for the Kafka producer. For more information, see Producer configs in the Kafka documentation. You can use this parameter to customize the Kafka producer. |
To read Avro or JSON messages, you must specify a message schema. To set a
schema directly, use the schema
parameter. To provide the schema through a
Confluent schema registry, set the confluent_schema_registry_url
and
confluent_schema_registry_subject
parameters.
To read or write Protocol Buffer messages, either specify a message schema or
set the file_descriptor_path
parameter.