Apache Beam I/O connector best practices

Data processing in Dataflow can be highly parallelized. Much of this parallelism is handled automatically by Dataflow. I/O connectors sit at the boundary between your pipeline and other parts of your architecture, such as file storage, databases, and messaging systems. As such, I/O connectors often have specific considerations for achieving parallelism.

General best practices

The following list describes general best practices for using I/O connectors in Dataflow.

Read the Javadoc, Pydoc, or Go documentation for the connectors in your pipeline. For more information, see I/O connectors in the Apache Beam documentation.
Use the latest version of the Apache Beam SDK. I/O connectors are continually being improved, adding features and fixing known issues.
When developing a pipeline, it's important to balance the parallelism of the job. If a job has too little parallelism, it can be slow, and data can build up in the source. However, too much parallelism can overwhelm a sink with too many requests.
Don't rely on the ordering of elements. In general, Dataflow does not guarantee the order of elements in a collection.
If an I/O connector isn't available in your SDK of choice, consider using the cross-language framework to use an I/O connector from another SDK. In addition, connectors don't always have feature parity between SDKs. If a connector from another SDK provides a feature that you need, you can use it as a cross-language transform.
In general, writing custom I/O connectors is challenging. Use an existing connector whenever possible. If you need to implement a custom I/O connector, read Developing a new I/O connector.
If a pipeline fails, check for errors logged by I/O connectors. See Troubleshoot Dataflow errors.
When performing writes from Dataflow to a connector, consider using an ErrorHandler to handle any failed writes or malformed reads. This type of error handling is supported for the following Java I/Os in Apache Beam versions 2.55.0 and later: BigQueryIO, BigtableIO, PubSubIO, KafkaIO, FileIO, TextIO, and AvroIO.

Best practices for individual I/O connectors

The following topics list best practices for individual I/O connectors:

I/O connector	Read	Write
Apache Iceberg	Read from Apache Iceberg	Write to Apache Iceberg
Apache Kafka	Read from Apache Kafka	Write to Apache Kafka
BigQuery	Read from BigQuery	Write to BigQuery
Bigtable	Read from Bigtable	Write to Bigtable
Cloud Storage	Read from Cloud Storage	Write to Cloud Storage
Pub/Sub	Read from Pub/Sub	Write to Pub/Sub

Google-supported I/O connectors

The following table lists the Apache Beam I/O connectors supported by Dataflow. For a full list of Apache Beam I/O connectors, including those developed by the Apache Beam community and supported by other runners, see I/O connectors in the Apache Beam documentation.

I/O connector name	Java	Python	Go
AvroIO	Supported	Supported	Supported
BigQueryIO	Supported	Supported	Supported
BigTableIO	Supported	Sink supported	Sink supported
DatastoreIO	Supported	Supported	Supported
FhirIO (healthcare)	Supported	Not available	Supported
FileIO	Supported	Supported	Supported
GcsFileSystem	Supported	Supported	Supported
JdbcIO	Supported	Cross-language support	Cross-language support
KafkaIO	Supported	Cross-language support	Cross-language support
LocalFileSystem	Supported	Supported	Supported
ParquetIO	Supported	Supported	Supported
PubSubIO	Supported	Supported	Supported
SpannerIO	Supported	Cross-language support	Supported
TextIO	Supported	Supported	Supported
TFRecordIO	Supported	Supported	Not available

What's next

Read the Apache Beam documentation for I/O connectors.