Data processing in Dataflow can be highly parallelized. Much of this parallelism is handled automatically by Dataflow. I/O connectors sit at the boundary between your pipeline and other parts of your architecture, such as file storage, databases, and messaging systems. As such, I/O connectors often have specific considerations for achieving parallelism.
General best practices
The following list describes general best practices for using I/O connectors in Dataflow.
Read the Javadoc, Pydoc, or Go documentation for the connectors in your pipeline. For more information, see I/O connectors in the Apache Beam documentation.
Use the latest version of the Apache Beam SDK. I/O connectors are continually being improved, adding features and fixing known issues.
When developing a pipeline, it's important to balance the parallelism of the job. If a job has too little parallelism, it can be slow, and data can build up in the source. However, too much parallelism can overwhelm a sink with too many requests.
Don't rely on the ordering of elements. In general, Dataflow does not guarantee the order of elements in a collection.
If an I/O connector isn't available in your SDK of choice, consider using the cross-language framework to use an I/O connector from another SDK. In addition, connectors don't always have feature parity between SDKs. If a connector from another SDK provides a feature that you need, you can use it as a cross-language transform.
In general, writing custom I/O connectors is challenging. Use an existing connector whenever possible. If you need to implement a custom I/O connector, read Developing a new I/O connector.
If a pipeline fails, check for errors logged by I/O connectors. See Troubleshoot Dataflow errors.
When performing writes from Dataflow to a connector, consider using an ErrorHandler to handle any failed writes or malformed reads. This type of error handling is supported for the following Java I/Os in Apache Beam versons 2.55.0 and later: BigQueryIO, BigtableIO, PubSubIO, KafkaIO, FileIO, TextIO, and AvroIO.
Best practices for individual I/O connectors
The following topics list best practices for individual I/O connectors:
I/O connector | Read | Write |
---|---|---|
Apache Iceberg | Read from Apache Iceberg | Write to Apache Iceberg |
Apache Kafka | Read from Apache Kafka | Write to Apache Kafka |
BigQuery | Read from BigQuery | Write to BigQuery |
Bigtable | Read from Bigtable | Write to Bigtable |
Cloud Storage | Read from Cloud Storage | Write to Cloud Storage |
Pub/Sub | Read from Pub/Sub | Write to Pub/Sub |
Google-supported I/O connectors
The following table lists the Apache Beam I/O connectors supported by Dataflow. For a full list of Apache Beam I/O connectors, including those developed by the Apache Beam community and supported by other runners, see I/O connectors in the Apache Beam documentation.
I/O connector name | Java | Python | Go |
---|---|---|---|
AvroIO | Supported | Supported | Supported |
BigQueryIO | Supported | Supported | Supported |
BigTableIO | Supported | Sink supported | Sink supported |
DatastoreIO | Supported | Supported | Supported |
FhirIO (healthcare) | Supported | Not available | Supported |
FileIO | Supported | Supported | Supported |
GcsFileSystem | Supported | Supported | Supported |
JdbcIO | Supported | Cross-language support | Cross-language support |
KafkaIO | Supported | Cross-language support | Cross-language support |
LocalFileSystem | Supported | Supported | Supported |
ParquetIO | Supported | Supported | Supported |
PubSubIO | Supported | Supported | Supported |
SpannerIO | Supported | Cross-language support | Supported |
TextIO | Supported | Supported | Supported |
TFRecordIO | Supported | Supported | Not available |
What's next
- Read the Apache Beam documentation for I/O connectors.