Data processing in Dataflow can be highly parallelized. Much of this parallelism is handled automatically by Dataflow. I/O connectors sit at the boundary between your pipeline and other parts of your architecture, such as file storage, databases, and messaging systems. As such, I/O connectors often have specific considerations for achieving parallelism.
This document describes some general best practices for using I/O connectors in Dataflow.
Read the Javadoc, Pydoc, or Go documentation for the connectors in your pipeline. For more information, see I/O connectors in the Apache Beam documentation.
Use the latest version of the Apache Beam SDK. I/O connectors are continually being improved, adding features and fixing known issues.
When developing a pipeline, it's important to balance the parallelism of the job. If a job has too little parallelism, it can be slow, and data can build up in the source. However, too much parallelism can overwhelm a sink with too many requests.
Don't rely on the ordering of elements. In general, Dataflow does not guarantee the order of elements in a collection.
If an I/O connector isn't available in your SDK of choice, consider using the cross-language framework to use an I/O connector from another SDK. In addition, connectors don't always have feature parity between SDKs. If a connector from another SDK provides a feature that you need, you can use it as a cross-language transform.
In general, writing custom I/O connectors is challenging. Use an existing connector whenever possible. If you need to implement a custom I/O connector, read Developing a new I/O connector.
If a pipeline fails, check for errors logged by I/O connectors. See Troubleshoot Dataflow errors.
The following topics list best practices for individual I/O connectors:
I/O connector | Read | Write |
---|---|---|
BigQuery | Read from BigQuery | Write to BigQuery |
Bigtable | Read from Bigtable | Write to Bigtable |
Cloud Storage | Read from Cloud Storage | Write to Cloud Storage |
What's next
- Read the Apache Beam documentation for I/O connectors.