Apache Beam I/O connector best practices

Data processing in Dataflow can be highly parallelized. Much of this parallelism is handled automatically by Dataflow. I/O connectors sit at the boundary between your pipeline and other parts of your architecture, such as file storage, databases, and messaging systems. As such, I/O connectors often have specific considerations for achieving parallelism.

This document describes some general best practices for using I/O connectors in Dataflow.

  • Read the Javadoc, Pydoc, or Go documentation for the connectors in your pipeline. For more information, see I/O connectors in the Apache Beam documentation.

  • Use the latest version of the Apache Beam SDK. I/O connectors are continually being improved, adding features and fixing known issues.

  • When developing a pipeline, it's important to balance the parallelism of the job. If a job has too little parallelism, it can be slow, and data can build up in the source. However, too much parallelism can overwhelm a sink with too many requests.

  • Don't rely on the ordering of elements. In general, Dataflow does not guarantee the order of elements in a collection.

  • If an I/O connector isn't available in your SDK of choice, consider using the cross-language framework to use an I/O connector from another SDK. In addition, connectors don't always have feature parity between SDKs. If a connector from another SDK provides a feature that you need, you can use it as a cross-language transform.

  • In general, writing custom I/O connectors is challenging. Use an existing connector whenever possible. If you need to implement a custom I/O connector, read Developing a new I/O connector.

  • If a pipeline fails, check for errors logged by I/O connectors. See Troubleshoot Dataflow errors.

The following topics list best practices for individual I/O connectors:

I/O connectorReadWrite
BigQuery Read from BigQuery Write to BigQuery
Bigtable Read from Bigtable Write to Bigtable
Cloud Storage Read from Cloud Storage Write to Cloud Storage

What's next