Apache Beam I/O connector best practices

Data processing in Dataflow can be highly parallelized. Much of this parallelism is handled automatically by Dataflow. I/O connectors sit at the boundary between your pipeline and other parts of your architecture, such as file storage, databases, and messaging systems. As such, I/O connectors often have specific considerations for achieving parallelism.

General best practices

The following list describes general best practices for using I/O connectors in Dataflow.

  • Read the Javadoc, Pydoc, or Go documentation for the connectors in your pipeline. For more information, see I/O connectors in the Apache Beam documentation.

  • Use the latest version of the Apache Beam SDK. I/O connectors are continually being improved, adding features and fixing known issues.

  • When developing a pipeline, it's important to balance the parallelism of the job. If a job has too little parallelism, it can be slow, and data can build up in the source. However, too much parallelism can overwhelm a sink with too many requests.

  • Don't rely on the ordering of elements. In general, Dataflow does not guarantee the order of elements in a collection.

  • If an I/O connector isn't available in your SDK of choice, consider using the cross-language framework to use an I/O connector from another SDK. In addition, connectors don't always have feature parity between SDKs. If a connector from another SDK provides a feature that you need, you can use it as a cross-language transform.

  • In general, writing custom I/O connectors is challenging. Use an existing connector whenever possible. If you need to implement a custom I/O connector, read Developing a new I/O connector.

  • If a pipeline fails, check for errors logged by I/O connectors. See Troubleshoot Dataflow errors.

  • When performing writes from Dataflow to a connector, consider using an ErrorHandler to handle any failed writes or malformed reads. This type of error handling is supported for the following Java I/Os in Apache Beam versions 2.55.0 and later: BigQueryIO, BigtableIO, PubSubIO, KafkaIO, FileIO, TextIO, and AvroIO.

Best practices for individual I/O connectors

The following topics list best practices for individual I/O connectors:

Google-supported I/O connectors

The following table lists the Apache Beam I/O connectors supported by Dataflow. For a full list of Apache Beam I/O connectors, including those developed by the Apache Beam community and supported by other runners, see I/O connectors in the Apache Beam documentation.

What's next