Dataflow reliability guide

Last reviewed 2023-08-07 UTC

Dataflow is a fully-managed data processing service which enables fast, simplified, streaming data pipeline development using open source Apache Beam libraries. Dataflow minimizes latency, processing time, and cost through autoscaling and batch processing.

Best practices

Building production-ready data pipelines using Dataflow - a document series on using Dataflow including planning, developing, deploying, and monitoring Dataflow pipelines.

  • Overview - introduction to Dataflow pipelines.
  • Planning - measuring SLOs, understanding the impact of data sources and sinks on pipeline scalability and performance, and taking high availability, disaster recovery, and network performance into account when specifying regions to run your Dataflow jobs.
  • Developing and testing - setting up deployment environments, preventing data loss by using dead letter queues for error handling, and reducing latency and cost by minimizing expensive per-element operations. Also, using batching to reduce performance overhead without overloading external services, unfusing inappropriately fused steps so that the steps are separated for better performance, and running end-to-end tests in preproduction to ensure that the pipeline continues to meet your SLOs and other production requirements.
  • Deploying - continuous integration (CI) and continuous delivery and deployment (CD), with special considerations for deploying new versions of streaming pipelines. Also, an example CI/CD pipeline, and some features for optimizing resource usage. Finally, a discussion of high availability, geographic redundancy, and best practices for pipeline reliability, including regional isolation, use of snapshots, handling job submission errors, and recovering from errors and outages impacting running pipelines.
  • Monitoring - observe service level indicators (SLIs) which are important indicators of pipeline performance, and define and measure service level objectives (SLOs).