Building production-ready data pipelines using Dataflow: Overview

This document is part of a series that helps you improve the production readiness of your data pipelines by using Dataflow. The series is intended for a technical audience whose responsibilities include developing, deploying, and monitoring Dataflow pipelines. The documents in the series assume a working understanding of Dataflow and Apache Beam.

The series uses the term production readiness to mean suitable for delivering business-critical functionality. The information in the series covers best practices relating to a range of universal considerations, such as pipeline reliability and maintainability, pipeline performance optimization, and developer productivity.

The documents in the series include the following parts:

Introduction

In computing, a data pipeline is a type of application that processes data through a sequence of connected processing steps. Data pipelines can be applied to many use cases, such as data migration between systems, extract transform load (ETL), data enrichment, and real-time data analysis. Typically, a data pipeline is operated as a batch process, which executes and then processes data while it runs, or as a streaming process, which executes continuously and processes data as it becomes available to the pipeline.

Dataflow is a managed service for running batch and streaming data-processing pipelines that are developed using the Apache Beam SDK. Dataflow is a serverless offering that lets you focus on expressing the business logic of your data pipelines (using code and SQL) while removing or simplifying many infrastructure and operational tasks. For example, Dataflow features include automatic infrastructure provisioning and scaling, pipeline updates, and templatized deployment for repeatable runs.

As with other applications, enabling production readiness for Dataflow pipelines involves more than just writing and running code. You need to gather and analyze requirements, plan to inform your design, and measure performance.

To ensure pipeline reliability and maintainability, during the development process, you must understand and adhere to coding best practices. For example, we recommend that you use modern DevOps practices such as continuous integration and delivery (CI/CD) to turn Apache Beam code on a developer's workstation into Dataflow pipelines that run in production. For your running pipelines, you use monitoring, alerting, and error reporting to help you troubleshoot pipeline errors, understand performance problems and data issues, and deliver your pipeline's overall performance against service level objectives (SLOs).

When you combine Dataflow with other managed Google Cloud services, you can simplify many aspects of productionizing data pipelines compared to self-managed solutions. For example, you can use Google Cloud's operations suite for monitoring and error reporting, and Cloud Composer for pipeline orchestration.

Objectives

The best practices described in this series are categorized into the following phases:

  • Planning: What non-functional requirements should you capture before you begin development? Why should you think about SLOs during the planning stage, and how should you set them? How should SLOs inform the design of your data pipelines? What potential impact do you need to consider for upstream and downstream systems?
  • Development and testing: What best practices should you observe when you write Apache Beam code? What tests should you build and automate? How should pipeline developers work with different deployment environments to write, test, and run code?
  • Deployment: How should you update pipelines that are already running when a new release is available? How can you apply CI/CD practices to data pipelines?
  • Monitoring: What monitoring should you have for production pipelines? How do you detect and resolve performance problems, data-handling issues, and code errors in production?

By the end of this series, you will have a deeper understanding of all the preceding considerations and how they are important aspects of building production-ready data pipelines using Dataflow.

What's next