The Dataflow programming model is designed to simplify the mechanics of large-scale data processing. When you program with a Dataflow SDK, you are essentially creating a data processing job to be executed by one of the Cloud Dataflow runner services. This model lets you concentrate on the logical composition of your data processing job, rather than the physical orchestration of parallel processing. You can focus on what you need your job to do instead of exactly how that job gets executed.
The Dataflow model provides a number of useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding data sets, and other such tasks. These low-level details are fully managed for you by Cloud Dataflow's runner services.
When you think about data processing with Dataflow, you can think in terms of four major concepts:
- I/O Sources and Sinks
Once you're familiar with these principles, you can learn about pipeline design principles to help determine how best to use the Dataflow programming model to accomplish your data processing tasks.
A pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data to provide some useful intelligence, and produces some output data. That output data is often written to an external data sink. The input source and output sink can be the same, or they can be of different types, allowing you to easily convert data from one format to another.
Each pipeline represents a single, potentially repeatable job, from start to finish, in the Dataflow service.
See Pipelines for a complete discussion of how a pipeline is represented in the Dataflow SDKs.
PCollection represents a set of data in your pipeline. The
PCollection classes are specialized container classes that can represent
data sets of virtually unlimited size. A
PCollection can hold a data set of a
fixed size (such as data from a text file or a BigQuery table), or an unbounded data set from a
continuously updating data source (such as a subscription from Google Cloud
PCollections are the inputs and outputs for each step in your pipeline.
See PCollections for a complete discussion of how
PCollection works in the Dataflow SDKs.
A transform is a data processing operation, or a step, in your
pipeline. A transform takes one or more
PCollections as input, performs a processing
function that you provide on the elements of that
PCollection, and produces an
Your transforms don't need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.
See Transforms for a complete discussion of how transforms work in the Dataflow SDKs.
I/O Sources and Sinks
The Dataflow SDKs provide data source and data sink APIs for pipeline I/O. You use the source APIs to read data into your pipeline, and the sink APIs to write output data from your pipeline. These source and sink operations represent the roots and endpoints of your pipeline.
The Dataflow source and sink APIs let your pipeline work with data from a number of different data storage formats, such as files in Google Cloud Storage, BigQuery tables, and more. You can also use a custom data source (or sink) by teaching Dataflow how to read from (or write to) it in parallel.
See Pipeline I/O for more information on how data sources and sinks work in the Dataflow SDKs.