In the Dataflow SDKs, a pipeline represents a data processing job. You build a pipeline by writing a program using a Dataflow SDK. A pipeline consists of a set of operations that can read a source of input data, transform that data, and write out the resulting output. The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms.
You can create pipelines with varying degrees of complexity. A pipeline can be relatively simple and linear, where a set of transforms are executed one after another, or a pipeline can branch and merge. In this way, you can think of a pipeline as a directed graph of steps, rather than a simple linear sequence of steps. Your pipeline construction can create this directed graph by using conditionals, loops, and other common programming structures.
Note: When you write a program with a Dataflow SDK, your program creates a
pipeline specification. This specification is sent to a pipeline runner, which can be the
Cloud Dataflow service or a third-party
runner. The pipeline runner executes the actual pipeline asynchronously. A pipeline can also be
executed locally for test and debugging purposes.
When the pipeline runner builds your actual pipeline for distributed execution, the pipeline
may be optimized. For example,
it may be more computationally efficient to run certain transforms together, or in a different
order. The Dataflow service fully manages this aspect of your pipeline's execution.
Parts of a Pipeline
A pipeline consists of two parts: data and transforms applied to that data. The Dataflow SDKs provide classes to represent both data and transforms. The Dataflow SDKs tie together the data classes and transform classes to construct the entire pipeline. See Constructing Your Pipeline for a start-to-finish guide on how to use the Dataflow SDK classes to construct your pipeline.
Pipeline Data
In the Dataflow SDKs, pipelines use a specialized collection class called
PCollection
to represent their input, intermediate, and output data.
PCollection
s can be used to represent data sets of virtually any size. Note that
compared to typical collection classes such as Java's Collection
,
PCollection
s are specifically designed to support parallelized processing.
A pipeline must create a PCollection
for any data it needs to work with. You can
read data from an external source into a PCollection
, or you can create a
PCollection
from local data in your Dataflow program. From there, each transform
in your pipeline accepts one or more PCollection
s as input and produces one or more
PCollection
s as output.
See PCollection for a complete discussion of how a
PCollection
works and how to use one.
Pipeline Transforms
A transform is a step in your pipeline. Each transform takes one or more
PCollection
s as input, changes or otherwise manipulates the elements of that
PCollection
, and produces one or more new PCollection
s as output.
Core Transforms
The Dataflow SDKs contain a number of core transforms. A core transform is a generic
operation that represents a basic or common processing operation that you perform on your
pipeline data. Most core transforms provide a processing pattern, and require you to create and
supply the actual processing logic that gets applied to the input PCollection
.
For example, the ParDo core transform provides a generic
processing pattern: for every element in the input PCollection
, perform a
user-specified processing function on that element. The Dataflow SDKs supply core transforms such
as ParDo and
GroupByKey, as well as other core transforms for
combining, merging, and splitting data sets.
See Transforms for a complete discussion of how to use transforms in your pipeline.
Composite Transforms
The Dataflow SDKs support combining multiple transforms into larger composite transforms. In a composite transform, multiple transforms are applied to a data set to perform a more complex data processing operation. Composite transforms are a good way to build modular, reusable combinations of transforms that do useful things.
The Dataflow SDKs contain libraries of pre-written composite transforms that handle common data processing use cases, including (but not limited to):
- Combining data, such as summing or averaging numerical data
- Map/Shuffle/Reduce-style processing, such as counting unique elements in a collection
- Statistical analysis, such as finding the top N elements in a collection
You can also create your own reusable composite transforms. See Creating Composite Transforms for a complete discussion.
Root Transforms
The Dataflow SDKs often use root transforms at the start of a pipeline to create an initial PCollection. Root transforms frequently involve reading data from an external data source. See Pipeline I/O for additional information.
A Simple Example Pipeline
The following example demonstrates constructing and running a pipeline with three transforms: a transform to read in some data, a transform to count the data, and a transform to write out the results of the count.
Note: See Constructing Your Pipeline for a detailed discussion of how to construct a pipeline using the classes in the Dataflow SDKs.
A common style for constructing a pipeline is to "chain" transforms together. To chain
transforms, you apply each new transform directly to the resulting PCollection
of
the previous transform, as shown in the following example.
Java
public static void main(String[] args) { // Create a pipeline parameterized by commandline flags. Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args)); p.apply(TextIO.Read.from("gs://...")) // Read input. .apply(new CountWords()) // Do some processing. .apply(TextIO.Write.to("gs://...")); // Write output. // Run the pipeline. p.run(); }
In the example, the first call to apply
invokes a
root transform to create a PCollection
(in this case, by reading data from a file).
Each subsequent call to apply
gets called on each resulting PCollection
in turn.
Note: Note that the return value for the entire chain is not saved. This is
because the final apply
call to the Write
transform returns a trivial
value of type PDone
instead of a PCollection
. PDone
is
generally ignored.