Pipelines

In the Dataflow SDKs, a pipeline represents a data processing job. You build a pipeline by writing a program using a Dataflow SDK. A pipeline consists of a set of operations that can read a source of input data, transform that data, and write out the resulting output. The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms.

You can create pipelines with varying degrees of complexity. A pipeline can be relatively simple and linear, where a set of transforms are executed one after another, or a pipeline can branch and merge. In this way, you can think of a pipeline as a directed graph of steps, rather than a simple linear sequence of steps. Your pipeline construction can create this directed graph by using conditionals, loops, and other common programming structures.

Note: When you write a program with a Dataflow SDK, your program creates a pipeline specification. This specification is sent to a pipeline runner, which can be the Cloud Dataflow service or a third-party runner. The pipeline runner executes the actual pipeline asynchronously. A pipeline can also be executed locally for test and debugging purposes.

When the pipeline runner builds your actual pipeline for distributed execution, the pipeline may be optimized. For example, it may be more computationally efficient to run certain transforms together, or in a different order. The Dataflow service fully manages this aspect of your pipeline's execution.

Parts of a Pipeline

A pipeline consists of two parts: data and transforms applied to that data. The Dataflow SDKs provide classes to represent both data and transforms. The Dataflow SDKs tie together the data classes and transform classes to construct the entire pipeline. See Constructing Your Pipeline for a start-to-finish guide on how to use the Dataflow SDK classes to construct your pipeline.

Pipeline Data

In the Dataflow SDKs, pipelines use a specialized collection class called PCollection to represent their input, intermediate, and output data. PCollections can be used to represent data sets of virtually any size. Note that compared to typical collection classes such as Java's Collection, PCollections are specifically designed to support parallelized processing.

A pipeline must create a PCollection for any data it needs to work with. You can read data from an external source into a PCollection, or you can create a PCollection from local data in your Dataflow program. From there, each transform in your pipeline accepts one or more PCollections as input and produces one or more PCollections as output.

See PCollection for a complete discussion of how a PCollection works and how to use one.

Pipeline Transforms

A transform is a step in your pipeline. Each transform takes one or more PCollections as input, changes or otherwise manipulates the elements of that PCollection, and produces one or more new PCollections as output.

Core Transforms

The Dataflow SDKs contain a number of core transforms. A core transform is a generic operation that represents a basic or common processing operation that you perform on your pipeline data. Most core transforms provide a processing pattern, and require you to create and supply the actual processing logic that gets applied to the input PCollection.

For example, the ParDo core transform provides a generic processing pattern: for every element in the input PCollection, perform a user-specified processing function on that element. The Dataflow SDKs supply core transforms such as ParDo and GroupByKey, as well as other core transforms for combining, merging, and splitting data sets.

See Transforms for a complete discussion of how to use transforms in your pipeline.

Composite Transforms

The Dataflow SDKs support combining multiple transforms into larger composite transforms. In a composite transform, multiple transforms are applied to a data set to perform a more complex data processing operation. Composite transforms are a good way to build modular, reusable combinations of transforms that do useful things.

The Dataflow SDKs contain libraries of pre-written composite transforms that handle common data processing use cases, including (but not limited to):

  • Combining data, such as summing or averaging numerical data
  • Map/Shuffle/Reduce-style processing, such as counting unique elements in a collection
  • Statistical analysis, such as finding the top N elements in a collection

You can also create your own reusable composite transforms. See Creating Composite Transforms for a complete discussion.

Root Transforms

The Dataflow SDKs often use root transforms at the start of a pipeline to create an initial PCollection. Root transforms frequently involve reading data from an external data source. See Pipeline I/O for additional information.

A Simple Example Pipeline

The following example demonstrates constructing and running a pipeline with three transforms: a transform to read in some data, a transform to count the data, and a transform to write out the results of the count.

Note: See Constructing Your Pipeline for a detailed discussion of how to construct a pipeline using the classes in the Dataflow SDKs.

A common style for constructing a pipeline is to "chain" transforms together. To chain transforms, you apply each new transform directly to the resulting PCollection of the previous transform, as shown in the following example.

Java

  public static void main(String[] args) {
    // Create a pipeline parameterized by commandline flags.
    Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args));

    p.apply(TextIO.Read.from("gs://..."))   // Read input.
     .apply(new CountWords())               // Do some processing.
     .apply(TextIO.Write.to("gs://..."));   // Write output.

    // Run the pipeline.
    p.run();
  }

In the example, the first call to apply invokes a root transform to create a PCollection (in this case, by reading data from a file). Each subsequent call to apply gets called on each resulting PCollection in turn.

Note: Note that the return value for the entire chain is not saved. This is because the final apply call to the Write transform returns a trivial value of type PDone instead of a PCollection. PDone is generally ignored.

Send feedback about...

Cloud Dataflow Documentation