Pipeline (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1


Class Pipeline

  • Direct Known Subclasses:
    DataflowPipeline, DirectPipeline, TestPipeline

    public class Pipeline
    extends Object
    A Pipeline manages a directed acyclic graph of PTransforms, and the PCollections that the PTransforms consume and produce.

    A Pipeline is initialized with a PipelineRunner that will later execute the Pipeline.

    Pipelines are independent, so they can be constructed and executed concurrently.

    Each Pipeline is self-contained and isolated from any other Pipeline. The PValues that are inputs and outputs of each of a Pipeline's PTransforms are also owned by that Pipeline. A PValue owned by one Pipeline can be read only by PTransforms also owned by that Pipeline.

    Here is a typical example of use:

     // Start by defining the options for the pipeline.
     PipelineOptions options = PipelineOptionsFactory.create();
     // Then create the pipeline. The runner is determined by the options.
     Pipeline p = Pipeline.create(options);
     // A root PTransform, like TextIO.Read or Create, gets added
     // to the Pipeline by being applied:
     PCollection<String> lines =
     // A Pipeline can have multiple root transforms:
     PCollection<String> moreLines =
     PCollection<String> yetMoreLines =
         p.apply(Create.of("yet", "more", "lines").withCoder(StringUtf8Coder.of()));
     // Further PTransforms can be applied, in an arbitrary (acyclic) graph.
     // Subsequent PTransforms (and intermediate PCollections etc.) are
     // implicitly part of the same Pipeline.
     PCollection<String> allLines =
         .apply(new Flatten<String>());
     PCollection<KV<String, Integer>> wordCounts =
         .apply(ParDo.of(new ExtractWords()))
         .apply(new Count<String>());
     PCollection<String> formattedWordCounts =
         wordCounts.apply(ParDo.of(new FormatCounts()));
     // PTransforms aren't executed when they're applied, rather they're
     // just added to the Pipeline.  Once the whole Pipeline of PTransforms
     // is constructed, the Pipeline's PTransforms can be run using a
     // PipelineRunner.  The default PipelineRunner executes the Pipeline
     // directly, sequentially, in this one process, which is useful for
     // unit tests and simple experiments: