Google Cloud Dataflow SDK for Java, version 1.9.1
Class Pipeline
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.Pipeline
-
- Direct Known Subclasses:
- DataflowPipeline, DirectPipeline, TestPipeline
public class Pipeline extends Object
APipeline
manages a directed acyclic graph ofPTransforms
, and thePCollections
that thePTransform
s consume and produce.A
Pipeline
is initialized with aPipelineRunner
that will later execute thePipeline
.Pipelines
are independent, so they can be constructed and executed concurrently.Each
Pipeline
is self-contained and isolated from any otherPipeline
. ThePValues
that are inputs and outputs of each of aPipeline's
PTransforms
are also owned by thatPipeline
. APValue
owned by onePipeline
can be read only byPTransforms
also owned by thatPipeline
.Here is a typical example of use:
// Start by defining the options for the pipeline. PipelineOptions options = PipelineOptionsFactory.create(); // Then create the pipeline. The runner is determined by the options. Pipeline p = Pipeline.create(options); // A root PTransform, like TextIO.Read or Create, gets added // to the Pipeline by being applied: PCollection<String> lines = p.apply(TextIO.Read.from("gs://bucket/dir/file*.txt")); // A Pipeline can have multiple root transforms: PCollection<String> moreLines = p.apply(TextIO.Read.from("gs://bucket/other/dir/file*.txt")); PCollection<String> yetMoreLines = p.apply(Create.of("yet", "more", "lines").withCoder(StringUtf8Coder.of())); // Further PTransforms can be applied, in an arbitrary (acyclic) graph. // Subsequent PTransforms (and intermediate PCollections etc.) are // implicitly part of the same Pipeline. PCollection<String> allLines = PCollectionList.of(lines).and(moreLines).and(yetMoreLines) .apply(new Flatten<String>()); PCollection<KV<String, Integer>> wordCounts = allLines .apply(ParDo.of(new ExtractWords())) .apply(new Count<String>()); PCollection<String> formattedWordCounts = wordCounts.apply(ParDo.of(new FormatCounts())); formattedWordCounts.apply(TextIO.Write.to("gs://bucket/dir/counts.txt")); // PTransforms aren't executed when they're applied, rather they're // just added to the Pipeline. Once the whole Pipeline of PTransforms // is constructed, the Pipeline's PTransforms can be run using a // PipelineRunner. The default PipelineRunner executes the Pipeline // directly, sequentially, in this one process, which is useful for // unit tests and simple experiments: p.run();
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
Pipeline.PipelineExecutionException
static interface
Pipeline.PipelineVisitor
APipeline.PipelineVisitor
can be passed intotraverseTopologically(com.google.cloud.dataflow.sdk.Pipeline.PipelineVisitor)
to be called for each of the transforms and values in thePipeline
.
-
Constructor Summary
Constructors Modifier Constructor and Description protected
Pipeline(PipelineRunner<?> runner)
Deprecated.replaced byPipeline(PipelineRunner, PipelineOptions)
protected
Pipeline(PipelineRunner<?> runner, PipelineOptions options)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method and Description void
addValueInternal(PValue value)
<OutputT extends POutput>
OutputTapply(PTransform<? super PBegin,OutputT> root)
Likeapply(String, PTransform)
but the transform node in thePipeline
graph will be named according toPTransform.getName()
.<OutputT extends POutput>
OutputTapply(String name, PTransform<? super PBegin,OutputT> root)
static <InputT extends PInput,OutputT extends POutput>
OutputTapplyTransform(InputT input, PTransform<? super InputT,OutputT> transform)
LikeapplyTransform(String, PInput, PTransform)
but defaulting to the name provided by thePTransform
.static <InputT extends PInput,OutputT extends POutput>
OutputTapplyTransform(String name, InputT input, PTransform<? super InputT,OutputT> transform)
Applies the givenPTransform
to this inputInputT
and returns itsOutputT
.PBegin
begin()
Returns aPBegin
owned by this Pipeline.static Pipeline
create(PipelineOptions options)
Constructs a pipeline from the provided options.CoderRegistry
getCoderRegistry()
Returns theCoderRegistry
that thisPipeline
uses.String
getFullNameForTesting(PTransform<?,?> transform)
Deprecated.this method is no longer compatible with the design ofPipeline
, asPTransforms
can be applied multiple times, with different names each time.PipelineOptions
getOptions()
Returns the configuredPipelineOptions
.PipelineRunner<?>
getRunner()
Returns the configuredPipelineRunner
.PipelineResult
run()
Runs thePipeline
using itsPipelineRunner
.void
setCoderRegistry(CoderRegistry coderRegistry)
Sets theCoderRegistry
that thisPipeline
uses.String
toString()
void
traverseTopologically(Pipeline.PipelineVisitor visitor)
Invokes thePipelineVisitor's
Pipeline.PipelineVisitor.visitTransform(com.google.cloud.dataflow.sdk.runners.TransformTreeNode)
andPipeline.PipelineVisitor.visitValue(com.google.cloud.dataflow.sdk.values.PValue, com.google.cloud.dataflow.sdk.runners.TransformTreeNode)
operations on each of thisPipeline's
transform and value nodes, in forward topological order.
-
-
-
Constructor Detail
-
Pipeline
@Deprecated protected Pipeline(PipelineRunner<?> runner)
Deprecated. replaced byPipeline(PipelineRunner, PipelineOptions)
-
Pipeline
protected Pipeline(PipelineRunner<?> runner, PipelineOptions options)
-
-
Method Detail
-
create
public static Pipeline create(PipelineOptions options)
Constructs a pipeline from the provided options.- Returns:
- The newly created pipeline.
-
begin
public PBegin begin()
-
apply
public <OutputT extends POutput> OutputT apply(PTransform<? super PBegin,OutputT> root)
Likeapply(String, PTransform)
but the transform node in thePipeline
graph will be named according toPTransform.getName()
.- See Also:
apply(String, PTransform)
-
apply
public <OutputT extends POutput> OutputT apply(String name, PTransform<? super PBegin,OutputT> root)
Adds a rootPTransform
, such asRead
orCreate
, to thisPipeline
.The node in the
Pipeline
graph will use the providedname
. This name is used in various places, including the monitoring UI, logging, and to stably identify this node in thePipeline
graph upon update.Alias for
begin().apply(name, root)
.
-
run
public PipelineResult run()
Runs thePipeline
using itsPipelineRunner
.
-
getCoderRegistry
public CoderRegistry getCoderRegistry()
Returns theCoderRegistry
that thisPipeline
uses.
-
setCoderRegistry
public void setCoderRegistry(CoderRegistry coderRegistry)
Sets theCoderRegistry
that thisPipeline
uses.
-
traverseTopologically
public void traverseTopologically(Pipeline.PipelineVisitor visitor)
Invokes thePipelineVisitor's
Pipeline.PipelineVisitor.visitTransform(com.google.cloud.dataflow.sdk.runners.TransformTreeNode)
andPipeline.PipelineVisitor.visitValue(com.google.cloud.dataflow.sdk.values.PValue, com.google.cloud.dataflow.sdk.runners.TransformTreeNode)
operations on each of thisPipeline's
transform and value nodes, in forward topological order.Traversal of the
Pipeline
causesPTransforms
andPValues
owned by thePipeline
to be marked as finished, at which point they may no longer be modified.Typically invoked by
PipelineRunner
subclasses.
-
applyTransform
public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(InputT input, PTransform<? super InputT,OutputT> transform)
LikeapplyTransform(String, PInput, PTransform)
but defaulting to the name provided by thePTransform
.
-
applyTransform
public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(String name, InputT input, PTransform<? super InputT,OutputT> transform)
Applies the givenPTransform
to this inputInputT
and returns itsOutputT
. This usesname
to identify this specific application of the transform. This name is used in various places, including the monitoring UI, logging, and to stably identify this application node in thePipeline
graph during update.Each
PInput
subclass that provides anapply
method should delegate to this method to ensure proper registration with thePipelineRunner
.
-
getRunner
public PipelineRunner<?> getRunner()
Returns the configuredPipelineRunner
.
-
getOptions
public PipelineOptions getOptions()
Returns the configuredPipelineOptions
.
-
getFullNameForTesting
@Deprecated public String getFullNameForTesting(PTransform<?,?> transform)
Deprecated. this method is no longer compatible with the design ofPipeline
, asPTransforms
can be applied multiple times, with different names each time.
-
addValueInternal
public void addValueInternal(PValue value)
-
-