Google Cloud Dataflow SDK for Java, version 1.9.1
Class PTransform<InputT extends PInput,OutputT extends POutput>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.transforms.PTransform<InputT,OutputT>
-
- Type Parameters:
InputT
- the type of the input to this PTransformOutputT
- the type of the output of this PTransform
- All Implemented Interfaces:
- HasDisplayData, Serializable
- Direct Known Subclasses:
- AvroIO.Read.Bound, AvroIO.Write.Bound, BigQueryIO.Read.Bound, BigQueryIO.Write.Bound, BigtableIO.Read, BigtableIO.Write, CoGroupByKey, Combine.Globally, Combine.GloballyAsSingletonView, Combine.GroupedValues, Combine.PerKey, Combine.PerKeyWithHotKeyFanout, Count.PerElement, CountingInput.BoundedCountingInput, CountingInput.UnboundedCountingInput, Create.Values, DataflowAssert.GroupThenAssert, DataflowAssert.GroupThenAssertForSingleton, DataflowAssert.OneSideInputAssert, DatastoreV1.DeleteEntity, DatastoreV1.DeleteKey, DatastoreV1.Read, DatastoreV1.Write, Filter, FlatMapElements, Flatten.FlattenIterables, Flatten.FlattenPCollectionList, ForwardingPTransform, GroupByKey, GroupByKey.GroupAlsoByWindow, GroupByKey.GroupByKeyOnly, GroupByKey.ReifyTimestampsAndWindows, GroupByKey.SortValuesByTimestamp, IntraBundleParallelization.Bound, Keys, KvSwap, MapElements, ParDo.Bound, ParDo.BoundMulti, Partition, PubsubIO.Read.Bound, PubsubIO.Write.Bound, PubsubUnboundedSink, PubsubUnboundedSource, Read.Bounded, Read.Unbounded, RemoveDuplicates, RemoveDuplicates.WithRepresentativeValues, Sample.SampleAny, TestStream, TextIO.Read.Bound, TextIO.Write.Bound, Values, View.AsIterable, View.AsList, View.AsMap, View.AsMultimap, View.AsSingleton, View.CreatePCollectionView, Window.Bound, Window.Remerge, WithKeys, WithTimestamps, Write.Bound
public abstract class PTransform<InputT extends PInput,OutputT extends POutput> extends Object implements Serializable, HasDisplayData
APTransform<InputT, OutputT>
is an operation that takes anInputT
(some subtype ofPInput
) and produces anOutputT
(some subtype ofPOutput
).Common PTransforms include root PTransforms like
TextIO.Read
,Create
, processing and conversion operations likeParDo
,GroupByKey
,CoGroupByKey
,Combine
, andCount
, and outputting PTransforms likeTextIO.Write
. Users also define their own application-specific composite PTransforms.Each
PTransform<InputT, OutputT>
has a singleInputT
type and a singleOutputT
type. Many PTransforms conceptually transform one input value to one output value, and in this caseInputT
andOutput
are typically instances ofPCollection
. A root PTransform conceptually has no input; in this case, conventionally aPBegin
object produced by callingPipeline.begin()
is used as the input. An outputting PTransform conceptually has no output; in this case, conventionallyPDone
is used as its output type. Some PTransforms conceptually have multiple inputs and/or outputs; in these cases special "bundling" classes likePCollectionList
,PCollectionTuple
are used to combine multiple values into a single bundle for passing into or returning from the PTransform.A
PTransform<InputT, OutputT>
is invoked by callingapply()
on itsInputT
, returning itsOutputT
. Calls can be chained to concisely create linear pipeline segments. For example:PCollection<T1> pc1 = ...; PCollection<T2> pc2 = pc1.apply(ParDo.of(new MyDoFn<T1,KV<K,V>>())) .apply(GroupByKey.<K, V>create()) .apply(Combine.perKey(new MyKeyedCombineFn<K,V>())) .apply(ParDo.of(new MyDoFn2<KV<K,V>,T2>()));
PTransform operations have unique names, which are used by the system when explaining what's going on during optimization and execution. Each PTransform gets a system-provided default name, but it's a good practice to specify an explicit name, where possible, using the
named()
method offered by some PTransforms such asParDo
. For example:... .apply(ParDo.named("Step1").of(new MyDoFn3())) ...
Each PCollection output produced by a PTransform, either directly or within a "bundling" class, automatically gets its own name derived from the name of its producing PTransform.
Each PCollection output produced by a PTransform also records a
Coder
that specifies how the elements of that PCollection are to be encoded as a byte string, if necessary. The PTransform may provide a default Coder for any of its outputs, for instance by deriving it from the PTransform input's Coder. If the PTransform does not specify the Coder for an output PCollection, the system will attempt to infer a Coder for it, based on what's known at run-time about the Java type of the output's elements. The enclosingPipeline
'sCoderRegistry
(accessible viaPipeline.getCoderRegistry()
) defines the mapping from Java types to the default Coder to use, for a standard set of Java types; users can extend this mapping for additional types, viaCoderRegistry.registerCoder(java.lang.Class<?>, java.lang.Class<?>)
. If this inference process fails, either because the Java type was not known at run-time (e.g., due to Java's "erasure" of generic types) or there was no default Coder registered, then the Coder should be specified manually by callingTypedPValue.setCoder(com.google.cloud.dataflow.sdk.coders.Coder<T>)
on the output PCollection. The Coder of every output PCollection must be determined one way or another before that output is used as an input to another PTransform, or before the enclosing Pipeline is run.A small number of PTransforms are implemented natively by the Google Cloud Dataflow SDK; such PTransforms simply return an output value as their apply implementation. The majority of PTransforms are implemented as composites of other PTransforms. Such a PTransform subclass typically just implements
apply(InputT)
, computing its Output value from itsInputT
value. User programs are encouraged to use this mechanism to modularize their own code. Such composite abstractions get their own name, and navigating through the composition hierarchy of PTransforms is supported by the monitoring interface. Examples of composite PTransforms can be found in this directory and in examples. From the caller's point of view, there is no distinction between a PTransform implemented natively and one implemented in terms of other PTransforms; both kinds of PTransform are invoked in the same way, usingapply()
.Note on Serialization
PTransform
doesn't actually support serialization, despite implementingSerializable
.PTransform
is markedSerializable
solely because it is common for an anonymousDoFn
, instance to be created within anapply()
method of a compositePTransform
.Each of those
*Fn
s isSerializable
, but unfortunately its instance state will contain a reference to the enclosingPTransform
instance, and so attempt to serialize thePTransform
instance, even though the*Fn
instance never references anything about the enclosingPTransform
.To allow such anonymous
*Fn
s to be written conveniently,PTransform
is marked asSerializable
, and includes dummywriteObject()
andreadObject()
operations that do not save or restore any state.- See Also:
- Applying Transformations, Serialized Form
-
-
Field Summary
Fields Modifier and Type Field and Description protected String
name
The base name of thisPTransform
, e.g., fromParDo.named(String)
, or from defaults, ornull
if not yet assigned.
-
Constructor Summary
Constructors Modifier Constructor and Description protected
PTransform()
protected
PTransform(String name)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method and Description OutputT
apply(InputT input)
Applies thisPTransform
on the givenInputT
, and returns itsOutput
.protected Coder<?>
getDefaultOutputCoder()
Returns the defaultCoder
to use for the output of this single-outputPTransform
.protected Coder<?>
getDefaultOutputCoder(InputT input)
Returns the defaultCoder
to use for the output of this single-outputPTransform
when applied to the given input.<T> Coder<T>
getDefaultOutputCoder(InputT input, TypedPValue<T> output)
Returns the defaultCoder
to use for the given output of this single-outputPTransform
when applied to the given input.protected String
getKindString()
Returns the name to use by default for thisPTransform
(not including the names of any enclosingPTransform
s).String
getName()
Returns the transform name.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.String
toString()
void
validate(InputT input)
Called before invoking apply (which may be intercepted by the runner) to verify this transform is fully specified and applicable to the specified input.
-
-
-
Field Detail
-
name
protected final transient String name
The base name of thisPTransform
, e.g., fromParDo.named(String)
, or from defaults, ornull
if not yet assigned.
-
-
Constructor Detail
-
PTransform
protected PTransform()
-
PTransform
protected PTransform(String name)
-
-
Method Detail
-
apply
public OutputT apply(InputT input)
Applies thisPTransform
on the givenInputT
, and returns itsOutput
.Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
The default implementation throws an exception. A derived class must either implement apply, or else each runner must supply a custom implementation via
PipelineRunner.apply(com.google.cloud.dataflow.sdk.transforms.PTransform<InputT, OutputT>, InputT)
.
-
validate
public void validate(InputT input)
Called before invoking apply (which may be intercepted by the runner) to verify this transform is fully specified and applicable to the specified input.By default, does nothing.
-
getName
public String getName()
Returns the transform name.This name is provided by the transform creator and is not required to be unique.
-
getKindString
protected String getKindString()
Returns the name to use by default for thisPTransform
(not including the names of any enclosingPTransform
s).By default, returns the base name of this
PTransform
's class.The caller is responsible for ensuring that names of applied
PTransform
s are unique, e.g., by adding a uniquifying suffix when needed.
-
getDefaultOutputCoder
protected Coder<?> getDefaultOutputCoder() throws CannotProvideCoderException
Returns the defaultCoder
to use for the output of this single-outputPTransform
.By default, always throws
- Throws:
CannotProvideCoderException
- if no coder can be inferred
-
getDefaultOutputCoder
protected Coder<?> getDefaultOutputCoder(InputT input) throws CannotProvideCoderException
Returns the defaultCoder
to use for the output of this single-outputPTransform
when applied to the given input.By default, always throws.
- Throws:
CannotProvideCoderException
- if none can be inferred.
-
getDefaultOutputCoder
public <T> Coder<T> getDefaultOutputCoder(InputT input, TypedPValue<T> output) throws CannotProvideCoderException
Returns the defaultCoder
to use for the given output of this single-outputPTransform
when applied to the given input.By default, always throws.
- Throws:
CannotProvideCoderException
- if none can be inferred.
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
-