Google Cloud Dataflow SDK for Java, version 1.9.1
Class DoFn<InputT,OutputT>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.transforms.DoFn<InputT,OutputT>
-
- Type Parameters:
InputT
- the type of the (main) input elementsOutputT
- the type of the (main) output elements
- All Implemented Interfaces:
- HasDisplayData, Serializable
- Direct Known Subclasses:
- DirectPipelineRunner.TestCombineDoFn, IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn, PubsubIO.Read.Bound.PubsubReader, PubsubIO.Write.Bound.PubsubWriter
public abstract class DoFn<InputT,OutputT> extends Object implements Serializable, HasDisplayData
The argument toParDo
providing the code to use to process elements of the inputPCollection
.See
ParDo
for more explanation, examples of use, and discussion of constraints onDoFn
s, including their serializability, lack of access to global shared mutable state, requirements for failure tolerance, and benefits of optimization.DoFn
s can be tested in the context of a particularPipeline
by running thatPipeline
on sample input and then checking its output. Unit testing of aDoFn
, separately from anyParDo
transform orPipeline
, can be done via theDoFnTester
harness.DoFnWithContext
(currently experimental) offers an alternative mechanism for accessingDoFn.ProcessContext.window()
without the need to implementDoFn.RequiresWindowAccess
.See also
processElement(com.google.cloud.dataflow.sdk.transforms.DoFn<InputT, OutputT>.ProcessContext)
for details on implementing the transformation fromInputT
toOutputT
.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description class
DoFn.Context
Information accessible to all methods in thisDoFn
.class
DoFn.ProcessContext
Information accessible when runningprocessElement(com.google.cloud.dataflow.sdk.transforms.DoFn<InputT, OutputT>.ProcessContext)
.static interface
DoFn.RequiresWindowAccess
Interface for signaling that aDoFn
needs to access the window the element is being processed in, viaDoFn.ProcessContext.window()
.
-
Constructor Summary
Constructors Constructor and Description DoFn()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Deprecated Methods Modifier and Type Method and Description protected <AggInputT,AggOutputT>
Aggregator<AggInputT,AggOutputT>createAggregator(String name, Combine.CombineFn<? super AggInputT,?,AggOutputT> combiner)
Returns anAggregator
with aggregation logic specified by theCombine.CombineFn
argument.protected <AggInputT>
Aggregator<AggInputT,AggInputT>createAggregator(String name, SerializableFunction<Iterable<AggInputT>,AggInputT> combiner)
Returns anAggregator
with the aggregation logic specified by theSerializableFunction
argument.void
finishBundle(DoFn.Context c)
Finishes processing this batch of elements.Duration
getAllowedTimestampSkew()
Deprecated.does not interact well with the watermark.protected TypeDescriptor<InputT>
getInputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the input type of thisDoFn
instance's most-derived class.protected TypeDescriptor<OutputT>
getOutputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the output type of thisDoFn
instance's most-derived class.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.abstract void
processElement(DoFn.ProcessContext c)
Processes one input element.void
startBundle(DoFn.Context c)
Prepares thisDoFn
instance for processing a batch of elements.
-
-
-
Method Detail
-
getAllowedTimestampSkew
@Deprecated public Duration getAllowedTimestampSkew()
Deprecated. does not interact well with the watermark.Returns the allowed timestamp skew duration, which is the maximum duration that timestamps can be shifted backward inDoFn.Context.outputWithTimestamp(OutputT, org.joda.time.Instant)
.The default value is
Duration.ZERO
, in which case timestamps can only be shifted forward to future. For infinite skew, returnDuration.millis(Long.MAX_VALUE)
.Note that producing an element whose timestamp is less than the current timestamp may result in late data, i.e. returning a non-zero value here does not impact watermark calculations used for firing windows.
-
startBundle
public void startBundle(DoFn.Context c) throws Exception
Prepares thisDoFn
instance for processing a batch of elements.By default, does nothing.
- Throws:
Exception
-
processElement
public abstract void processElement(DoFn.ProcessContext c) throws Exception
Processes one input element.The current element of the input
PCollection
is returned byc.element()
. It should be considered immutable. The Dataflow runtime will not mutate the element, so it is safe to cache, etc. The element should not be mutated by any of theDoFn
methods, because it may be cached elsewhere, retained by the Dataflow runtime, or used in other unspecified ways.A value is added to the main output
PCollection
byDoFn.Context.output(OutputT)
. Once passed tooutput
the element should be considered immutable and not be modified in any way. It may be cached elsewhere, retained by the Dataflow runtime, or used in other unspecified ways.- Throws:
Exception
- See Also:
DoFn.ProcessContext
-
finishBundle
public void finishBundle(DoFn.Context c) throws Exception
Finishes processing this batch of elements.By default, does nothing.
- Throws:
Exception
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
getInputTypeDescriptor
protected TypeDescriptor<InputT> getInputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the input type of thisDoFn
instance's most-derived class.See
getOutputTypeDescriptor()
for more discussion.
-
getOutputTypeDescriptor
protected TypeDescriptor<OutputT> getOutputTypeDescriptor()
Returns aTypeDescriptor
capturing what is known statically about the output type of thisDoFn
instance's most-derived class.In the normal case of a concrete
DoFn
subclass with no generic type parameters of its own (including anonymous inner classes), this will be a complete non-generic type, which is good for choosing a default outputCoder<OutputT>
for the outputPCollection<OutputT>
.
-
createAggregator
protected final <AggInputT,AggOutputT> Aggregator<AggInputT,AggOutputT> createAggregator(String name, Combine.CombineFn<? super AggInputT,?,AggOutputT> combiner)
Returns anAggregator
with aggregation logic specified by theCombine.CombineFn
argument. The name provided must be unique acrossAggregator
s created within the DoFn. Aggregators can only be created during pipeline construction.- Parameters:
name
- the name of the aggregatorcombiner
- theCombine.CombineFn
to use in the aggregator- Returns:
- an aggregator for the provided name and combiner in the scope of this DoFn
- Throws:
NullPointerException
- if the name or combiner is nullIllegalArgumentException
- if the given name collides with another aggregator in this scopeIllegalStateException
- if called during pipeline processing.
-
createAggregator
protected final <AggInputT> Aggregator<AggInputT,AggInputT> createAggregator(String name, SerializableFunction<Iterable<AggInputT>,AggInputT> combiner)
Returns anAggregator
with the aggregation logic specified by theSerializableFunction
argument. The name provided must be unique acrossAggregator
s created within the DoFn. Aggregators can only be created during pipeline construction.- Parameters:
name
- the name of the aggregatorcombiner
- theSerializableFunction
to use in the aggregator- Returns:
- an aggregator for the provided name and combiner in the scope of this DoFn
- Throws:
NullPointerException
- if the name or combiner is nullIllegalArgumentException
- if the given name collides with another aggregator in this scopeIllegalStateException
- if called during pipeline processing.
-
-