Google Cloud Dataflow SDK for Java, version 1.9.1
Class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.io.Source<OutputT>
-
- com.google.cloud.dataflow.sdk.io.UnboundedSource<OutputT,CheckpointMarkT>
-
- Type Parameters:
OutputT
- Type of records output by this source.CheckpointMarkT
- Type of checkpoint marks used by the readers of this source.
- All Implemented Interfaces:
- HasDisplayData, Serializable
public abstract class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark> extends Source<OutputT>
ASource
that reads an unbounded amount of input and, because of that, supports some additional operations such as checkpointing, watermarks, and record ids.- Checkpointing allows sources to not re-read the same data again in the case of failures.
- Watermarks allow for downstream parts of the pipeline to know up to what point in time the data is complete.
- Record ids allow for efficient deduplication of input records; many streaming sources do not guarantee that a given record will only be read a single time.
See
Window
andTrigger
for more information on timestamps and watermarks.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static interface
UnboundedSource.CheckpointMark
A marker representing the progress and state of anUnboundedSource.UnboundedReader
.static class
UnboundedSource.UnboundedReader<OutputT>
AReader
that reads an unbounded amount of input.-
Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.Source
Source.Reader<T>
-
-
Constructor Summary
Constructors Constructor and Description UnboundedSource()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method and Description abstract UnboundedSource.UnboundedReader<OutputT>
createReader(PipelineOptions options, CheckpointMarkT checkpointMark)
Create a newUnboundedSource.UnboundedReader
to read from this source, resuming from the given checkpoint if present.abstract List<? extends UnboundedSource<OutputT,CheckpointMarkT>>
generateInitialSplits(int desiredNumSplits, PipelineOptions options)
Returns a list ofUnboundedSource
objects representing the instances of this source that should be used when executing the workflow.abstract Coder<CheckpointMarkT>
getCheckpointMarkCoder()
Returns aCoder
for encoding and decoding the checkpoints for this source, or null if the checkpoints do not need to be durably committed.boolean
requiresDeduping()
Returns whether this source requires explicit deduping.-
Methods inherited from class com.google.cloud.dataflow.sdk.io.Source
getDefaultOutputCoder, populateDisplayData, validate
-
-
-
-
Method Detail
-
generateInitialSplits
public abstract List<? extends UnboundedSource<OutputT,CheckpointMarkT>> generateInitialSplits(int desiredNumSplits, PipelineOptions options) throws Exception
Returns a list ofUnboundedSource
objects representing the instances of this source that should be used when executing the workflow. Each split should return a separate partition of the input data.For example, for a source reading from a growing directory of files, each split could correspond to a prefix of file names.
Some sources are not splittable, such as reading from a single TCP stream. In that case, only a single split should be returned.
Some data sources automatically partition their data among readers. For these types of inputs,
n
identical replicas of the top-level source can be returned.The size of the returned list should be as close to
desiredNumSplits
as possible, but does not have to match exactly. A low number of splits will limit the amount of parallelism in the source.- Throws:
Exception
-
createReader
public abstract UnboundedSource.UnboundedReader<OutputT> createReader(PipelineOptions options, @Nullable CheckpointMarkT checkpointMark)
Create a newUnboundedSource.UnboundedReader
to read from this source, resuming from the given checkpoint if present.
-
getCheckpointMarkCoder
@Nullable public abstract Coder<CheckpointMarkT> getCheckpointMarkCoder()
Returns aCoder
for encoding and decoding the checkpoints for this source, or null if the checkpoints do not need to be durably committed.
-
requiresDeduping
public boolean requiresDeduping()
Returns whether this source requires explicit deduping.This is needed if the underlying data source can return the same record multiple times, such a queuing system with a pull-ack model. Sources where the records read are uniquely identified by the persisted state in the CheckpointMark do not need this.
-
-