Google Cloud Dataflow SDK for Java, version 1.9.1
Class UnboundedSource.UnboundedReader<OutputT>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.io.Source.Reader<OutputT>
-
- com.google.cloud.dataflow.sdk.io.UnboundedSource.UnboundedReader<OutputT>
-
- All Implemented Interfaces:
- AutoCloseable
- Enclosing class:
- UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>
@Experimental(value=SOURCE_SINK) public abstract static class UnboundedSource.UnboundedReader<OutputT> extends Source.Reader<OutputT>
AReader
that reads an unbounded amount of input.A given
UnboundedReader
object will only be accessed by a single thread at once.
-
-
Field Summary
Fields Modifier and Type Field and Description static long
BACKLOG_UNKNOWN
Constant representing an unknown amount of backlog.
-
Constructor Summary
Constructors Constructor and Description UnboundedReader()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method and Description abstract boolean
advance()
Advances the reader to the next valid record.abstract UnboundedSource.CheckpointMark
getCheckpointMark()
Returns aUnboundedSource.CheckpointMark
representing the progress of thisUnboundedReader
.byte[]
getCurrentRecordId()
Returns a unique identifier for the current record.abstract UnboundedSource<OutputT,?>
getCurrentSource()
Returns theUnboundedSource
that created this reader.long
getSplitBacklogBytes()
Returns the size of the backlog of unread data in the underlying data source represented by this split of this source.long
getTotalBacklogBytes()
Returns the size of the backlog of unread data in the underlying data source represented by all splits of this source.abstract Instant
getWatermark()
Returns a timestamp before or at the timestamps of all future elements read by this reader.abstract boolean
start()
Initializes the reader and advances the reader to the first record.-
Methods inherited from class com.google.cloud.dataflow.sdk.io.Source.Reader
close, getCurrent, getCurrentTimestamp
-
-
-
-
Field Detail
-
BACKLOG_UNKNOWN
public static final long BACKLOG_UNKNOWN
Constant representing an unknown amount of backlog.- See Also:
- Constant Field Values
-
-
Method Detail
-
start
public abstract boolean start() throws IOException
Initializes the reader and advances the reader to the first record.This method should be called exactly once. The invocation should occur prior to calling
advance()
orSource.Reader.getCurrent()
. This method may perform expensive operations that are needed to initialize the reader.Returns
true
if a record was read,false
if there is no more input currently available. Future calls toadvance()
may returntrue
once more data is available. Regardless of the return value ofstart
,start
will not be called again on the sameUnboundedReader
object; it will only be called again when a new reader object is constructed for the same source, e.g. on recovery.- Specified by:
start
in classSource.Reader<OutputT>
- Returns:
true
if a record was read,false
if there is no more input available.- Throws:
IOException
-
advance
public abstract boolean advance() throws IOException
Advances the reader to the next valid record.Returns
true
if a record was read,false
if there is no more input available. Future calls toadvance()
may returntrue
once more data is available.- Specified by:
advance
in classSource.Reader<OutputT>
- Returns:
true
if a record was read,false
if there is no more input available.- Throws:
IOException
-
getCurrentRecordId
public byte[] getCurrentRecordId() throws NoSuchElementException
Returns a unique identifier for the current record. This should be the same for each instance of the same logical record read from the underlying data source.It is only necessary to override this if
UnboundedSource.requiresDeduping()
has been overridden to return true.For example, this could be a hash of the record contents, or a logical ID present in the record. If this is generated as a hash of the record contents, it should be at least 16 bytes (128 bits) to avoid collisions.
This method has the same restrictions on when it can be called as
Source.Reader.getCurrent()
andSource.Reader.getCurrentTimestamp()
.- Throws:
NoSuchElementException
- if the reader is at the beginning of the input andstart()
oradvance()
wasn't called, or if the laststart()
oradvance()
returnedfalse
.
-
getWatermark
public abstract Instant getWatermark()
Returns a timestamp before or at the timestamps of all future elements read by this reader.This can be approximate. If records are read that violate this guarantee, they will be considered late, which will affect how they will be processed. See
Window
for more information on late data and how to handle it.However, this value should be as late as possible. Downstream windows may not be able to close until this watermark passes their end.
For example, a source may know that the records it reads will be in timestamp order. In this case, the watermark can be the timestamp of the last record read. For a source that does not have natural timestamps, timestamps can be set to the time of reading, in which case the watermark is the current clock time.
See
Window
andTrigger
for more information on timestamps and watermarks.May be called after
advance()
orstart()
has returned false, but not beforestart()
has been called.
-
getCheckpointMark
public abstract UnboundedSource.CheckpointMark getCheckpointMark()
Returns aUnboundedSource.CheckpointMark
representing the progress of thisUnboundedReader
.The elements read up until this is called will be processed together as a bundle. Once the result of this processing has been durably committed,
UnboundedSource.CheckpointMark.finalizeCheckpoint()
will be called on theUnboundedSource.CheckpointMark
object.The returned object should not be modified.
May be called after
advance()
orstart()
has returned false, but not beforestart()
has been called.
-
getSplitBacklogBytes
public long getSplitBacklogBytes()
Returns the size of the backlog of unread data in the underlying data source represented by this split of this source.One of this or
getTotalBacklogBytes()
should be overridden in order to allow the runner to scale the amount of resources allocated to the pipeline.
-
getTotalBacklogBytes
public long getTotalBacklogBytes()
Returns the size of the backlog of unread data in the underlying data source represented by all splits of this source.One of this or
getSplitBacklogBytes()
should be overridden in order to allow the runner to scale the amount of resources allocated to the pipeline.
-
getCurrentSource
public abstract UnboundedSource<OutputT,?> getCurrentSource()
Returns theUnboundedSource
that created this reader. This will not change over the life of the reader.- Specified by:
getCurrentSource
in classSource.Reader<OutputT>
-
-