UnboundedSource (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1


Class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>

  • Type Parameters:
    OutputT - Type of records output by this source.
    CheckpointMarkT - Type of checkpoint marks used by the readers of this source.
    All Implemented Interfaces:
    HasDisplayData, Serializable

    public abstract class UnboundedSource<OutputT,CheckpointMarkT extends UnboundedSource.CheckpointMark>
    extends Source<OutputT>
    A Source that reads an unbounded amount of input and, because of that, supports some additional operations such as checkpointing, watermarks, and record ids.
    • Checkpointing allows sources to not re-read the same data again in the case of failures.
    • Watermarks allow for downstream parts of the pipeline to know up to what point in time the data is complete.
    • Record ids allow for efficient deduplication of input records; many streaming sources do not guarantee that a given record will only be read a single time.

    See Window and Trigger for more information on timestamps and watermarks.

    See Also:
    Serialized Form
    • Constructor Detail

      • UnboundedSource

        public UnboundedSource()
    • Method Detail

      • generateInitialSplits

        public abstract List<? extends UnboundedSource<OutputT,CheckpointMarkT>> generateInitialSplits(int desiredNumSplits,
                                                                                                       PipelineOptions options)
                                                                                                throws Exception
        Returns a list of UnboundedSource objects representing the instances of this source that should be used when executing the workflow. Each split should return a separate partition of the input data.

        For example, for a source reading from a growing directory of files, each split could correspond to a prefix of file names.

        Some sources are not splittable, such as reading from a single TCP stream. In that case, only a single split should be returned.

        Some data sources automatically partition their data among readers. For these types of inputs, n identical replicas of the top-level source can be returned.

        The size of the returned list should be as close to desiredNumSplits as possible, but does not have to match exactly. A low number of splits will limit the amount of parallelism in the source.

      • getCheckpointMarkCoder

        public abstract Coder<CheckpointMarkT> getCheckpointMarkCoder()
        Returns a Coder for encoding and decoding the checkpoints for this source, or null if the checkpoints do not need to be durably committed.
      • requiresDeduping

        public boolean requiresDeduping()
        Returns whether this source requires explicit deduping.

        This is needed if the underlying data source can return the same record multiple times, such a queuing system with a pull-ack model. Sources where the records read are uniquely identified by the persisted state in the CheckpointMark do not need this.