BlockBasedSource (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1

com.google.cloud.dataflow.sdk.io

Class BlockBasedSource<T>

  • Type Parameters:
    T - The type of records to be read from the source.
    All Implemented Interfaces:
    HasDisplayData, Serializable
    Direct Known Subclasses:
    AvroSource


    @Experimental(value=SOURCE_SINK)
    public abstract class BlockBasedSource<T>
    extends FileBasedSource<T>
    A BlockBasedSource is a FileBasedSource where a file consists of blocks of records.

    BlockBasedSource should be derived from when a file format does not support efficient seeking to a record in the file, but can support efficient seeking to a block. Alternatively, records in the file cannot be offset-addressed, but blocks can (it is not possible to say that record {code i} starts at offset m, but it is possible to say that block j starts at offset n).

    The records that will be read from a BlockBasedSource that corresponds to a subrange of a file [startOffset, endOffset) are those records such that the record is contained in a block that starts at offset i, where i >= startOffset and i < endOffset. In other words, a record will be read from the source if its first byte is contained in a block that begins within the range described by the source.

    This entails that it is possible to determine the start offsets of all blocks in a file.

    Progress reporting for reading from a BlockBasedSource is inaccurate. A BlockBasedSource.BlockBasedReader reports its current offset as (offset of current block) + (current block size) * (fraction of block consumed). However, only the offset of the current block is required to be accurately reported by subclass implementations. As such, in the worst case, the current offset is only updated at block boundaries.

    BlockBasedSource supports dynamic splitting. However, because records in a BlockBasedSource are not required to have offsets and progress reporting is inaccurate, BlockBasedReader only supports splitting at block boundaries. In other words, BlockBasedSource.BlockBasedReader.atSplitPoint returns true iff the current record is the first record in a block. See FileBasedSource.FileBasedReader for discussion about split points.

    See Also:
    Serialized Form


Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataflow