PCollection (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1


Class PCollection<T>

  • Type Parameters:
    T - the type of the elements of this PCollection
    All Implemented Interfaces:
    PInput, POutput, PValue

    public class PCollection<T>
    extends TypedPValue<T>
    A PCollection<T> is an immutable collection of values of type T. A PCollection can contain either a bounded or unbounded number of elements. Bounded and unbounded PCollections are produced as the output of PTransforms (including root PTransforms like Read and Create), and can be passed as the inputs of other PTransforms.

    Some root transforms produce bounded PCollections and others produce unbounded ones. For example, TextIO.Read reads a static set of files, so it produces a bounded PCollection. PubsubIO.Read, on the other hand, receives a potentially infinite stream of Pubsub messages, so it produces an unbounded PCollection.

    Each element in a PCollection may have an associated implicit timestamp. Readers assign timestamps to elements when they create PCollections, and other PTransforms propagate these timestamps from their input to their output. For example, PubsubIO.Read assigns pubsub message timestamps to elements, and TextIO.Read assigns the default value BoundedWindow.TIMESTAMP_MIN_VALUE to elements. User code can explicitly assign timestamps to elements with DoFn.Context.outputWithTimestamp(OutputT, org.joda.time.Instant).

    Additionally, a PCollection has an associated WindowFn and each element is assigned to a set of windows. By default, the windowing function is GlobalWindows and all elements are assigned into a single default window. This default can be overridden with the Window PTransform.

    See the individual PTransform subclasses for specific information on how they propagate timestamps and windowing.