A PCollection (or collection) represents a potentially distributed, multi- element dataset that acts as a pipeline's data. Pipeline transforms use collections as inputs and outputs for each step in your pipeline. A collection can hold a bounded dataset of a fixed size, or hold an unbounded dataset from a continuously updating data source such as Pub/Sub.
For a bounded collection, grouping operations group all of the elements that have the same key within the entire collection. However, with an unbounded collection, it is impossible to collect all of the elements; the continuously updating data source constantly adds new elements, and there might be infinitely many elements (this is often referred to as streaming data).
Windows and windowing functions
Windowing enables grouping over unbounded collections by dividing the collection into windows according to the timestamps of the individual elements. Each window contains a finite number of elements. Grouping operations work implicitly on a per-window basis; grouping operations process each collection as a succession of multiple, finite windows, though the entire collection might be of unbounded size.
A windowing function specifies how to assign elements to an initial window, and how to merge windows of grouped elements. There are three supported windowing functions:
- Tumbling windows (called fixed windows in Apache Beam)
- Hopping windows (called sliding windows in Apache Beam)
- Session windows
The simplest form of windowing is tumbling windowing. A tumbling window represents a consistent duration, non overlapping time interval in the data stream. For example, if your windows are set to a five-minute duration: all elements in your unbounded collection with timestamp values from 0:00:00 up to (but not including) 0:05:00 belong to the first window, elements with timestamp values from 0:05:00 up to (but not including) 0:10:00 belong to the second window, and so on.
Figure 1: Tumbling windows, 30 seconds in duration.
Hopping windowing also represents time intervals in the data stream; however, hopping windows can overlap. For example, each window might capture five minutes worth of data, but a new window starts every ten seconds. The frequency with which hopping windows begin is called the period. Therefore, our example would have a window duration of five minutes and a period of ten seconds.
Because multiple windows overlap, most elements in a dataset belong to more than one window. Hopping windowing is useful for taking running averages of data; in our example, you can compute a running average of the past minutes' worth of data, updated every thirty seconds.
Figure 2: Hopping windows with 1 minute window duration and 30 second window period.
Session windows are windows that contain elements that are within a certain gap duration of another element. Session windowing applies on a per-key basis and is useful for data that is irregularly distributed with respect to time. For example, a data stream representing user mouse activity might have long periods of idle time interspersed with high concentrations of clicks. If data arrives after the minimum specified gap duration time, this initiates the start of a new window.
Figure 3: Session windows with a minimum gap duration. Each data key has different windows, according to its data distribution.
Watermarks are the notion of when the system expects that all data in a certain window has arrived in the pipeline. Dataflow tracks watermarks because data is not guaranteed to arrive in time order or at predictable intervals. In addition, there are no guarantees that data events appear in the pipeline in the same order that they were generated. After the watermark progresses past the end of a window, any further elements that arrive with a timestamp in that window are considered late data.
Triggers determine when to emit aggregated results as data arrives. For bounded data, results are emitted after all of the input has been processed. For unbounded data, results are emitted when the watermark passes the end of the window, which indicates that the system believes all input data for that window has been processed.