Dataflow streaming job constantly increasing disk usage

Problem

A Dataflow job is running in streaming mode, and the disk utilization of the worker VMs is constantly increasing. After several hours, this may lead to the job stalling completely due to no free disk space.

Environment

Dataflow in streaming mode (any SDK language).
Pipeline
- window accumulation mode set to accumulating.
- allowed lateness configured.

Solution

Note: while this example is provided with Java syntax, similar concepts apply to Python and other SDKs.

Using .discardingFiredPanes().withAllowedLateness()/p>

Dataflow discards all the data after the trigger has been fired (usually whenever the watermark passes the end of window). It still keeps around a record of the window itself, in case any more late data comes in for this window during the allowed period. However, this metadata does not take up much space, so should not cause problems unless you have many thousands of windows that are being retained excessively long, and using more memory than the actual data in-flight.

Using .accumulatingFiredPanes().withAllowedLateness()

Dataflow keeps all the data even after it has been triggered, so it's possible that you could accumulate quite a lot of data on the worker VMs during the allowed lateness period.

For example, if the allowed lateness is set to 24 hours, you could end up with 24 hours worth of data (or even more) accumulating on the disk of your worker VMs, even after this data has been processed and written to the final destination.

The best solution is either to reduce the allowed lateness, or if possible simply discard fired panes.

Cause

Dataflow jobs usually work out of the box without special attention needed for variables like disk space or memory consumption.

However, when using accumulating windows with allowed lateness, careful planning is needed to ensure that the amount of data accumulated does not exceed the available disk space.