Use Streaming Engine for streaming jobs

Dataflow's Streaming Engine moves pipeline execution out of the worker virtual machines (VMs) and into the Dataflow service backend. When you don't use Streaming Engine for streaming jobs, the Dataflow runner executes the steps of your streaming pipeline entirely on worker VMs, consuming worker CPU, memory, and Persistent Disk storage.

Streaming Engine is enabled by default for the following pipelines:

  • Streaming pipelines that use the Apache Beam Python SDK version 2.21.0 or later and Python 3.
  • Streaming pipelines that use the Apache Beam Go SDK version 2.33.0 or later.

To learn more about the implementation of Streaming Engine, see Streaming Engine: Execution Model for Highly-Scalable, Low-Latency Data Processing.

Benefits

The Streaming Engine model has the following benefits:

  • Reduced CPU, memory, and Persistent Disk storage resource usage on the worker VMs. Streaming Engine works best with smaller worker machine types (n1-standard-2 instead of n1-standard-4). It doesn't require Persistent Disk beyond a small worker boot disk, leading to less resource and quota consumption.
  • More responsive Horizontal Autoscaling in response to variations in incoming data volume. Streaming Engine offers smoother, more granular scaling of workers.
  • Improved supportability, because you don't need to redeploy your pipelines to apply service updates.

Most of the reduction in worker resources comes from offloading the work to the Dataflow service. For that reason, there is a charge associated with the use of Streaming Engine.

Support and limitations

  • For the Java SDK, Streaming Engine requires the Apache Beam SDK version 2.10.0 or later.
  • For the Python SDK, Streaming Engine requires the Apache Beam SDK version 2.16.0 or later.
  • For the Go SDK, Streaming Engine requires the Apache Beam SDK version 2.33.0 or later.
  • You can't update pipelines that are already running to use Streaming Engine. If your pipeline is running in production without Streaming Engine and you want to use Streaming Engine, stop your pipeline by using the Dataflow Drain option. Then, specify the Streaming Engine parameter, and rerun your pipeline.
  • For jobs that use Streaming Engine, the aggregated input data for the open windows has a limit of 60 GB per key. Aggregated input data includes both buffered elements and custom state. When a pipeline exceeds this limit, the pipeline becomes stuck with high system lag, but no error appears in the external log files. As a best practice, avoid pipeline designs that result in large keys. For more information, see Writing Dataflow pipelines with scalability in mind.

Use Streaming Engine

This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations.

Java

Streaming Engine requires the Apache Beam Java SDK version 2.10.0 or later.

To use Streaming Engine for your streaming pipelines, specify the following parameter:

  • --enableStreamingEngine if you're using Apache Beam SDK for Java versions 2.11.0 or later.
  • --experiments=enable_streaming_engine if you're using Apache Beam SDK for Java version 2.10.0.

If you use Dataflow Streaming Engine for your pipeline, don't specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Streaming Engine is currently available. Dataflow auto-selects the zone in the region you specified. If you do specify the --zone parameter and set it to a zone outside of the available regions, Dataflow reports an error.

Streaming Engine works best with smaller core worker machine types. Use the job type to determine whether to use a high memory worker machine type. Example machine types that we recommend include ‑‑workerMachineType=n1-standard-2 and ‑‑workerMachineType=n1-highmem-2. You can also set ‑‑diskSizeGb=30 because Streaming Engine only needs space for the worker boot image and local logs. These values are the default values.

Python

Streaming Engine requires the Apache Beam Python SDK version 2.16.0 or later.

Streaming Engine is enabled by default for new Dataflow streaming pipelines when the following conditions are met:

If you want to disable Streaming Engine in your Python streaming pipeline, specify the following parameter. This parameter must be specified everytime you want to disable Streaming Engine.

--experiments=disable_streaming_engine

If you use Python 2, to enable Streaming Engine, specify the following parameter:

--enable_streaming_engine

If you use Dataflow Streaming Engine in your pipeline, don't specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Streaming Engine is currently available. Dataflow auto-selects the zone in the region you specified. If you specify the --zone parameter and set it to a zone outside of the available regions, Dataflow reports an error.

Streaming Engine works best with smaller core worker machine types. Use the job type to determine whether to use a high memory worker machine type. Example machine types that we recommend include ‑‑workerMachineType=n1-standard-2 and ‑‑workerMachineType=n1-highmem-2. You can also set ‑‑disk_size_gb=30 because Streaming Engine only needs space for the worker boot image and local logs. These values are the default values.

Go

Streaming Engine requires the Apache Beam Go SDK version 2.33.0 or later.

Streaming Engine is enabled by default for new Dataflow streaming pipelines that use the Apache Beam Go SDK.

If you want to disable Streaming Engine in your Go streaming pipeline, specify the following parameter. This parameter must be specified everytime you want to disable Streaming Engine.

--experiments=disable_streaming_engine

If you use Dataflow Streaming Engine in your pipeline, don't specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Streaming Engine is currently available. Dataflow auto-selects the zone in the region you specified. If you specify the --zone parameter and set it to a zone outside of the available regions, Dataflow reports an error.

Streaming Engine works best with smaller core worker machine types. Use the job type to determine whether to use a high memory worker machine type. Example machine types that we recommend include ‑‑workerMachineType=n1-standard-2 and ‑‑workerMachineType=n1-highmem-2. You can also set ‑‑disk_size_gb=30 because Streaming Engine only needs space for the worker boot image and local logs. These values are the default values.