Use Streaming Engine for streaming jobs

Stay organized with collections Save and categorize content based on your preferences.

Currently, the Dataflow pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Dataflow's Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend.

To learn more about the implementation of Streaming Engine, see Streaming Engine: Execution Model for Highly-Scalable, Low-Latency Data Processing.

Benefits

The Streaming Engine model has the following benefits:

  • A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs. Streaming Engine works best with smaller worker machine types (n1-standard-2 instead of n1-standard-4) and does not require Persistent Disk beyond a small worker boot disk, leading to less resource and quota consumption.
  • More responsive Horizontal Autoscaling in response to variations in incoming data volume. Streaming Engine offers smoother, more granular scaling of workers.
  • Improved supportability, since you don't need to redeploy your pipelines to apply service updates.

Most of the reduction in worker resources comes from offloading the work to the Dataflow service. For that reason, there is a charge associated with the use of Streaming Engine.

Using Streaming Engine

This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations.

Note: Updating a pipeline that's already running to use Streaming Engine is currently not supported.

If your pipeline is already running in production and you would like to use Streaming Engine, you need to stop your pipeline using the Dataflow Drain option. Then, specify the Streaming Engine parameter and rerun your pipeline.

Java

To use Streaming Engine for your streaming pipelines, specify the following parameter:

  • --enableStreamingEngine if you're using Apache Beam SDK for Java versions 2.11.0 or later.
  • --experiments=enable_streaming_engine if you're using Apache Beam SDK for Java version 2.10.0.

If you use Dataflow Streaming Engine for your pipeline, do not specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Streaming Engine is currently available. Dataflow auto-selects the zone in the region you specified. If you do specify the --zone parameter and set it to a zone outside of the available regions, Dataflow reports an error.

Streaming Engine works best with smaller core worker machine types. Use the job type to determine whether to use a high memory worker machine type. Example machine types that we recommend include ‑‑workerMachineType=n1-standard-2 and ‑‑workerMachineType=n1-highmem-2. You can also set ‑‑diskSizeGb=30 because Streaming Engine only needs space for the worker boot image and local logs. These values are the default values.

Python

Streaming Engine is enabled by default for new Dataflow streaming pipelines when the following conditions are met:

If you would like to disable Streaming Engine in your Python streaming pipeline, specify the following parameter:

--experiments=disable_streaming_engine

If you use Python 2, you must still enable Streaming Engine by specifying the following parameter:

--enable_streaming_engine

Caution: On October 7, 2020, Dataflow will stop supporting pipelines using Python 2. Read more information on the Python 2 support on Google Cloud page.

If you use Dataflow Streaming Engine in your pipeline, do not specify the --zone parameter. Instead, specify the --region parameter and set the value to one of the regions where Streaming Engine is currently available. Dataflow auto-selects the zone in the region you specified. If you do specify the --zone parameter and set it to a zone outside of the available regions, Dataflow reports an error.

Streaming Engine works best with smaller core worker machine types. Use the job type to determine whether to use a high memory worker machine type. Example machine types that we recommend include ‑‑workerMachineType=n1-standard-2 and ‑‑workerMachineType=n1- highmem-2. You can also set ‑‑disk_size_gb=30 because Streaming Engine only needs space for the worker boot image and local logs. These values are the default values.

Go

Streaming Engine is currently not supported for the Go SDK.