Currently, the Dataflow pipeline runner executes the steps of your streaming pipeline entirely on worker virtual machines, consuming worker CPU, memory, and Persistent Disk storage. Dataflow's Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend.
To learn more about the implementation of Streaming Engine, see Streaming Engine: Execution Model for Highly-Scalable, Low-Latency Data Processing.
Benefits
The Streaming Engine model has the following benefits:
- A reduction in consumed CPU, memory, and Persistent Disk storage resources
on the worker VMs. Streaming Engine works best with smaller worker machine
types (
n1-standard-2
instead ofn1-standard-4
) and does not require Persistent Disk beyond a small worker boot disk, leading to less resource and quota consumption. - More responsive Horizontal Autoscaling in response to variations in incoming data volume. Streaming Engine offers smoother, more granular scaling of workers.
- Improved supportability, since you don't need to redeploy your pipelines to apply service updates.
Most of the reduction in worker resources comes from offloading the work to the Dataflow service. For that reason, there is a charge associated with the use of Streaming Engine.
Using Streaming Engine
This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations.
Note: Updating a pipeline that's already running to use Streaming Engine is currently not supported.
If your pipeline is already running in production and you would like to use Streaming Engine, you need to stop your pipeline using the Dataflow Drain option. Then, specify the Streaming Engine parameter and rerun your pipeline.
Java
To use Streaming Engine for your streaming pipelines, specify the following parameter:
--enableStreamingEngine
if you're using Apache Beam SDK for Java versions 2.11.0 or later.--experiments=enable_streaming_engine
if you're using Apache Beam SDK for Java version 2.10.0.
If you use Dataflow Streaming Engine for your pipeline, do not
specify the --zone
parameter. Instead, specify the --region
parameter and
set the value to one of the regions where Streaming Engine is currently
available. Dataflow auto-selects the zone in the region you
specified. If you do specify the --zone
parameter and set it to a zone
outside of the available regions, Dataflow reports an error.
Streaming Engine works best with smaller core worker machine types. Use the
job type to determine whether to use a high memory worker machine type.
Example machine types that we recommend include ‑‑workerMachineType=n1-standard-2
and ‑‑workerMachineType=n1-highmem-2
. You can also set ‑‑diskSizeGb=30
because Streaming Engine only needs space for the worker boot image and local
logs. These values are the default values.
Python
Streaming Engine is enabled by default for new Dataflow streaming pipelines when the following conditions are met:
- Pipelines are developed against Apache Beam Python SDK version 2.21.0 or later using Python 3.
- Customer-managed encryption keys are not used.
- Dataflow workers and the regional endpoint for your Dataflow job are located in the same region.
If you would like to disable Streaming Engine in your Python streaming pipeline, specify the following parameter:
--experiments=disable_streaming_engine
If you use Python 2, you must still enable Streaming Engine by specifying the following parameter:
--enable_streaming_engine
Caution: On October 7, 2020, Dataflow will stop supporting pipelines using Python 2. Read more information on the Python 2 support on Google Cloud page.
If you use Dataflow Streaming Engine in your pipeline, do not
specify the --zone
parameter. Instead, specify the --region
parameter and
set the value to one of the regions where Streaming Engine is currently
available. Dataflow auto-selects the zone in the region you
specified. If you do specify the --zone
parameter and set it to a zone
outside of the available regions, Dataflow reports an error.
Streaming Engine works best with smaller core worker machine types. Use the
job type to determine whether to use a high memory worker machine type.
Example machine types that we recommend include ‑‑workerMachineType=n1-standard-2
and ‑‑workerMachineType=n1- highmem-2
. You can also set ‑‑disk_size_gb=30
because Streaming Engine only needs space for the worker boot image and local
logs. These values are the default values.
Go
Streaming Engine is currently not supported for the Go SDK.