Dataflow Prime is a serverless data processing platform for Apache Beam pipelines. Based on Dataflow, Dataflow Prime uses a compute and state-separated architecture and includes many new features. Pipelines using Dataflow Prime benefit from automated and optimized resource management, reduced operational costs, and diagnostics capabilities.
Dataflow Prime uses the Dataflow Runner V2 in its pipelines.
SDK version support
Dataflow Prime supports the following Apache Beam SDKs:
Apache Beam Python SDK version 2.21.0 or later
Apache Beam Java SDK version 2.30.0 or later
To download the SDK package or to read the Release Notes, see Apache Beam Downloads.
Dataflow Prime features
The following is the list of supported Dataflow Prime features for different kinds of pipelines:
- Vertical Autoscaling (memory). Applies to streaming pipelines in Python.
- Right Fitting (Dataflow Prime resource hints). Applies to batch pipelines in Python and Java.
- Job Visualizer. Applies to batch pipelines in Python and Java.
- Smart Recommendations. Applies to both streaming and batch pipelines in Python and Java.
- Data Pipelines. Applies to both streaming and batch pipelines in Python and Java.
The features Job Visualizer, Smart Recommendations, and Data Pipelines are also supported for non-Dataflow Prime jobs.
This feature automatically adjusts the memory available to the Dataflow worker VMs to fit the needs of the pipeline and help prevent out-of-memory errors. In Dataflow Prime, Vertical Autoscaling works alongside Horizontal Autoscaling to scale resources dynamically.
For more information, see Vertical Autoscaling.
This feature uses resource hints, a new feature of Apache Beam that lets you specify resource requirements either for the entire pipeline or for specific steps of the pipeline. This feature lets you create customized workers for different steps of a pipeline. Right fitting lets you specify pipeline resources to maximize efficiency, lower operational cost, and avoid out-of-memory and other resource errors.
For Preview, consider the following about Right Fitting:
It supports memory and GPU resource hints.
It requires Apache Beam 2.30.0 or later.
For more information, see Configuring Dataflow Prime Right Fitting.
This feature lets you see the performance of a Dataflow job and optimize the job's performance by finding inefficient code, including parallelization bottlenecks. In the Cloud Console, you can click on any Dataflow job in the Jobs page to view details about the job. You can also see the list of steps associated with each stage of the pipeline.
For more information, see Execution details.
This feature lets you optimize and troubleshoot the pipeline based on the recommendations provided in the Diagnostics tab of a job's details page. In the Cloud Console, you can click on any Dataflow job in the Jobs page to view details about the job.
For more information, see Recommendations and diagnostics.
This feature lets you schedule jobs, observe resource utilizations, track data freshness objectives for streaming data, and optimize pipelines.
For more information, see Working with Data Pipelines.
Quota and limit requirements
Quotas and limits are the same for Dataflow and Dataflow Prime. For more information, see Quotas and limits.
If you opt for Data Pipelines, there are additional implications for quotas and regional endpoints.
Before using Dataflow Prime
To use Dataflow Prime, you can reuse the existing pipeline code and also enable the Dataflow Prime experimental option either through Cloud Shell or programmatically.
Dataflow Prime is backwards compatible with batch jobs that use Dataflow Shuffle and streaming jobs that use Streaming Engine. However, we recommended testing your pipelines with Dataflow Prime before you use them in a production environment.
If your streaming pipeline is running in production, to use Dataflow Prime, perform the following steps:
Enabling Dataflow Prime
To enable Dataflow Prime for a pipeline:
Enable the Cloud Autoscaling API.
Dataflow Prime uses the Cloud Autoscaling API to dynamically adjust memory.
Enable Prime in your pipeline options.
Apache Beam Python SDK version 2.29.0 or later:
Apache Beam Python SDK version 2.21.0 to 2.28.0:
Using Dataflow Prime with templates
If you are using Dataflow templates, you can choose to enable Dataflow Prime in one of the following ways:
Go to the Create job from template page.
In the Additional experiment field, enter
- Run the pipeline code with the
--experimentsflag set to
- In the pipeline code, set the
Dataflow Prime notes
Dataflow Prime does not support the following:
Resource hints for cross-language transforms. For more information about this limitation, see the Apache Beam documentation.
Designating specific VM types by using the flag
--machine_typefor Python pipelines and
--workerMachineTypefor Java pipelines.
Viewing or using SSH to log into worker VMs.
OrderedListStatefor Java pipelines.
Custom window types.
Feature comparison between Dataflow and Dataflow Prime
The following table compares the available features for both variants of Dataflow.
|Runner V2||Default feature with no option to turn off||Optional feature|
|Dataflow Shuffle for batch jobs||Default feature with no option to turn off||Default feature with an option to turn off|
|Streaming Engine||Default feature with no option to turn off||Optional feature for Java pipelines and option to turn off for Python pipelines|
|Horizontal Autoscaling||Optional feature||Optional feature|
|Vertical Autoscaling||Default feature with no option to turn off||Not applicable|
|Right Fitting||Optional feature||Not applicable|
|Job Visualizer||Default feature with no option to turn off||Default feature with no option to turn off|
|Job recommendations||Default feature with no option to turn off||Default feature with no option to turn off|
|Data Pipelines||Optional feature||Optional feature|
|Billing||Serverless billing||Legacy billing|
- Read about Dataflow quotas.
- Learn how to set pipeline options.
- See available pipeline options for Java and Python pipelines.
- Learn more about autotuning features for Dataflow Prime.
- Learn more about Dataflow GPUs.