Using Dataflow Prime

Dataflow Prime is a serverless data processing platform for Apache Beam pipelines. Based on Dataflow, Dataflow Prime uses a compute and state-separated architecture and includes many new features. Pipelines using Dataflow Prime benefit from automated and optimized resource management, reduced operational costs, and diagnostics capabilities.

Dataflow Prime supports both batch and streaming pipelines. By default, Dataflow Prime uses Dataflow Shuffle for batch jobs and Streaming Engine for streaming jobs.

Dataflow Prime uses the Dataflow Runner V2 in its pipelines.

SDK version support

Dataflow Prime supports the following Apache Beam SDKs:

  • Apache Beam Python SDK version 2.21.0 or later

  • Apache Beam Java SDK version 2.30.0 or later

To download the SDK package or to read the Release Notes, see Apache Beam Downloads.

Dataflow Prime features

The following is the list of supported Dataflow Prime features for different kinds of pipelines:

  • Vertical Autoscaling (memory). Applies to streaming pipelines in Python.
  • Right Fitting (Dataflow Prime resource hints). Applies to batch pipelines in Python and Java.
  • Job Visualizer. Applies to batch pipelines in Python and Java.
  • Smart Recommendations. Applies to both streaming and batch pipelines in Python and Java.
  • Data Pipelines. Applies to both streaming and batch pipelines in Python and Java.

The features Job Visualizer, Smart Recommendations, and Data Pipelines are also supported for non-Dataflow Prime jobs.

Vertical Autoscaling

This feature automatically adjusts the memory available to the Dataflow worker VMs to fit the needs of the pipeline and help prevent out-of-memory errors. In Dataflow Prime, Vertical Autoscaling works alongside Horizontal Autoscaling to scale resources dynamically.

For more information, see Vertical Autoscaling.

Right Fitting

This feature uses resource hints, a new feature of Apache Beam that lets you specify resource requirements either for the entire pipeline or for specific steps of the pipeline. This feature lets you create customized workers for different steps of a pipeline. Right fitting lets you specify pipeline resources to maximize efficiency, lower operational cost, and avoid out-of-memory and other resource errors.

For Preview, consider the following about Right Fitting:

For more information, see Configuring Dataflow Prime Right Fitting.

Job Visualizer

This feature lets you see the performance of a Dataflow job and optimize the job's performance by finding inefficient code, including parallelization bottlenecks. In the Cloud Console, you can click on any Dataflow job in the Jobs page to view details about the job. You can also see the list of steps associated with each stage of the pipeline.

For more information, see Execution details.

Smart Recommendations

This feature lets you optimize and troubleshoot the pipeline based on the recommendations provided in the Diagnostics tab of a job's details page. In the Cloud Console, you can click on any Dataflow job in the Jobs page to view details about the job.

For more information, see Recommendations and diagnostics.

Data Pipelines

This feature lets you schedule jobs, observe resource utilizations, track data freshness objectives for streaming data, and optimize pipelines.

For more information, see Working with Data Pipelines.

Quota and limit requirements

Quotas and limits are the same for Dataflow and Dataflow Prime. For more information, see Quotas and limits.

If you opt for Data Pipelines, there are additional implications for quotas and regional endpoints.

Before using Dataflow Prime

To use Dataflow Prime, you can reuse the existing pipeline code and also enable the Dataflow Prime experimental option either through Cloud Shell or programmatically.

Dataflow Prime is backwards compatible with batch jobs that use Dataflow Shuffle and streaming jobs that use Streaming Engine. However, we recommended testing your pipelines with Dataflow Prime before you use them in a production environment.

If your streaming pipeline is running in production, to use Dataflow Prime, perform the following steps:

  1. Stop the pipeline.

  2. Enable Dataflow Prime.

  3. Rerun the pipeline.

Enabling Dataflow Prime

To enable Dataflow Prime for a pipeline:

  1. Enable the Cloud Autoscaling API.

    Enable the API

    Dataflow Prime uses the Cloud Autoscaling API to dynamically adjust memory.

  2. Enable Prime in your pipeline options.

    You can set the pipeline options either programmatically or by using the command line. For supported Apache Beam SDK versions, enable the following flag:

Java

--dataflowServiceOptions=enable_prime

Python

Apache Beam Python SDK version 2.29.0 or later:

--dataflow_service_options=enable_prime

Apache Beam Python SDK version 2.21.0 to 2.28.0:

--experiments=enable_prime

Using Dataflow Prime with templates

If you are using Dataflow templates, you can choose to enable Dataflow Prime in one of the following ways:

Console

  1. Go to the Create job from template page.

    Go to Create job from template

  2. In the Additional experiment field, enter enable_prime.

Shell

  • Run the pipeline code with the --experiments flag set to enable_prime.

Pipeline code

  • In the pipeline code, set the additional_experiment argument to enable_prime.

Dataflow Prime notes

Dataflow Prime does not support the following:

  • Resource hints for cross-language transforms. For more information about this limitation, see the Apache Beam documentation.

  • Designating specific VM types by using the flag --worker_machine_type or --machine_type for Python pipelines and --workerMachineType for Java pipelines.

  • Viewing or using SSH to log into worker VMs.

  • The classes MapState and OrderedListState for Java pipelines.

  • Custom window types.

  • Flexible Resource Scheduling (FlexRS).

Feature comparison between Dataflow and Dataflow Prime

The following table compares the available features for both variants of Dataflow.

Feature Dataflow Prime Dataflow
Runner V2 Default feature with no option to turn off Optional feature
Dataflow Shuffle for batch jobs Default feature with no option to turn off Default feature with an option to turn off
Streaming Engine Default feature with no option to turn off Optional feature for Java pipelines and option to turn off for Python pipelines
Horizontal Autoscaling Optional feature Optional feature
Vertical Autoscaling Default feature with no option to turn off Not applicable
Right Fitting Optional feature Not applicable
Job Visualizer Default feature with no option to turn off Default feature with no option to turn off
Job recommendations Default feature with no option to turn off Default feature with no option to turn off
Data Pipelines Optional feature Optional feature
Billing Serverless billing Legacy billing

What's next