Use Dataflow Runner V2

Stay organized with collections Save and categorize content based on your preferences.

The current production Dataflow runner utilizes language-specific workers when running Apache Beam pipelines. To improve scalability, generality, extensibility, and efficiency, Dataflow runner is moving to a more services-based architecture. These changes include a more efficient and portable worker architecture packaged together with the Shuffle Service and Streaming Engine.

Enable Dataflow Runner v2

To enable Dataflow Runner v2, follow the configuration instructions for your Apache Beam SDK:

Java

Dataflow Runner v2 requires the Apache Beam SDK for Java version 2.30.0 or later, with version 2.36.0 or later being recommended.

To enable Runner v2, run your job with the following flag: --experiments=use_runner_v2.

Python

Dataflow Runner v2 requires the Apache Beam SDK version 2.21.0 or later for Python.

Under certain circumstances, your pipeline might not use Runner v2, although the pipeline runs on a supported SDK version. Some examples are Dataflow template jobs, or jobs where Group by Key is used as a side input. In such cases, you can choose to run the job with the --experiments=use_runner_v2 flag.

If you want to disable Runner V2 and your job is identified as auto_runner_v2 experiment, you can use the --experiments=disable_runner_v2 flag.

Go

Dataflow Runner v2 is the only Dataflow runner available for the Apache Beam SDK for Go. Runner v2 is enabled by default.

Benefits of using Dataflow Runner v2

New features will be available on Dataflow Runner v2 only. In addition, the improved efficiency of the Dataflow Runner v2 architecture could lead to performance improvements in your Dataflow jobs.

While using Dataflow Runner v2, you might notice a reduction in your bill.

Dataflow Runner v2 also allows you to pre-build your Python container, which can improve VM startup times and Horizontal Autoscaling performance. Learn more about the pre-build options at Using custom containers.

Dataflow Runner v2 supports multi-language pipelines, a feature that enables your Apache Beam pipeline to use transforms defined in other Apache Beam SDKs. Currently, Dataflow Runner v2 supports using Java transforms from a Python SDK pipeline and using Python transforms from a Java SDK pipeline.

Use Dataflow Runner v2

Dataflow Runner v2 is available in regions that have Dataflow regional endpoints.

Because Dataflow Runner v2 requires Streaming Engine for streaming jobs, any Apache Beam transform that requires Dataflow Runner v2 also requires the use of Streaming Engine for streaming jobs. For example, the Pub/Sub Lite I/O connector for the Apache Beam SDK for Python is a cross-language transform that requires Dataflow Runner v2. If you try to disable Streaming Engine for a job or template that uses this transform, the job will fail.

Debugging

To debug jobs using Dataflow Runner v2, you should follow standard debugging steps; however, be aware of the following when using Dataflow Runner v2:

  • Dataflow Runner v2 jobs run two types of processes on the worker VM—SDK process and the runner harness process. Depending on the pipeline and VM type, there might be one or more SDK processes, but there is only one runner harness process per VM.
  • SDK processes run user code and other language-specific functions, while the runner harness process manages everything else.
  • The runner harness process waits for all SDK processes to connect to it before starting to request work from Dataflow.
  • Jobs might be delayed if the worker VM downloads and installs dependencies during the SDK process startup. If there are issues in an SDK process, such as starting up or installing libraries, the worker reports its status as unhealthy. If the startup times increase, enable the Cloud Build API on your project and submit your pipeline with the following parameter: --prebuild_sdk_container_engine=cloud_build.
  • Worker VM logs—available through the Logs Explorer or the Dataflow monitoring interface include logs from the runner harness process as well as logs from the SDK processes.
  • To diagnose problems in your user code, examine the worker logs from the SDK processes. If you find any errors in the runner harness logs, please contact Support to file a bug.
  • To debug common errors related to Dataflow multi-language pipelines, see the Multi-language Pipelines Tips guide.