The current production Dataflow runner utilizes language-specific workers when running Apache Beam pipelines. To improve scalability, generality, extensibility, and efficiency, Dataflow runner is moving to a more services-based architecture. These changes include a more efficient and portable worker architecture packaged together with the Shuffle Service and Streaming Engine.
Enable Dataflow Runner v2
To enable Dataflow Runner v2, follow the configuration instructions for your Apache Beam SDK:
Java
Dataflow Runner v2 requires the Apache Beam SDK for Java version 2.30.0 or later, with version 2.36.0 or later being recommended.
To enable Runner v2, run your job with the following flag:
--experiments=use_runner_v2
.
Python
Dataflow Runner v2 requires the Apache Beam SDK version 2.21.0 or later for Python.
Under certain circumstances, your pipeline might not use Runner v2, although
the pipeline runs on a supported SDK version. Some examples are
Dataflow template jobs, or jobs where Group by Key is used as a
side input. In such cases, you can choose to run the job with the
--experiments=use_runner_v2
flag.
If you want to disable Runner V2 and your job is identified as auto_runner_v2
experiment, you can use the --experiments=disable_runner_v2
flag.
Go
Dataflow Runner v2 is the only Dataflow runner available for the Apache Beam SDK for Go. Runner v2 is enabled by default.
Benefits of using Dataflow Runner v2
New features will be available on Dataflow Runner v2 only. In addition, the improved efficiency of the Dataflow Runner v2 architecture could lead to performance improvements in your Dataflow jobs.
While using Dataflow Runner v2, you might notice a reduction in your bill.
Dataflow Runner v2 also allows you to pre-build your Python container, which can improve VM startup times and Horizontal Autoscaling performance. Learn more about the pre-build options at Using custom containers.
Dataflow Runner v2 supports multi-language pipelines, a feature that enables your Apache Beam pipeline to use transforms defined in other Apache Beam SDKs. Currently, Dataflow Runner v2 supports using Java transforms from a Python SDK pipeline and using Python transforms from a Java SDK pipeline.
Use Dataflow Runner v2
Dataflow Runner v2 is available in regions that have Dataflow regional endpoints.
Because Dataflow Runner v2 requires Streaming Engine for streaming jobs, any Apache Beam transform that requires Dataflow Runner v2 also requires the use of Streaming Engine for streaming jobs. For example, the Pub/Sub Lite I/O connector for the Apache Beam SDK for Python is a cross-language transform that requires Dataflow Runner v2. If you try to disable Streaming Engine for a job or template that uses this transform, the job will fail.
Debugging
To debug jobs using Dataflow Runner v2, you should follow standard debugging steps; however, be aware of the following when using Dataflow Runner v2:
- Dataflow Runner v2 jobs run two types of processes on the worker VM—SDK process and the runner harness process. Depending on the pipeline and VM type, there might be one or more SDK processes, but there is only one runner harness process per VM.
- SDK processes run user code and other language-specific functions, while the runner harness process manages everything else.
- The runner harness process waits for all SDK processes to connect to it before starting to request work from Dataflow.
- Jobs might be delayed if the worker VM downloads and installs dependencies
during the SDK process startup. If there are issues in an SDK process, such as
starting up or installing libraries, the worker reports its status as
unhealthy. If the startup times increase, enable the Cloud Build API on your
project and submit your pipeline with the following parameter:
--prebuild_sdk_container_engine=cloud_build
.
- Worker VM logs—available through the Logs Explorer or the Dataflow monitoring interface include logs from the runner harness process as well as logs from the SDK processes.
- To diagnose problems in your user code, examine the worker logs from the SDK processes. If you find any errors in the runner harness logs, please contact Support to file a bug.
- To debug common errors related to Dataflow multi-language pipelines, see the Multi-language Pipelines Tips guide.