Dataflow components

The Dataflow components let you submit Apache Beam jobs to Dataflow for execution. In Dataflow, a Job resource represents a Dataflow job.

The Google Cloud Pipeline Components SDK includes the following operators for creating Job resources and monitor their execution:

Additionally, the Google Cloud Pipeline Components SDK includes the WaitGcpResourcesOp component, which you can use to mitigate costs while running Dataflow jobs.

DataflowFlexTemplateJobOp

The DataflowFlexTemplateJobOp operator lets you create a Vertex AI Pipelines component to launch a Dataflow Flex Template.

In Dataflow, a LaunchFlexTemplateParameter resource represents a Flex Template to launch. This component creates a LaunchFlexTemplateParameter resource and then requests Dataflow to create a job by launching the template. If the template is launched successfully, Dataflow returns a Job resource.

The Dataflow Flex Template component terminates upon receiving a Job resource from Dataflow. The component outputs a job_id as a serialized gcp_resources proto. You can pass this parameter to a WaitGcpResourcesOp component, to wait for the Dataflow job to complete.

DataflowPythonJobOp

The DataflowPythonJobOp operator lets you create a Vertex AI Pipelines component that prepares data by submitting a Python-based Apache Beam job to Dataflow for execution.

The Python code of the Apache Beam job runs with Dataflow Runner. When you run your pipeline with the Dataflow service, the runner uploads your executable code to the location specified by the python_module_path parameter and dependencies to a Cloud Storage bucket (specified by temp_location), and then creates a Dataflow job that executes your Apache Beam pipeline on managed resources in Google Cloud.

To learn more about the Dataflow Runner, see Using the Dataflow Runner.

The Dataflow Python component accepts a list of arguments that are passed using the Beam Runner to your Apache Beam code. These arguments are specified by args. For example, you can use these arguments to set the apache_beam.options.pipeline_options to specify a network, a subnetwork, customer-managed encryption key (CMEK), and other options when you run Dataflow jobs.

WaitGcpResourcesOp

Dataflow jobs can often take long time to complete. The costs of a busy-wait container (the container that launches Dataflow job and wait for the result) can become expensive.

After submitting the Dataflow job using the Beam runner, the DataflowPythonJobOp component terminates immediately and returns a job_id output parameter as a serialized gcp_resources proto. You can pass this parameter to a WaitGcpResourcesOp component, to wait for the Dataflow job to complete.

    dataflow_python_op = DataflowPythonJobOp(
        project=project_id,
        location=location,
        python_module_path=python_file_path,
        temp_location = staging_dir,
        requirements_file_path = requirements_file_path,
        args = ['--output', OUTPUT_FILE],
    )
  
    dataflow_wait_op =  WaitGcpResourcesOp(
        gcp_resources = dataflow_python_op.outputs["gcp_resources"]
    )

Vertex AI Pipelines optimizes the WaitGcpResourcesOp to execute it in a serverless fashion, and has zero cost.

If DataflowPythonJobOp and DataflowFlexTemplateJobOp don't meet your requirements, you can also create your own component that outputs the gcp_resources parameter and pass it to the WaitGcpResourcesOp component.

For more information about how to create gcp_resources output parameter, see Write a component to show a Google Cloud console link.

API reference

Tutorials

Version history and release notes

To learn more about the version history and changes to the Google Cloud Pipeline Components SDK, see the Google Cloud Pipeline Components SDK Release Notes.

Technical support contacts

If you have any questions, reach out to kubeflow-pipelines-components@google.com.