The Dataflow components let you submit Apache Beam jobs to
Dataflow for execution. In Dataflow, a
Job
resource represents a Dataflow job.
The Google Cloud Pipeline Components SDK includes the
following operators for creating Job
resources and monitor their execution:
Additionally, the Google Cloud Pipeline Components SDK includes the
WaitGcpResourcesOp
component, which you can use to mitigate costs while running
Dataflow jobs.
DataflowFlexTemplateJobOp
The DataflowFlexTemplateJobOp
operator lets you create a
Vertex AI Pipelines component to launch a
Dataflow Flex Template.
In Dataflow, a LaunchFlexTemplateParameter
resource represents a Flex Template to launch. This component creates a
LaunchFlexTemplateParameter
resource and then requests Dataflow to
create a job by launching the template. If the template is launched
successfully, Dataflow returns a Job
resource.
The Dataflow Flex Template component terminates upon receiving a Job
resource from Dataflow. The component outputs a job_id
as a
serialized gcp_resources
proto. You
can pass this parameter to a WaitGcpResourcesOp
component, to wait for the Dataflow job to complete.
DataflowPythonJobOp
The DataflowPythonJobOp
operator lets you create a Vertex AI Pipelines component that prepares
data by submitting a Python-based Apache Beam job to Dataflow for
execution.
The Python code of the Apache Beam job runs with Dataflow Runner.
When you run your pipeline with the Dataflow service, the runner
uploads your executable code to the location specified by the python_module_path
parameter
and dependencies to a Cloud Storage bucket (specified by temp_location
), and then creates a
Dataflow job that executes your Apache Beam pipeline on managed resources in Google Cloud.
To learn more about the Dataflow Runner, see Using the Dataflow Runner.
The Dataflow Python component accepts a list of arguments
that are passed using the Beam Runner to your Apache Beam
code. These arguments are specified by args
. For example, you can use these
arguments to set the
apache_beam.options.pipeline_options
to
specify a network, a subnetwork, customer-managed encryption key (CMEK), and
other options when you run Dataflow jobs.
WaitGcpResourcesOp
Dataflow jobs can often take long time to complete. The costs of
a busy-wait
container (the container that launches Dataflow job and
wait for the result) can become expensive.
After submitting the Dataflow job using the Beam runner,
the DataflowPythonJobOp
component terminates immediately and returns a job_id
output parameter as a
serialized gcp_resources
proto. You
can pass this parameter to a WaitGcpResourcesOp
component, to wait for the
Dataflow job to complete.
dataflow_python_op = DataflowPythonJobOp( project=project_id, location=location, python_module_path=python_file_path, temp_location = staging_dir, requirements_file_path = requirements_file_path, args = ['--output', OUTPUT_FILE], ) dataflow_wait_op = WaitGcpResourcesOp( gcp_resources = dataflow_python_op.outputs["gcp_resources"] )
Vertex AI Pipelines optimizes the WaitGcpResourcesOp
to execute it in a
serverless fashion, and has zero cost.
If DataflowPythonJobOp
and DataflowFlexTemplateJobOp
don't meet your
requirements, you can also create your own component that outputs the
gcp_resources
parameter and pass it to the WaitGcpResourcesOp
component.
For more information about how to create gcp_resources
output parameter, see
Write a component to show a Google Cloud console link.
API reference
For component reference, see the Google Cloud Pipeline Components SDK reference for Dataflow components.
For Dataflow resource reference, see the following API reference pages:
LaunchFlexTemplateParameter
resourceJob
resource
Tutorials
- Get started with the Dataflow Flex Template component
- Get started with the Dataflow Python Job component
- Specify a network and subnetwork
- Using customer-managed encryption keys (CMEK)
Version history and release notes
To learn more about the version history and changes to the Google Cloud Pipeline Components SDK, see the Google Cloud Pipeline Components SDK Release Notes.
Technical support contacts
If you have any questions, reach out to kubeflow-pipelines-components@google.com.