You can define a workflow template in a YAML file, then instantiate the template to run the workflow. You can also import and export a workflow template YAML file to create and update a Dataproc workflow template resource.
Run a workflow using a YAML file
To run a workflow without first creating a workflow template resource, use the gcloud dataproc workflow-templates instantiate-from-file command.
- Define your workflow template in a YAML file. The YAML file must include all
required
WorkflowTemplate
fields except the
id
field, and it must also exclude theversion
field and all output-only fields. In the following workflow example, theprerequisiteStepIds
list in theterasort
step ensures theterasort
step will only begin after theteragen
step completes successfully.jobs: - hadoopJob: args: - teragen - '1000' - hdfs:///gen/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar stepId: teragen - hadoopJob: args: - terasort - hdfs:///gen/ - hdfs:///sort/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar stepId: terasort prerequisiteStepIds: - teragen placement: managedCluster: clusterName: my-managed-cluster config: gceClusterConfig: zoneUri: us-central1-a
- Run the workflow:
gcloud dataproc workflow-templates instantiate-from-file \ --file=TEMPLATE_YAML \ --region=REGION
Instantiate a workflow using a YAML file with Dataproc Auto Zone Placement
- Define your workflow template in a YAML file. This YAML file is the same as the
previous YAML file, except the
zoneUri
field is set to the empty string ('') to allow Dataproc Auto Zone Placement to select the zone for the cluster.jobs: - hadoopJob: args: - teragen - '1000' - hdfs:///gen/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar stepId: teragen - hadoopJob: args: - terasort - hdfs:///gen/ - hdfs:///sort/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar stepId: terasort prerequisiteStepIds: - teragen placement: managedCluster: clusterName: my-managed-cluster config: gceClusterConfig: zoneUri: ''
- Run the workflow. When using Auto Placement, you must pass a
region
to the
gcloud
command.gcloud dataproc workflow-templates instantiate-from-file \ --file=TEMPLATE_YAML \ --region=REGION
Import and export a workflow template YAML file
You can import and export workflow template YAML files. Typically, a workflow template is first exported as a YAML file, then the YAML is edited, and then the edited YAML file is imported to update the template.
Export the workflow template to a YAML file. During the export operation, the
id
andversion
fields, and all output-only fields are filtered from the output and do not appear in the exported YAML file.gcloud dataproc workflow-templates export TEMPLATE_ID or TEMPLATE_NAME \ --destination=TEMPLATE_YAML \ --region=REGION
You can pass either the WorkflowTemplateid
or the fully qualified template resourcename
("projects/PROJECT_ID/regions/REGION/workflowTemplates/TEMPLATE_ID") to the command.Edit the YAML file locally. Note that the
id
,version
, and output-only fields, which were filtered from the YAML file when the template was exported, are disallowed in the imported YAML file.Import the updated workflow template YAML file:
gcloud dataproc workflow-templates import TEMPLATE_ID or TEMPLATE_NAME \ --source=TEMPLATE_YAML \ --region=REGION
You can pass either the WorkflowTemplateid
or the fully qualified template resourcename
("projects/PROJECT_ID/regions/region/workflowTemplates/TEMPLATE_ID") to the command. The template resource with the same template name will be overwritten (updated) and its version number will be incremented. If a template with the same template name does not exist, it will be created.