Workflow templates—Overview

The Cloud Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing workflows. Create a workflow template, add one or more jobs to the template, then instantiate the template. The instantiated template (workflow) will create the cluster, run the jobs, then delete the cluster when the workflow is finished. Workflow metadata includes a graph of workflow operations that can help you monitor and analyze workflow progress and results.

Creating a workflow template does not create a Cloud Dataproc cluster or create and submit jobs. Clusters and jobs associated with clusters are only created when a workflow template is instantiated. Further, you can create a template that does not create a new cluster but, instead, runs the workflow on an existing cluster (see Types of workflow templates).

Types of workflow templates

A workflow template can specify a managed-cluster. The workflow will create this "ephemeral" cluster to run workflow jobs, and will delete this cluster when the workflow is finished.

Alternatively, a workflow template can specify one or more existing clusters, via one or more user labels that were previously applied by the user to the cluster(s). Cloud Dataproc will randomly select among the clusters to use in the workflow. At the end of workflow, the clusters are not deleted.

Creating a template

gcloud

gcloud beta dataproc workflow-templates create template-id (such as "my-workflow")

REST API

See workflowTemplates.create.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Adding a managed-cluster to a template

A Cloud Dataproc managed cluster is created at the start of the workflow and is used to run workflow jobs. At the end of the workflow, the managed cluster is deleted.

gcloud

Use flags inherited from gcloud dataproc cluster create to configure the managed cluster (number of workers, master/worker machine type, etc.). Cloud Dataproc will add a suffix to the cluster name to ensure uniqueness.
gcloud beta dataproc workflow-templates set-managed-cluster template-id \
  --master-machine-type machine-type \
  --worker-machine-type machine-type \
  --num-workers number \
  --cluster-name cluster-name

REST API

See WorkflowTemplatePlacement.ManagedCluster. Note that this field/structure is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Adding a cluster selector to a template

Instead of using a managed cluster whose lifetime depends on the lifetime of the workflow, you can specify a cluster selector that selects an existing cluster to run workflow jobs. The selected cluster will not be deleted at the end of the workflow.

You must specify at least one label when setting a cluster selector for your workflow. Cloud Dataproc will select a cluster whose label(s) matches the specified selector label(s). For example, you can run your workflow on a specific cluster by specifying a cluster selector that uses one of the cluster's automatically-applied labels: either goog-dataproc-cluster-name or goog-dataproc-cluster-uuid.

If more than one label is passed to the selector, all labels must match before a cluster will be selected. If more than one cluster matches the specified label(s), Cloud Dataproc will choose the cluster with the most free YARN memory (one workflow job can be run on one matching cluster, and another job can be run on a different matching cluster).

You also must also specify a zone. Cloud Dataproc will run the workflow process in this zone (see Available regions & zones to choose a zone). Note that this parameter only affects the location where the workflow process is run; it does not affect the selection of the cluster where workflow jobs are run.

gcloud

gcloud beta dataproc workflow-templates set-cluster-selector template-id \
  --cluster-labels name=value[[,name=value]...]
  --zone zone

REST API

See WorkflowTemplatePlacement.ClusterSelector. Note that this field/structure is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Adding jobs to a template

All jobs run concurrently unless you specify a one or more jobs that must finish successfully before a job can start. You must provide an step-id for each job. The id must be unique within the workflow, but doesn't need to be unique globally.

gcloud

Use job type and flags inherited from gcloud dataproc jobs submit to define the job to add to the template. You can optionally use the ‑‑start-after job-id of another workflow job flag to have the job start after the completion of one or more other jobs in the workflow.
Examples:
Add Hadoop job "foo" to the "my-workflow" template.
gcloud beta dataproc workflow-templates add-job hadoop \
  --step-id foo \
  --workflow-template my-workflow \
  ...job args...
Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.
gcloud beta dataproc workflow-templates add-job job-type \
  --step-id bar \
  --start-after foo \
  --workflow-template my-workflow \
  ...job args...
Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.
gcloud beta dataproc workflow-templates add-job job-type \
  --step-id baz \
  --start-after foo,bar \
  --workflow-template my-workflow \
  ...job args...

REST API

See WorkflowTemplate.OrderedJob. Note that this field/structure is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Running, monitoring, stopping, and updating a workflow template

The instantiation of a workflow template runs the workflow defined by the template. Multiple (simultaneous) instantiations of a template are supported. The workflow checks that all resource names are unique within the workflow (e.g., no job id collisions). It also resolves cluster resources.

gcloud

gcloud beta dataproc workflow-templates instantiate template-id

REST API

See workflowTemplates.instantiate.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Workflow job failures

A failure in any node in a workflow will cause the workflow to fail. Cloud Dataproc will seek to mitigate the effect of failures by causing all concurrently executing jobs to fail and preventing subsequent jobs from starting.

Monitoring and listing templates

To monitor a workflow, use the Cloud Dataproc operations.get API.

To list running workflows, use the Cloud Dataproc operations.list API with a label filter. Here's a gcloud command-line tool example:

gcloud dataproc operations list \
--filter "labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"

Stopping a workflow

A running workflow can be terminated by using the Cloud Dataproc operations.cancel API. Here's an example that uses the gcloud command-line tool:

gcloud beta dataproc operations cancel operation-id

Deleting a template

gcloud

gcloud beta dataproc workflow-templates delete template-id

REST API

See workflowTemplates.delete.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Updating a template

You can update workflow templates, but note that updates do not affect running instantiations of the updated template: the new template will only apply to new instantiations of the updated template.

gcloud

Workflow templates can be updated by issuing new gcloud workflow-templates commands that reference an existing workflow template-id to add jobs, add managed clusters or add cluster selectors to an existing workflow template.

REST API

To make an update to a template with the REST API:
  1. Call workflowTemplates.get, which returns the current template with the version field filled in with the current server version
  2. Make updates to the fetched template
  3. Call workflowTemplates.update with the updated template

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation