Workflow templates—Overview

The Cloud Dataproc WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing workflows. A Workflow Template is a reusable workflow configuration. It defines a graph of jobs, with information on where to run those jobs.

You run a workflow by:

  1. Creating a workflow template
  2. Adding one or more jobs to the template
  3. Adding placement information to the template
  4. Instantiating the template

Instantiating a workflow template launches a Workflow. The Workflow is an operation that creates a cluster, run jobs on the cluster, then deletes the cluster.

Creating a workflow template does not create a Cloud Dataproc cluster or create and submit jobs. Clusters and jobs associated with clusters are only created when a workflow template is instantiated. Further, you can create a template that does not create a new cluster but, instead, runs the workflow on an existing cluster (see Workflow placement).

Workflow placement

A workflow template can specify a managed-cluster. The workflow will create this "ephemeral" cluster to run workflow jobs, and will delete this cluster when the workflow is finished.

Alternatively, a workflow template can specify one or more existing clusters, via one or more user labels that were previously applied by the user to the cluster(s). Cloud Dataproc will randomly select among the clusters to use in the workflow. At the end of workflow, the clusters are not deleted.

Creating a workflow template

gcloud command

Run the following command to create a Cloud Dataproc workflow template resource.

gcloud dataproc workflow-templates create template-id (such as "my-workflow")

REST API

See workflowTemplates.create.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Adding a managed-cluster to a template

A Cloud Dataproc managed cluster is created at the start of the workflow and is used to run workflow jobs. At the end of the workflow, the managed cluster is deleted.

gcloud command

Use flags inherited from gcloud dataproc cluster create to configure the managed cluster (number of workers, master/worker machine type, etc.). Cloud Dataproc will add a suffix to the cluster name to ensure uniqueness.
gcloud dataproc workflow-templates set-managed-cluster template-id \
  --master-machine-type machine-type \
  --worker-machine-type machine-type \
  --num-workers number \
  --cluster-name cluster-name

REST API

See WorkflowTemplatePlacement.ManagedCluster. Note that this field/structure is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Adding a cluster selector to a template

Instead of using a managed cluster whose lifetime depends on the lifetime of the workflow, you can specify a cluster selector that selects an existing cluster to run workflow jobs. The selected cluster will not be deleted at the end of the workflow.

You must specify at least one label when setting a cluster selector for your workflow. Cloud Dataproc will select a cluster whose label matches the specified selector label. If more than one label is passed to the selector, a cluster must match all labels to be selected.

gcloud command

gcloud dataproc workflow-templates set-cluster-selector template-id \
  --cluster-labels name=value[[,name=value]...]

REST API

See WorkflowTemplatePlacement.ClusterSelector. Note that this field/structure is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Using Automatically Applied Labels

You can specify a cluster selector that uses one of the following automatically-applied cluster labels:

  • goog-dataproc-cluster-name
  • goog-dataproc-cluster-uuid

Example:

 gcloud dataproc workflow-templates set-cluster-selector template-id \
  --cluster-labels goog-dataproc-cluster-name=my-cluster

Selecting from a Cluster Pool

You can let Cloud Dataproc choose a cluster from a pool of clusters defining cluster pools with labels.

Example:

gcloud dataproc clusters create cluster-1 --labels cluster-pool=pool-1
gcloud dataproc clusters create cluster-2 --labels cluster-pool=pool-1
gcloud dataproc clusters create cluster-3 --labels cluster-pool=pool-2

... After cluster creation ...

gcloud dataproc workflow-templates create my-template
gcloud dataproc workflow-templates set-cluster-selector my-template \
  --cluster-labels cluster-pool=pool-1

The workflow will be run on either cluster-1 or cluster-2, but not on cluster-3.

Adding jobs to a template

All jobs run concurrently unless you specify a one or more jobs that must finish successfully before a job can start. You must provide an step-id for each job. The id must be unique within the workflow, but doesn't need to be unique globally.

gcloud command

Use job type and flags inherited from gcloud dataproc jobs submit to define the job to add to the template. You can optionally use the ‑‑start-after job-id of another workflow job flag to have the job start after the completion of one or more other jobs in the workflow.

Examples:

Add Hadoop job "foo" to the "my-workflow" template.

gcloud dataproc workflow-templates add-job hadoop \
  --step-id foo \
  --workflow-template my-workflow \
  -- space separated job args

Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.

gcloud dataproc workflow-templates add-job job-type \
  --step-id bar \
  --start-after foo \
  --workflow-template my-workflow \
  -- space separated job args

Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.

gcloud dataproc workflow-templates add-job job-type \
  --step-id baz \
  --start-after foo,bar \
  --workflow-template my-workflow \
  -- space separated job args

REST API

See WorkflowTemplate.OrderedJob. Note that this field/structure is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Running, monitoring, stopping, and updating a workflow template

The instantiation of a workflow template runs the workflow defined by the template. Multiple (simultaneous) instantiations of a template are supported. The workflow checks that all resource names are unique within the workflow (no job id collisions). It also resolves cluster resources.

gcloud command

gcloud dataproc workflow-templates instantiate template-id

The command returns an operation ID, which you can use to track workflow status.

Example:
gcloud beta dataproc workflow-templates instantiate my-template-id
...
WorkflowTemplate [my-template-id] RUNNING
...
Created cluster: my-template-id-rg544az7mpbfa.
Job ID teragen-rg544az7mpbfa RUNNING
Job ID teragen-rg544az7mpbfa COMPLETED
Job ID terasort-rg544az7mpbfa RUNNING
Job ID terasort-rg544az7mpbfa COMPLETED
Job ID teravalidate-rg544az7mpbfa RUNNING
Job ID teravalidate-rg544az7mpbfa COMPLETED
...
WorkflowTemplate [my-template-id] DONE
Deleted cluster: my-template-id-rg544az7mpbfa.

REST API

See workflowTemplates.instantiate.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Workflow job failures

A failure in any node in a workflow will cause the workflow to fail. Cloud Dataproc will seek to mitigate the effect of failures by causing all concurrently executing jobs to fail and preventing subsequent jobs from starting.

Monitoring and listing templates

gcloud command

To monitor a workflow:

gcloud dataproc operations describe operation-id

Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running, monitoring, stopping, and updating a workflow template).

To list workflow status:

gcloud dataproc operations list \
  --filter "labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"

REST API

To monitor a workflow, use the Cloud Dataproc operations.get API.

To list running workflows, use the Cloud Dataproc operations.list API with a label filter.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Stopping a workflow

A running workflow can be terminated by using the Cloud Dataproc operations.cancel API. Here's an example that uses the gcloud command-line tool:

gcloud dataproc operations cancel operation-id

Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running, monitoring, stopping, and updating a workflow template).

Deleting a template

gcloud command

gcloud dataproc workflow-templates delete template-id

Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running, monitoring, stopping, and updating a workflow template).

REST API

See workflowTemplates.delete.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Updating a template

You can update workflow templates, but note that updates do not affect running instantiations of the updated template: the new template will only apply to new instantiations of the updated template.

gcloud command

Workflow templates can be updated by issuing new gcloud workflow-templates commands that reference an existing workflow template-id to add jobs, add managed clusters or add cluster selectors to an existing workflow template.

REST API

To make an update to a template with the REST API:
  1. Call workflowTemplates.get, which returns the current template with the version field filled in with the current server version
  2. Make updates to the fetched template
  3. Call workflowTemplates.update with the updated template

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Using workflow template YAML files

You can define a workflow template in a YAML file, then instantiate the template to run the workflow. You can also import and export a workflow template YAML file to create and update a Cloud Dataproc workflow template resource.

Instantiate a workflow using a YAML file

To run a workflow without first creating a workflow template resource, use the gcloud dataproc workflow-templates instantiate-from-file command.

  1. Define your workflow template in a YAML file. The YAML file must include all required WorkflowTemplate fields except the id field, and it must also exclude the version field and all output-only fields. Here's an example of a single-job workflow:
    jobs:
    - hadoopJob:
        args:
        - teragen
        - '1000'
        - hdfs:///gen/
        mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
      stepId: teragen
    placement:
      managedCluster:
        clusterName: my-managed-cluster
        config:
          gceClusterConfig:
            zoneUri: us-central1-a
    
  2. Run the workflow:
    gcloud dataproc workflow-templates instantiate-from-file --file your-template.yaml
    

Import and export a workflow template YAML file

You can import and export workflow template YAML files. Typically, a workflow template is first exported as a YAML file, then the YAML is edited, and then the edited YAML file is imported to update the template.

  1. Export the workflow template to a YAML file. During the export operation, the id and version fields, and all output-only fields are filtered from the output and do not appear in the exported YAML file.
    gcloud dataproc workflow-templates export template-id or template-name \
      --destination template.yaml
    
    You can pass either the WorkflowTemplate id or the fully qualified template resource name ("projects/projectId/regions/region/workflowTemplates/template_id") to the command.
  2. Edit the YAML file locally. Note that the id, version, and output-only fields, which were filtered from the YAML file when the template was exported, are disallowed in the imported YAML file.
  3. Import the updated workflow template YAML file:
    gcloud dataproc workflow-templates import template-id or template-name \
      --source template.yaml
    
    You can pass either the WorkflowTemplate id or the fully qualified template resource name ("projects/projectId/regions/region/workflowTemplates/template_id") to the command. The template resource with the same template name will be overwritten (updated) and its version number will be incremented. If a template with the same template name does not exist, it will be created.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation