Using workflows

You set up and run a workflow by:

  1. Creating a workflow template
  2. Configuring a managed (ephemeral) cluster or selecting an existing cluster
  3. Adding jobs
  4. Instantiating the template to run the workflow

Creating a template

gcloud command

Run the following command to create a Cloud Dataproc workflow template resource.

gcloud dataproc workflow-templates create template-id (such as "my-workflow")

REST API

See workflowTemplates.create.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Configuring or selecting a cluster

Cloud Dataproc can create and use a new, "managed" cluster for your workflow or an existing cluster.

  • Existing cluster: See Using cluster selectors with workflows to select an existing cluster for your workflow.

  • Managed cluster: You must configure a managed cluster for your workflow. Cloud Dataproc will create this new cluster to run workflow jobs, then delete the cluster at the end of the workflow.
    You can configure a managed cluster for your workflow using the gcloud command-line tool or the Cloud Dataproc API.

    gcloud command

    Use flags inherited from gcloud dataproc cluster create to configure the managed cluster (number of workers, master/worker machine type, etc.). Cloud Dataproc will add a suffix to the cluster name to ensure uniqueness.

    gcloud dataproc workflow-templates set-managed-cluster template-id 
    --master-machine-type machine-type
    --worker-machine-type machine-type
    --num-workers number
    --cluster-name cluster-name

    REST API

    See WorkflowTemplatePlacement.ManagedCluster.

    Console

    Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Adding jobs to a template

All jobs run concurrently unless you specify one or more job dependencies. A job's dependencies are expressed as a list of other jobs that must finish successfully before the ultimate job can start. You must provide a step-id for each job. The ID must be unique within the workflow, but does not need to be unique globally.

gcloud command

Use job type and flags inherited from gcloud dataproc jobs submit to define the job to add to the template. You can optionally use the ‑‑start-after job-id of another workflow job flag to have the job start after the completion of one or more other jobs in the workflow.

Examples:

Add Hadoop job "foo" to the "my-workflow" template.

gcloud dataproc workflow-templates add-job hadoop \
  --step-id foo \
  --workflow-template my-workflow \
  -- space separated job args

Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.

gcloud dataproc workflow-templates add-job job-type \
  --step-id bar \
  --start-after foo \
  --workflow-template my-workflow \
  -- space separated job args

Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.

gcloud dataproc workflow-templates add-job job-type \
  --step-id baz \
  --start-after foo,bar \
  --workflow-template my-workflow \
  -- space separated job args

REST API

See WorkflowTemplate.OrderedJob.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Running a workflow

The instantiation of a workflow template runs the workflow defined by the template. Multiple instantiations of a template are supported—you can run a workflow multiple times.

gcloud command

gcloud dataproc workflow-templates instantiate template-id

The command returns an operation ID, which you can use to track workflow status.

Example:
gcloud beta dataproc workflow-templates instantiate my-template-id
...
WorkflowTemplate [my-template-id] RUNNING
...
Created cluster: my-template-id-rg544az7mpbfa.
Job ID teragen-rg544az7mpbfa RUNNING
Job ID teragen-rg544az7mpbfa COMPLETED
Job ID terasort-rg544az7mpbfa RUNNING
Job ID terasort-rg544az7mpbfa COMPLETED
Job ID teravalidate-rg544az7mpbfa RUNNING
Job ID teravalidate-rg544az7mpbfa COMPLETED
...
Deleted cluster: my-template-id-rg544az7mpbfa.
WorkflowTemplate [my-template-id] DONE

REST API

See workflowTemplates.instantiate.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Workflow job failures

A failure in any job in a workflow will cause the workflow to fail. Cloud Dataproc will seek to mitigate the effect of failures by causing all concurrently executing jobs to fail and preventing subsequent jobs from starting.

Monitoring and listing a workflow

gcloud command

To monitor a workflow:

gcloud dataproc operations describe operation-id

Note: The operation-id is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

To list workflow status:

gcloud dataproc operations list \
  --filter "labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"

REST API

To monitor a workflow, use the Cloud Dataproc operations.get API.

To list running workflows, use the Cloud Dataproc operations.list API with a label filter.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Terminating a workflow

You can terminate a workflow using the gcloud command-line tool or by calling the Cloud Dataproc API.

gcloud command

gcloud dataproc operations cancel operation-id
Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

REST API

See operations.cancel API.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Updating a workflow template

Updates do not affect running workflows. The new template version will only apply to new workflows.

gcloud command

Workflow templates can be updated by issuing new gcloud workflow-templates commands that reference an existing workflow template-id:

to an existing workflow template.

REST API

To make an update to a template with the REST API:

  1. Call workflowTemplates.get, which returns the current template with the version field filled in with the current server version
  2. Make updates to the fetched template
  3. Call workflowTemplates.update with the updated template

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.

Deleting a workflow template

gcloud command

gcloud dataproc workflow-templates delete template-id

Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

REST API

See workflowTemplates.delete.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in a future Cloud Dataproc release.
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation