Use workflows

You set up and run a workflow by:

  1. Creating a workflow template
  2. Configuring a managed (ephemeral) cluster or selecting an existing cluster
  3. Adding jobs
  4. Instantiating the template to run the workflow

Creating a template

gcloud command

Run the following command to create a Dataproc workflow template resource.

gcloud dataproc workflow-templates create TEMPLATE_ID \
    --region=REGION

Notes:

  • REGION: Specify the region where your template will run.
  • TEMPLATE_ID: Provide an ID for your template, such as, "workflow-template-1".
  • CMEK encryption. You can add the --kms-key flag to use CMEK encryption on workflow template job arguments.

REST API

Submit a WorkflowTemplate as part of a workflowTemplates.create request. You can add the WorkflowTemplate.EncryptionConfig.kmsKey field to use CMEK encryption on workflow template job arguments. kmsKey

Console

You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.

Configuring or selecting a cluster

Dataproc can create and use a new, "managed" cluster for your workflow or an existing cluster.

  • Existing cluster: See Using cluster selectors with workflows to select an existing cluster for your workflow.

  • Managed cluster: You must configure a managed cluster for your workflow. Dataproc will create this new cluster to run workflow jobs, then delete the cluster at the end of the workflow.
    You can configure a managed cluster for your workflow using the gcloud command-line tool or the Dataproc API.

    gcloud command

    Use flags inherited from gcloud dataproc cluster create to configure the managed cluster, such as the number of workers and the master and worker machine type. Dataproc will add a suffix to the cluster name to ensure uniqueness.

    gcloud dataproc workflow-templates set-managed-cluster template-id \
        --region=region \
        --master-machine-type=machine-type \
        --worker-machine-type=machine-type \
        --num-workers=number \
        --cluster-name=cluster-name
    

    REST API

    See WorkflowTemplatePlacement.ManagedCluster. This field is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

    Console

    You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.

Adding jobs to a template

All jobs run concurrently unless you specify one or more job dependencies. A job's dependencies are expressed as a list of other jobs that must finish successfully before the ultimate job can start. You must provide a step-id for each job. The ID must be unique within the workflow, but does not need to be unique globally.

gcloud command

Use job type and flags inherited from gcloud dataproc jobs submit to define the job to add to the template. You can optionally use the ‑‑start-after job-id of another workflow job flag to have the job start after the completion of one or more other jobs in the workflow.

Examples:

Add Hadoop job "foo" to the "my-workflow" template.

gcloud dataproc workflow-templates add-job hadoop \
    --region=region \
    --step-id=foo \
    --workflow-template=my-workflow \
    -- space separated job args

Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.

gcloud dataproc workflow-templates add-job job-type \
    --region=region \
    --step-id=bar \
    --start-after=foo \
    --workflow-template=my-workflow \
    -- space separated job args

Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.

gcloud dataproc workflow-templates add-job job-type \
    --region=region \
    --step-id=baz \
    --start-after=foo,bar \
    --workflow-template=my-workflow \
    -- space separated job args

REST API

See WorkflowTemplate.OrderedJob. This field is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.

Running a workflow

The instantiation of a workflow template runs the workflow defined by the template. Multiple instantiations of a template are supported—you can run a workflow multiple times.

gcloud command

gcloud dataproc workflow-templates instantiate template-id \
    --region=region

The command returns an operation ID, which you can use to track workflow status.

Example command and output:
gcloud beta dataproc workflow-templates instantiate my-template-id \
    --region=us-central1
...
WorkflowTemplate [my-template-id] RUNNING
...
Created cluster: my-template-id-rg544az7mpbfa.
Job ID teragen-rg544az7mpbfa RUNNING
Job ID teragen-rg544az7mpbfa COMPLETED
Job ID terasort-rg544az7mpbfa RUNNING
Job ID terasort-rg544az7mpbfa COMPLETED
Job ID teravalidate-rg544az7mpbfa RUNNING
Job ID teravalidate-rg544az7mpbfa COMPLETED
...
Deleted cluster: my-template-id-rg544az7mpbfa.
WorkflowTemplate [my-template-id] DONE

REST API

See workflowTemplates.instantiate.

Console

You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.

Workflow job failures

A failure in any job in a workflow will cause the workflow to fail. Dataproc will seek to mitigate the effect of failures by causing all concurrently executing jobs to fail and preventing subsequent jobs from starting.

Monitoring and listing a workflow

gcloud command

To monitor a workflow:

gcloud dataproc operations describe operation-id \
    --region=region

Note: The operation-id is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

To list workflow status:

gcloud dataproc operations list \
    --region=region \
    --filter="labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"

REST API

To monitor a workflow, use the Dataproc operations.get API.

To list running workflows, use the Dataproc operations.list API with a label filter.

Console

You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.

Terminating a workflow

You can end a workflow using the Google Cloud CLI or by calling the Dataproc API.

gcloud command

gcloud dataproc operations cancel operation-id \
    --region=region
Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

REST API

See operations.cancel API.

Console

You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.

Updating a workflow template

Updates do not affect running workflows. The new template version will only apply to new workflows.

gcloud command

Workflow templates can be updated by issuing new gcloud workflow-templates commands that reference an existing workflow template-id:

to an existing workflow template.

REST API

To make an update to a template with the REST API:

  1. Call workflowTemplates.get, which returns the current template with the version field filled in with the current server version.
  2. Make updates to the fetched template.
  3. Call workflowTemplates.update with the updated template.

Console

You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.

Deleting a workflow template

gcloud command

gcloud dataproc workflow-templates delete template-id \
    --region=region

Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

REST API

See workflowTemplates.delete.

Console

You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.