Using workflows

You set up and run a workflow by:

  1. Creating a workflow template
  2. Configuring a managed (ephemeral) cluster or selecting an existing cluster
  3. Adding jobs
  4. Instantiating the template to run the workflow

Creating a template

gcloud command

Run the following command to create a Cloud Dataproc workflow template resource.

gcloud dataproc workflow-templates create template-id (such as "my-workflow")

REST API

See workflowTemplates.create. A completed WorkflowTemplate is submitted with a workflowTemplates.create request.

Console

You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

Configuring or selecting a cluster

Cloud Dataproc can create and use a new, "managed" cluster for your workflow or an existing cluster.

  • Existing cluster: See Using cluster selectors with workflows to select an existing cluster for your workflow.

  • Managed cluster: You must configure a managed cluster for your workflow. Cloud Dataproc will create this new cluster to run workflow jobs, then delete the cluster at the end of the workflow.
    You can configure a managed cluster for your workflow using the gcloud command-line tool or the Cloud Dataproc API.

    gcloud command

    Use flags inherited from gcloud dataproc cluster create to configure the managed cluster (number of workers, master/worker machine type, etc.). Cloud Dataproc will add a suffix to the cluster name to ensure uniqueness.

    gcloud dataproc workflow-templates set-managed-cluster template-id \
        --master-machine-type machine-type \
        --worker-machine-type machine-type \
        --num-workers number \
        --cluster-name cluster-name
    

    REST API

    See WorkflowTemplatePlacement.ManagedCluster. This field is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

    Console

    You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

Adding jobs to a template

All jobs run concurrently unless you specify one or more job dependencies. A job's dependencies are expressed as a list of other jobs that must finish successfully before the ultimate job can start. You must provide a step-id for each job. The ID must be unique within the workflow, but does not need to be unique globally.

gcloud command

Use job type and flags inherited from gcloud dataproc jobs submit to define the job to add to the template. You can optionally use the ‑‑start-after job-id of another workflow job flag to have the job start after the completion of one or more other jobs in the workflow.

Examples:

Add Hadoop job "foo" to the "my-workflow" template.

gcloud dataproc workflow-templates add-job hadoop \
  --step-id foo \
  --workflow-template my-workflow \
  -- space separated job args

Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.

gcloud dataproc workflow-templates add-job job-type \
    --step-id bar \
    --start-after foo \
    --workflow-template my-workflow \
    -- space separated job args

Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.

gcloud dataproc workflow-templates add-job job-type \
    --step-id baz \
    --start-after foo,bar \
    --workflow-template my-workflow \
    -- space separated job args

REST API

See WorkflowTemplate.OrderedJob. This field is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.

Console

You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

Running a workflow

The instantiation of a workflow template runs the workflow defined by the template. Multiple instantiations of a template are supported—you can run a workflow multiple times.

gcloud command

gcloud dataproc workflow-templates instantiate template-id

The command returns an operation ID, which you can use to track workflow status.

Example:
gcloud beta dataproc workflow-templates instantiate my-template-id
...
WorkflowTemplate [my-template-id] RUNNING
...
Created cluster: my-template-id-rg544az7mpbfa.
Job ID teragen-rg544az7mpbfa RUNNING
Job ID teragen-rg544az7mpbfa COMPLETED
Job ID terasort-rg544az7mpbfa RUNNING
Job ID terasort-rg544az7mpbfa COMPLETED
Job ID teravalidate-rg544az7mpbfa RUNNING
Job ID teravalidate-rg544az7mpbfa COMPLETED
...
Deleted cluster: my-template-id-rg544az7mpbfa.
WorkflowTemplate [my-template-id] DONE

REST API

See workflowTemplates.instantiate.

Console

You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

Workflow job failures

A failure in any job in a workflow will cause the workflow to fail. Cloud Dataproc will seek to mitigate the effect of failures by causing all concurrently executing jobs to fail and preventing subsequent jobs from starting.

Monitoring and listing a workflow

gcloud command

To monitor a workflow:

gcloud dataproc operations describe operation-id

Note: The operation-id is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

To list workflow status:

gcloud dataproc operations list \
    --filter "labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"

REST API

To monitor a workflow, use the Cloud Dataproc operations.get API.

To list running workflows, use the Cloud Dataproc operations.list API with a label filter.

Console

You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

Terminating a workflow

You can terminate a workflow using the gcloud command-line tool or by calling the Cloud Dataproc API.

gcloud command

gcloud dataproc operations cancel operation-id
Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

REST API

See operations.cancel API.

Console

You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

Updating a workflow template

Updates do not affect running workflows. The new template version will only apply to new workflows.

gcloud command

Workflow templates can be updated by issuing new gcloud workflow-templates commands that reference an existing workflow template-id:

to an existing workflow template.

REST API

To make an update to a template with the REST API:

  1. Call workflowTemplates.get, which returns the current template with the version field filled in with the current server version.
  2. Make updates to the fetched template.
  3. Call workflowTemplates.update with the updated template.

Console

You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

Deleting a workflow template

gcloud command

gcloud dataproc workflow-templates delete template-id

Note: The operation-id that is returned when you instantiate the workflow with gcloud dataproc workflow-templates instantiate (see Running a workflow).

REST API

See workflowTemplates.delete.

Console

You can view existing workflow templates and instantiated workflows from the Cloud Dataproc Workflows page in GCP Console.

هل كانت هذه الصفحة مفيدة؟ يرجى تقييم أدائنا:

إرسال تعليقات حول...

Cloud Dataproc Documentation
هل تحتاج إلى مساعدة؟ انتقل إلى صفحة الدعم.