You set up and run a workflow by:
- Creating a workflow template
- Configuring a managed (ephemeral) cluster or selecting an existing cluster
- Adding jobs
- Instantiating the template to run the workflow
Creating a template
Run the following command
to create a Dataproc workflow template resource.
gcloud dataproc workflow-templates createTEMPLATE_ID \ --region=REGION
Notes:
- REGION: Specify the region where your template will run.
- TEMPLATE_ID: Provide an ID for your template, such as, "workflow-template-1".
- CMEK encryption. You can add the --kms-key flag to use CMEK encryption on workflow template job arguments.
Submit a WorkflowTemplate as part of a workflowTemplates.create request. You can add the WorkflowTemplate.EncryptionConfig.kmsKey field to use CMEK encryption on workflow template job arguments. kmsKey
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Configuring or selecting a cluster
Dataproc can create and use a new, "managed" cluster for your workflow or an existing cluster.
Existing cluster: See Using cluster selectors with workflows to select an existing cluster for your workflow.
Managed cluster: You must configure a managed cluster for your workflow. Dataproc will create this new cluster to run workflow jobs, then delete the cluster at the end of the workflow.
You can configure a managed cluster for your workflow using the
gcloud
command-line tool or the Dataproc API.Use flags inherited from gcloud dataproc cluster create to configure the managed cluster, such as the number of workers and the master and worker machine type. Dataproc will add a suffix to the cluster name to ensure uniqueness. You can use the
--service-account
flag to specify a VM service account for the managed cluster.gcloud dataproc workflow-templates set-managed-cluster
TEMPLATE_ID \ --region=REGION \ --master-machine-type=MACHINE_TYPE \ --worker-machine-type=MACHINE_TYPE \ --num-workers=NUMBER \ --cluster-name=CLUSTER_NAME --service-account=SERVICE_ACCOUNT See WorkflowTemplatePlacement.ManagedCluster, which you can provide as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.
You can use the
GceClusterConfig.serviceAccount
field to specify a VM service account for the managed cluster.You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Adding jobs to a template
All jobs run concurrently unless you specify one or more job dependencies. A
job's dependencies are expressed as a list of other jobs that must finish
successfully before the ultimate job can start. You must provide a step-id
for each job. The ID must be unique within the workflow, but does not need to be
unique globally.
Use job type and flags inherited from
gcloud dataproc jobs submit
to define the job to add to the template. You can optionally use the
‑‑start-after job-id of another workflow job
flag to have the job start after the completion of one or more other jobs
in the workflow.
Examples:
Add Hadoop job "foo" to the "my-workflow" template.
gcloud dataproc workflow-templates add-job hadoop \ --region=REGION \ --step-id=foo \ --workflow-template=my-workflow \ --space separated job args
Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.
gcloud dataproc workflow-templates add-jobJOB_TYPE \ --region=REGION \ --step-id=bar \ --start-after=foo \ --workflow-template=my-workflow \ --space separated job args
Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.
gcloud dataproc workflow-templates add-jobJOB_TYPE \ --region=REGION \ --step-id=baz \ --start-after=foo,bar \ --workflow-template=my-workflow \ --space separated job args
See WorkflowTemplate.OrderedJob. This field is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Running a workflow
The instantiation of a workflow template runs the workflow defined by the template. Multiple instantiations of a template are supported—you can run a workflow multiple times.
gcloud dataproc workflow-templates instantiateTEMPLATE_ID \ --region=REGION
The command returns an operation ID, which you can use to track workflow status.
Example command and output:gcloud beta dataproc workflow-templates instantiate my-template-id \ --region=us-central1 ... WorkflowTemplate [my-template-id] RUNNING ... Created cluster: my-template-id-rg544az7mpbfa. Job ID teragen-rg544az7mpbfa RUNNING Job ID teragen-rg544az7mpbfa COMPLETED Job ID terasort-rg544az7mpbfa RUNNING Job ID terasort-rg544az7mpbfa COMPLETED Job ID teravalidate-rg544az7mpbfa RUNNING Job ID teravalidate-rg544az7mpbfa COMPLETED ... Deleted cluster: my-template-id-rg544az7mpbfa. WorkflowTemplate [my-template-id] DONE
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Workflow job failures
A failure in any job in a workflow will cause the workflow to fail. Dataproc will seek to mitigate the effect of failures by causing all concurrently executing jobs to fail and preventing subsequent jobs from starting.
Monitoring and listing a workflow
To monitor a workflow:
gcloud dataproc operations describeOPERATION_ID \ --region=REGION
Note: The operation-id is returned when you instantiate the workflow
with gcloud dataproc workflow-templates instantiate
(see
Running a workflow).
To list workflow status:
gcloud dataproc operations list \ --region=REGION \ --filter="labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"
To monitor a workflow, use the Dataproc operations.get API.
To list running workflows, use the Dataproc operations.list API with a label filter.
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Terminating a workflow
You can end a workflow using the Google Cloud CLI or by calling the Dataproc API.
gcloud dataproc operations cancelOPERATION_ID \ --region=REGION
gcloud dataproc workflow-templates instantiate
(see
Running a workflow).
See the operations.cancel API.
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Updating a workflow template
Updates do not affect running workflows. The new template version will only apply to new workflows.
Workflow templates can be updated by issuing new gcloud workflow-templates
commands that reference an existing workflow template-id:
To make an update to a template with the REST API:
- Call workflowTemplates.get, which returns the current template with the
version
field filled in with the current server version. - Make updates to the fetched template.
- Call workflowTemplates.update with the updated template.
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Deleting a workflow template
gcloud dataproc workflow-templates deleteTEMPLATE_ID \ --region=REGION
Note: The operation-id that is returned when you instantiate the workflow
with gcloud dataproc workflow-templates instantiate
(see
Running a workflow).
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.