This page provides information to help you monitor and debug Dataproc workflows.
Listing workflows
An instantiated WorkflowTemplate is a called a "workflow" and is modeled as an "operation."
Run the following gcloud
command to list your project's workflows:
gcloud dataproc operations list \ --region=region \ --filter="operationType = WORKFLOW"
... OPERATION_NAME DONE projects/.../operations/07282b66-2c60-4919-9154-13bd4f03a1f2 True projects/.../operations/1c0b0fd5-839a-4ad4-9a57-bbb011956690 True
Here's a sample request to list all workflows started from a "terasort" template:
gcloud dataproc operations list \ --region=region \ --filter="labels.goog-dataproc-workflow-template-id=terasort"
... OPERATION_NAME DONE projects/.../07282b66-2c60-4919-9154-13bd4f03a1f2 True projects/.../1c0b0fd5-839a-4ad4-9a57-bbb011956690 True
Note that only the UUID portion of OPERATION_NAME
is used in subsequent
queries.
Using WorkflowMetadata
The operation.metadata
field provides information to help you diagnose
workflow failures.
Here's a sample
WorkflowMetadata
,
including a graph of nodes (jobs), embedded in an operation:
{ "name": "projects/my-project/regions/us-central1/operations/671c1d5d-9d24-4cc7-8c93-846e0f886d6e", "metadata": { "@type": "type.googleapis.com/google.cloud.dataproc.v1.WorkflowMetadata", "template": "terasort", "version": 1, "createCluster": { "operationId": "projects/my-project/regions/us-central1/operations/8d472870-4a8b-4609-9f7d-48daccb028fc", "Done": true }, "graph": { "nodes": [ { "stepId": "teragen", "jobId": "teragen-vtrprwcgepyny", "state": "COMPLETED" }, { "stepId": "terasort", "prerequisiteStepIds": [ "teragen" ], "jobId": "terasort-vtrprwcgepyny", "state": "FAILED", "error": "Job failed" }, { "stepId": "teravalidate", "prerequisiteStepIds": [ "terasort" ], "state": "FAILED", "error": "Skipped, node terasort failed" } ] }, "deleteCluster": { "operationId": "projects/my-project/regions/us-central1/operations/9654c67b-2642-4142-a145-ca908e7c81c9", "Done": true }, "state": "DONE", "clusterName": "terasort-cluster-vtrprwcgepyny" }, "done": true, "error": { "message": "Workflow failed" } } Done!
Retrieving a template
As shown in the previous example, the metadata
contains the template id
and version.
"template": "terasort", "version": 1,
If a template is not deleted, instantiated template versions can be retrieved by a describe-with-version request.
gcloud dataproc workflow-templates describe terasort \ --region=region \ --version=1
List cluster operations started by a template:
gcloud dataproc operations list \ --region=region \ --filter="labels.goog-dataproc-workflow-instance-id = 07282b66-2c60-4919-9154-13bd4f03a1f2"
... OPERATION_NAME DONE projects/.../cf9ce692-d6c9-4671-a909-09fd62041024 True projects/.../1bbaefd9-7fd9-460f-9adf-ee9bc448b8b7 True
Here's a sample request to list jobs submitted from a template:
gcloud dataproc jobs list \ --region=region \ --filter="labels.goog-dataproc-workflow-template-id = terasort"
... JOB_ID TYPE STATUS terasort2-ci2ejdq2ta7l6 pyspark DONE terasort2-ci2ejdq2ta7l6 pyspark DONE terasort1-ci2ejdq2ta7l6 pyspark DONE terasort3-3xwsy6ubbs4ak pyspark DONE terasort2-3xwsy6ubbs4ak pyspark DONE terasort1-3xwsy6ubbs4ak pyspark DONE terasort3-ajov4nptsllti pyspark DONE terasort2-ajov4nptsllti pyspark DONE terasort1-ajov4nptsllti pyspark DONE terasort1-b262xachbv6c4 pyspark DONE terasort1-cryvid3kreea2 pyspark DONE terasort1-ndprn46nesbv4 pyspark DONE terasort1-yznruxam4ppxi pyspark DONE terasort1-ttjbhpqmw55t6 pyspark DONE terasort1-d7svwzloplbni pyspark DONE
List jobs submitted from a workflow instance:
gcloud dataproc jobs list \ --region=region \ --filter="labels.goog-dataproc-workflow-instance-id = 07282b66-2c60-4919-9154-13bd4f03a1f2"
... JOB_ID TYPE STATUS terasort3-ci2ejdq2ta7l6 pyspark DONE terasort2-ci2ejdq2ta7l6 pyspark DONE terasort1-ci2ejdq2ta7l6 pyspark DONE
Workflow timeouts
You can set a workflow timeout that will cancel the workflow if the workflow's jobs do not finish within the timeout period. The timeout period applies to the DAG (Directed Acyclic Graph) of jobs in the workflow (the sequence of jobs in the workflow), not to the entire workflow operation. The timeout period starts when the first workflow job starts—it does not include the time taken to create a managed cluster. If any job is running at the end of the timeout period, all running jobs are stopped, the workflow is ended, and if the workflow was running on a managed cluster, the cluster is deleted.
Benefit: Use this feature to avoid having to manually end a workflow that does not complete due to stuck jobs.
Setting a workflow template timeout
You can set a workflow template timeout period when you create a workflow template. You can also add a workflow timeout to an existing workflow template by updating the workflow template.
gcloud
To set a workflow timeout on a new template, use the --dag-timeout
flag with the
gcloud dataproc workflow-templates create command. You can use "s", "m", "h", and "d" suffixes
to set second, minute, hour, and day duration values, respectively. The timeout
duration must be from 10 minutes ("10m") to 24 hours ("24h" or "1d").
gcloud dataproc workflow-templates create template-id (such as "my-workflow") \ --region=region \ --dag-timeout=duration (from "10m" to "24h" or "1d"") \ ... other args ...
API
To set a workflow timeout, complete the WorkflowTemplate
dagTimeout
field as part of a workflowTemplates.create request.
Console
Currently, the Google Cloud console does not support creating a workflow template.
Updating a workflow template timeout
You can update an existing workflow template to change, add, or remove a workflow timeout.
gcloud
Adding or changing a workflow timeout
To add or change a workflow timeout on an existing template, use the
--dag-timeout
flag with the
gcloud dataproc workflow-templates set-dag-timeout
command. You can use "s", "m", "h", and "d" suffixes to set second, minute,
hour, and day duration values, respectively. The timeout duration must be from
10 minutes ("10m") to 24 hours ("24h").
gcloud dataproc workflow-templates set-dag-timeout template-id (such as "my-workflow") \ --region=region \ --dag-timeout=duration (from "10m" to "24h" or "1d")
Removing a workflow timeout
To remove a workflow timeout from an existing template, use the gcloud dataproc workflow-templates remove-dag-timeout command.
gcloud dataproc workflow-templates remove-dag-timeout template-id (such as "my-workflow") \ --region=region
API
Adding or changing a workflow timeout
To add or change a workflow timeout on an existing template,
update the workflow template
by filling in the template's
dagTimeout
field with the new or changed timeout value.
Removing a workflow timeout
To remove a workflow timeout from an existing template,
update the workflow template
by removing the template's
dagTimeout
field.
Console
Currently, the Google Cloud console does not support updating a workflow template.