As an alternative to running a workflow on a managed cluster, you can use a cluster selector to choose an existing cluster for your workflow. At the conclusion of the workflow, the selected cluster is not deleted.
Selectors specify one or more Dataproc user labels. Clusters in same region as the workflow whose labels match all of the selector labels are eligible to run workflow jobs. If multiple clusters match the selector, Dataproc will choose the cluster with the most free YARN memory.
Adding a cluster selector to a template
You can add a cluster selector to a workflow template using the Google Cloud CLI or the Dataproc API.
gcloud command
gcloud dataproc workflow-templates set-cluster-selector template-id \ --region=region \ --cluster-labels=name=value[[,name=value]...]
REST API
See WorkflowTemplatePlacement.ClusterSelector. This field is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Using Automatically Applied Labels
You can point a cluster selector to an existing cluster by using one of the following automatically-applied cluster labels:
goog-dataproc-cluster-name
goog-dataproc-cluster-uuid
Example:
gcloud dataproc workflow-templates set-cluster-selector template-id \ --region=region \ --cluster-labels=goog-dataproc-cluster-name=my-cluster
Selecting from a Cluster Pool
You can let Dataproc choose a cluster from a pool of clusters. The cluster pools can be defined with labels.
Example:
gcloud dataproc clusters create cluster-1 --labels cluster-pool=pool-1 \ --region=region gcloud dataproc clusters create cluster-2 --labels cluster-pool=pool-1 \ --region=region gcloud dataproc clusters create cluster-3 --labels cluster-pool=pool-2 \ --region=region
After cluster creation ...
gcloud dataproc workflow-templates create my-template \ --region=region gcloud dataproc workflow-templates set-cluster-selector my-template \ --region=region \ --cluster-labels=cluster-pool=pool-1
The workflow will be run on either cluster-1 or cluster-2, but not on cluster-3.