If your workflow template will be run multiple times with different values, you can avoid having to edit the workflow each time by defining parameters in the template (parameterizing the template). Then, you can pass different values for the parameters each time you run the template.
Parameterizable Fields
The following Dataproc workflow template fields can be parameterized:
- Labels
- File URIs
- Managed cluster name. Dataproc will use the user-supplied name as the name prefix, and append random characters to create a unique cluster name. The cluster is deleted at the end of the workflow.
- Job properties
- Job arguments
- Script variables (in HiveJob, SparkSqlJob, and PigJob)
- Main class (in HadoopJob and SparkJob)
- Zone (in ClusterSelector)
- Number of Instances (
numInstances
) in a master or worker instance group.
Parameter attributes
Workflow template parameters are defined with the following required and optional attributes:
- name (required)
- A unix-style variable name. This name will be used as a key when providing a value for the parameter later.
- fields (required)
- A list of fields that this parameter will replace (see Parameterizable Fields for a list of fields that can be parameterized). Each field is specified as a "field path" (see Field Path Syntax for the syntax to use to specify a field path). Note that a field is allowed to appear in at most one parameter's list of field paths.
- description (optional)
- Brief description of the parameter.
- validation (optional)
- Rules used to validate a parameter value, which can be one of:
- a list of allowed values
- a list of regular expressions that a value must match.
- a list of allowed values
Field path syntax
A field path is similar in syntax to a
FieldMask.
For example, a field path that references the zone field of a workflow template's
cluster selector would be specified as placement.clusterSelector.zone
.
Field paths can reference fields using the following syntax:
Managed cluster name:
- placement.managedCluster.clusterName
Values in maps can be referenced by key, for example:
- labels['key']
- placement.clusterSelector.clusterLabels['key']
- placement.managedCluster.labels['key']
- jobs['step-id'].labels['key']
Jobs in the jobs list can be referenced by step-id.
- jobs['step-id'].hadoopJob.mainJarFileUri
- jobs['step-id'].hiveJob.queryFileUri
- jobs['step-id'].pySparkJob.mainPythonFileUri
- jobs['step-id'].hadoopJob.jarFileUris[0]
- jobs['step-id'].hadoopJob.archiveUris[0]
- jobs['step-id'].hadoopJob.fileUris[0]
jobs['step-id'].pySparkJob.pythonFileUris[0]
Items in repeated fields can be referenced by a zero-based index, for example:
jobs['step-id'].sparkJob.args[0]
Other examples:
jobs['step-id'].hadoopJob.args[0]
jobs['step-id'].hadoopJob.mainJarFileUri
jobs['step-id'].hadoopJob.properties['key']
jobs['step-id'].hiveJob.scriptVariables['key']
placement.clusterSelector.zone
Maps and repeated fields cannot be parameterized in their entirety: currently, only individual map values and individual items in repeated fields can be referenced. For example, the following field paths are invalid:
placement.clusterSelector.clusterLabels
jobs['step-id'].sparkJob.args
Parameterizing a workflow template
You parameterize a workflow template by defining template parameters with the Dataproc API or the Google Cloud CLI.
gcloud Command
You can define workflow template parameters by creating, or exporting with the Google Cloud CLI and editing, a workflow template YAML file, then importing the file with the Google Cloud CLI to create or update the template. See Using YAML files for more information.
Example 1: Parameterized managed-cluster template example
The following is a sample teragen-terasort managed-cluster workflow template YAML file with four defined parameters: CLUSTER, NUM_ROWS, GEN_OUT, and SORT_OUT. Two versions are listed: one BEFORE and the other AFTER parameterization.
Before
placement: managedCluster: clusterName: my-managed-cluster config: gceClusterConfig: zoneUri: us-central1-a jobs: - hadoopJob: args: - teragen - '10000' - hdfs:///gen/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar stepId: teragen - hadoopJob: args: - terasort - hdfs:///gen/ - hdfs:///sort/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar prerequisiteStepIds: - teragen stepId: terasort
After
placement: managedCluster: clusterName: 'to-be-determined' config: gceClusterConfig: zoneUri: us-central1-a jobs: - hadoopJob: args: - teragen - '10000' - hdfs:///gen/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar stepId: teragen - hadoopJob: args: - terasort - hdfs:///gen/ - hdfs:///sort/ mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar prerequisiteStepIds: - teragen stepId: terasort parameters: - description: The managed cluster name prefix fields: - placement.managedCluster.clusterName name: CLUSTER - description: The number of rows to generate fields: - jobs['teragen'].hadoopJob.args[1] name: NUM_ROWS validation: values: values: - '1000' - '10000' - '100000' - description: Output directory for teragen fields: - jobs['teragen'].hadoopJob.args[2] - jobs['terasort'].hadoopJob.args[1] name: GEN_OUT validation: regex: regexes: - hdfs:///.* - description: Output directory for terasort fields: - jobs['terasort'].hadoopJob.args[2] name: SORT_OUT validation: regex: regexes: - hdfs:///.*
Example 2: Cluster selector workflow template example
The following is a parameterized sample teragen-terasort cluster-selector workflow template YAML file with three defined parameters: CLUSTER, NUM_ROWS, and OUTPUT_DIR.
placement: clusterSelector: clusterLabels: goog-dataproc-cluster-name: 'to-be-determined' jobs: - stepId: teragen hadoopJob: args: - 'teragen' - 'tbd number of rows' - 'tbd output directory' parameters: - name: CLUSTER fields: - placement.clusterSelector.clusterLabels['goog-dataproc-cluster-name'] - name: NUM_ROWS fields: - jobs['teragen'].hadoopJob.args[1] - name: OUTPUT_DIR fields: - jobs['teragen'].hadoopJob.args[2]
After creating or editing a YAML file that defines a workflow template with template parameters, use the following gcloud command to import the YAML file to create or update the parameterized template.
gcloud dataproc workflow-templates import template-ID or template-name \ --region=region \ --source=template.yaml
You can pass either the
WorkflowTemplate
id
or the fully qualified template resource name
("projects/projectId/regions/region/workflowTemplates/template_id") to the command. If a template resource with the same template name exists, it will
be overwritten (updated) and its version number will be incremented. If a
template with the same template name does not exist, it will be created.
Rest API
You can define one or more WorkflowTemplate.parameters in a workflowTemplates.create or a workflowTemplates.update API request.
The following is a sample workflowTemplates.create
request to create
a teragen-terasort workflow template with four defined parameters: CLUSTER, NUM_ROWS, GEN_OUT, and SORT_OUT.
POST https://dataproc.googleapis.com/v1/projects/my-project/locations/us-central1/workflowTemplates { "id": "my-template", "jobs": [ { "stepId": "teragen", "hadoopJob": { "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar", "args": [ "teragen", "10000", "hdfs:///gen/" ] } }, { "stepId": "terasort", "prerequisiteStepIds": [ "teragen" ], "hadoopJob": { "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar", "args": [ "terasort", "hdfs:///gen/", "hdfs:///sort/" ] } } ], "parameters": [ { "name": "CLUSTER", "fields": [ "placement.managedCluster.clusterName" ], "description": "The managed cluster name prefix" }, { "name": "NUM_ROWS", "fields": [ "jobs['teragen'].hadoopJob.args[1]" ], "description": "The number of rows to generate", "validation": { "values": { "values": [ "1000", "10000", "100000" ] } } }, { "name": "GEN_OUT", "fields": [ "jobs['teragen'].hadoopJob.args[2]", "jobs['terasort'].hadoopJob.args[1]" ], "description": "Output directory for teragen", "validation": { "regex": { "regexes": [ "hdfs:///.*" ] } } }, { "name": "SORT_OUT", "fields": [ "jobs['terasort'].hadoopJob.args[2]" ], "description": "Output directory for terasort", "validation": { "regex": { "regexes": [ "hdfs:///.*" ] } } } ], "placement": { "managedCluster": { "clusterName": "to-be-determined", "config": { "gceClusterConfig": { "zoneUri": "us-central1-a" } } } } }
Passing Parameters to a parameterized template
You can pass a different set of parameter values each time you run a parameterized workflow template. You must provide a value for each parameter defined in the template.
gcloud Command
You can pass a map of parameter names to values to the
gcloud dataproc workflow-templates instantiate command with the --parameters
flag. All
parameter values defined in the template must be supplied. The
supplied values will override values specified in the template.
Parameterized managed-cluster template example
gcloud dataproc workflow-templates instantiate my-template \ --region=region \ --parameters=CLUSTER=cluster,NUM_ROWS=1000,GEN_OUT=hdfs:///gen_20180601/,SORT_OUT=hdfs:///sort_20180601
Parameterized cluster-selector template example
gcloud dataproc workflow-templates instantiate \ --parameters CLUSTER=my-cluster,NUM_ROWS=10000,OUTPUT_DIR=hdfs://some/dir
Rest API
You can pass a
parameters
map
of parameter names
to values
to the Dataproc
workflowTemplates.instantiate
API. All parameter values defined in the template must be
supplied. The supplied values will override values specified in the
template.
Example:
POST https://dataproc.googleapis.com/v1/projects/my-project/regions/us-central1/workflowTemplates/my-template:instantiate { "parameters": { "CLUSTER": "clusterA", "NUM_ROWS": "1000", "GEN_OUT": "hdfs:///gen_20180601/", "SORT_OUT": "hdfs:///sort_20180601/" } }