Workflow Template Parameterization

If your workflow template will be run multiple times with different values, you can avoid having to edit the workflow each time by defining parameters in the template (parameterizing the template). Then, you can pass different values for the parameters each time you run the template.

Parameterizable Fields

The following Cloud Dataproc workflow template fields can be parameterized:

  • Labels
  • File URIs
  • Job properties
  • Job arguments
  • Script variables (in HiveJob, SparkSqlJob, and PigJob)
  • Main class (in HadoopJob and SparkJob)
  • Zone (in ClusterSelector)

Parameter attributes

Workflow template parameters are defined with the following required and optional attributes:

name (required)
A unix-style variable name. This name will be used as a key when providing a value for the parameter later.
fields (required)
A list of fields that this parameter will replace (see Parameterizable Fields for a list of fields that can be parameterized). Each field is specified as a "field path" (see Field Path Syntax for the syntax to use to specify a field path). Note that a field is allowed to appear in at most one parameter's list of field paths.
description (optional)
Brief description of the parameter.
validation (optional)
Rules used to validate a parameter value, which can consist of a list of allowed values or a list of regular expressions that a value must match.

Field path syntax

A field path is similar in syntax to a FieldMask. For example, a field path that references the zone field of a workflow template's cluster selector would be specified as placement.clusterSelector.zone.

Also, field paths can reference fields using the following syntax:

  • Values in maps can be referenced by key, for example:

    • labels['key']
    • placement.clusterSelector.clusterLabels['key']
    • placement.managedCluster.labels['key']
    • jobs['step-id'].labels['key']
  • Jobs in the jobs list can be referenced by step-id.

    • jobs['step-id'].hadoopJob.mainJarFileUri
    • jobs['step-id'].hiveJob.queryFileUri
    • jobs['step-id'].pySparkJob.mainPythonFileUri
    • jobs['step-id'].hadoopJob.jarFileUris[0]
    • jobs['step-id'].hadoopJob.archiveUris[0]
    • jobs['step-id'].hadoopJob.fileUris[0]
    • jobs['step-id'].pySparkJob.pythonFileUris[0]
  • Items in repeated fields can be referenced by a zero-based index, for example:

    • jobs['step-id'].sparkJob.args[0]
  • Other examples:

    • jobs['step-id'].hadoopJob.args[0]
    • jobs['step-id'].hadoopJob.mainJarFileUri
    • jobs['step-id'].hadoopJob.properties['key']
    • jobs['step-id'].hiveJob.scriptVariables['key']
    • placement.clusterSelector.zone

Maps and repeated fields cannot be parameterized in their entirety: currently, only individual map values and individual items in repeated fields can be referenced. For example, the following field paths are invalid:

placement.clusterSelector.clusterLabels
jobs['step-id'].sparkJob.args

Parameterizing a workflow template

You parameterize a workflow template by defining template parameters with the Cloud Dataproc API or the gcloud command-line tool.

Rest API

You can define one or more WorkflowTemplate.parameters in a workflowTemplates.create or a workflowTemplates.update API request.

The following is a sample workflowTemplates.create request to create a teragen-terasort workflow template with three defined parameters: NUM_ROWS, GEN_OUT, and SORT_OUT.

POST https://dataproc.googleapis.com/v1beta2/projects/my-project/locations/global/workflowTemplates
{
  "id": "my-template",
  "jobs": [
    {
      "stepId": "teragen",
      "hadoopJob": {
        "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar",
        "args": [
          "teragen",
          "10000",
          "hdfs:///gen/"
        ]
      }
    },
    {
      "stepId": "terasort",
      "prerequisiteStepIds": [
        "teragen"
      ],
      "hadoopJob": {
        "mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar",
        "args": [
          "terasort",
          "hdfs:///gen/",
          "hdfs:///sort/"
        ]
      }
    }
  ],
  "parameters": [
    {
      "name": "NUM_ROWS",
      "fields": [
        "jobs['teragen'].hadoopJob.args[1]"
      ],
      "description": "The number of rows to generate",
      "validation": {
        "values": {
          "values": [
            "10000",
            "10000000",
            "10000000000"
          ]
        }
      }
    },
    {
      "name": "GEN_OUT",
      "fields": [
        "jobs['teragen'].hadoopJob.args[2]",
        "jobs['terasort'].hadoopJob.args[1]"
      ],
      "description": "Output directory for teragen",
      "validation": {
        "regex": {
          "regexes": [
            "hdfs:///.*"
          ]
        }
      }
    },
    {
      "name": "SORT_OUT",
      "fields": [
        "jobs['terasort'].hadoopJob.args[2]"
      ],
      "description": "Output directory for terasort",
      "validation": {
        "regex": {
          "regexes": [
            "hdfs:///.*"
          ]
        }
      }
    }
  ],
  "placement": {
    "managedCluster": {
      "clusterName": "my-managed-cluster",
      "config": {
        "gceClusterConfig": {
          "zoneUri": "us-central1-a"
        }
      }
    }
  }
}

gcloud Command

You can define workflow template parameters by creating, or exporting with the gcloud command-line tool and editing, a workflow template YAML file, then importing the file with the gcloud command-line tool to create or update the template.

Example:

The following is a sample teragen-terasort workflow template YAML file with three defined parameters: NUM_ROWS, GEN_OUT, and SORT_OUT.

jobs:
- hadoopJob:
    args:
    - teragen
    - '10000'
    - hdfs:///gen/
    mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
  stepId: teragen
- hadoopJob:
    args:
    - terasort
    - hdfs:///gen/
    - hdfs:///sort/
    mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
  prerequisiteStepIds:
  - teragen
  stepId: terasort
parameters:
- description: The number of rows to generate
  fields:
  - jobs['teragen'].hadoopJob.args[1]
  name: NUM_ROWS
  validation:
    values:
      values:
      - '10000'
      - '10000000'
      - '10000000000'
- description: Output directory for teragen
  fields:
  - jobs['teragen'].hadoopJob.args[2]
  - jobs['terasort'].hadoopJob.args[1]
  name: GEN_OUT
  validation:
    regex:
      regexes:
      - hdfs:///.*
- description: Output directory for terasort
  fields:
  - jobs['terasort'].hadoopJob.args[2]
  name: SORT_OUT
  validation:
    regex:
      regexes:
      - hdfs:///.*
placement:
  managedCluster:
    clusterName: my-managed-cluster
    config:
      gceClusterConfig:
        zoneUri: us-central1-a

After creating or editing a YAML file that defines a workflow template (with template parameters as shown above), use the following gcloud command to import the YAML file to create or update the parameterized template.

gcloud beta dataproc workflow-templates import template-ID or template-name \
  --source template.yaml
You can pass either the WorkflowTemplate id or the fully qualified template resource name ("projects/projectId/regions/region/workflowTemplates/template_id") to the command. If a template resource with the same template name exists, it will be overwritten (updated) and its version number will be incremented. If a template with the same template name does not exist, it will be created.

Passing Parameters to a parameterized template

You can pass a different set of parameter values each time you run a parameterized workflow template. You must provide a value for EACH parameter defined in the template.

Rest API

You can pass a parameters map of parameter names to values to the Cloud Dataproc workflowTemplates.instantiate API. Note that all parameter values defined in the template must be supplied, and that these supplied values will override values specified in the template (see Parameterizing a workflow template).

.

Example:

POST https://dataproc.googleapis.com/v1beta2/projects/my-project/regions/global/workflowTemplates/my-template:instantiate
{
  "parameters": {
    "NUM_ROWS": "10000000000",
    "GEN_OUT": "hdfs:///gen_20180601/",
    "SORT_OUT": "hdfs:///sort_20180601/"
  }
}

gcloud Command

You can pass a map of parameter names to values to the gcloud beta dataproc workflow-templates instantiate command with the --parameters flag. Note that all parameter values defined in the template must be supplied, and that these supplied values will override values specified in the template (see Parameterizing a workflow template).

Example:

gcloud beta dataproc workflow-templates instantiate my-template \
  --parameters NUM_ROWS=10000000000,GEN_OUT=hdfs:///gen_20180601/,SORT_OUT=hdfs:///sort_20180601
Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Dataproc Documentation