Configure Flex Templates

This page documents various Dataflow Flex Template configuration options, including:

To configure a sample Flex Template, see the Flex Template tutorial.

Understand Flex Template permissions

When you're working with Flex Templates, you need three sets of permissions:

  • Permissions to create resources
  • Permissions to build a Flex Template
  • Permissions to run a Flex Template

Permissions to create resources

To develop and run a Flex Template pipeline, you need to create various resources (for example, a staging bucket). For one-time resource creation tasks, you can use the basic Owner role.

Permissions to build a Flex Template

As the developer of a Flex Template, you need to build the template to make it available to users. Building involves uploading a template spec to a Cloud Storage bucket and provisioning a Docker image with the code and dependencies needed to run the pipeline. To build a Flex Template, you need read and write access to Cloud Storage and Artifact Registry Writer access to your Artifact Registry repository. You can grant these permissions by assigning the following roles:

  • Storage Admin (roles/storage.admin)
  • Cloud Build Editor (roles/cloudbuild.builds.editor)
  • Artifact Registry Writer (roles/artifactregistry.writer)

Permissions to run a Flex Template

When you run a Flex Template, Dataflow creates a job for you. To create the job, the Dataflow service account needs the following permission:

  • dataflow.serviceAgent

When you first use Dataflow, the service assigns this role for you, so you don't need to grant this permission.

By default, the Compute Engine service account is used for launcher VMs and worker VMs. The service account needs the following roles and abilities:

  • Storage Object Admin (roles/storage.objectAdmin)
  • Viewer (roles/viewer)
  • Dataflow Worker (roles/dataflow.worker)
  • Read and write access to the staging bucket
  • Read access to the Flex Template image

To grant read and write access to the staging bucket, you can use the role Storage Object Admin (roles/storage.objectAdmin). For more information, see IAM roles for Cloud Storage.

To grant read access to the Flex Template image, you can use the role Storage Object Viewer (roles/storage.objectViewer). For more information, see Configuring access control.

Set required Dockerfile environment variables

If you want to create your own Dockerfile for a Flex Template job, specify the following environment variables:

Java

Specify FLEX_TEMPLATE_JAVA_MAIN_CLASS and FLEX_TEMPLATE_JAVA_CLASSPATH in your Dockerfile.

ENV Description Required
FLEX_TEMPLATE_JAVA_MAIN_CLASS Specifies which Java class to run in order to launch the Flex Template. YES
FLEX_TEMPLATE_JAVA_CLASSPATH Specifies the location of class files. YES
FLEX_TEMPLATE_JAVA_OPTIONS Specifies the Java options to be passed while launching the Flex Template. NO

Python

Specify FLEX_TEMPLATE_PYTHON_PY_FILE in your Dockerfile.

To manage pipeline dependencies, set variables in your Dockerfile, such as the following:

  • FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
  • FLEX_TEMPLATE_PYTHON_PY_OPTIONS
  • FLEX_TEMPLATE_PYTHON_SETUP_FILE
  • FLEX_TEMPLATE_PYTHON_EXTRA_PACKAGES

For example, the following environment variables are set in the Streaming in Python Flex Template tutorial in GitHub:

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/streaming_beam.py"
ENV Description Required
FLEX_TEMPLATE_PYTHON_PY_FILE Specifies which Python file to run to launch the Flex Template. YES
FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE Specifies the requirements file with pipeline dependencies. For more information, see PyPI dependencies in the Apache Beam documentation. NO
FLEX_TEMPLATE_PYTHON_SETUP_FILE Specifies the path to the pipeline package `setup.py` file. For more information, see Multiple File Dependencies in the Apache Beam documentation. NO
FLEX_TEMPLATE_PYTHON_EXTRA_PACKAGES

Specifies the packages that are not available publicly. For information on how using extra packages, read Local or non-PyPI Dependencies.

NO
FLEX_TEMPLATE_PYTHON_PY_OPTIONS Specifies the Python options to be passed while launching the Flex Template. NO

Package dependencies for Python

When a Dataflow Python pipeline uses additional dependencies, you might need to configure the Flex Template to install additional dependencies on Dataflow worker VMs.

When you run a Python Dataflow job that uses Flex Templates in an environment that restricts access to the internet, you must prepackage the dependencies when you create the template.

Use one of the following options to prepackage Python dependencies.

For instructions for managing pipeline dependencies in Java and Go pipelines, see Manage pipeline dependencies in Dataflow.

Use a requirements file and prepackage the dependencies with the template

If you are using your own Dockerfile to define the Flex Template image, follow these steps:

  1. Create a requirements.txt file that lists your pipeline dependencies.

    COPY requirements.txt /template/
    ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="/template/requirements.txt"
    
  2. Install the dependencies in the Flex Template image.

    RUN pip install --no-cache-dir -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
    
  3. Download the dependencies into the local requirements cache, which is staged to the Dataflow workers when the template launches.

    RUN pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
    

When you use this approach, dependencies from the requirements.txt file are installed onto Dataflow workers at runtime. An insight in the Google Cloud console recommendations tab might note this behavior. To avoid installing dependencies at runtime, use a custom container image.

The following is a code sample that uses a requirements file in the Flex Template.

# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base

# Configure the Template to launch the pipeline with a --requirements_file option.
# See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#pypi-dependencies
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="/template/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/streaming_beam.py"

COPY . /template

RUN apt-get update \
    # Install any apt packages if required by your template pipeline.
    && apt-get install -y libffi-dev git \
    && rm -rf /var/lib/apt/lists/* \
    # Upgrade pip and install the requirements.
    && pip install --no-cache-dir --upgrade pip \
    # Install dependencies from requirements file in the launch environment.
    && pip install --no-cache-dir -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE \
    # When FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE  option is used,
    # then during Template launch Beam downloads dependencies
    # into a local requirements cache folder and stages the cache to workers.
    # To speed up Flex Template launch, pre-download the requirements cache
    # when creating the Template.
    && pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE

# Set this if using Beam 2.37.0 or earlier SDK to speed up job submission.
ENV PIP_NO_DEPS=True

ENTRYPOINT ["/opt/google/dataflow/python_template_launcher"]

Structure the pipeline as a package and use local packages

When you use multiple Python local files or modules, structure your pipeline as a package. The file structure might look like the following example:

main.py
pyproject.toml
setup.py
src/
  my_package/
    my_custom_dofns_and_transforms.py
    my_pipeline_launcher.py
    other_utils_and_helpers.py
  1. Place the top-level entry point, for example, the main.py file, in the root directory. Place the rest of the files in a separate folder in the src directory, for example, my_package.

  2. Add the package configuration files to the root directory with the package details and requirements.

    pyproject.toml

    [project]
    name = "my_package"
    version = "package_version"
    dependencies = [
      # Add list of packages (and versions) that my_package depends on.
      # Example:
      "apache-beam[gcp]==2.54.0",
    ]
    

    setup.py

      """An optional setuptools configuration stub for the pipeline package.
    
      Use pyproject.toml to define the package. Add this file only if you must
      use the --setup_file pipeline option or the
      FLEX_TEMPLATE_PYTHON_SETUP_FILE configuration option.
      """
    
      import setuptools
      setuptools.setup()
    

    For more information about how to configure your local package, see Packaging Python Projects.

  3. When you import local modules or files for your pipeline, use the my_package package name as the import path.

    from my_package import word_count_transform
    
  4. Install your pipeline package in the Flex Template image. Your Flex Template Dockerfile might include content similar to the following example:

    Dockerfile

    ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
    ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
    
    # Copy pipeline, packages and requirements.
    WORKDIR ${WORKDIR}
    COPY main.py .
    COPY pyproject.toml .
    COPY setup.py .
    COPY src src
    
    # Install local package.
    RUN pip install -e .
    

When you use this approach, dependencies from the requirements.txt file are installed onto Dataflow workers at runtime. An insight in the Google Cloud console recommendations tab might note this behavior. To avoid installing dependencies at runtime, use a custom container image.

For an example that follows the recommended approach, see the Flex Template for a pipeline with dependencies and a custom container image tutorial in GitHub.

Use a custom container that preinstalls all dependencies

To avoid dependency installation at runtime, use custom containers. This option is preferred for pipelines that run in environments without internet access.

Follow these steps to use a custom container:

  1. Build a custom container image that preinstalls necessary dependencies.

  2. Preinstall the same dependencies in the Flex Template Dockerfile.

    To prevent dependency installation at runtime, don't use the FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE or FLEX_TEMPLATE_PYTHON_SETUP_FILE options in your Flex Template configuration.

    A modified Flex Template Dockerfile might look like the following example:

    FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
    ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/main.py"
    COPY . /template
    # If you use a requirements file, pre-install the requirements.txt.
    RUN pip install --no-cache-dir -r /template/requirements.txt
    # If you supply the pipeline in a package, pre-install the local package and its dependencies.
    RUN pip install -e /template
    

    When you use this approach, you do the following:

    • build the Flex Template image
    • build the custom SDK container image
    • install the same dependencies in both images

    Alternatively, to reduce the number of images to maintain, use your custom container image as a base image for the Flex Template.

  3. If you use the Apache Beam SDK version 2.49.0 or earlier, add the --sdk_location=container pipeline option in your pipeline launcher. This option tells your pipeline to use the SDK from your custom container instead of downloading the SDK.

    options = PipelineOptions(beam_args, save_main_session=True, streaming=True, sdk_location="container")
    
  4. Set the sdk_container_image parameter in the flex-template run command. For example:

    gcloud dataflow flex-template run $JOB_NAME \
       --region=$REGION \
       --template-file-gcs-location=$TEMPLATE_PATH \
       --parameters=sdk_container_image=$CUSTOM_CONTAINER_IMAGE \
       --additional-experiments=use_runner_v2
    

    For more information, see Use custom containers in Dataflow.

Choose a base image

You can use a Google-provided base image to package your template container images by using Docker. Choose the most recent tag from the Flex Templates base images. It's recommended to use a specific image tag instead of latest.

Specify the base image in the following format:

gcr.io/dataflow-templates-base/IMAGE_NAME:TAG

Replace the following:

Use custom container images

If your pipeline uses a custom container image, we recommend using the custom image as a base image for your Flex Template Docker image. To do so, copy the Flex Template launcher binary from the Google-provided template base image onto your custom image.

An example Dockerfile for an image that can be used both as Custom SDK container image and as a Flex Template, might look like the following:

FROM gcr.io/dataflow-templates-base/IMAGE_NAME:TAG as template_launcher
FROM apache/beam_python3.10_sdk:2.60.0

# RUN <...Make image customizations here...>
# See: https://cloud.google.com/dataflow/docs/guides/build-container-image

# Configure the Flex Template here.
COPY --from=template_launcher /opt/google/dataflow/python_template_launcher /opt/google/dataflow/python_template_launcher
COPY my_pipeline.py /template/
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/my_pipeline.py"

Replace the following:

  • IMAGE_NAME: a Google-provided base image. For example: python311-template-launcher-base.
  • TAG: a version tag for the base image found in the Flex Templates base images reference. For better stability and troubleshooting, avoid using latest. Instead, pin to a specific version tag.

For an example that follows this approach, see the Flex Template for a pipeline with dependencies and a custom container image tutorial.

Use an image from a private registry

You can build a Flex Template image stored in a private Docker registry, if the private registry uses HTTPS and has a valid certificate.

To use an image from a private registry, specify the path to the image and a username and password for the registry. The username and password must be stored in Secret Manager. You can provide the secret in one of the following formats:

  • projects/{project}/secrets/{secret}/versions/{secret_version}
  • projects/{project}/secrets/{secret}

If you use the second format, because it doesn't specify the version, Dataflow uses the latest version.

If the registry uses a self-signed certificate, you also need to specify the path to the self-signed certificate in Cloud Storage.

The following table describes the gcloud CLI options that you can use to configure a private registry.

Parameter Description
image The address of the registry. For example: gcp.repository.example.com:9082/registry/example/image:latest.
image-repository-username-secret-id The Secret Manager secret ID for the username to authenticate to the private registry. For example: projects/example-project/secrets/username-secret.
image-repository-password-secret-id The Secret Manager secret ID for the password to authenticate to the private registry. For example: projects/example-project/secrets/password-secret/versions/latest.
image-repository-cert-path The full Cloud Storage URL for a self-signed certificate for the private registry. This value is only required if the registry uses a self-signed certificate. For example: gs://example-bucket/self-signed.crt.

Here's an example Google Cloud CLI command that builds a Flex Template using an image in a private registry with a self-signed certificate.

gcloud dataflow flex-template build gs://example-bucket/custom-pipeline-private-repo.json
--sdk-language=JAVA
--image="gcp.repository.example.com:9082/registry/example/image:latest"
--image-repository-username-secret-id="projects/example-project/secrets/username-secret"
--image-repository-password-secret-id="projects/example-project/secrets/password-secret/versions/latest"
--image-repository-cert-path="gs://example-bucket/self-signed.crt"
--metadata-file=metadata.json

To build your own Flex Template, you need to replace the example values, and you might need to specify different or additional options. To learn more, see the following resources:

Specify pipeline options

For information about pipeline options that are directly supported by Flex Templates, see Pipeline options.

You can also use any Apache Beam pipeline options indirectly. If you're using a metadata.json file for your Flex Template job, include these pipeline options in the file. This metadata file must follow the format in TemplateMetadata.

Otherwise, when you launch the Flex Template job, pass these pipeline options by using the parameters field.

API

Include pipeline options by using the parameters field.

gcloud

Include pipeline options by using the parameters flag.

When passing parameters of List or Map type, you might need to define parameters in a YAML file and use the flags-file. For an example of this approach, view the "Create a file with parameters..." step in this solution.

When using Flex Templates, you can configure some pipeline options during pipeline initialization, but other pipeline options can't be changed. If the command line arguments required by the Flex Template are overwritten, the job might ignore, override, or discard the pipeline options passed by the template launcher. The job might fail to launch, or a job that doesn't use the Flex Template might launch. For more information, see Failed to read the job file.

During pipeline initialization, don't change the following pipeline options:

Java

  • runner
  • project
  • jobName
  • templateLocation
  • region

Python

  • runner
  • project
  • job_name
  • template_location
  • region

Go

  • runner
  • project
  • job_name
  • template_location
  • region

Block project SSH keys from VMs that use metadata-based SSH keys

You can prevent VMs from accepting SSH keys that are stored in project metadata by blocking project SSH keys from VMs. Use the additional-experiments flag with the block_project_ssh_keys service option:

--additional-experiments=block_project_ssh_keys

For more information, see Dataflow service options.

Metadata

You can extend your template with additional metadata so that custom parameters are validated when the template is run. If you want to create metadata for your template, follow these steps:

  1. Create a metadata.json file using the parameters in Metadata parameters.

    To view an example, see Example metadata file.

  2. Store the metadata file in Cloud Storage in the same folder as the template.

Metadata parameters

Parameter key Required Description of the value
name Yes The name of your template.
description No A short paragraph of text describing the template.
streaming No If true, this template supports streaming. The default value is false.
supportsAtLeastOnce No If true, this template supports at-least-once processing. The default value is false. Set this parameter to true if the template is designed to work with at-least-once streaming mode.
supportsExactlyOnce No If true, this template supports exactly-once processing. The default value is true.
defaultStreamingMode No The default streaming mode, for templates that support both at-least-once mode and exactly-once mode. Use one of the following values: "AT_LEAST_ONCE", "EXACTLY_ONCE". If unspecified, the default streaming mode is exactly-once.
parameters No An array of additional parameters that the template uses. An empty array is used by default.
name Yes The name of the parameter that is used in your template.
label Yes A human readable string that is used in the Google Cloud console to label the parameter.
helpText Yes A short paragraph of text that describes the parameter.
isOptional No false if the parameter is required and true if the parameter is optional. Unless set with a value, isOptional defaults to false. If you do not include this parameter key for your metadata, the metadata becomes a required parameter.
regexes No An array of POSIX-egrep regular expressions in string form that is used to validate the value of the parameter. For example, ["^[a-zA-Z][a-zA-Z0-9]+"] is a single regular expression that validates that the value starts with a letter and then has one or more characters. An empty array is used by default.

Example metadata file

Java

{
  "name": "Streaming Beam SQL",
  "description": "An Apache Beam streaming pipeline that reads JSON encoded messages from Pub/Sub, uses Beam SQL to transform the message data, and writes the results to a BigQuery",
  "parameters": [
    {
      "name": "inputSubscription",
      "label": "Pub/Sub input subscription.",
      "helpText": "Pub/Sub subscription to read from.",
      "regexes": [
        "[a-zA-Z][-_.~+%a-zA-Z0-9]{2,}"
      ]
    },
    {
      "name": "outputTable",
      "label": "BigQuery output table",
      "helpText": "BigQuery table spec to write to, in the form 'project:dataset.table'.",
      "isOptional": true,
      "regexes": [
        "[^:]+:[^.]+[.].+"
      ]
    }
  ]
}

Python

{
  "name": "Streaming beam Python flex template",
  "description": "Streaming beam example for python flex template.",
  "parameters": [
    {
      "name": "input_subscription",
      "label": "Input PubSub subscription.",
      "helpText": "Name of the input PubSub subscription to consume from.",
      "regexes": [
        "projects/[^/]+/subscriptions/[a-zA-Z][-_.~+%a-zA-Z0-9]{2,}"
      ]
    },
    {
      "name": "output_table",
      "label": "BigQuery output table name.",
      "helpText": "Name of the BigQuery output table name.",
      "isOptional": true,
      "regexes": [
        "([^:]+:)?[^.]+[.].+"
      ]
    }
  ]
}

You can download metadata files for the Google-provided templates from the Dataflow template directory.

Understand staging location and temp location

The Google Cloud CLI provides --staging-location and --temp-location options when you run a flex template. Similarly, the Dataflow REST API provides stagingLocation and tempLocation fields for FlexTemplateRuntimeEnvironment.

For Flex Templates, the staging location is the Cloud Storage URL that files are written to during the staging step of launching a template. Dataflow reads these staged files to create the template graph. The temp location is the Cloud Storage URL that temporary files are written to during the execution step.

Update a Flex Template job

The following example request shows you how to update a template streaming job by using the projects.locations.flexTemplates.launch method. If you want to use the gcloud CLI, see Update an existing pipeline.

If you want to update a classic template, use projects.locations.templates.launch instead.

  1. Follow the steps to create a streaming job from a Flex Template. Send the following HTTP POST request with the modified values:

    POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/REGION/flexTemplates:launch
    {
        "launchParameter": {
          "update": true
          "jobName": "JOB_NAME",
          "parameters": {
            "input_subscription": "projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME",
            "output_table": "PROJECT_ID:DATASET.TABLE_NAME"
          },
        "containerSpecGcsPath": "STORAGE_PATH"
        },
    }
    
    • Replace PROJECT_ID with your project ID.
    • Replace REGION with the Dataflow region of the job that you're updating.
    • Replace JOB_NAME with the exact name of the job that you want to update.
    • Set parameters to your list of key-value pairs. The parameters listed are specific to this template example. If you're using a custom template, modify the parameters as needed. If you're using the example template, replace the following variables.
      • Replace SUBSCRIPTION_NAME with your Pub/Sub subscription name.
      • Replace DATASET with your with your BigQuery dataset name.
      • Replace TABLE_NAME with your with your BigQuery table name.
    • Replace STORAGE_PATH with the Cloud Storage location of the template file. The location should start with gs://.
  2. Use the environment parameter to change environment settings. For more information, see FlexTemplateRuntimeEnvironment.

  3. Optional: To send your request using curl (Linux, macOS, or Cloud Shell), save the request to a JSON file, and then run the following command:

    curl -X POST -d "@FILE_PATH" -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)"  https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/REGION/flexTemplates:launch
    

    Replace FILE_PATH with the path to the JSON file that contains the request body.

  4. Use the Dataflow monitoring interface to verify that a new job with the same name was created. This job has the status Updated.

Limitations

The following limitations apply to Flex Templates jobs:

  • You must use a Google-provided base image to package your containers using Docker. For a list of applicable images, see Flex Template base images.
  • The program that constructs the pipeline must exit after run is called in order for the pipeline to start.
  • waitUntilFinish (Java) and wait_until_finish (Python) are not supported.

What's next