This page documents various Dataflow Flex Template configuration options, including:
- Permissions
- Dockerfile environment variables
- Package dependencies for Python
- Docker images
- Pipeline options
- Staging and temp locations
To configure a sample Flex Template, see the Flex Template tutorial.
Understand Flex Template permissions
When you're working with Flex Templates, you need three sets of permissions:
- Permissions to create resources
- Permissions to build a Flex Template
- Permissions to run a Flex Template
Permissions to create resources
To develop and run a Flex Template pipeline, you need to create various resources (for example, a staging bucket). For one-time resource creation tasks, you can use the basic Owner role.
Permissions to build a Flex Template
As the developer of a Flex Template, you need to build the template to make it available to users. Building involves uploading a template spec to a Cloud Storage bucket and provisioning a Docker image with the code and dependencies needed to run the pipeline. To build a Flex Template, you need read and write access to Cloud Storage and Artifact Registry Writer access to your Artifact Registry repository. You can grant these permissions by assigning the following roles:
- Storage Admin (
roles/storage.admin
) - Cloud Build Editor (
roles/cloudbuild.builds.editor
) - Artifact Registry Writer (
roles/artifactregistry.writer
)
Permissions to run a Flex Template
When you run a Flex Template, Dataflow creates a job for you. To create the job, the Dataflow service account needs the following permission:
dataflow.serviceAgent
When you first use Dataflow, the service assigns this role for you, so you don't need to grant this permission.
By default, the Compute Engine service account is used for launcher VMs and worker VMs. The service account needs the following roles and abilities:
- Storage Object Admin (
roles/storage.objectAdmin
) - Viewer (
roles/viewer
) - Dataflow Worker (
roles/dataflow.worker
) - Read and write access to the staging bucket
- Read access to the Flex Template image
To grant read and write access to the staging bucket, you can use the role
Storage Object Admin (roles/storage.objectAdmin
). For more information,
see IAM roles for Cloud Storage.
To grant read access to the Flex Template image, you can use the role
Storage Object Viewer (roles/storage.objectViewer
). For more information,
see Configuring access control.
Set required Dockerfile environment variables
If you want to create your own Dockerfile for a Flex Template job, specify the following environment variables:
Java
Specify FLEX_TEMPLATE_JAVA_MAIN_CLASS
and
FLEX_TEMPLATE_JAVA_CLASSPATH
in your Dockerfile.
ENV | Description | Required |
---|---|---|
FLEX_TEMPLATE_JAVA_MAIN_CLASS |
Specifies which Java class to run in order to launch the Flex Template. | YES |
FLEX_TEMPLATE_JAVA_CLASSPATH |
Specifies the location of class files. | YES |
FLEX_TEMPLATE_JAVA_OPTIONS |
Specifies the Java options to be passed while launching the Flex Template. | NO |
Python
Specify FLEX_TEMPLATE_PYTHON_PY_FILE
in your Dockerfile.
To manage pipeline dependencies, set variables in your Dockerfile, such as the following:
FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
FLEX_TEMPLATE_PYTHON_PY_OPTIONS
FLEX_TEMPLATE_PYTHON_SETUP_FILE
FLEX_TEMPLATE_PYTHON_EXTRA_PACKAGES
For example, the following environment variables are set in the Streaming in Python Flex Template tutorial in GitHub:
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/streaming_beam.py"
ENV | Description | Required |
---|---|---|
FLEX_TEMPLATE_PYTHON_PY_FILE |
Specifies which Python file to run to launch the Flex Template. | YES |
FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE |
Specifies the requirements file with pipeline dependencies. For more information, see PyPI dependencies in the Apache Beam documentation. | NO |
FLEX_TEMPLATE_PYTHON_SETUP_FILE |
Specifies the path to the pipeline package `setup.py` file. For more information, see Multiple File Dependencies in the Apache Beam documentation. | NO |
FLEX_TEMPLATE_PYTHON_EXTRA_PACKAGES |
Specifies the packages that are not available publicly. For information on how using extra packages, read Local or non-PyPI Dependencies. |
NO |
FLEX_TEMPLATE_PYTHON_PY_OPTIONS |
Specifies the Python options to be passed while launching the Flex Template. | NO |
Package dependencies for Python
When a Dataflow Python pipeline uses additional dependencies, you might need to configure the Flex Template to install additional dependencies on Dataflow worker VMs.
When you run a Python Dataflow job that uses Flex Templates in an environment that restricts access to the internet, you must prepackage the dependencies when you create the template.
Use one of the following options to prepackage Python dependencies.
For instructions for managing pipeline dependencies in Java and Go pipelines, see Manage pipeline dependencies in Dataflow.
Use a requirements file and prepackage the dependencies with the template
If you are using your own Dockerfile to define the Flex Template image, follow these steps:
Create a
requirements.txt
file that lists your pipeline dependencies.COPY requirements.txt /template/ ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="/template/requirements.txt"
Install the dependencies in the Flex Template image.
RUN pip install --no-cache-dir -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
Download the dependencies into the local requirements cache, which is staged to the Dataflow workers when the template launches.
RUN pip download --no-cache-dir --dest /tmp/dataflow-requirements-cache -r $FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
When you use this approach, dependencies from the requirements.txt
file are
installed onto Dataflow workers at runtime. An insight in the Google Cloud console
recommendations tab might note this behavior. To avoid installing
dependencies at runtime, use a
custom container image.
The following is a code sample that uses a requirements file in the Flex Template.
Structure the pipeline as a package and use local packages
When you use multiple Python local files or modules, structure your pipeline as a package. The file structure might look like the following example:
main.py
pyproject.toml
setup.py
src/
my_package/
my_custom_dofns_and_transforms.py
my_pipeline_launcher.py
other_utils_and_helpers.py
Place the top-level entry point, for example, the
main.py
file, in the root directory. Place the rest of the files in a separate folder in thesrc
directory, for example,my_package
.Add the package configuration files to the root directory with the package details and requirements.
pyproject.toml
[project] name = "my_package" version = "package_version" dependencies = [ # Add list of packages (and versions) that my_package depends on. # Example: "apache-beam[gcp]==2.54.0", ]
setup.py
"""An optional setuptools configuration stub for the pipeline package. Use pyproject.toml to define the package. Add this file only if you must use the --setup_file pipeline option or the FLEX_TEMPLATE_PYTHON_SETUP_FILE configuration option. """ import setuptools setuptools.setup()
For more information about how to configure your local package, see Packaging Python Projects.
When you import local modules or files for your pipeline, use the
my_package
package name as the import path.from my_package import word_count_transform
Install your pipeline package in the Flex Template image. Your Flex Template Dockerfile might include content similar to the following example:
Dockerfile
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py" ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py" # Copy pipeline, packages and requirements. WORKDIR ${WORKDIR} COPY main.py . COPY pyproject.toml . COPY setup.py . COPY src src # Install local package. RUN pip install -e .
When you use this approach, dependencies from the requirements.txt
file are
installed onto Dataflow workers at runtime. An insight in the Google Cloud console
recommendations tab might note this behavior. To avoid installing dependencies at runtime,
use a custom container image.
For an example that follows the recommended approach, see the Flex Template for a pipeline with dependencies and a custom container image tutorial in GitHub.
Use a custom container that preinstalls all dependencies
To avoid dependency installation at runtime, use custom containers. This option is preferred for pipelines that run in environments without internet access.
Follow these steps to use a custom container:
Build a custom container image that preinstalls necessary dependencies.
Preinstall the same dependencies in the Flex Template Dockerfile.
To prevent dependency installation at runtime, don't use the
FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
orFLEX_TEMPLATE_PYTHON_SETUP_FILE
options in your Flex Template configuration.A modified Flex Template
Dockerfile
might look like the following example:FROM gcr.io/dataflow-templates-base/python3-template-launcher-base ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/main.py" COPY . /template # If you use a requirements file, pre-install the requirements.txt. RUN pip install --no-cache-dir -r /template/requirements.txt # If you supply the pipeline in a package, pre-install the local package and its dependencies. RUN pip install -e /template
When you use this approach, you do the following:
- build the Flex Template image
- build the custom SDK container image
- install the same dependencies in both images
Alternatively, to reduce the number of images to maintain, use your custom container image as a base image for the Flex Template.
If you use the Apache Beam SDK version 2.49.0 or earlier, add the
--sdk_location=container
pipeline option in your pipeline launcher. This option tells your pipeline to use the SDK from your custom container instead of downloading the SDK.options = PipelineOptions(beam_args, save_main_session=True, streaming=True, sdk_location="container")
Set the
sdk_container_image
parameter in theflex-template run
command. For example:gcloud dataflow flex-template run $JOB_NAME \ --region=$REGION \ --template-file-gcs-location=$TEMPLATE_PATH \ --parameters=sdk_container_image=$CUSTOM_CONTAINER_IMAGE \ --additional-experiments=use_runner_v2
For more information, see Use custom containers in Dataflow.
Choose a base image
You can use a Google-provided
base image to package your template
container images by using Docker. Choose the most recent tag from the
Flex Templates base images.
It's recommended to use a specific image tag instead of latest
.
Specify the base image in the following format:
gcr.io/dataflow-templates-base/IMAGE_NAME:TAG
Replace the following:
IMAGE_NAME
: a Google-provided base imageTAG
: a version name for the base image, found in the Flex Templates base images reference
Use custom container images
If your pipeline uses a custom container image, we recommend using the custom image as a base image for your Flex Template Docker image. To do so, copy the Flex Template launcher binary from the Google-provided template base image onto your custom image.
An example Dockerfile
for an image that can be
used both as Custom SDK container image and as a Flex Template,
might look like the following:
FROM gcr.io/dataflow-templates-base/IMAGE_NAME:TAG as template_launcher
FROM apache/beam_python3.10_sdk:2.60.0
# RUN <...Make image customizations here...>
# See: https://cloud.google.com/dataflow/docs/guides/build-container-image
# Configure the Flex Template here.
COPY --from=template_launcher /opt/google/dataflow/python_template_launcher /opt/google/dataflow/python_template_launcher
COPY my_pipeline.py /template/
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="/template/my_pipeline.py"
Replace the following:
IMAGE_NAME
: a Google-provided base image. For example:python311-template-launcher-base
.TAG
: a version tag for the base image found in the Flex Templates base images reference. For better stability and troubleshooting, avoid usinglatest
. Instead, pin to a specific version tag.
For an example that follows this approach, see the Flex Template for a pipeline with dependencies and a custom container image tutorial.
Use an image from a private registry
You can build a Flex Template image stored in a private Docker registry, if the private registry uses HTTPS and has a valid certificate.
To use an image from a private registry, specify the path to the image and a username and password for the registry. The username and password must be stored in Secret Manager. You can provide the secret in one of the following formats:
projects/{project}/secrets/{secret}/versions/{secret_version}
projects/{project}/secrets/{secret}
If you use the second format, because it doesn't specify the version, Dataflow uses the latest version.
If the registry uses a self-signed certificate, you also need to specify the path to the self-signed certificate in Cloud Storage.
The following table describes the gcloud CLI options that you can use to configure a private registry.
Parameter | Description |
---|---|
image
|
The address of the registry. For example:
gcp.repository.example.com:9082/registry/example/image:latest .
|
image-repository-username-secret-id
|
The Secret Manager secret ID for the username to authenticate
to the private registry. For example:
projects/example-project/secrets/username-secret .
|
image-repository-password-secret-id
|
The Secret Manager secret ID for the password to authenticate
to the private registry. For example:
projects/example-project/secrets/password-secret/versions/latest .
|
image-repository-cert-path
|
The full Cloud Storage URL for a self-signed certificate for the
private registry. This value is only required if the registry uses a self-signed
certificate. For example:
gs://example-bucket/self-signed.crt .
|
Here's an example Google Cloud CLI command that builds a Flex Template using an image in a private registry with a self-signed certificate.
gcloud dataflow flex-template build gs://example-bucket/custom-pipeline-private-repo.json --sdk-language=JAVA --image="gcp.repository.example.com:9082/registry/example/image:latest" --image-repository-username-secret-id="projects/example-project/secrets/username-secret" --image-repository-password-secret-id="projects/example-project/secrets/password-secret/versions/latest" --image-repository-cert-path="gs://example-bucket/self-signed.crt" --metadata-file=metadata.json
To build your own Flex Template, you need to replace the example values, and you might need to specify different or additional options. To learn more, see the following resources:
Specify pipeline options
For information about pipeline options that are directly supported by Flex Templates, see Pipeline options.
You can also use any Apache Beam pipeline options indirectly. If you're
using a metadata.json
file for your Flex Template job, include
these pipeline options in the file. This metadata file must follow the format in
TemplateMetadata
.
Otherwise, when you launch the Flex Template job, pass these pipeline options by using the parameters field.
API
Include pipeline options by using the parameters
field.
gcloud
Include pipeline options by using the parameters
flag.
When passing parameters of List
or Map
type, you might need to define
parameters in a YAML file and use the flags-file
.
For an example of this approach, view the "Create a file with parameters..." step in this solution.
When using Flex Templates, you can configure some pipeline options during pipeline initialization, but other pipeline options can't be changed. If the command line arguments required by the Flex Template are overwritten, the job might ignore, override, or discard the pipeline options passed by the template launcher. The job might fail to launch, or a job that doesn't use the Flex Template might launch. For more information, see Failed to read the job file.
During pipeline initialization, don't change the following pipeline options:
Java
runner
project
jobName
templateLocation
region
Python
runner
project
job_name
template_location
region
Go
runner
project
job_name
template_location
region
Block project SSH keys from VMs that use metadata-based SSH keys
You can prevent VMs from accepting SSH keys that are stored in project metadata
by blocking project SSH keys from VMs. Use the additional-experiments
flag with
the block_project_ssh_keys
service option:
--additional-experiments=block_project_ssh_keys
For more information, see Dataflow service options.
Metadata
You can extend your template with additional metadata so that custom parameters are validated when the template is run. If you want to create metadata for your template, follow these steps:
- Create a
metadata.json
file using the parameters in Metadata parameters.To view an example, see Example metadata file.
- Store the metadata file in Cloud Storage in the same folder as the template.
Metadata parameters
Parameter key | Required | Description of the value | |
---|---|---|---|
name |
Yes | The name of your template. | |
description |
No | A short paragraph of text describing the template. | |
streaming |
No | If true , this template supports streaming. The default value is
false . |
|
supportsAtLeastOnce |
No | If true , this template supports at-least-once processing. The default value
is false . Set this parameter to true if the template is designed
to work with at-least-once streaming
mode.
|
|
supportsExactlyOnce |
No | If true , this template supports
exactly-once processing. The default
value is true . |
|
defaultStreamingMode |
No | The default streaming mode, for templates that support both at-least-once mode and
exactly-once mode. Use one of the following values: "AT_LEAST_ONCE" ,
"EXACTLY_ONCE" . If unspecified, the default streaming mode is exactly-once.
|
|
parameters |
No | An array of additional parameters that the template uses. An empty array is used by default. | |
name |
Yes | The name of the parameter that is used in your template. | |
label |
Yes | A human readable string that is used in the Google Cloud console to label the parameter. | |
helpText |
Yes | A short paragraph of text that describes the parameter. | |
isOptional |
No | false if the parameter is required and true if the parameter is
optional. Unless set with a value, isOptional defaults to false .
If you do not include this parameter key for your metadata, the metadata becomes a required
parameter. |
|
regexes |
No | An array of POSIX-egrep regular expressions in string form that is used to validate the
value of the parameter. For example, ["^[a-zA-Z][a-zA-Z0-9]+"] is a single
regular expression that validates that the value starts with a letter and then has one or
more characters. An empty array is used by default. |
Example metadata file
Java
{ "name": "Streaming Beam SQL", "description": "An Apache Beam streaming pipeline that reads JSON encoded messages from Pub/Sub, uses Beam SQL to transform the message data, and writes the results to a BigQuery", "parameters": [ { "name": "inputSubscription", "label": "Pub/Sub input subscription.", "helpText": "Pub/Sub subscription to read from.", "regexes": [ "[a-zA-Z][-_.~+%a-zA-Z0-9]{2,}" ] }, { "name": "outputTable", "label": "BigQuery output table", "helpText": "BigQuery table spec to write to, in the form 'project:dataset.table'.", "isOptional": true, "regexes": [ "[^:]+:[^.]+[.].+" ] } ] }
Python
{ "name": "Streaming beam Python flex template", "description": "Streaming beam example for python flex template.", "parameters": [ { "name": "input_subscription", "label": "Input PubSub subscription.", "helpText": "Name of the input PubSub subscription to consume from.", "regexes": [ "projects/[^/]+/subscriptions/[a-zA-Z][-_.~+%a-zA-Z0-9]{2,}" ] }, { "name": "output_table", "label": "BigQuery output table name.", "helpText": "Name of the BigQuery output table name.", "isOptional": true, "regexes": [ "([^:]+:)?[^.]+[.].+" ] } ] }
You can download metadata files for the Google-provided templates from the Dataflow template directory.
Understand staging location and temp location
The Google Cloud CLI provides --staging-location
and --temp-location
options
when you run a flex template.
Similarly, the Dataflow REST API provides stagingLocation
and
tempLocation
fields for
FlexTemplateRuntimeEnvironment.
For Flex Templates, the staging location is the Cloud Storage URL that files are written to during the staging step of launching a template. Dataflow reads these staged files to create the template graph. The temp location is the Cloud Storage URL that temporary files are written to during the execution step.
Update a Flex Template job
The following example request shows you how to update a template streaming job by using the projects.locations.flexTemplates.launch method. If you want to use the gcloud CLI, see Update an existing pipeline.
If you want to update a classic template, use projects.locations.templates.launch instead.
Follow the steps to create a streaming job from a Flex Template. Send the following HTTP POST request with the modified values:
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/REGION/flexTemplates:launch { "launchParameter": { "update": true "jobName": "JOB_NAME", "parameters": { "input_subscription": "projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME", "output_table": "PROJECT_ID:DATASET.TABLE_NAME" }, "containerSpecGcsPath": "STORAGE_PATH" }, }
- Replace
PROJECT_ID
with your project ID. - Replace
REGION
with the Dataflow region of the job that you're updating. - Replace
JOB_NAME
with the exact name of the job that you want to update. - Set
parameters
to your list of key-value pairs. The parameters listed are specific to this template example. If you're using a custom template, modify the parameters as needed. If you're using the example template, replace the following variables.- Replace
SUBSCRIPTION_NAME
with your Pub/Sub subscription name. - Replace
DATASET
with your with your BigQuery dataset name. - Replace
TABLE_NAME
with your with your BigQuery table name.
- Replace
- Replace
STORAGE_PATH
with the Cloud Storage location of the template file. The location should start withgs://
.
- Replace
Use the
environment
parameter to change environment settings. For more information, seeFlexTemplateRuntimeEnvironment
.Optional: To send your request using curl (Linux, macOS, or Cloud Shell), save the request to a JSON file, and then run the following command:
curl -X POST -d "@FILE_PATH" -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/REGION/flexTemplates:launch
Replace FILE_PATH with the path to the JSON file that contains the request body.
Use the Dataflow monitoring interface to verify that a new job with the same name was created. This job has the status Updated.
Limitations
The following limitations apply to Flex Templates jobs:
- You must use a Google-provided base image to package your containers using Docker. For a list of applicable images, see Flex Template base images.
- The program that constructs the pipeline must exit after
run
is called in order for the pipeline to start. waitUntilFinish
(Java) andwait_until_finish
(Python) are not supported.
What's next
- To know more about Classic Templates, Flex Templates, and their use-case scenarios, see Dataflow templates.
- For Flex Templates troubleshooting information, see Troubleshoot Flex Template timeouts.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.