Configuring Flex Templates

This page provides information about required Dockerfile environment variables and supported pipeline parameters for Dataflow Flex Templates.

Setting required Dockerfile environment variables

If you want to create your own Docker file for a Flex Template job, you must specify the following environment variables:

Java

You must specify FLEX_TEMPLATE_JAVA_MAIN_CLASS and FLEX_TEMPLATE_JAVA_CLASSPATH in your Dockerfile.

Python

You must specify the following in your Dockerfile: FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE, FLEX_TEMPLATE_PYTHON_PY_FILE, FLEX_TEMPLATE_PYTHON_PY_OPTIONS, and FLEX_TEMPLATE_PYTHON_SETUP_FILE

For example, we set the following environment variables in the Streaming in Python Flex Template tutorial:

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/streaming_beam.py"

Changing the base image

You must use a Google-provided base image to package your containers using Docker. Choose the most recent version name from the Flex Templates base images reference. Do not select latest.

Specify the base image in the following format:

gcr.io/dataflow-templates-base/IMAGE_NAME:VERSION_NAME

Replace the following:

Specifying pipeline parameters

Pipeline options are execution parameters that configure how and where to run Dataflow jobs. You can set the following Dataflow pipeline options for Flex Template jobs using the gcloud command-line tool:

Java

Field Type Description Default Value
gcpTempLocation String Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
numWorkers int The initial number of Google Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins. If unspecified, the Dataflow service determines an appropriate number of workers.
maxNumWorkers int The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by numWorkers to allow your job to scale up, automatically or otherwise. If unspecified, the Dataflow service will determine an appropriate number of workers.
numberOfWorkerHarnessThreads int The number of threads per worker harness. If unspecified, the Dataflow service determines an appropriate number of threads per worker.
workerRegion String

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for workerRegion is automatically assigned.

Note: This option cannot be combined with workerZone or zone.

If not set, defaults to the value set for region.
workerZone String

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.

Note: This option cannot be combined with workerRegion or zone.

If you specify either region or workerRegion, workerZone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.
zone String (Deprecated) For Apache Beam SDK 2.17.0 or earlier, this specifies the Compute Engine zone for launching worker instances to run your pipeline. If you specify region, zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.
dataflowKmsKey String Specifies the customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specify gcpTempLocation to use this feature. If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.
network String The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. If not set, Google Cloud assumes that you intend to use a network named default.
subnetwork String The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. The Dataflow service determines the default value.
enableStreamingEngine boolean Specifies whether Dataflow Streaming Engine is enabled or disabled; true if enabled. Enabling Streaming Engine allows you to run the steps of your streaming pipeline in the Dataflow service backend, thus conserving CPU, memory, and Persistent Disk storage resources. The default value is false. This means that the steps of your streaming pipeline are executed entirely on worker VMs.
serviceAccount String Specifies a user-managed controller service account, using the format my-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see the Controller service account section of the Cloud Dataflow security and permissions page. If not set, workers use your project's Compute Engine service account as the controller service account.
workerMachineType String

The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

The Dataflow service will choose the machine type based on your job if you do not set this option.

Python

Field Type Description Default Value
temp_location str Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://. If not set, defaults to the value for staging_location. You must specify at least one of temp_location or staging_location to run your pipeline on the Google cloud.
num_workers int The number of Compute Engine instances to use when executing your pipeline. If unspecified, the Dataflow service will determine an appropriate number of workers.
max_num_workers int The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by num_workers to allow your job to scale up, automatically or otherwise. If unspecified, the Dataflow service will determine an appropriate number of workers.
number_of_worker_harness_threads int The number of threads per worker harness. If unspecified, the Dataflow service determines an appropriate number of threads per worker. In order to use this parameter, you also need to use the flag --experiments=use_runner_v2
worker_region String

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for worker_region is automatically assigned.

Note: This option cannot be combined with worker_zone or zone.

If not set, defaults to the value set for region.
worker_zone String

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.

Note: This option cannot be combined with worker_region or zone.

If you specify either region or worker_region, worker_zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.
zone str (Deprecated) For Apache Beam SDK 2.17.0 or earlier, this specifies the Compute Engine zone for launching worker instances to run your pipeline. If you specify region, zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.
dataflow_kms_key str Specifies the customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specify temp_location to use this feature. If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.
network str The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. If not set, Google Cloud assumes that you intend to use a network named default.
subnetwork str The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. The Dataflow service determines the default value.
enable_streaming_engine bool Specifies whether Dataflow Streaming Engine is enabled or disabled; true if enabled. Enabling Streaming Engine allows you to run the steps of your streaming pipeline in the Dataflow service backend, thus conserving CPU, memory, and Persistent Disk storage resources. The default value is false. This means that the steps of your streaming pipeline are executed entirely on worker VMs.
service_account_email str Specifies a user-managed controller service account, using the format my-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see the Controller service account section of the Cloud Dataflow security and permissions page. If not set, workers use your project's Compute Engine service account as the controller service account.
machine_type str

The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

The Dataflow service will choose the machine type based on your job if you do not set this option.

What's next