Pipeline options

This page documents Dataflow pipeline options. For information about how to use these options, see Setting pipeline options.

Basic options

This table describes basic pipeline options that are used by many jobs.

Java

Field Type Description
dataflowServiceOptions String

Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options.

enableStreamingEngine boolean

Specifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value is false. When set to the default value, the steps of your streaming pipeline are run entirely on worker VMs.

Supported in Flex Templates.

experiments String

Enables experimental or pre-GA Dataflow features, using the following syntax: --experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list.

jobName String

The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. Also used when updating an existing pipeline.

If not set, Dataflow generates a unique name automatically.

labels String

User-defined labels, also known as additional-user-labels. User-specified labels are available in billing exports, which you can use for cost attribution. Specify a JSON string of "key": "value" pairs. Example: --labels='{ "name": "wrench", "mass": "1_3kg", "count": "3" }'.

Supported in Flex Templates.

project String

The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service.

If not set, defaults to the project that is configured in the gcloud CLI.

region String

Specifies a region for deploying your Dataflow jobs.

If not set, defaults to us-central1.

runner Class (NameOfRunner)

The PipelineRunner to use. This option lets you determine the PipelineRunner at runtime. To run your pipeline on Dataflow, use DataflowRunner. To run your pipeline locally, use DirectRunner.

The default value is DirectRunner (local mode).

stagingLocation String

Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/.

If not set, defaults to what you specified for tempLocation.

tempLocation String

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/. In the tempLocation filename, the at sign (@) can't be followed by a number or by an asterisk (*).

Supported in Flex Templates.

Python

Field Type Description
dataflow_service_options str

Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options.

experiments str

Enables experimental or pre-GA Dataflow features, using the following syntax: --experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list.

enable_streaming_engine bool

Specifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value depends on your pipeline configuration. For more information, see Use Streaming Engine. When set to false, the steps of your streaming pipeline are run entirely on worker VMs.

Supported in Flex Templates.

job_name str

The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details.

If not set, Dataflow generates a unique name automatically.

labels str

User-defined labels, also known as additional-user-labels. User-specified labels are available in billing exports, which you can use for cost attribution.

For each label, specify a "key=value" pair.

Keys must conform to the regular expression: [\p{Ll}\p{Lo}][\p{Ll}\p{Lo}\p{N}_-]{0,62}.

Values must conform to the regular expression: [\p{Ll}\p{Lo}\p{N}_-]{0,63}.

For example, to define two user labels: --labels "name=wrench" --labels "mass=1_3kg".

Supported in Flex Templates.

pickle_library str

The pickle library to use for data serialization. Supported values are dill, cloudpickle, and default. To use the cloudpickle option, set the option both at the start of the code and as a pipeline option. You must set the option in both places because pickling starts when PTransforms are constructed, which happens before pipeline construction. To include at the start of the code, add lines similar to the following:

from apache_beam.internal import pickler
pickler.set_library(pickler.USE_CLOUDPICKLE)

If not set, defaults to dill.

project str

The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service.

If not set, throws an error.

region str

Specifies a region for deploying your Dataflow jobs.

If not set, defaults to us-central1.

runner str

The PipelineRunner to use. This option lets you determine the PipelineRunner at runtime. To run your pipeline on Dataflow, use DataflowRunner. To run your pipeline locally, use DirectRunner.

The default value is DirectRunner (local mode).

sdk_location str

Path to the Apache Beam SDK. Must be a valid URL, Cloud Storage path, or local path to an Apache Beam SDK tar or tar archive file. To install the Apache Beam SDK from within a container, use the value container.

If not set, defaults to the current version of the Apache Beam SDK.

Supported in Flex Templates.

staging_location str

Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/.

If not set, defaults to a staging directory within temp_location. You must specify at least one of temp_location or staging_location to run your pipeline on Google Cloud.

temp_location str

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/. In the temp_location filename, the at sign (@) can't be followed by a number or by an asterisk (*).

You must specify either temp_location or staging_location (or both). If temp_location is not set, temp_location defaults to the value for staging_location.

Supported in Flex Templates.

Go

Field Type Description
dataflow_service_options str Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.40.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options.
experiments str Enables experimental or pre-GA Dataflow features, using the following syntax: --experiments=experiment. When setting multiple experiments programmatically, pass a comma-separated list.
job_name str

The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details.

If not set, Dataflow generates a unique name automatically.

project str

The project ID for your Google Cloud project. The project is required if you want to run your pipeline using the Dataflow managed service.

If not set, returns an error.

region str

Specifies a region for deploying your Dataflow jobs.

If not set, returns an error.

runner str

The PipelineRunner to use. This option lets you determine the PipelineRunner at runtime. To run your pipeline on Dataflow, use dataflow. To run your pipeline locally, use direct.

The default value is direct (local mode).

staging_location str

Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/.

If not set, returns an error.

temp_location str

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/. In the temp_location filename, the at sign (@) can't be followed by a number or by an asterisk (*).

If temp_location is not set, temp_location defaults to the value for staging_location.

Resource utilization

This table describes pipeline options that you can set to manage resource utilization.

Java

Field Type Description
autoscalingAlgorithm String

The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See Autotuning features to learn more about how autoscaling works in the Dataflow managed service.

Defaults to THROUGHPUT_BASED for all batch Dataflow jobs, and for streaming jobs that use Streaming Engine. Defaults to NONE for streaming jobs that don't use Streaming Engine.

flexRSGoal String

Specifies Flexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects the numWorkers, autoscalingAlgorithm, zone, region, and workerMachineType parameters. For more information, see the FlexRS pipeline options section.

If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.

maxNumWorkers int

The maximum number of Compute Engine instances to be made available to your pipeline during execution. This value can be higher than the initial number of workers (specified by numWorkers) to allow your job to scale up, automatically or otherwise.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

numberOfWorkerHarnessThreads int

This option influences the number of concurrent units of work that can be assigned to one worker VM at a time. Lower values might reduce memory usage by decreasing parallelism. This value influences the upper bound of parallelism, but the actual number of threads on the worker might not match this value depending on other constraints. The implementation depends on the SDK language and other runtime parameters.

To reduce the parallelism for batch pipelines, set the value of the flag to a number that is less than the number of vCPUs on the worker. For streaming pipelines, set the value of the flag to a number that is less than the number of threads per Apache Beam SDK process. To estimate threads per process, see the table in the DoFn memory usage section in "Troubleshoot Dataflow out of memory errors."

For more information about using this option to reduce memory usage, see Troubleshoot Dataflow out of memory errors.

If unspecified, the Dataflow service determines an appropriate value.

Supported in Flex Templates.

numWorkers int

The initial number of Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

Python

Field Type Description
autoscaling_algorithm str

The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See Autotuning features to learn more about how autoscaling works in the Dataflow managed service.

Defaults to THROUGHPUT_BASED for all batch Dataflow jobs, and for streaming jobs that use Streaming Engine. Defaults to NONE for streaming jobs that don't use Streaming Engine.

flexrs_goal str

Specifies Flexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects the num_workers, autoscaling_algorithm, zone, region, and machine_type parameters. For more information, see the FlexRS pipeline options section.

If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.

max_num_workers int

The maximum number of Compute Engine instances to be made available to your pipeline during execution. This value can be higher than the initial number of workers (specified by num_workers) to allow your job to scale up, automatically or otherwise.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

number_of_worker_harness_threads int

This option influences the number of concurrent units of work that can be assigned to one worker VM at a time. Lower values might reduce memory usage by decreasing parallelism. This value influences the upper bound of parallelism, but the actual number of threads on the worker might not match this value depending on other constraints. The implementation depends on the SDK language and other runtime parameters.

To reduce the parallelism for batch pipelines, set the value of the flag to a number that is less than the number of vCPUs on the worker. For streaming pipelines, set the value of the flag to a number that is less than the number of threads per Apache Beam SDK process. To estimate threads per process, see the table in the DoFn memory usage section in "Troubleshoot Dataflow out of memory errors."

When using this option to reduce memory usage, using the --experiments=no_use_multiple_sdk_containers option might also be necessary, particularly for batch pipelines. For more information, see Troubleshoot Dataflow out of memory errors.

If unspecified, the Dataflow service determines an appropriate value.

Supported in Flex Templates.

experiments=no_use_multiple_sdk_containers

Configures Dataflow worker VMs to start only one containerized Apache Beam Python SDK process. Does not decrease the total number of threads, therefore all threads run in a single Apache Beam SDK process. Due to Python's global interpreter lock (GIL), CPU utilization might be limited and performance reduced. When using this option with a worker machine type that has many vCPU cores, to prevent stuck workers, consider reducing the number of worker harness threads.

If not specified, Dataflow starts one Apache Beam SDK process per VM core. This experiment only affects Python pipelines that use Dataflow Runner V2.

Supported in Flex Templates. Can be set by the template or by using the --additional_experiments option.

num_workers int

The number of Compute Engine instances to use when executing your pipeline.

If unspecified, the Dataflow service determines an appropriate number of workers.

Supported in Flex Templates.

Go

Field Type Description
autoscaling_algorithm str

The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See Autotuning features to learn more about how autoscaling works in the Dataflow managed service.

Defaults to THROUGHPUT_BASED for all batch Dataflow jobs.

flexrs_goal str

Specifies Flexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects the num_workers, autoscaling_algorithm, zone, region, and worker_machine_type parameters. Requires Apache Beam SDK 2.40.0 or later. For more information, see the FlexRS pipeline options section.

If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.

max_num_workers int

The maximum number of Compute Engine instances to be made available to your pipeline during execution. This value can be higher than the initial number of workers (specified by num_workers) to allow your job to scale up, automatically or otherwise.

If unspecified, the Dataflow service determines an appropriate number of workers.

number_of_worker_harness_threads int

This option influences the number of concurrent units of work that can be assigned to one worker VM at a time. Lower values might reduce memory usage by decreasing parallelism. This value influences the upper bound of parallelism, but the actual number of threads on the worker might not match this value depending on other constraints. The implementation depends on the SDK language and other runtime parameters.

To reduce the parallelism for batch pipelines, set the value of the flag to a number that is less than the number of vCPUs on the worker. For streaming pipelines, set the value of the flag to a number that is less than the number of threads per Apache Beam SDK process. To estimate threads per process, see the table in the DoFn memory usage section in "Troubleshoot Dataflow out of memory errors."

For more information about using this option to reduce memory usage, see Troubleshoot Dataflow out of memory errors.

If unspecified, the Dataflow service determines an appropriate value.

num_workers int

The number of Compute Engine instances to use when executing your pipeline.

If unspecified, the Dataflow service determines an appropriate number of workers.

Debugging

This table describes pipeline options that you can use to debug your job.

Java

Field Type Description
hotKeyLoggingEnabled boolean

Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project.

If not set, only the presence of a hot key is logged.

Python

Field Type Description
enable_hot_key_logging bool

Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project.

Requires Dataflow Runner V2 and Apache Beam SDK 2.29.0 or later. Must be set as a service option, using the format dataflow_service_options=enable_hot_key_logging.

If not set, only the presence of a hot key is logged.

Go

No debugging pipeline options are available.

Security and networking

This table describes pipeline options for controlling your account and networking.

Java

Field Type Description
dataflowKmsKey String

Specifies the usage and the name of a customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specify tempLocation to use this feature.

If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.

Supported in Flex Templates.

gcpOauthScopes List

Specifies the OAuth scopes that will be requested when creating the default Google Cloud credentials. Might have no effect if you manually specify the Google Cloud credential or credential factory.

If not set, the following scopes are used:

"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/bigquery.insertdata",
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/devstorage.full_control",
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/userinfo.email"

impersonateServiceAccount String

If set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. Specify either a single service account as the impersonator, or a comma-separated list of service accounts to create an impersonation delegation chain. This option is only used to submit Dataflow jobs.

If not set, Application Default Credentials are used to submit Dataflow jobs.

serviceAccount String

Specifies a user-managed worker service account, using the format my-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see the Worker service account section of the Dataflow security and permissions page.

If not set, workers use the Compute Engine service account of your project as the worker service account.

Supported in Flex Templates.

network String

The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network.

If not set, Google Cloud assumes that you intend to use a network named default.

Supported in Flex Templates.

subnetwork String

The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork.

The Dataflow service determines the default value.

Supported in Flex Templates.

usePublicIps boolean

Specifies whether Dataflow workers use external IP addresses. If the value is set to false, Dataflow workers use internal IP addresses for all communication. In this case, if the subnetwork option is specified, the network option is ignored. Make sure that the specified network or subnetwork has Private Google Access enabled. External IP addresses have an associated cost.

You can also use the WorkerIPAddressConfiguration API field to specify how IP addresses are allocated to worker machines.

If not set, the default value is true and Dataflow workers use external IP addresses.

Python

Field Type Description
dataflow_kms_key str

Specifies the usage and the name of a customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specify temp_location to use this feature.

If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.

Supported in Flex Templates.

gcp_oauth_scopes list[str]

Specifies the OAuth scopes that will be requested when creating Google Cloud credentials. If set programmatically, must be set as a list of strings.

If not set, the following scopes are used:

"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/devstorage.full_control",
'https://www.googleapis.com/auth/spanner.admin",
"https://www.googleapis.com/auth/spanner.data",
"https://www.googleapis.com/auth/userinfo.email"

impersonate_service_account str

If set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. Specify either a single service account as the impersonator, or a comma-separated list of service accounts to create an impersonation delegation chain. This option is only used to submit Dataflow jobs.

If not set, Application Default Credentials are used to submit Dataflow jobs.

service_account_email str

Specifies a user-managed worker service account, using the format my-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see the Worker service account section of the Dataflow security and permissions page.

If not set, workers use the Compute Engine service account of your project as the worker service account.

Supported in Flex Templates.

network str

The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network.

If not set, Google Cloud assumes that you intend to use a network named default.

Supported in Flex Templates.

subnetwork str

The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork.

The Dataflow service determines the default value.

Supported in Flex Templates.

use_public_ips Optional [bool]

Specifies whether Dataflow workers must use external IP addresses. External IP addresses have an associated cost.

To enable external IP addresses for Dataflow workers, specify the command-line flag: --use_public_ips or set the option using the programmatic API—for example, options = PipelineOptions(use_public_ips=True).

To make Dataflow workers use internal IP addresses for all communication, specify the command-line flag: --no_use_public_ips or set the option using the programmatic API—for example, options = PipelineOptions(use_public_ips=False). In this case, if the subnetwork option is specified, the network option is ignored. Make sure that the specified network or subnetwork has Private Google Access enabled.

You can also use the WorkerIPAddressConfiguration API field to specify how IP addresses are allocated to worker machines.

If the option is not explicitly enabled or disabled, the Dataflow workers use external IP addresses.

Supported in Flex Templates.

no_use_public_ips

Command-line flag that sets use_public_ips to False. See use_public_ips.

Supported in Flex Templates.

Go

Field Type Description
dataflow_kms_key str

Specifies the usage and the name of a customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specify temp_location to use this feature. Requires Apache Beam SDK 2.40.0 or later.

If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK.

network str

The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network.

If not set, Google Cloud assumes that you intend to use a network named default.

service_account_email str

Specifies a user-managed worker service account, using the format my-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see the Worker service account section of the Dataflow security and permissions page.

If not set, workers use the Compute Engine service account of your project as the worker service account.

subnetwork str

The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork.

The Dataflow service determines the default value.

no_use_public_ips bool

Specifies that Dataflow workers must not use external IP addresses. If the value is set to true, Dataflow workers use internal IP addresses for all communication. In this case, if the subnetwork option is specified, the network option is ignored. Make sure that the specified network or subnetwork has Private Google Access enabled. External IP addresses have an associated cost.

You can also use the WorkerIPAddressConfiguration API field to specify how IP addresses are allocated to worker machines.

If not set, Dataflow workers use external IP addresses.

Streaming pipeline management

This table describes pipeline options that let you manage the state of your Dataflow pipelines across job instances.

Java

Field Type Description
createFromSnapshot String

Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, see Using snapshots.

If not set, no snapshot is used to create a job.

enableStreamingEngine boolean

Specifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value is false. This default means that the steps of your streaming pipeline are executed entirely on worker VMs.

Supported in Flex Templates.

update boolean

Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline.

The default value is false.

Python

Field Type Description
create_from_snapshot String

Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, see Using snapshots.

If not set, no snapshot is used to create a job.

enable_streaming_engine bool

Specifies whether Dataflow Streaming Engine is enabled or disabled. Streaming Engine lets you run the steps of your streaming pipeline in the Dataflow service backend, which conserves CPU, memory, and Persistent Disk storage resources.

The default value is false. This default means that the steps of your streaming pipeline are executed entirely on worker VMs.

Supported in Flex Templates.

update bool

Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline.

The default value is false.

Go

Field Type Description
update bool

Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. Requires Apache Beam SDK 2.40.0 or later.

The default value is false.

Worker-level options

This table describes pipeline options that apply to the Dataflow worker level.

Java

Field Type Description
diskSizeGb int

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, see Disk size.

Set to 0 to use the default size defined in your Google Cloud project.

filesToStage List<String>

A non-empty list of local files, directories of files, or archives (such as JAR or zip files) to make available to each worker. If you set this option, then only those files you specify are uploaded (the Java classpath is ignored). You must specify all of your resources in the correct classpath order. Resources are not limited to code, but can also include configuration files and other resources to make available to all workers. Your code can access the listed resources using the standard Java resource lookup methods. Cautions: Specifying a directory path is suboptimal since Dataflow zips the files before uploading, which involves a higher startup time cost. Also, don't use this option to transfer data to workers that is meant to be processed by the pipeline since doing so is significantly slower than using built-in Cloud Storage/BigQuery APIs combined with the appropriate Dataflow data source.

If filesToStage is omitted, Dataflow infers the files to stage based on the Java classpath. The considerations and cautions mentioned in the left column also apply here (types of files to list and how to access them from your code).

workerDiskType String

The type of Persistent Disk to use. For more information, see Disk type.

The Dataflow service determines the default value.

workerMachineType String

The Compute Engine machine type that Dataflow uses when starting worker VMs. For more information, see Machine type.

If you don't set this option, Dataflow chooses the machine type based on your job.

Supported in Flex Templates.

workerRegion String

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for workerRegion is automatically assigned.

Note: This option cannot be combined with workerZone or zone.

If not set, defaults to the value set for region.

Supported in Flex Templates.

workerZone String

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.

Note: This option cannot be combined with workerRegion or zone.

If you specify either region or workerRegion, workerZone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.

Supported in Flex Templates.

zone String

(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this option specifies the Compute Engine zone for launching worker instances to run your pipeline.

If you specify region, zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.

Supported in Flex Templates.

workerCacheMb int

Specifies the size of cache for side inputs and user state. By default, the Dataflow allocate 100 MB of memory for caching side inputs and user state. A larger cache might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory.

Defaults to 100 MB.

maxCacheMemoryUsageMb int

For jobs that use Dataflow Runner v2, specifies the cache size for side inputs and user state in the format maxCacheMemoryUsageMb=N, where N is the cache size in MB. A larger cache might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory. Alternatively, to set the cache size as a percentage of total VM space, specify maxCacheMemoryUsagePercent.

Defaults to 100 MB.

maxCacheMemoryUsagePercent int

For jobs that use Dataflow Runner v2, specifies the cache size as a percentage of total VM space in the format maxCacheMemoryUsagePercent=N, where N is the cache size as a percentage of total VM space. A larger cache might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory.

Defaults to 20%.

Python

Field Type Description
disk_size_gb int

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, see Disk size.

Set to 0 to use the default size defined in your Google Cloud project.

worker_disk_type str

The type of Persistent Disk to use. For more information, see Disk type.

The Dataflow service determines the default value.

machine_type str

The Compute Engine machine type that Dataflow uses when starting worker VMs. For more information, see Machine type.

If you don't set this option, Dataflow chooses the machine type based on your job.

Supported in Flex Templates.

worker_region str

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for worker_region is automatically assigned.

Note: This option cannot be combined with worker_zone or zone.

If not set, defaults to the value set for region.

Supported in Flex Templates.

worker_zone str

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.

Note: This option cannot be combined with worker_region or zone.

If you specify either region or worker_region, worker_zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.

Supported in Flex Templates.

zone str

(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this option specifies the Compute Engine zone for launching worker instances to run your pipeline.

If you specify region, zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.

Supported in Flex Templates.

max_cache_memory_usage_mb int

Starting in Apache Beam Python SDK version 2.52.0, you can use this option to control the cache size for side inputs and for user state. Applies for each SDK process. Increasing the amount of memory allocated to workers might improve the performance of jobs that use large iterable side inputs but also consumes more worker memory.

To increase the side input cache value, use one of the following pipeline options.

  • For SDK versions 2.52.0 and later, use --max_cache_memory_usage_mb=N.
  • For SDK versions 2.42.0 to 2.51.0, use --experiments=state_cache_size=N.

    Replace N with the cache size, in MB.

  • For SDK versions 2.52.0-2.54.0, defaults to 100 MB.

  • For other SDK versions, defaults to 0 MB.

Go

Field Type Description
disk_size_gb int

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. For more information, see Disk size.

Set to 0 to use the default size defined in your Google Cloud project.

disk_type str

The type of Persistent Disk to use. For more information, see Disk type.

The Dataflow service determines the default value.

worker_machine_type str

The Compute Engine machine type that Dataflow uses when starting worker VMs. For more information, see Machine type.

If you don't set this option, Dataflow chooses the machine type based on your job.

worker_region str

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for worker_region is automatically assigned.

Note: This option cannot be combined with worker_zone or zone.

If not set, defaults to the value set for region.

worker_zone str

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. Requires Apache Beam SDK 2.40.0 or later.

Note: This option cannot be combined with worker_region or zone.

If you specify either region or worker_region, worker_zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone.

Setting other local pipeline options

When executing your pipeline locally, the default values for the properties in PipelineOptions are usually sufficient.

Java

You can find the default values for PipelineOptions in the Apache Beam SDK for Java API reference; see the PipelineOptions class listing for complete details.

If your pipeline uses Google Cloud products such as BigQuery or Cloud Storage for I/O, you might need to set certain Google Cloud project and credential options. In such cases, you should use GcpOptions.setProject to set your Google Cloud Project ID. You may also need to set credentials explicitly. See the GcpOptions class for complete details.

Python

You can find the default values for PipelineOptions in the Apache Beam SDK for Python API reference; see the PipelineOptions module listing for complete details.

If your pipeline uses Google Cloud services such as BigQuery or Cloud Storage for I/O, you might need to set certain Google Cloud project and credential options. In such cases, you should use options.view_as(GoogleCloudOptions).project to set your Google Cloud Project ID. You may also need to set credentials explicitly. See the GoogleCloudOptions class for complete details.

Go

You can find the default values for PipelineOptions in the Apache Beam SDK for Go API reference; see jobopts for more details.