Pipeline options

This page documents Dataflow pipeline options. For information on how to use these options, read Setting pipeline options.

Basic options

This table describes basic pipeline options that are used by most jobs.

Java

Field Type Description Default value Template support
dataflowServiceOptions String Enables retroactively using GA service features on previously released SDKs. Requires Apache Beam SDK 2.29.0 or later. For example, to use hot key logging as a service option, set dataflowServiceOptions=hotKeyLoggingEnabled.
gcpTempLocation String Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/. Supported in Flex Templates.
jobName String The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. Also used when updating an existing pipeline. Dataflow generates a unique name automatically.
project String The project ID for your Google Cloud project. This is required if you want to run your pipeline using the Dataflow managed service. If not set, defaults to the currently configured project in the Cloud SDK.
region String Specifies a regional endpoint for deploying your Dataflow jobs. If not set, defaults to us-central1.
runner Class (NameOfRunner) The PipelineRunner to use. This option allows you to determine the PipelineRunner at runtime. To run your pipeline on Dataflow, use DataflowRunner. To run your pipeline locally, use DirectRunner. DirectRunner (local mode)
stagingLocation String Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/. If not set, defaults to what you specified for tempLocation.
streaming boolean Specifies whether streaming mode is enabled or disabled; true if enabled. If your pipeline reads from an unbounded source, default value is true. Otherwise, false.

Python

Field Type Description Default value Template support
dataflow_service_options str Enables retroactively using GA service features on previously released SDKs. Requires Apache Beam SDK 2.29.0 or later. For example, to use hot key logging as a service option, set dataflow_service_options=enable_hot_key_logging.
job_name str The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. Dataflow generates a unique name automatically.
project str The project ID for your Google Cloud project. This is required if you want to run your pipeline using the Dataflow managed service. If not set, throws an error.
staging_location str Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/. If not set, defaults to a staging directory within temp_location. You must specify at least one of temp_location or staging_location to run your pipeline on Google Cloud.
region str Specifies a regional endpoint for deploying your Dataflow jobs. If not set, defaults to us-central1.
runner str The PipelineRunner to use. This option allows you to determine the PipelineRunner at runtime. To run your pipeline on Dataflow, use DataflowRunner. To run your pipeline locally, use DirectRunner. DirectRunner (local mode)
streaming bool Specifies whether streaming mode is enabled or disabled; true if enabled. false
temp_location str Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://BUCKET-NAME/. You must specify at either temp_location or staging_location (or both). If temp_location is not set, temp_location defaults to the value for staging_location. Supported in Flex Templates.

Resource utilization

This table describes pipeline options that you can set to manage resource utilization.

Java

Field Type Description Default value Template support
autoscalingAlgorithm String The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See Autotuning features to learn more about how autoscaling works in the Dataflow managed service. Defaults to THROUGHPUT_BASED for all batch Dataflow jobs, and for streaming jobs that use Streaming Engine. Defaults to NONE for streaming jobs that do not use Streaming Engine.
flexRSGoal String Specifies Flexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects the numWorkers, autoscalingAlgorithm, zone, region, and workerMachineType parameters. For more information, see the FlexRS pipeline options section. If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.
maxNumWorkers int The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by numWorkers) to allow your job to scale up, automatically or otherwise. If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates.
numberOfWorkerHarnessThreads int The number of threads per each worker harness process If unspecified, the Dataflow service determines an appropriate number of threads per worker. Supported in Flex Templates.
numWorkers int The initial number of Google Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins. If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates.

Python

Field Type Description Default value Template support
autoscaling_algorithm str The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See Autotuning features to learn more about how autoscaling works in the Dataflow managed service. Defaults to THROUGHPUT_BASED for all batch Dataflow jobs, and for streaming jobs that use Streaming Engine. Defaults to NONE for streaming jobs that do not use Streaming Engine.
flexrs_goal str Specifies Flexible Resource Scheduling (FlexRS) for autoscaled batch jobs. Affects the num_workers, autoscaling_algorithm, zone, region, and machine_type parameters. For more information, see the FlexRS pipeline options section. If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.
max_num_workers int The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by num_workers) to allow your job to scale up, automatically or otherwise. If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates.
number_of_worker_harness_threads int The number of threads per each worker harness process. If unspecified, the Dataflow service determines an appropriate number of threads per worker. In order to use this parameter, you also need to use the set the option--experiments=use_runner_v2. Supported in Flex Templates.
experiments=no_use_multiple_container_images Configures Dataflow worker VMs to start only one containerized Python process. If not specified, Dataflow starts one Apache Beam SDK process per VM core. This experiment only affects Python pipelines that use [Dataflow Runner v2](dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2). Supported. Can be set by the template or via --additional_experiments option.
num_workers int The number of Compute Engine instances to use when executing your pipeline. If unspecified, the Dataflow service determines an appropriate number of workers. Supported in Flex Templates.

Debugging

This table describes pipeline options you can use to debug your job.

Java

Field Type Description Default value Template support
hotKeyLoggingEnabled boolean

Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project.

If not set, only the presence of a hot key is logged.

Python

Field Type Description Default value Template support
enable_hot_key_logging bool

Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project.

Requires Dataflow Runner V2 and Apache Beam SDK 2.29.0 or later. Must be set as a service option, using the format dataflow_service_options=enable_hot_key_logging.

If not set, only the presence of a hot key is logged.

Security and networking

This table describes pipeline options for controlling your account and networking.

Java

Field Type Description Default value Template support
dataflowKmsKey String Specifies the usage and the name of a customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specify gcpTempLocation to use this feature. If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. Supported in Flex Templates.
network String The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. If not set, Google Cloud assumes that you intend to use a network named default. Supported in Flex Templates.
serviceAccount String Specifies a user-managed controller service account, using the format my-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see the Controller service account section of the Dataflow security and permissions page. If not set, workers use your project's Compute Engine service account as the controller service account. Supported in Flex Templates.
subnetwork String The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. The Dataflow service determines the default value. Supported in Flex Templates.
usePublicIps boolean Specifies whether Dataflow workers use public IP addresses. If the value is set to false, Dataflow workers use private IP addresses for all communication. In this case, if the subnetwork option is specified, the network option is ignored. Make sure that the specified network or subnetwork has Private Google Access enabled. Public IP addresses have an associated cost. If not set, the default value is true and Dataflow workers use public IP addresses.

Python

Field Type Description Default value Template support
dataflow_kms_key str Specifies the usage and the name of a customer-managed encryption key (CMEK) used to encrypt data at rest. You can control the encryption key through Cloud KMS. You must also specify temp_location to use this feature. If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. Supported in Flex Templates.
network str The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. If not set, Google Cloud assumes that you intend to use a network named default. Supported in Flex Templates.
service_account_email str Specifies a user-managed controller service account, using the format my-service-account-name@<project-id>.iam.gserviceaccount.com. For more information, see the Controller service account section of the Dataflow security and permissions page. If not set, workers use your project's Compute Engine service account as the controller service account. Supported in Flex Templates.
subnetwork str The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. The Dataflow service determines the default value. Supported in Flex Templates.
use_public_ips bool Specifies that Dataflow workers must use public IP addresses. If the value is set to false, Dataflow workers use private IP addresses for all communication. In this case, if the subnetwork option is specified, the network option is ignored. Make sure that the specified network or subnetwork has Private Google Access enabled. Public IP addresses have an associated cost. This option requires the Beam SDK for Python. The deprecated Dataflow SDK for Python does not support it. If not set, Dataflow workers use public IP addresses.

Streaming pipeline management

This table describes pipeline options that let you manage the state of your Dataflow pipelines across job instances.

Java

Field Type Description Default value Template support
createFromSnapshot String Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, see Using snapshots. If not set, no snapshot is used to create a job.
enableStreamingEngine boolean Specifies whether Dataflow Streaming Engine is enabled or disabled; true if enabled. Enabling Streaming Engine allows you to run the steps of your streaming pipeline in the Dataflow service backend, thus conserving CPU, memory, and Persistent Disk storage resources. The default value is false. This means that the steps of your streaming pipeline are executed entirely on worker VMs. Supported in Flex Templates.
update boolean Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. false

Python

Field Type Description Default value Template support
enable_streaming_engine bool Specifies whether Dataflow Streaming Engine is enabled or disabled; true if enabled. Enabling Streaming Engine allows you to run the steps of your streaming pipeline in the Dataflow service backend, thus conserving CPU, memory, and Persistent Disk storage resources. The default value is false. This means that the steps of your streaming pipeline are executed entirely on worker VMs. Supported in Flex Templates.
update bool Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. false

Worker-level options

This table describes pipeline options that apply to the Dataflow worker level.

Java

Field Type Description Default value Template support
diskSizeGb int

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs.

For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM's boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected.

For streaming jobs using Streaming Engine, this option sets size of the boot disks. For streaming jobs not using Streaming Engine, this option sets the size of each additional Persistent Disk created by the Dataflow service; the boot disk is not affected. If a streaming job does not use Streaming Engine, you can set the boot disk size with the experiment flag streaming_boot_disk_size_gb. For example, specify --experiments=streaming_boot_disk_size_gb=80 to create boot disks of 80 GB.

Set to 0 to use the default size defined in your Cloud Platform project.

If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB.

If a streaming job uses Streaming Engine, then the default is 30 GB; otherwise, the default is 400 GB.

Warning: Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle or Streaming Engine may result in increased runtime and job cost.

filesToStage List<String> A non-empty list of local files, directories of files, or archives (such as JAR or zip files) to make available to each worker. If you set this option, then only those files you specify are uploaded (the Java classpath is ignored). You must specify all of your resources in the correct classpath order. Resources are not limited to code, but can also include configuration files and other resources to make available to all workers. Your code can access the listed resources using Java's standard resource lookup methods. Cautions: Specifying a directory path is sub-optimal since Dataflow zips the files before uploading, which involves a higher startup time cost. Also, don't use this option to transfer data to workers that is meant to be processed by the pipeline since doing so is significantly slower than using native Cloud Storage/BigQuery APIs combined with the appropriate Dataflow data source. If filesToStage is omitted, Dataflow infers the files to stage based on the Java classpath. The considerations and cautions mentioned in the left column also apply here (types of files to list and how to access them from your code).
workerDiskType String The type of Persistent Disk to use, specified by a full URL of the disk type resource. For example, use compute.googleapis.com/projects/PROJECT/zones/ZONE/diskTypes/pd-ssd to specify a SSD Persistent Disk. For more information, see the Compute Engine API reference page for diskTypes. The Dataflow service determines the default value.
workerMachineType String

The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

The Dataflow service chooses the machine type based on your job if you do not set this option. Supported in Flex Templates.
workerRegion String

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for workerRegion is automatically assigned.

Note: This option cannot be combined with workerZone or zone.

If not set, defaults to the value set for region. Supported in Flex Templates.
workerZone String

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.

Note: This option cannot be combined with workerRegion or zone.

If you specify either region or workerRegion, workerZone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone. Supported in Flex Templates.
zone String (Deprecated) For Apache Beam SDK 2.17.0 or earlier, this specifies the Compute Engine zone for launching worker instances to run your pipeline. If you specify region, zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone. Supported in Flex Templates.

Python

Field Type Description Default value Template support
disk_size_gb int

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs.

For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM's boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected.

For streaming jobs using Streaming Engine, this option sets size of the boot disks. For streaming jobs not using Streaming Engine, this option sets the size of each additional Persistent Disk created by the Dataflow service; the boot disk is not affected. If a streaming job does not use Streaming Engine, you can set the boot disk size with the experiment flag streaming_boot_disk_size_gb. For example, specify --experiments=streaming_boot_disk_size_gb=80 to create boot disks of 80 GB.

Set to 0 to use the default size defined in your Cloud Platform project.

If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB.

If a streaming job uses Streaming Engine, then the default is 30 GB; otherwise, the default is 400 GB.

Warning: Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle or Streaming Engine may result in increased runtime and job cost.

worker_disk_type str The type of Persistent Disk to use, specified by a full URL of the disk type resource. For example, use compute.googleapis.com/projects/PROJECT/zones/ZONE/diskTypes/pd-ssd to specify a SSD Persistent Disk. For more information, see the Compute Engine API reference page for diskTypes. The Dataflow service determines the default value.
machine_type str

The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family.

The Dataflow service chooses the machine type based on your job if you do not set this option. Supported in Flex Templates.
worker_region str

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for worker_region is automatically assigned.

Note: This option cannot be combined with worker_zone or zone.

If not set, defaults to the value set for region. Supported in Flex Templates.
worker_zone str

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.

Note: This option cannot be combined with worker_region or zone.

If you specify either region or worker_region, worker_zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone. Supported in Flex Templates.
zone str (Deprecated) For Apache Beam SDK 2.17.0 or earlier, this specifies the Compute Engine zone for launching worker instances to run your pipeline. If you specify region, zone defaults to a zone from the corresponding region. You can override this behavior by specifying a different zone. Supported in Flex Templates.

Setting other local pipeline options

When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient.

Java

You can find the default values for Java PipelineOptions in the Java API reference; see the PipelineOptions class listing for complete details.

If your pipeline uses Google Cloud such as BigQuery or Cloud Storage for I/O, you might need to set certain Google Cloud project and credential options. In such cases, you should use GcpOptions.setProject to set your Google Cloud Project ID. You may also need to set credentials explicitly. See the GcpOptions class for complete details.

Python

You can find the default values for Python PipelineOptions in the Python API reference; see the PipelineOptions module listing for complete details.

If your pipeline uses Google Cloud services such as BigQuery or Cloud Storage for I/O, you might need to set certain Google Cloud project and credential options. In such cases, you should use options.view_as(GoogleCloudOptions).project to set your Google Cloud Project ID. You may also need to set credentials explicitly. See the GoogleCloudOptions class for complete details.