This page documents Dataflow pipeline options. For information on how to use these options, read Setting pipeline options.
Basic options
This table describes basic pipeline options that are used by many jobs.
Java
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
dataflowServiceOptions |
String |
Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options. | ||
enableStreamingEngine |
boolean |
Specifies whether streaming mode is enabled or disabled; true if enabled. | If your pipeline reads from an unbounded source, default value is true .
Otherwise, false . |
|
experiments |
String |
Enables experimental or pre-GA Dataflow features, using
the following syntax:
--experiments=experiment . |
||
jobName |
String |
The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. Also used when updating an existing pipeline. | Dataflow generates a unique name automatically. | |
labels |
String |
User defined labels. User specified labels are available in billing exports which can be used for cost attribution. Specify a string containing a list of "key": value pairs. Example: `--labels='{ "name": "wrench", "mass": "1_3kg", "count": "3" }'. | Supported in Flex Templates. | |
project |
String |
The project ID for your Google Cloud project. This is required if you want to run your pipeline using the Dataflow managed service. | If not set, defaults to the currently configured project in the gcloud CLI. | |
region |
String |
Specifies a regional endpoint for deploying your Dataflow jobs. | If not set, defaults to us-central1 . |
|
runner |
Class (NameOfRunner) |
The PipelineRunner to use. This option allows you to determine the
PipelineRunner at runtime. To run your pipeline on
Dataflow, use DataflowRunner . To run your pipeline
locally, use DirectRunner . |
DirectRunner (local mode) |
|
stagingLocation |
String |
Cloud Storage path for staging local files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/ . |
If not set, defaults to what you specified for tempLocation . |
|
tempLocation |
String |
Cloud Storage path for temporary files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/ . |
Supported in Flex Templates. |
Python
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
dataflow_service_options |
str |
Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don't have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.29.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options. | ||
experiments |
str |
Enables experimental or pre-GA Dataflow features, using
the following syntax:
--experiments=experiment . |
||
enable_streaming_engine |
bool |
Specifies whether streaming mode is enabled or disabled; true if enabled. | false |
|
job_name |
str |
The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. | Dataflow generates a unique name automatically. | |
labels |
str |
User defined
labels. User specified labels are available in billing exports, which
can be used for cost attribution. Specify a string containing a list
of "key": value pairs. Example: --labels='{ "name": "wrench", "mass":
"1_3kg", "count": "3" }' .
| Supported in Flex Templates. | |
pickle_library |
str |
The pickle library to use for data serialization. Supported values are
dill , cloudpickle , and default .
To use the cloudpickle option, set the option both at the
start of the code and as a pipeline option.
Setting the option in both places is necessary because pickling starts as
soon as PTransforms are constructed, which happens before
pipeline construction. To include at the start of the code, add lines
similar to the following:
from apache_beam.internal import pickler pickler.set_library(pickler.USE_CLOUDPICKLE) |
If not set, defaults to dill . |
|
project |
str |
The project ID for your Google Cloud project. This is required if you want to run your pipeline using the Dataflow managed service. | If not set, throws an error. | |
region |
str |
Specifies a regional endpoint for deploying your Dataflow jobs. | If not set, defaults to us-central1 . |
|
runner |
str |
The PipelineRunner to use. This option allows you to determine the
PipelineRunner at runtime. To run your pipeline on
Dataflow, use DataflowRunner . To run your pipeline
locally, use DirectRunner . |
DirectRunner (local mode) |
|
sdk_location |
str |
Path to the Apache Beam SDK. Must be a valid URL,
Cloud Storage path, or local file path to an Apache Beam SDK
tar or tar archive file.
To install the Apache Beam SDK from within a container,
use the value container . |
If not set, defaults to the current version of the Apache Beam SDK. | Supported in Flex Templates. |
staging_location |
str |
Cloud Storage path for staging local files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/ . |
If not set, defaults to a staging directory within temp_location . You must
specify at least one of temp_location or staging_location to run
your pipeline on Google Cloud. |
|
temp_location |
str |
Cloud Storage path for temporary files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/ . |
You must specify at either temp_location or
staging_location (or both). If temp_location
is not set, temp_location defaults to the value for
staging_location . |
Supported in Flex Templates. |
Go
Field | Type | Description | Default value |
---|---|---|---|
dataflow_service_options |
str |
Specifies additional job modes and configurations. Also provides forward compatibility for SDK versions that don’t have explicit pipeline options for later Dataflow features. Requires Apache Beam SDK 2.40.0 or later. To set multiple service options, specify a comma-separated list of options. For a list of supported options, see Service options. | |
experiments |
str |
Enables experimental or pre-GA Dataflow features.
For example, to enable the Monitoring agent, set:
--experiments=enable_stackdriver_agent_metrics . |
|
job_name |
str |
The name of the Dataflow job being executed as it appears in the Dataflow jobs list and job details. | Dataflow generates a unique name automatically. |
project |
str |
The project ID for your Google Cloud project. This is required if you want to run your pipeline using the Dataflow managed service. | If not set, returns an error. |
region |
str |
Specifies a regional endpoint for deploying your Dataflow jobs. | If not set, returns an error. |
runner |
str |
The PipelineRunner to use. This option allows you to determine the
PipelineRunner at runtime. To run your pipeline on
Dataflow, use dataflow . To run your pipeline
locally, use direct . |
direct (local mode) |
staging_location |
str |
Cloud Storage path for staging local files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/ . |
If not set, returns an error. |
temp_location |
str |
Cloud Storage path for temporary files. Must be a valid Cloud Storage URL,
beginning with gs://BUCKET-NAME/ . |
If temp_location is not set, temp_location
defaults to the value for staging_location . |
Resource utilization
This table describes pipeline options that you can set to manage resource utilization.
Java
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
autoscalingAlgorithm |
String
| The autoscaling mode for your Dataflow job. Possible values are
THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See
Autotuning features to learn
more about how autoscaling works in the Dataflow managed service. |
Defaults to THROUGHPUT_BASED for all batch Dataflow jobs, and for
streaming jobs that use Streaming Engine.
Defaults to NONE for streaming jobs that do not use Streaming Engine. |
|
flexRSGoal |
String |
Specifies Flexible Resource Scheduling (FlexRS) for
autoscaled batch jobs. Affects the numWorkers , autoscalingAlgorithm ,
zone , region , and workerMachineType parameters. For
more information, see the FlexRS pipeline options section. |
If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources. | |
maxNumWorkers |
int |
The maximum number of Compute Engine instances to be made available to your pipeline
during execution. Note that this can be higher than the initial number of workers (specified
by numWorkers ) to allow your job to scale up, automatically or otherwise. |
If unspecified, the Dataflow service determines an appropriate number of workers. | Supported in Flex Templates. |
numberOfWorkerHarnessThreads |
int |
The number of threads per each worker harness process | If unspecified, the Dataflow service determines an appropriate number of threads per worker. | Supported in Flex Templates. |
numWorkers |
int |
The initial number of Google Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins. | If unspecified, the Dataflow service determines an appropriate number of workers. | Supported in Flex Templates. |
Python
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
autoscaling_algorithm |
str
| The autoscaling mode for your Dataflow job. Possible values are
THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See
Autotuning features to learn
more about how autoscaling works in the Dataflow managed service. |
Defaults to THROUGHPUT_BASED for all batch Dataflow jobs, and for
streaming jobs that use Streaming Engine.
Defaults to NONE for streaming jobs that do not use Streaming Engine. |
|
flexrs_goal |
str |
Specifies Flexible Resource Scheduling (FlexRS) for
autoscaled batch jobs. Affects the num_workers , autoscaling_algorithm ,
zone , region , and machine_type parameters. For more information,
see the FlexRS pipeline options section. |
If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources. | |
max_num_workers |
int |
The maximum number of Compute Engine instances to be made available to your pipeline
during execution. Note that this can be higher than the initial number of workers (specified
by num_workers ) to allow your job to scale up, automatically or otherwise. |
If unspecified, the Dataflow service determines an appropriate number of workers. | Supported in Flex Templates. |
number_of_worker_harness_threads |
int |
The number of threads per each worker harness process. | If unspecified, the Dataflow service determines an appropriate number of threads per worker.
In order to use this parameter, you also need to use the set the option--experiments=use_runner_v2 . |
Supported in Flex Templates. |
experiments=no_use_multiple_sdk_containers |
Configures Dataflow worker VMs to start only one containerized Apache Beam Python SDK process. Does not decrease the total number of threads, therefore all threads run in a single Apache Beam SDK process. Due to Python's [global interpreter lock (GIL)](https://wiki.python.org/moin/GlobalInterpreterLock), CPU utilization might be limited, and performance reduced. When using this option with a worker machine type that has a large number of vCPU cores, to prevent worker stuckness, consider reducing the number of worker harness threads. | If not specified, Dataflow starts one Apache Beam SDK process per VM core. This experiment only affects Python pipelines that use Dataflow Runner V2. | Supported. Can be set by the template or via --additional_experiments option. |
|
experiments=use_sibling_sdk_workers |
Configures Dataflow worker VMs to start all Python processes in the same container. | If not specified, Dataflow might start one Apache Beam SDK process per VM core in separate containers.
This pipeline option only affects Python pipelines that use Dataflow Runner v2 and the Apache Beam SDK versions 2.35.0 or later.
This experiment has no effect if the pipeline option experiments=no_use_multiple_sdk_containers is specified.
| Supported. Can be set by the template or using the --additional_experiments option. |
|
num_workers |
int |
The number of Compute Engine instances to use when executing your pipeline. | If unspecified, the Dataflow service determines an appropriate number of workers. | Supported in Flex Templates. |
Go
Field | Type | Description | Default value |
---|---|---|---|
autoscaling_algorithm |
str
| The autoscaling mode for your Dataflow job. Possible values are
THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See
Autotuning features to learn
more about how autoscaling works in the Dataflow managed service. |
Defaults to THROUGHPUT_BASED for all batch Dataflow jobs. |
flexrs_goal |
str |
Specifies Flexible Resource Scheduling (FlexRS) for
autoscaled batch jobs. Affects the num_workers , autoscaling_algorithm ,
zone , region , and worker_machine_type parameters.
Requires Apache Beam SDK 2.40.0 or later. For more information, see the
FlexRS pipeline options section. |
If unspecified, defaults to SPEED_OPTIMIZED, which is the same as omitting this flag. To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources. |
max_num_workers |
int |
The maximum number of Compute Engine instances to be made available to your pipeline
during execution. Note that this can be higher than the initial number of workers (specified
by num_workers ) to allow your job to scale up, automatically or otherwise. |
If unspecified, the Dataflow service determines an appropriate number of workers. |
number_of_worker_harness_threads |
int |
The number of threads per each worker harness process. | If unspecified, the Dataflow service determines an appropriate number of threads per worker. |
num_workers |
int |
The number of Compute Engine instances to use when executing your pipeline. | If unspecified, the Dataflow service determines an appropriate number of workers. |
Debugging
This table describes pipeline options you can use to debug your job.
Java
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
hotKeyLoggingEnabled |
boolean |
Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project. |
If not set, only the presence of a hot key is logged. |
Python
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
enable_hot_key_logging |
bool |
Specifies that when a hot key is detected in the pipeline, the literal, human-readable key is printed in the user's Cloud Logging project. Requires
Dataflow Runner V2
and Apache Beam SDK 2.29.0 or later. Must be set as a service
option, using the format
|
If not set, only the presence of a hot key is logged. |
Go
No debugging pipeline options are available.
Security and networking
This table describes pipeline options for controlling your account and networking.
Java
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
dataflowKmsKey |
String |
Specifies the usage and the name of a customer-managed encryption key (CMEK)
used to encrypt data at rest. You can control the encryption key through Cloud KMS.
You must also specify tempLocation to use this feature. |
If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. | Supported in Flex Templates. |
gcpOauthScopes |
List |
Specifies the OAuth scopes that will be requested when creating the default Google Cloud credentials. Might have no effect if you manually specify the Google Cloud credential or credential factory. | If not set, the following scopes are used:
"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/bigquery.insertdata",
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/devstorage.full_control",
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/userinfo.email"
|
|
impersonateServiceAccount |
String |
If set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. You can specify either a single service account as the impersonator, or you can specify a comma-separated list of service accounts to create an impersonation delegation chain. | If not set, workers use your project's Compute Engine service account as the controller service account. | |
serviceAccount |
String |
Specifies a user-managed controller service account, using the format
my-service-account-name@<project-id>.iam.gserviceaccount.com . For more
information, see the Controller service account
section of the Dataflow security and permissions page.
|
If not set, workers use your project's Compute Engine service account as the controller service account. | Supported in Flex Templates. |
network |
String |
The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. | If not set, Google Cloud assumes that you intend to use a network named default . |
Supported in Flex Templates. |
subnetwork |
String |
The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. | The Dataflow service determines the default value. | Supported in Flex Templates. |
usePublicIps |
boolean |
Specifies whether Dataflow workers use
public IP addresses.
If the value is set to false , Dataflow workers use
private IP addresses for all communication. In this case, if the subnetwork
option is specified, the network option is ignored. Make sure that the
specified network or subnetwork has
Private Google Access enabled. Public IP addresses have an associated cost. |
If not set, the default value is true and Dataflow workers use
public IP addresses. |
Python
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
dataflow_kms_key |
str |
Specifies the usage and the name of a customer-managed encryption key (CMEK)
used to encrypt data at rest. You can control the encryption key through Cloud KMS.
You must also specify temp_location to use this feature. |
If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. | Supported in Flex Templates. |
gcp_oauth_scopes |
list[str] |
Specifies the OAuth scopes that will be requested when creating Google Cloud credentials. If set programmatically, must be set as a list of strings. | If not set, the following scopes are used:
"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/devstorage.full_control",
'https://www.googleapis.com/auth/spanner.admin",
"https://www.googleapis.com/auth/spanner.data",
"https://www.googleapis.com/auth/userinfo.email"
|
|
impersonate_service_account |
str |
If set, all API requests are made as the designated service account or as the target service account in an impersonation delegation chain. You can specify either a single service account as the impersonator, or you can specify a comma-separated list of service accounts to create an impersonation delegation chain. | If not set, workers use your project's Compute Engine service account as the controller service account. | |
service_account_email |
str |
Specifies a user-managed controller service account, using the format
my-service-account-name@<project-id>.iam.gserviceaccount.com . For more
information, see the Controller service account
section of the Dataflow security and permissions page.
|
If not set, workers use your project's Compute Engine service account as the controller service account. | Supported in Flex Templates. |
network |
str |
The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. | If not set, Google Cloud assumes that you intend to use a network named default . |
Supported in Flex Templates. |
subnetwork |
str |
The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. | The Dataflow service determines the default value. | Supported in Flex Templates. |
use_public_ips |
Optional [bool ] |
Specifies whether Dataflow workers must use public IP addresses.
Public IP addresses have an associated cost. To enable public IP addresses for Dataflow workers, specify the command-line flag: --use_public_ips or
set the option using the programmatic API—for example,
options = PipelineOptions(use_public_ips=True) .
To make Dataflow workers use private IP addresses for all communication, specify the command-line flag: --no_use_public_ips or set the option using the programmatic API—for example,
options = PipelineOptions(use_public_ips=False) . In this case,
if the subnetwork option is specified, the network option is ignored. Make sure that the specified
network or subnetwork has Private Google Access enabled.
|
If the option is not explicitly enabled or disabled, the Dataflow workers use public IP addresses. | Supported in Flex Templates. |
no_use_public_ips |
Command-line flag that sets use_public_ips to False . See use_public_ips . |
Supported in Flex Templates. |
Go
Field | Type | Description | Default value |
---|---|---|---|
dataflow_kms_key |
str |
Specifies the usage and the name of a customer-managed encryption key (CMEK)
used to encrypt data at rest. You can control the encryption key through Cloud KMS.
You must also specify temp_location to use this feature.
Requires Apache Beam SDK 2.40.0 or later. |
If unspecified, Dataflow uses the default Google Cloud encryption instead of a CMEK. |
network |
str |
The Compute Engine network for launching Compute Engine instances to run your pipeline. See how to specify your network. | If not set, Google Cloud assumes that you intend to use a network named default . |
service_account_email |
str |
Specifies a user-managed controller service account, using the format
my-service-account-name@<project-id>.iam.gserviceaccount.com . For more
information, see the Controller service account
section of the Dataflow security and permissions page.
|
If not set, workers use your project's Compute Engine service account as the controller service account. |
subnetwork |
str |
The Compute Engine subnetwork for launching Compute Engine instances to run your pipeline. See how to specify your subnetwork. | The Dataflow service determines the default value. |
no_use_public_ips |
bool |
Specifies that Dataflow workers must not use
public IP addresses.
If the value is set to true , Dataflow workers use
private IP addresses for all communication. In this case, if the subnetwork
option is specified, the network option is ignored. Make sure that the
specified network or subnetwork has
Private Google Access enabled. Public IP addresses have an associated cost. |
If not set, Dataflow workers use public IP addresses. |
Streaming pipeline management
This table describes pipeline options that let you manage the state of your Dataflow pipelines across job instances.
Java
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
createFromSnapshot |
String |
Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, see Using snapshots. | If not set, no snapshot is used to create a job. | |
enableStreamingEngine |
boolean |
Specifies whether Dataflow Streaming Engine is enabled or disabled; true if enabled. Enabling Streaming Engine allows you to run the steps of your streaming pipeline in the Dataflow service backend, thus conserving CPU, memory, and Persistent Disk storage resources. | The default value is false. This means that the steps of your streaming pipeline are executed entirely on worker VMs. | Supported in Flex Templates. |
update |
boolean |
Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. | false |
Python
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
create_from_snapshot |
String |
Specifies the snapshot ID to use when creating a streaming job. Snapshots save the state of a streaming pipeline and allow you to start a new version of your job from that state. For more information on snapshots, see Using snapshots. | If not set, no snapshot is used to create a job. | |
enable_streaming_engine |
bool |
Specifies whether Dataflow Streaming Engine is enabled or disabled; true if enabled. Enabling Streaming Engine allows you to run the steps of your streaming pipeline in the Dataflow service backend, thus conserving CPU, memory, and Persistent Disk storage resources. | The default value is false. This means that the steps of your streaming pipeline are executed entirely on worker VMs. | Supported in Flex Templates. |
update |
bool |
Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. | false |
Go
Field | Type | Description | Default value |
---|---|---|---|
update |
bool |
Replaces the existing job with a new job that runs your updated pipeline code. For more information, read Updating an existing pipeline. Requires Apache Beam SDK 2.40.0 or later. | false |
Worker-level options
This table describes pipeline options that apply to the Dataflow worker level.
Java
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
diskSizeGb |
int |
The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs. For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM's boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected.
For streaming jobs using
Streaming Engine,
this option sets size of the boot disks. For streaming jobs not using
Streaming Engine, this option sets the size of each additional Persistent Disk created by
the Dataflow service; the boot disk is not affected.
If a streaming job does not use Streaming Engine, you can set the boot disk size with the
experiment flag |
Set to If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB. If a streaming job uses Streaming Engine, then the default is 30 GB; otherwise, the default is 400 GB. Warning: Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle or Streaming Engine may result in increased runtime and job cost. |
|
filesToStage |
List<String> |
A non-empty list of local files, directories of files, or archives (such as JAR or zip
files) to make available to each worker. If you set this option, then only those files
you specify are uploaded (the Java classpath is ignored). You must specify all
of your resources in the correct classpath order. Resources are not limited to code,
but can also include configuration files and other resources to make available to all
workers. Your code can access the listed resources using Java's standard
resource lookup methods. Cautions: Specifying a directory path is sub-optimal since
Dataflow zips the files before uploading, which involves a higher startup time
cost. Also, don't use this option to transfer data to workers that is meant to be processed
by the pipeline since doing so is significantly slower than using native
Cloud Storage/BigQuery APIs combined with the appropriate
Dataflow data source.
|
If filesToStage is omitted, Dataflow infers the files to stage based
on the Java classpath. The considerations and cautions mentioned in the left column also
apply here (types of files to list and how to access them from your code).
|
|
workerDiskType |
String |
The type of Persistent Disk
to use, specified by a full URL of the disk type resource. For example, use
compute.googleapis.com/projects/PROJECT/zones/ZONE/diskTypes/pd-ssd to specify a SSD
Persistent Disk. When using Streaming Engine, do not specify a Persistent Disk.
For more information, see the Compute Engine API reference page for
diskTypes. |
The Dataflow service determines the default value. | |
workerMachineType |
String |
The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. |
The Dataflow service chooses the machine type based on your job if you do not set this option. | Supported in Flex Templates. |
workerRegion |
String |
Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with |
If not set, defaults to the value set for region . |
Supported in Flex Templates. |
workerZone |
String |
Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with |
If you specify either region or workerRegion , workerZone
defaults to a zone from the corresponding region. You can override this behavior
by
specifying a different zone. |
Supported in Flex Templates. |
zone |
String |
(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this specifies the Compute Engine zone for launching worker instances to run your pipeline. | If you specify region , zone
defaults to a zone from the corresponding region. You can override this behavior
by
specifying a different zone. |
Supported in Flex Templates. |
Python
Field | Type | Description | Default value | Template support |
---|---|---|---|---|
disk_size_gb |
int |
The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs. For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM's boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected.
For streaming jobs using
Streaming Engine,
this option sets size of the boot disks. For streaming jobs not using
Streaming Engine, this option sets the size of each additional Persistent Disk created by
the Dataflow service; the boot disk is not affected.
If a streaming job does not use Streaming Engine, you can set the boot disk size with the
experiment flag |
Set to If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB. If a streaming job uses Streaming Engine, then the default is 30 GB; otherwise, the default is 400 GB. Warning: Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle or Streaming Engine may result in increased runtime and job cost. |
|
worker_disk_type |
str |
The type of Persistent Disk
to use, specified by a full URL of the disk type resource. For example, use
compute.googleapis.com/projects/PROJECT/zones/ZONE/diskTypes/pd-ssd to specify a SSD Persistent Disk. When using Streaming Engine, do not specify a
Persistent Disk. For more information, see the Compute Engine API reference page for
diskTypes. |
The Dataflow service determines the default value. | |
machine_type |
str |
The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. |
The Dataflow service chooses the machine type based on your job if you do not set this option. | Supported in Flex Templates. |
worker_region |
str |
Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with |
If not set, defaults to the value set for region . |
Supported in Flex Templates. |
worker_zone |
str |
Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with |
If you specify either region or worker_region , worker_zone
defaults to a zone from the corresponding region. You can override this behavior
by
specifying a different zone. |
Supported in Flex Templates. |
zone |
str |
(Deprecated) For Apache Beam SDK 2.17.0 or earlier, this specifies the Compute Engine zone for launching worker instances to run your pipeline. | If you specify region , zone
defaults to a zone from the corresponding region. You can override this behavior
by
specifying a different zone. |
Supported in Flex Templates. |
Go
Field | Type | Description | Default value |
---|---|---|---|
disk_size_gb |
int |
The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs. For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM's boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected. |
Set to If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB. Warning: Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle might result in increased runtime and job cost. |
disk_type |
str |
The type of Persistent Disk
to use, specified by a full URL of the disk type resource. For example, use
compute.googleapis.com/projects/PROJECT/zones/ZONE/diskTypes/pd-ssd to specify a SSD
Persistent Disk. For more information, see the Compute Engine API reference page for
diskTypes. |
The Dataflow service determines the default value. |
worker_machine_type |
str |
The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types. For best results, use Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. |
The Dataflow service chooses the machine type based on your job if you do not set this option. |
worker_region |
str |
Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with |
If not set, defaults to the value set for region . |
worker_zone |
str |
Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the Note: This option cannot be combined with |
If you specify either region or worker_region , worker_zone
defaults to a zone from the corresponding region. You can override this behavior
by
specifying a different zone. |
Setting other local pipeline options
When executing your pipeline locally, the default values for the properties in
PipelineOptions
are generally sufficient.
Java
You can find the default values for PipelineOptions
in the Beam SDK for Java
API reference; see the
PipelineOptions
class listing for complete details.
If your pipeline uses Google Cloud such as BigQuery or
Cloud Storage for I/O, you might need to set certain
Google Cloud project and credential options. In such cases, you should
use GcpOptions.setProject
to set your Google Cloud Project ID. You may also
need to set credentials explicitly. See the
GcpOptions
class for complete details.
Python
You can find the default values for PipelineOptions
in the Beam SDK for
Python API reference; see the
PipelineOptions
module listing for complete details.
If your pipeline uses Google Cloud services such as
BigQuery or Cloud Storage for I/O, you might need to
set certain Google Cloud project and credential options. In such cases,
you should use options.view_as(GoogleCloudOptions).project
to set your
Google Cloud Project ID. You may also need to set credentials
explicitly. See the
GoogleCloudOptions
class for complete details.
Go
You can find the default values for PipelineOptions
in the Beam SDK for
Go API reference; see
jobopts
for more details.