Dataflow service options

Service options are a type of pipeline option that let you specify additional job modes and configurations for a Dataflow job. Set these options by setting the Dataflow service options pipeline option.

Java

--dataflowServiceOptions=SERVICE_OPTION

Replace SERVICE_OPTION with the service option that you want to use.

Python

--dataflow_service_options=SERVICE_OPTION

Replace SERVICE_OPTION with the service option that you want to use.

Go

--dataflow_service_options=SERVICE_OPTION

Replace SERVICE_OPTION with the service option that you want to use.

gcloud

Use the gcloud dataflow jobs run command with the additional-experiments option. If you're using Flex Templates, use the gcloud dataflow flex-template run command.

--additional-experiments=SERVICE_OPTION

For example:

gcloud dataflow jobs run JOB_NAME --additional-experiments=SERVICE_OPTION

Replace the following values:

  • JOB_NAME: the name of your Dataflow job
  • SERVICE_OPTION: the service option that you want to use

REST

Use the additionalExperiments field in the RuntimeEnvironment object. If you're using Flex Templates, use the additionalExperiments field in the FlexTemplateRuntimeEnvironment object.

{
  additionalExperiments : ["SERVICE_OPTION"]
  ...
}

Replace SERVICE_OPTION with the service option that you want to use.

For more information, see Set Dataflow pipeline options.

Dataflow supports the following service options.

Option Description
automatically_use_created_reservation Use Compute Engine reservations for the Dataflow workers. For more information, see Use Compute Engine reservations with Dataflow
block_project_ssh_keys Prevents VMs from accepting SSH keys that are stored in project metadata. For more information, see Restrict SSH keys from VMs.
enable_confidential_compute

Enables Confidential VM with AMD Secure Encryption Virtualization (SEV) on Dataflow worker VMs. For more information, see Confidential Computing concepts. This service option is not compatible with Dataflow Prime or worker accelerators. You must specify a supported machine type. When this option is enabled, the job incurs additional flat per-vCPU and per-GB costs. For more information, see Dataflow pricing.

enable_dynamic_thread_scaling

Enable dynamic thread scaling on Dataflow worker VMs. For more information, see Dynamic thread scaling.

enable_google_cloud_heap_sampling Enable heap profiling. For more information, see Monitoring pipeline performance using Cloud Profile.
enable_google_cloud_profiler Enable performance profiling. For more information, see Monitoring pipeline performance using Cloud Profile.
enable_preflight_validation When you run your pipeline on Dataflow, before the job launches, Dataflow performs validation checks on the pipeline. This option is enabled by default. To disable pipeline validation, set this option to false. For more information, see Pipeline validation.
enable_prime Enable Dataflow Prime for this job. For more information, see Use Dataflow Prime.
enable_streaming_engine_resource_based_billing Enable resource-based billing for this job. For more information, see Pricing in "Use Streaming Engine for streaming jobs."
graph_validate_only Runs a job graph validation check to verify whether a replacement job is valid. For more information, see Validate a replacement job.
max_workflow_runtime_walltime_seconds

The maximum number of seconds the job can run. If the job exceeds this limit, Dataflow cancels the job. This service option is supported for batch jobs only.

Specify the number of seconds as a parameter to the flag. For example:

--dataflowServiceOptions=max_workflow_runtime_walltime_seconds=300

sdf_checkpoint_after_duration

The maximum duration each worker buffers splittable DoFn (SDF) outputs before checkpointing for further processing. Set this duration when you want low-latency processing on pipelines that have low throughput per worker, such as when reading change streams from Spanner. The worker checkpoints when either the duration limit or the bytes limit is triggered, so you can use this service option with sdf_checkpoint_after_output_bytes or by itself.

This service option is supported for Streaming Engine jobs that use Runner v2.

Specify the duration as a parameter. For example, to change the default from 5 seconds to 500 milliseconds, use the following syntax:

--dataflowServiceOptions=sdf_checkpoint_after_duration=500ms

sdf_checkpoint_after_output_bytes

The maximum splittable DoFn (SDF) output bytes each worker produces and buffers before checkpointing for further processing. Set this value when you want low-latency processing on pipelines that have low throughput per worker, such as when reading change streams from Spanner. The worker checkpoints when either the duration limit or the bytes limit is triggered, so you can use this service option with sdf_checkpoint_after_duration or by itself.

This service option is supported for Streaming Engine jobs that use Runner v2.

Specify the number of bytes as a parameter. For example, to change the default from 5 MiB to 512 KiB, use the following syntax:

--dataflowServiceOptions=sdf_checkpoint_after_output_bytes=524288

streaming_mode_at_least_once Enables at-least-once streaming mode. For more information, see Set the pipeline streaming mode.
worker_accelerator

Enable GPUs for this job.

Specify the type and number of GPUs to attach to Dataflow workers as parameters to the flag. For a list of GPU types that are supported with Dataflow, see Dataflow support for GPUs. For example:

--dataflow_service_options "worker_accelerator=type:GPU_TYPE;count:GPU_COUNT;install-nvidia-driver"

If you're using NVIDIA Multi-Process Service (MPS), append the use_nvidia_mps parameter to the end of the list of parameters. For example:

"worker_accelerator=type:GPU_TYPE;count:GPU_COUNT;install-nvidia-driver;use_nvidia_mps"

For more information about using GPUs, see GPUs with Dataflow.