Configure Dataflow worker VMs

This document describes how to configure the worker VMs for a Dataflow job.

By default, Dataflow selects the machine type for the worker VMs that run your job, along with the size and type of Persistent Disk. To configure the worker VMs, set the following pipeline options when you create the job.

Machine type

The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use x86 or Arm machine types, including custom machine types.

Java

Set the workerMachineType pipeline option.

Python

Set the machine_type pipeline option.

Go

Set the worker_machine_type pipeline option.

For Arm, the Tau T2A machine series is supported. For more information about using Arm VMs, see Use Arm VMs in Dataflow.
Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.
Billing is independent of the machine type family. For more information, see Dataflow pricing.

Custom machine types

To specify a custom machine type, use the following format: FAMILY-vCPU-MEMORY. Replace the following:

FAMILY. Use one of the following values:

Machine series	Value
N1	`custom`
N2	`n2-custom`
N2D	`n2d-custom`
N4 For streaming jobs, Streaming Engine must be enabled. For batch jobs, Dataflow shuffle must be enabled (default).	`n4-custom`
E2	`e2-custom`

vCPU. The number of vCPUs.
MEMORY. The memory, in MB.

To enable extended memory, append -ext to the machine type. Examples: n2-custom-6-3072, n2-custom-2-32768-ext.

For more information about valid custom machine types, see Custom machine types in the Compute Engine documentation.

Disk type

The type of Persistent Disk to use.

Don't specify a Persistent Disk when using either Streaming Engine or the N4 machine type.

Java

Set the workerDiskType pipeline option.

Python

Set the worker_disk_type pipeline option.

Go

Set the disk_type pipeline option.

To specify the disk type, use the following format: compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/DISK_TYPE.

Replace the following:

PROJECT_ID: your project ID
ZONE: the zone for the Persistent Disk, for example us-central1-b
DISK_TYPE: the disk type, either pd-ssd or pd-standard

For more information, see the Compute Engine API reference page for diskTypes.

Disk size

The Persistent Disk size.

Java

Set the diskSizeGb pipeline option.

Python

Set the disk_size_gb pipeline option.

Go

Set the disk_size_gb pipeline option.

If you set this option, specify at least 30 GB to account for the worker boot image and local logs.

Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle or Streaming Engine may result in increased runtime and job cost.

Batch jobs

For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected.

If a batch job uses Dataflow Shuffle, then the default disk size is 25 GB. Otherwise, the default is 250 GB.

Streaming jobs

For streaming jobs using Streaming Engine, this option sets size of the boot disks. For streaming jobs not using Streaming Engine, this option sets the size of each additional Persistent Disk created by the Dataflow service; the boot disk is not affected.

If a streaming job does not use Streaming Engine, you can set the boot disk size with the experiment flag streaming_boot_disk_size_gb. For example, specify --experiments=streaming_boot_disk_size_gb=80 to create boot disks of 80 GB.

If a streaming job uses Streaming Engine, then the default disk size is 30 GB. Otherwise, the default is 400 GB.

Use Cloud Storage FUSE to mount your Cloud Storage buckets onto Dataflow VMs

Cloud Storage FUSE lets you mount your Cloud Storage buckets directly with Dataflow VMs, allowing software to access files as if they are local. This integration eliminates the need for pre-downloading data, streamlining data access for your workloads. For more information, see Process ML data using Dataflow and Cloud Storage FUSE.

Configure Dataflow worker VMs Stay organized with collections Save and categorize content based on your preferences.

Machine type

Java

Python

Go

Custom machine types

Disk type

Java

Python

Go

Disk size

Java

Python

Go

Batch jobs

Streaming jobs

Use Cloud Storage FUSE to mount your Cloud Storage buckets onto Dataflow VMs

What's next

Configure Dataflow worker VMs