This document describes how to configure the worker VMs for a Dataflow job.
By default, Dataflow selects the machine type for the worker VMs that run your job, along with the size and type of Persistent Disk. To configure the worker VMs, set the following pipeline options when you create the job.
Machine type
The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use x86 or Arm machine types, including custom machine types.
Java
Set the workerMachineType
pipeline option.
Python
Set the machine_type
pipeline option.
Go
Set the worker_machine_type
pipeline option.
For Arm, the Tau T2A machine series is supported. For more information about using Arm VMs, see Use Arm VMs in Dataflow.
Shared core machine types, such as
f1
andg1
series workers, are not supported under the Dataflow Service Level Agreement.Billing is independent of the machine type family. For more information, see Dataflow pricing.
Custom machine types
To specify a custom machine type, use the following format:
FAMILY-vCPU-MEMORY
. Replace the
following:
- FAMILY. Use one of the following values:
Machine series Value N1 custom
N2 n2-custom
N2D n2d-custom
N4 n4-custom
E2 e2-custom
- vCPU. The number of vCPUs.
- MEMORY. The memory, in MB.
To enable
extended memory,
append -ext
to the machine type. Examples: n2-custom-6-3072
,
n2-custom-2-32768-ext
.
For more information about valid custom machine types, see Custom machine types in the Compute Engine documentation.
Disk type
The type of Persistent Disk to use.
Don't specify a Persistent Disk when using Streaming Engine.
Java
Set the workerDiskType
pipeline option.
Python
Set the worker_disk_type
pipeline option.
Go
Set the disk_type
pipeline option.
To specify the disk type, use the following format:
compute.googleapis.com/projects/PROJECT_ID/zones/ZONE/diskTypes/DISK_TYPE
.
Replace the following:
- PROJECT_ID: your project ID
- ZONE: the zone for the Persistent Disk, for example
us-central1-b
- DISK_TYPE: the disk type, either
pd-ssd
orpd-standard
For more information, see the Compute Engine API reference page for diskTypes.
Disk size
The Persistent Disk size.
Java
Set the diskSizeGb
pipeline option.
Python
Set the disk_size_gb
pipeline option.
Go
Set the disk_size_gb
pipeline option.
If you set this option, specify at least 30 GB to account for the worker boot image and local logs.
Lowering the disk size reduces available shuffle I/O. Shuffle-bound jobs not using Dataflow Shuffle or Streaming Engine may result in increased runtime and job cost.
Batch jobs
For batch jobs using Dataflow Shuffle, this option sets the size of a worker VM boot disk. For batch jobs not using Dataflow Shuffle, this option sets the size of the disks used to store shuffled data; the boot disk size is not affected.
If a batch job uses Dataflow Shuffle, then the default disk size is 25 GB. Otherwise, the default is 250 GB.
Streaming jobs
For streaming jobs using Streaming Engine, this option sets size of the boot disks. For streaming jobs not using Streaming Engine, this option sets the size of each additional Persistent Disk created by the Dataflow service; the boot disk is not affected.
If a streaming job does not use Streaming Engine, you can set the boot disk size
with the experiment flag streaming_boot_disk_size_gb
. For example, specify
--experiments=streaming_boot_disk_size_gb=80
to create boot disks of 80 GB.
If a streaming job uses Streaming Engine, then the default disk size is 30 GB. Otherwise, the default is 400 GB.