Use Dataflow Shuffle for batch jobs

Dataflow Shuffle is the base operation behind Dataflow transforms such as GroupByKey, CoGroupByKey, and Combine. The Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. The Dataflow Shuffle feature, available for batch pipelines only, moves the shuffle operation out of the worker VMs and into the Dataflow service backend.

Batch jobs use Dataflow Shuffle by default.

Benefits of Dataflow Shuffle

The service-based Dataflow Shuffle has the following benefits:

  • Faster execution time of batch pipelines for the majority of pipeline job types.
  • A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.
  • Better Horizontal Autoscaling because VMs do not hold any shuffle data and can therefore be scaled down earlier.
  • Better fault tolerance; an unhealthy VM holding Dataflow Shuffle data will not cause the entire job to fail, as would happen if not using the feature.

Most of the reduction in worker resources comes from offloading the shuffle work to the Dataflow service. For that reason, there is a charge associated with the use of Dataflow Shuffle. The execution times might vary from run to run. If you are running a pipeline that has important deadlines, we recommend allocating sufficient buffer time before the deadline.

Use Dataflow Shuffle

This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations. If you use the Dataflow Shuffle, the workers must also be deployed in the same region as the regional endpoint.

Java

If you use Dataflow Shuffle for your pipeline, do not specify the zone pipeline options. Instead, specify the region and set the value to one of the available regions. Dataflow autoselects the zone in the region you specified. If you do specify the zone pipeline option and set it to a zone outside of the available regions, Dataflow reports an error. If you set an incompatible combination of region and zone, your job cannot use Dataflow Shuffle.

Python

If you use Dataflow Shuffle for your pipeline, do not specify the zone pipeline options. Instead, specify the region and set the value to one of the available regions. Dataflow autoselects the zone in the region you specified. If you do specify the zone pipeline option and set it to a zone outside of the available regions, Dataflow reports an error. If you set an incompatible combination of region and zone, your job cannot use Dataflow Shuffle.

Go

If you use Dataflow Shuffle for your pipeline, do not specify the zone pipeline options. Instead, specify the region and set the value to one of the available regions. Dataflow autoselects the zone in the region you specified. If you do specify the zone pipeline option and set it to a zone outside of the available regions, Dataflow reports an error. If you set an incompatible combination of region and zone, your job cannot use Dataflow Shuffle.

The default boot disk size for each batch job is 25 GB. For some batch jobs, you might be required to modify the size of the disk. Consider the following:

  • A worker VM uses part of the 25 GB of disk space for the operating system, binaries, logs, and containers. Jobs that use a significant amount of disk and exceed the remaining disk capacity may fail when you use Dataflow Shuffle.
  • Jobs that use a lot of disk I/O may be slow due to the performance of the small disk. For more information about performance differences between disk sizes, see Compute Engine Persistent Disk Performance.

To specify a larger disk size for a Dataflow Shuffle job, you can use the --disk_size_gb parameter.