Use Dataflow Shuffle for batch jobs

Stay organized with collections Save and categorize content based on your preferences.

Dataflow Shuffle is the base operation behind Dataflow transforms such as GroupByKey, CoGroupByKey, and Combine. The Dataflow Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant manner. The Dataflow Shuffle feature, available for batch pipelines only, moves the shuffle operation out of the worker VMs and into the Dataflow service backend.

Batch jobs use Dataflow Shuffle by default.

Benefits of Dataflow Shuffle

The service-based Dataflow Shuffle has the following benefits:

  • Faster execution time of batch pipelines for the majority of pipeline job types.
  • A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.
  • Better Horizontal Autoscaling because VMs no longer hold any shuffle data and can therefore be scaled down earlier.
  • Better fault tolerance; an unhealthy VM holding Dataflow Shuffle data will not cause the entire job to fail, as would happen if not using the feature.

Most of the reduction in worker resources comes from offloading the shuffle work to the Dataflow service. For that reason, there is a charge associated with the use of Dataflow Shuffle. However, the total bill for Dataflow pipelines using the service-based Dataflow implementation is expected to be less than or equal to the cost of Dataflow pipelines that do not use this option.

For the majority of pipeline job types, Dataflow Shuffle is expected to execute faster than the shuffle implementation running on worker VMs. However, the execution times might vary from run to run. If you are running a pipeline that has important deadlines, we recommend allocating sufficient buffer time before the deadline. In addition, consider requesting a bigger quota for Shuffle.

Exceptions for using Dataflow Shuffle

Batch jobs use Dataflow Shuffle by default. The default boot disk size for each batch job reduces to 25 GB instead of the default 250 GB. For some batch jobs, you might be required to turn off Dataflow Shuffle or to modify the size of the disk. Consider the following:

  • A worker VM uses part of the 25 GB of disk space for the operating system, binaries, logs, and containers. Jobs that use a significant amount of disk and exceed the remaining disk capacity may fail when you use Dataflow Shuffle.
  • Jobs that use a lot of disk I/O may be slow due to the performance of the small disk. For more information about performance differences between disk sizes, see the [Compute Engine Persistent Disk Performance(/compute/docs/disks/performance) page.

To specify a larger disk size for a Dataflow Shuffle job, you can use the --disk_size_gb parameter.

To turn off Dataflow Shuffle, see the following section.

Use Dataflow Shuffle

This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations. If you use the Dataflow Shuffle, the workers must also be deployed in the same region as the regional endpoint.

Java

Batch jobs use Dataflow Shuffle by default. To opt out of using Dataflow Shuffle, specify the following pipeline option: --experiments=shuffle_mode=appliance.

If you use Dataflow Shuffle for your pipeline, do not specify the zone pipeline options. Instead, specify the region and set the value to one of the regions where Shuffle is currently available. Dataflow autoselects the zone in the region you specified. If you do specify the zone pipeline option and set it to a zone outside of the available regions, Dataflow reports an error. If you set an incompatible combination of region and zone, your job cannot use Dataflow Shuffle.

Python

Batch jobs use Dataflow Shuffle by default. To opt out of using Dataflow Shuffle, specify the following pipeline option: --experiments=shuffle_mode=appliance.

If you use Dataflow Shuffle for your pipeline, do not specify the zone pipeline options. Instead, specify the region and set the value to one of the regions where Shuffle is currently available. Dataflow autoselects the zone in the region you specified. If you do specify the zone pipeline option and set it to a zone outside of the available regions, Dataflow reports an error. If you set an incompatible combination of region and zone, your job cannot use Dataflow Shuffle.

Go

Batch jobs use Dataflow Shuffle by default. To opt out of using Dataflow Shuffle, specify the following pipeline option: ‑‑experiments=shuffle_mode=appliance.

If you use Dataflow Shuffle for your pipeline, do not specify the zone pipeline options. Instead, specify the region and set the value to one of the regions where Shuffle is currently available. Dataflow autoselects the zone in the region you specified. If you do specify the zone pipeline option and set it to a zone outside of the available regions, Dataflow reports an error. If you set an incompatible combination of region and zone, your job cannot use Dataflow Shuffle.