IntraBundleParallelization (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1

Class IntraBundleParallelization

  • java.lang.Object

  • public class IntraBundleParallelization
    extends Object
    Provides multi-threading of DoFns, using threaded execution to process multiple elements concurrently within a bundle.

    Note, that each Dataflow worker will already process multiple bundles concurrently and usage of this class is meant only for cases where processing elements from within a bundle is limited by blocking calls.

    CPU intensive or IO intensive tasks are in general a poor fit for parallelization. This is because a limited resource that is already maximally utilized does not benefit from sub-division of work. The parallelization will increase the amount of time to process each element yet the throughput for processing will remain relatively the same. For example, if the local disk (an IO resource) has a maximum write rate of 10 MiB/s, and processing each element requires to write 20 MiBs to disk, then processing one element to disk will take 2 seconds. Yet processing 3 elements concurrently (each getting an equal share of the maximum write rate) will take at least 6 seconds to complete (there is additional overhead in the extra parallelization).

    To parallelize a DoFn to 10 threads:

     PCollection<T> data = ...;
       IntraBundleParallelization.of(new MyDoFn())

    An uncaught exception from the wrapped DoFn will result in the exception being rethrown in later calls to IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.processElement(<InputT, OutputT>.ProcessContext) or a call to IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.finishBundle(<InputT, OutputT>.Context).