IntraBundleParallelization (Google Cloud Dataflow SDK 1.9.1 API)

Google Cloud Dataflow SDK for Java, version 1.9.1

Class IntraBundleParallelization

  • java.lang.Object

  • public class IntraBundleParallelization
    extends Object
    Provides multi-threading of DoFns, using threaded execution to process multiple elements concurrently within a bundle.

    Note, that each Dataflow worker will already process multiple bundles concurrently and usage of this class is meant only for cases where processing elements from within a bundle is limited by blocking calls.

    CPU intensive or IO intensive tasks are in general a poor fit for parallelization. This is because a limited resource that is already maximally utilized does not benefit from sub-division of work. The parallelization will increase the amount of time to process each element yet the throughput for processing will remain relatively the same. For example, if the local disk (an IO resource) has a maximum write rate of 10 MiB/s, and processing each element requires to write 20 MiBs to disk, then processing one element to disk will take 2 seconds. Yet processing 3 elements concurrently (each getting an equal share of the maximum write rate) will take at least 6 seconds to complete (there is additional overhead in the extra parallelization).

    To parallelize a DoFn to 10 threads:

     PCollection<T> data = ...;
       IntraBundleParallelization.of(new MyDoFn())

    An uncaught exception from the wrapped DoFn will result in the exception being rethrown in later calls to IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.processElement(<InputT, OutputT>.ProcessContext) or a call to IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.finishBundle(<InputT, OutputT>.Context).

이 페이지가 도움이 되었나요? 평가를 부탁드립니다.

다음에 대한 의견 보내기...

도움이 필요하시나요? 지원 페이지를 방문하세요.