Google Cloud Dataflow SDK for Java, version 1.9.1
Class IntraBundleParallelization
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.transforms.IntraBundleParallelization
-
public class IntraBundleParallelization extends Object
Provides multi-threading ofDoFn
s, using threaded execution to process multiple elements concurrently within a bundle.Note, that each Dataflow worker will already process multiple bundles concurrently and usage of this class is meant only for cases where processing elements from within a bundle is limited by blocking calls.
CPU intensive or IO intensive tasks are in general a poor fit for parallelization. This is because a limited resource that is already maximally utilized does not benefit from sub-division of work. The parallelization will increase the amount of time to process each element yet the throughput for processing will remain relatively the same. For example, if the local disk (an IO resource) has a maximum write rate of 10 MiB/s, and processing each element requires to write 20 MiBs to disk, then processing one element to disk will take 2 seconds. Yet processing 3 elements concurrently (each getting an equal share of the maximum write rate) will take at least 6 seconds to complete (there is additional overhead in the extra parallelization).
To parallelize a
DoFn
to 10 threads:PCollection<T> data = ...; data.apply( IntraBundleParallelization.of(new MyDoFn()) .withMaxParallelism(10)));
An uncaught exception from the wrapped
DoFn
will result in the exception being rethrown in later calls toIntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.processElement(com.google.cloud.dataflow.sdk.transforms.DoFn<InputT, OutputT>.ProcessContext)
or a call toIntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.finishBundle(com.google.cloud.dataflow.sdk.transforms.DoFn<InputT, OutputT>.Context)
.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
IntraBundleParallelization.Bound<InputT,OutputT>
APTransform
that, when applied to aPCollection<InputT>
, invokes a user-specifiedDoFn<InputT, OutputT>
on all its elements, with all its outputs collected into an outputPCollection<OutputT>
.static class
IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn<InputT,OutputT>
A multi-threadedDoFn
wrapper.static class
IntraBundleParallelization.Unbound
An incompleteIntraBundleParallelization
transform, with unbound input/output types.
-
Constructor Summary
Constructors Constructor and Description IntraBundleParallelization()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method and Description static <InputT,OutputT>
IntraBundleParallelization.Bound<InputT,OutputT>of(DoFn<InputT,OutputT> doFn)
Creates aIntraBundleParallelization
PTransform
for the givenDoFn
that processes elements using multiple threads.static IntraBundleParallelization.Unbound
withMaxParallelism(int maxParallelism)
Creates aIntraBundleParallelization
PTransform
with the specified maximum concurrency level.
-
-
-
Method Detail
-
of
public static <InputT,OutputT> IntraBundleParallelization.Bound<InputT,OutputT> of(DoFn<InputT,OutputT> doFn)
Creates aIntraBundleParallelization
PTransform
for the givenDoFn
that processes elements using multiple threads.Note that the specified
doFn
needs to be thread safe.
-
withMaxParallelism
public static IntraBundleParallelization.Unbound withMaxParallelism(int maxParallelism)
Creates aIntraBundleParallelization
PTransform
with the specified maximum concurrency level.
-
-