Google Cloud Dataflow SDK for Java, version 1.9.1
Class GroupByKey<K,V>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.transforms.PTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>>
-
- com.google.cloud.dataflow.sdk.transforms.GroupByKey<K,V>
-
- Type Parameters:
K
- the type of the keys of the input and outputPCollection
sV
- the type of the values of the inputPCollection
and the elements of theIterable
s in the outputPCollection
- All Implemented Interfaces:
- HasDisplayData, Serializable
public class GroupByKey<K,V> extends PTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>>
GroupByKey<K, V>
takes aPCollection<KV<K, V>>
, groups the values by key and windows, and returns aPCollection<KV<K, Iterable<V>>>
representing a map from each distinct key and window of the inputPCollection
to anIterable
over all the values associated with that key in the input per window. Absent repeatedly-firingtriggering
, each key in the outputPCollection
is unique within each window.GroupByKey
is analogous to converting a multi-map into a uni-map, and related toGROUP BY
in SQL. It corresponds to the "shuffle" step between the Mapper and the Reducer in the MapReduce framework.Two keys of type
K
are compared for equality not by regular JavaObject.equals(java.lang.Object)
, but instead by first encoding each of the keys using theCoder
of the keys of the inputPCollection
, and then comparing the encoded bytes. This admits efficient parallel evaluation. Note that this requires that theCoder
of the keys be deterministic (seeCoder.verifyDeterministic()
). If the keyCoder
is not deterministic, an exception is thrown at pipeline construction time.By default, the
Coder
of the keys of the outputPCollection
is the same as that of the keys of the input, and theCoder
of the elements of theIterable
values of the outputPCollection
is the same as theCoder
of the values of the input.Example of use:
PCollection<KV<String, Doc>> urlDocPairs = ...; PCollection<KV<String, Iterable<Doc>>> urlToDocs = urlDocPairs.apply(GroupByKey.<String, Doc>create()); PCollection<R> results = urlToDocs.apply(ParDo.of(new DoFn<KV<String, Iterable<Doc>>, R>() { public void processElement(ProcessContext c) { String url = c.element().getKey(); Iterable<Doc> docsWithThatUrl = c.element().getValue(); ... process all docs having that url ... }}));
GroupByKey
is a key primitive in data-parallel processing, since it is the main way to efficiently bring associated data together into one location. It is also a key determiner of the performance of a data-parallel pipeline.See
CoGroupByKey
for a way to group multiple input PCollections by a common key at once.See
Combine.PerKey
for a common pattern ofGroupByKey
followed byCombine.GroupedValues
.When grouping, windows that can be merged according to the
WindowFn
of the inputPCollection
will be merged together, and a window pane corresponding to the new, merged window will be created. The items in this pane will be emitted when a trigger fires. By default this will be when the input sources estimate there will be no more data for the window. SeeAfterWatermark
for details on the estimation.The timestamp for each emitted pane is determined by the
windowing operation
. The outputPCollection
will have the sameWindowFn
as the input.If the input
PCollection
contains late data (seePubsubIO.Read.Bound.timestampLabel
for an example of how this can occur) or therequested TriggerFn
can fire before the watermark, then there may be multiple elements output by aGroupByKey
that correspond to the same key and window.If the
WindowFn
of the input requires merging, it is not valid to apply anotherGroupByKey
without first applying a newWindowFn
or applyingWindow.remerge()
.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
GroupByKey.GroupAlsoByWindow<K,V>
Helper transform that takes a collection of timestamp-ordered values associated with each key, groups the values by window, combines windows as needed, and for each window in each key, outputs a collection of key/value-list pairs implicitly assigned to the window and with the timestamp derived from that window.static class
GroupByKey.GroupByKeyOnly<K,V>
Primitive helper transform that groups by key only, ignoring any window assignments.static class
GroupByKey.ReifyTimestampsAndWindows<K,V>
Helper transform that makes timestamps and window assignments explicit in the value part of each key/value pair.static class
GroupByKey.SortValuesByTimestamp<K,V>
Helper transform that sorts the values associated with each key by timestamp.
-
Field Summary
-
Fields inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform
name
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description static void
applicableTo(PCollection<?> input)
PCollection<KV<K,Iterable<V>>>
apply(PCollection<KV<K,V>> input)
Applies thisPTransform
on the givenInputT
, and returns itsOutput
.static <K,V> GroupByKey<K,V>
create()
Returns aGroupByKey<K, V>
PTransform
.boolean
fewKeys()
Returns whether it groups just few keys.protected Coder<KV<K,Iterable<V>>>
getDefaultOutputCoder(PCollection<KV<K,V>> input)
Returns the defaultCoder
to use for the output of this single-outputPTransform
when applied to the given input.static <K,V> Coder<V>
getInputValueCoder(Coder<KV<K,V>> inputCoder)
Returns theCoder
of the values of the input to this transform.void
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.com.google.cloud.dataflow.sdk.util.WindowingStrategy<?,?>
updateWindowingStrategy(com.google.cloud.dataflow.sdk.util.WindowingStrategy<?,?> inputStrategy)
void
validate(PCollection<KV<K,V>> input)
Called before invoking apply (which may be intercepted by the runner) to verify this transform is fully specified and applicable to the specified input.-
Methods inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform
getDefaultOutputCoder, getDefaultOutputCoder, getKindString, getName, toString
-
-
-
-
Method Detail
-
create
public static <K,V> GroupByKey<K,V> create()
Returns aGroupByKey<K, V>
PTransform
.- Type Parameters:
K
- the type of the keys of the input and outputPCollection
sV
- the type of the values of the inputPCollection
and the elements of theIterable
s in the outputPCollection
-
fewKeys
public boolean fewKeys()
Returns whether it groups just few keys.
-
applicableTo
public static void applicableTo(PCollection<?> input)
-
validate
public void validate(PCollection<KV<K,V>> input)
Description copied from class:PTransform
Called before invoking apply (which may be intercepted by the runner) to verify this transform is fully specified and applicable to the specified input.By default, does nothing.
- Overrides:
validate
in classPTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>>
-
updateWindowingStrategy
public com.google.cloud.dataflow.sdk.util.WindowingStrategy<?,?> updateWindowingStrategy(com.google.cloud.dataflow.sdk.util.WindowingStrategy<?,?> inputStrategy)
-
apply
public PCollection<KV<K,Iterable<V>>> apply(PCollection<KV<K,V>> input)
Description copied from class:PTransform
Applies thisPTransform
on the givenInputT
, and returns itsOutput
.Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
The default implementation throws an exception. A derived class must either implement apply, or else each runner must supply a custom implementation via
PipelineRunner.apply(com.google.cloud.dataflow.sdk.transforms.PTransform<InputT, OutputT>, InputT)
.- Overrides:
apply
in classPTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>>
-
getDefaultOutputCoder
protected Coder<KV<K,Iterable<V>>> getDefaultOutputCoder(PCollection<KV<K,V>> input)
Description copied from class:PTransform
Returns the defaultCoder
to use for the output of this single-outputPTransform
when applied to the given input.By default, always throws.
- Overrides:
getDefaultOutputCoder
in classPTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>>
-
getInputValueCoder
public static <K,V> Coder<V> getInputValueCoder(Coder<KV<K,V>> inputCoder)
Returns theCoder
of the values of the input to this transform.
-
populateDisplayData
public void populateDisplayData(DisplayData.Builder builder)
Description copied from class:PTransform
Register display data for the given transform or component.populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect display data viaDisplayData.from(HasDisplayData)
. Implementations may callsuper.populateDisplayData(builder)
in order to register display data in the current namespace, but should otherwise usesubcomponent.populateDisplayData(builder)
to use the namespace of the subcomponent.By default, does not register any display data. Implementors may override this method to provide their own display data.
- Specified by:
populateDisplayData
in interfaceHasDisplayData
- Overrides:
populateDisplayData
in classPTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>>
- Parameters:
builder
- The builder to populate with display data.- See Also:
HasDisplayData
-
-