Google Cloud Dataflow SDK for Java, version 1.9.1
Class RemoveDuplicates<T>
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.transforms.PTransform<PCollection<T>,PCollection<T>>
-
- com.google.cloud.dataflow.sdk.transforms.RemoveDuplicates<T>
-
- Type Parameters:
T
- the type of the elements of the input and outputPCollection
s
- All Implemented Interfaces:
- HasDisplayData, Serializable
public class RemoveDuplicates<T> extends PTransform<PCollection<T>,PCollection<T>>
RemoveDuplicates<T>
takes aPCollection<T>
and returns aPCollection<T>
that has all the elements of the input but with duplicate elements removed such that each element is unique within each window.Two values of type
T
are compared for equality not by regular JavaObject.equals(java.lang.Object)
, but instead by first encoding each of the elements using thePCollection
'sCoder
, and then comparing the encoded bytes. This admits efficient parallel evaluation.Optionally, a function may be provided that maps each element to a representative value. In this case, two elements will be considered duplicates if they have equal representative values, with equality being determined as above.
By default, the
Coder
of the outputPCollection
is the same as theCoder
of the inputPCollection
.Each output element is in the same window as its corresponding input element, and has the timestamp of the end of that window. The output
PCollection
has the sameWindowFn
as the input.Does not preserve any order the input PCollection might have had.
Example of use:
PCollection<String> words = ...; PCollection<String> uniqueWords = words.apply(RemoveDuplicates.<String>create());
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
RemoveDuplicates.WithRepresentativeValues<T,IdT>
ARemoveDuplicates
PTransform
that uses aSerializableFunction
to obtain a representative value for each input element.
-
Field Summary
-
Fields inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform
name
-
-
Constructor Summary
Constructors Constructor and Description RemoveDuplicates()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method and Description PCollection<T>
apply(PCollection<T> in)
Applies thisPTransform
on the givenInputT
, and returns itsOutput
.static <T> RemoveDuplicates<T>
create()
Returns aRemoveDuplicates<T>
PTransform
.static <T,IdT> RemoveDuplicates.WithRepresentativeValues<T,IdT>
withRepresentativeValueFn(SerializableFunction<T,IdT> fn)
Returns aRemoveDuplicates<T, IdT>
PTransform
.-
Methods inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform
getDefaultOutputCoder, getDefaultOutputCoder, getDefaultOutputCoder, getKindString, getName, populateDisplayData, toString, validate
-
-
-
-
Method Detail
-
create
public static <T> RemoveDuplicates<T> create()
Returns aRemoveDuplicates<T>
PTransform
.- Type Parameters:
T
- the type of the elements of the input and outputPCollection
s
-
withRepresentativeValueFn
public static <T,IdT> RemoveDuplicates.WithRepresentativeValues<T,IdT> withRepresentativeValueFn(SerializableFunction<T,IdT> fn)
Returns aRemoveDuplicates<T, IdT>
PTransform
.- Type Parameters:
T
- the type of the elements of the input and outputPCollection
sIdT
- the type of the representative value used to dedup
-
apply
public PCollection<T> apply(PCollection<T> in)
Description copied from class:PTransform
Applies thisPTransform
on the givenInputT
, and returns itsOutput
.Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
The default implementation throws an exception. A derived class must either implement apply, or else each runner must supply a custom implementation via
PipelineRunner.apply(com.google.cloud.dataflow.sdk.transforms.PTransform<InputT, OutputT>, InputT)
.- Overrides:
apply
in classPTransform<PCollection<T>,PCollection<T>>
-
-