Google Cloud Dataflow SDK for Java, version 1.9.1
com.google.cloud.dataflow.sdk.transforms
Class Sample
- java.lang.Object
-
- com.google.cloud.dataflow.sdk.transforms.Sample
-
public class Sample extends Object
PTransform
s for taking samples of the elements in aPCollection
, or samples of the values associated with each key in aPCollection
ofKV
s.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class and Description static class
Sample.FixedSizedSampleFn<T>
CombineFn
that computes a fixed-size sample of a collection of values.static class
Sample.SampleAny<T>
APTransform
that takes aPCollection<T>
and a limit, and produces a newPCollection<T>
containing up to limit elements of the inputPCollection
.
-
Constructor Summary
Constructors Constructor and Description Sample()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method and Description static <T> PTransform<PCollection<T>,PCollection<T>>
any(long limit)
Sample#any(long)
takes aPCollection<T>
and a limit, and produces a newPCollection<T>
containing up to limit elements of the inputPCollection
.static <T> PTransform<PCollection<T>,PCollection<Iterable<T>>>
fixedSizeGlobally(int sampleSize)
Returns aPTransform
that takes aPCollection<T>
, selectssampleSize
elements, uniformly at random, and returns aPCollection<Iterable<T>>
containing the selected elements.static <K,V> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>>
fixedSizePerKey(int sampleSize)
Returns aPTransform
that takes an inputPCollection<KV<K, V>>
and returns aPCollection<KV<K, Iterable<V>>>
that contains an output element mapping each distinct key in the inputPCollection
to a sample ofsampleSize
values associated with that key in the inputPCollection
, taken uniformly at random.
-
-
-
Method Detail
-
any
public static <T> PTransform<PCollection<T>,PCollection<T>> any(long limit)
Sample#any(long)
takes aPCollection<T>
and a limit, and produces a newPCollection<T>
containing up to limit elements of the inputPCollection
.If limit is less than or equal to the size of the input
PCollection
, then all the input's elements will be selected.All of the elements of the output
PCollection
should fit into main memory of a single worker machine. This operation does not run in parallel.Example of use:
PCollection<String> input = ...; PCollection<String> output = input.apply(Sample.<String>any(100));
- Type Parameters:
T
- the type of the elements of the input and outputPCollection
s- Parameters:
limit
- the number of elements to take from the input
-
fixedSizeGlobally
public static <T> PTransform<PCollection<T>,PCollection<Iterable<T>>> fixedSizeGlobally(int sampleSize)
Returns aPTransform
that takes aPCollection<T>
, selectssampleSize
elements, uniformly at random, and returns aPCollection<Iterable<T>>
containing the selected elements. If the inputPCollection
has fewer thansampleSize
elements, then the outputIterable<T>
will be all the input's elements.Example of use:
PCollection<String> pc = ...; PCollection<Iterable<String>> sampleOfSize10 = pc.apply(Sample.fixedSizeGlobally(10));
- Type Parameters:
T
- the type of the elements- Parameters:
sampleSize
- the number of elements to select; must be>= 0
-
fixedSizePerKey
public static <K,V> PTransform<PCollection<KV<K,V>>,PCollection<KV<K,Iterable<V>>>> fixedSizePerKey(int sampleSize)
Returns aPTransform
that takes an inputPCollection<KV<K, V>>
and returns aPCollection<KV<K, Iterable<V>>>
that contains an output element mapping each distinct key in the inputPCollection
to a sample ofsampleSize
values associated with that key in the inputPCollection
, taken uniformly at random. If a key in the inputPCollection
has fewer thansampleSize
values associated with it, then the outputIterable<V>
associated with that key will be all the values associated with that key in the inputPCollection
.Example of use:
PCollection<KV<String, Integer>> pc = ...; PCollection<KV<String, Iterable<Integer>>> sampleOfSize10PerKey = pc.apply(Sample.<String, Integer>fixedSizePerKey());
- Type Parameters:
K
- the type of the keysV
- the type of the values- Parameters:
sampleSize
- the number of values to select for each distinct key; must be>= 0
-
-