The Dataflow SDKs provide a library of pre-written transforms that represent common and useful data processing operations. These transforms are either core transforms with generic processing functions already written for you, or composite transforms that combine simple pre-written transforms to perform useful processing functions.
Java
In the Dataflow SDK for Java, you can find these transforms in the package com.google.cloud.dataflow.sdk.transforms.
You can use any of the pre-written transforms in the Dataflow SDKs as-is in your own pipelines. These transforms are generic and convenient operations that can perform common data processing steps, such as counting elements in a collection, dividing a collection into quantiles, finding the top (or bottom) N elements in a collection, and performing basic mathematical combinations on numerical data.
Many of the pre-written transforms in the Dataflow SDKs are genericized composite transforms that can take different data types. They are made up of nested core transforms like ParDo, GroupByKey, and Combine.
Java
The Dataflow SDK for Java can represent most common data processing operations using core
transforms. The pre-written transforms provided in the SDKs are essentially pre-built wrappers for
generic ParDo
, Combine
, etc. transforms organized in such a way that
they count elements or do basic mathematical combinations. Sum.integersGlobally
, for
example, wraps the Combine.Globally
core transform for Integer
types,
and provides a pre-written CombineFn
that computes the sum of all input elements.
Rather than writing your own version of Combine.Globally
with a sum
CombineFn
, you can use the pre-built provided in the SDK.
If the transforms included in the Dataflow SDKs don't fit your pipeline's use case exactly,
you can create your own generic, reusable composite transforms. The source code for the included
transforms can serve as a model for creating your own composite transforms using
ParDo
, Combine
, and other core transforms. See
Creating Composite Transforms for more
information.
Common Processing Patterns
The transforms included in the Dataflow SDKs provide convenient mechanisms to
perform common data processing operations in your pipelines. The source code for these
transforms illustrate how the core transforms like ParDo
can be used
(or re-used) for various operations.
Simple ParDo Wrappers
Some of the simplest transforms provided in the Dataflow SDKs are utility transforms for dealing
with key/value pairs. Given a PCollection
of key/value pairs, the
Keys
transform returns a PCollection
containing only the keys; the
Values
transform returns a PCollection
containing only the values. The
KvSwap
transform swaps the key element and the value element for each
key/value pair, and returns a PCollection
of the reversed pairs.
Java
Keys
, Values
, KvSwap
, MapElements
,
FlatMapElements
, Filter
, and Partition
are simple
transforms composed of a single ParDo
. In each case, the ParDo
invokes a relatively simple DoFn
to produce the elements of the output
PCollection
.
Here's the apply
method for the
Keys
transform, which accepts a generic PCollection
of KV<K, V>
elements, and returns a PCollection<K>
of only the keys from the key/value
pairs:
@Override public PCollection<K> apply(PCollection<? extends KV<K, ?>> in) { return in.apply(ParDo.named("Keys") .of(new DoFn<KV<K, ?>, K>() { @Override public void processElement(ProcessContext c) { c.output(c.element().getKey()); } })); }
In the example, the apply
method applies a ParDo
transform to the input
collection (in
). That ParDo
invokes a simple DoFn
to
output the key part of the key/value pair. The DoFn
is trivial, and can be defined
as an anonymous inner class instance.
Patterns That Combine Elements
The Dataflow SDKs contain a number of convenient transforms that perform common statistical and
mathematical combinations on elements. For example, there are transforms that accept a
PCollection
of numerical data (such as integers) and perform a mathematical
combination: finding the sum of all elements, the mean average
of all elements, or the largest/smallest of all elements in the collection. Examples of
transforms of this kind are Sum
and Mean
.
Other transforms perform basic statistical analysis on a collection: finding the top
N elements, for example, or returning a random sample of every N elements in a
given PCollection
. Examples of transforms of this kind include Top
and
Sample
.
Java
These transforms are based on the Combine core
transform. They include variants that work on PCollection
s of individual values
(using Combine.globally
) and PCollection
s of key/value pairs (using
Combine.perKey
).
See the source and API for Java reference documentation for the Top transform for an example of a combining transform with both global and per-key variants.
Map/Shuffle/Reduce-Style Processing
Some of the transforms included in the Dataflow SDKs perform processing similar to a
Map/Shuffle/Reduce-style algorithm. These transforms include Count
, which accepts a
potentially non-unique collection of elements and returns a reduced collection of just the unique
elements paired with an occurrence count for each. The RemoveDuplicates
transform
likewise reduces a non-unique collection to just the unique elements, but does not provide an
occurrence count.
Java
These transforms make use of the core ParDo
and Combine.perKey
transforms. Combine.perKey
is itself a composite operation that performs a
GroupByKey
and combines the resulting value stream for each key into a single
value. The ParDo
represents the Map phase of Map/Shuffle/Reduce;
Combine.perKey
represents the Shuffle and Reduce phases.
Here's the apply
method for the
Count
transform, showing the processing logic in the nested ParDo
and
Combine.perKey
transforms:
@Override public PCollection<KV<T, Long>> apply(PCollection<T> in) { return in .apply(ParDo.named("Init") .of(new DoFn<T, KV<T, Long>>() { @Override public void processElement(ProcessContext c) { c.output(KV.of(c.element(), 1L)); } })) .apply(Combine.<T, Long>perKey( new SerializableFunction<Iterable<Long>, Long>() { @Override public Long apply(Iterable<Long> values) { long sum = 0; for (Long value : values) { sum += value; } return sum; } })); }
In the example, the apply
method uses a ParDo
transform to attach an
occurrence count to each element in the input PCollection
, creating a key/value pair
for each element—this is the Map phase of Map/Shuffle/Reduce. Count
then
applies a Combine.perKey
transform to perform the Shuffle and Reduce logic,
and produce a PCollection
of unique elements with a combined count of
occurrences.