In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.
Transforms in the Dataflow model can be nested—that is, transforms can contain and invoke other transforms, thus forming composite transforms.
How Transforms Work
Transforms represent your pipeline's processing logic. Each transform accepts one (or
multiple) PCollection
s as input, performs an operation on the elements in the input
PCollection
(s), and produces one (or multiple) new PCollection
s as
output.
Java
To use a transform, you apply the transform to the input PCollection
that you
want to process by calling the apply
method on the input PCollection
.
When you call PCollection.apply
, you pass the transform you want to use as an
argument. The output PCollection
is the return value from
PCollection.apply
.
For example, the following code sample shows how to apply
a user-defined transform
called ComputeWordLengths
to a PCollection<String>
.
ComputeWordLengths
returns a new PCollection<Integer>
containing
the length of each String
in the input collection:
// The input PCollection of word strings. PCollection<String> words = ...; // The ComputeWordLengths transform, which takes a PCollection of Strings as input and // returns a PCollection of Integers as output. static class ComputeWordLengths extends PTransform<PCollection<String>, PCollection<Integer>> { ... } // Apply ComputeWordLengths, capturing the results as the PCollection wordLengths. PCollection<Integer> wordLengths = words.apply(new ComputeWordLengths());
When you build a pipeline with a Dataflow program, the transforms you include might not be executed precisely in the order you specify them. The Cloud Dataflow managed service, for example, performs optimized execution. In optimized execution, the Dataflow service orders transforms in dependency order, inferring the exact sequence from the inputs and outputs defined in your pipeline. Certain transforms may be merged or executed in a different order to provide the most efficient execution.
Types of Transforms in the Dataflow SDKs
Core Transforms
The Dataflow SDK contains a small group of core transforms that are the foundation of the Cloud
Dataflow parallel processing model. Core transforms form the basic building blocks of pipeline
processing. Each core transform provides a generic processing framework for applying
business logic that you provide to the elements of a PCollection
.
When you use a core transform, you provide the processing logic as a function object. The
function you provide gets applied to the elements of the input PCollection
(s).
Instances of the function may be executed in parallel across multiple Google Compute Engine
instances, given a large enough data set, and pending optimizations performed by the pipeline
runner service. The worker code function produces the output elements, if any, that are added to
the output PCollection
(s).
Requirements for User-Provided Function Objects
The function objects you provide for a transform might have many copies executing in parallel across multiple Compute Engine instances in your Cloud Platform project. As such, you should consider a few factors when creating such a function:
- Your function object must be serializable.
- Your function object must be thread-compatible, and be aware that the Dataflow SDKs are not thread-safe.
- We recommend making your function object idempotent.
These requirements apply to subclasses of DoFn
(used with the
ParDo core transform), CombineFn
(used with the
Combine core transform), and WindowFn
(used
with the Window transform).
Serializability
The function object you provide to a core transform must be fully serializable. The base
classes for user code, such as DoFn
, CombineFn
, and
WindowFn
, already implement Serializable
. However, your subclass must
not add any non-serializable members.
Some other serializability factors for which you must account:
- Transient fields in your function object are not carried down to worker instances in your Cloud Platform project, because they are not automatically serialized.
- Avoid loading large amounts of data into a field before serialization.
- Individual instances of function objects cannot share data.
- Mutating a function object after it gets applied has no effect.
- Take care when you declare your function object inline by using an anonymous inner class instance. In a non-static context, your inner class instance will implicitly contain a pointer to the enclosing class and its state. That enclosing class will also be serialized, and thus the same considerations that apply to the function object itself also apply to this outer class.
Thread-Compatibility
Your function object should be thread-compatible. Each instance of your function object is accessed by a single thread on a worker instance, unless you explicitly create your own threads. Note, however, that the Dataflow SDKs are not thread-safe. If you create your own threads in your function object, you must provide your own synchronization. Note that static members are not passed to worker instances and that multiple instances of your function may be accessed from different threads.
Idempotency
We recommend making your function object idempotent— that is, for any given input, your function always provides the same output. Idempotency is not required, but making your functions idempotent makes your output deterministic and can make debugging and troubleshooting your transforms easier.
Types of Core Transforms
You will often use the core transforms directly in your pipeline. In addition, many of the other transforms provided in the Dataflow SDKs are implemented in terms of the core transforms.
The Dataflow SDKs define the following core transforms:
- ParDo for generic parallel processing
- GroupByKey for Key-Grouping Key/Value pairs
- Combine for combining collections or grouped values
- Flatten for merging collections
Composite Transforms
The Dataflow SDKs support composite transforms, which are transforms built from multiple sub-transforms. The model of transforms in the Dataflow SDKs is modular, in that you can build a transform that is implemented in terms of other transforms. You can think of a composite transform as a complex step in your pipeline that contains several nested steps.
Composite transforms are useful when you want to create a repeatable operation that involves
multiple steps. Many of the built-in transforms included in the Dataflow SDKs, such as
Count
and Top
, are this sort of composite transform.
They are used in exactly the same manner as any other transform.
See Creating Composite Transforms for more information.
Pre-Written Transforms in the Dataflow SDKs
The Dataflow SDKs provide a number of pre-written transforms, which are both core and composite transforms where the processing logic is already written for you. These are more complex transforms for combining, splitting, manipulating, and performing statistical analysis on data.
Java
You can find these transforms in the com.google.cloud.dataflow.sdk.transforms
package
and its subpackages.
For a discussion on using the transforms provided in the Dataflow SDKs, see Transforms Included in the SDKs.
Root Transforms for Reading and Writing Data
The Dataflow SDKs provide specialized transforms, called root transforms, for getting data into and out of your pipeline. These transforms can be used at any time in your pipeline, but most often serve as your pipeline's root and endpoints. They include read transforms, write transforms, and create transforms.
Read transforms, which can serve as the root of your pipeline to create an initial
PCollection
, are used to create pipeline data from various sources. These sources
can include text files in Google Cloud Storage, data stored in BigQuery or Pub/Sub, and other
cloud storage sources. The Dataflow SDKs also provide an extensible API for working with your
own custom data sources.
Write transforms can serve as pipeline endpoints to write PCollection
s containing
processed output data to external storage. External data storage sinks can include
text files in Google Cloud Storage, BigQuery tables, Pub/Sub, or other cloud storage
mechanisms.
Create transforms are useful for creating a PCollection
from in-memory data. See
Creating a PCollection for more
information.
For more information on read and write transforms, see Pipeline I/O.
Transforms with Multiple Inputs and Outputs
Some transforms accept multiple PCollection
inputs, or specialized side inputs. A
transform can also produce multiple PCollection
outputs and side outputs. The
Dataflow SDKs provide a tagging API to help you track and pass multiple inputs and outputs of
different types.
To learn about transforms with multiple inputs and outputs and the details of the tagging system, see Handling Multiple PCollections.