Transforms

In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.

Transforms in the Dataflow model can be nested—that is, transforms can contain and invoke other transforms, thus forming composite transforms.

How Transforms Work

Transforms represent your pipeline's processing logic. Each transform accepts one (or multiple) PCollections as input, performs an operation on the elements in the input PCollection(s), and produces one (or multiple) new PCollections as output.

Java

To use a transform, you apply the transform to the input PCollection that you want to process by calling the apply method on the input PCollection. When you call PCollection.apply, you pass the transform you want to use as an argument. The output PCollection is the return value from PCollection.apply.

For example, the following code sample shows how to apply a user-defined transform called ComputeWordLengths to a PCollection<String>. ComputeWordLengths returns a new PCollection<Integer> containing the length of each String in the input collection:

  // The input PCollection of word strings.
  PCollection<String> words = ...;

  // The ComputeWordLengths transform, which takes a PCollection of Strings as input and
  // returns a PCollection of Integers as output.
  static class ComputeWordLengths
      extends PTransform<PCollection<String>, PCollection<Integer>> { ... }

  // Apply ComputeWordLengths, capturing the results as the PCollection wordLengths.
  PCollection<Integer> wordLengths = words.apply(new ComputeWordLengths());

When you build a pipeline with a Dataflow program, the transforms you include might not be executed precisely in the order you specify them. The Cloud Dataflow managed service, for example, performs optimized execution. In optimized execution, the Dataflow service orders transforms in dependency order, inferring the exact sequence from the inputs and outputs defined in your pipeline. Certain transforms may be merged or executed in a different order to provide the most efficient execution.

Types of Transforms in the Dataflow SDKs

Core Transforms

The Dataflow SDK contains a small group of core transforms that are the foundation of the Cloud Dataflow parallel processing model. Core transforms form the basic building blocks of pipeline processing. Each core transform provides a generic processing framework for applying business logic that you provide to the elements of a PCollection.

When you use a core transform, you provide the processing logic as a function object. The function you provide gets applied to the elements of the input PCollection(s). Instances of the function may be executed in parallel across multiple Google Compute Engine instances, given a large enough data set, and pending optimizations performed by the pipeline runner service. The worker code function produces the output elements, if any, that are added to the output PCollection(s).

Requirements for User-Provided Function Objects

The function objects you provide for a transform might have many copies executing in parallel across multiple Compute Engine instances in your Cloud Platform project. As such, you should consider a few factors when creating such a function:

  • Your function object must be serializable.
  • Your function object must be thread-compatible, and be aware that the Dataflow SDKs are not thread-safe.
  • We recommend making your function object idempotent.

These requirements apply to subclasses of DoFn (used with the ParDo core transform), CombineFn (used with the Combine core transform), and WindowFn (used with the Window transform).

Serializability

The function object you provide to a core transform must be fully serializable. The base classes for user code, such as DoFn, CombineFn, and WindowFn, already implement Serializable. However, your subclass must not add any non-serializable members.

Some other serializability factors for which you must account:

  • Transient fields in your function object are not carried down to worker instances in your Cloud Platform project, because they are not automatically serialized.
  • Avoid loading large amounts of data into a field before serialization.
  • Individual instances of function objects cannot share data.
  • Mutating a function object after it gets applied has no effect.
  • Take care when you declare your function object inline by using an anonymous inner class instance. In a non-static context, your inner class instance will implicitly contain a pointer to the enclosing class and its state. That enclosing class will also be serialized, and thus the same considerations that apply to the function object itself also apply to this outer class.
Thread-Compatibility

Your function object should be thread-compatible. Each instance of your function object is accessed by a single thread on a worker instance, unless you explicitly create your own threads. Note, however, that the Dataflow SDKs are not thread-safe. If you create your own threads in your function object, you must provide your own synchronization. Note that static members are not passed to worker instances and that multiple instances of your function may be accessed from different threads.

Idempotency

We recommend making your function object idempotent— that is, for any given input, your function always provides the same output. Idempotency is not required, but making your functions idempotent makes your output deterministic and can make debugging and troubleshooting your transforms easier.

Types of Core Transforms

You will often use the core transforms directly in your pipeline. In addition, many of the other transforms provided in the Dataflow SDKs are implemented in terms of the core transforms.

The Dataflow SDKs define the following core transforms:

  • ParDo for generic parallel processing
  • GroupByKey for Key-Grouping Key/Value pairs
  • Combine for combining collections or grouped values
  • Flatten for merging collections

Composite Transforms

The Dataflow SDKs support composite transforms, which are transforms built from multiple sub-transforms. The model of transforms in the Dataflow SDKs is modular, in that you can build a transform that is implemented in terms of other transforms. You can think of a composite transform as a complex step in your pipeline that contains several nested steps.

Composite transforms are useful when you want to create a repeatable operation that involves multiple steps. Many of the built-in transforms included in the Dataflow SDKs, such as Count and Top, are this sort of composite transform. They are used in exactly the same manner as any other transform.

See Creating Composite Transforms for more information.

Pre-Written Transforms in the Dataflow SDKs

The Dataflow SDKs provide a number of pre-written transforms, which are both core and composite transforms where the processing logic is already written for you. These are more complex transforms for combining, splitting, manipulating, and performing statistical analysis on data.

Java

You can find these transforms in the com.google.cloud.dataflow.sdk.transforms package and its subpackages.

For a discussion on using the transforms provided in the Dataflow SDKs, see Transforms Included in the SDKs.

Root Transforms for Reading and Writing Data

The Dataflow SDKs provide specialized transforms, called root transforms, for getting data into and out of your pipeline. These transforms can be used at any time in your pipeline, but most often serve as your pipeline's root and endpoints. They include read transforms, write transforms, and create transforms.

Read transforms, which can serve as the root of your pipeline to create an initial PCollection, are used to create pipeline data from various sources. These sources can include text files in Google Cloud Storage, data stored in BigQuery or Pub/Sub, and other cloud storage sources. The Dataflow SDKs also provide an extensible API for working with your own custom data sources.

Write transforms can serve as pipeline endpoints to write PCollections containing processed output data to external storage. External data storage sinks can include text files in Google Cloud Storage, BigQuery tables, Pub/Sub, or other cloud storage mechanisms.

Create transforms are useful for creating a PCollection from in-memory data. See Creating a PCollection for more information.

For more information on read and write transforms, see Pipeline I/O.

Transforms with Multiple Inputs and Outputs

Some transforms accept multiple PCollection inputs, or specialized side inputs. A transform can also produce multiple PCollection outputs and side outputs. The Dataflow SDKs provide a tagging API to help you track and pass multiple inputs and outputs of different types.

To learn about transforms with multiple inputs and outputs and the details of the tagging system, see Handling Multiple PCollections.

Send feedback about...

Cloud Dataflow Documentation