Pre-Written Transforms in the Cloud Dataflow SDK

The Dataflow SDKs provide a library of pre-written transforms that represent common and useful data processing operations. These transforms are either core transforms with generic processing functions already written for you, or composite transforms that combine simple pre-written transforms to perform useful processing functions.

Java

In the Dataflow SDK for Java, you can find these transforms in the package com.google.cloud.dataflow.sdk.transforms.

You can use any of the pre-written transforms in the Dataflow SDKs as-is in your own pipelines. These transforms are generic and convenient operations that can perform common data processing steps, such as counting elements in a collection, dividing a collection into quantiles, finding the top (or bottom) N elements in a collection, and performing basic mathematical combinations on numerical data.

Many of the pre-written transforms in the Dataflow SDKs are genericized composite transforms that can take different data types. They are made up of nested core transforms like ParDo, GroupByKey, and Combine.

Java

The Dataflow SDK for Java can represent most common data processing operations using core transforms. The pre-written transforms provided in the SDKs are essentially pre-built wrappers for generic ParDo, Combine, etc. transforms organized in such a way that they count elements or do basic mathematical combinations. Sum.integersGlobally, for example, wraps the Combine.Globally core transform for Integer types, and provides a pre-written CombineFn that computes the sum of all input elements. Rather than writing your own version of Combine.Globally with a sum CombineFn, you can use the pre-built provided in the SDK.

If the transforms included in the Dataflow SDKs don't fit your pipeline's use case exactly, you can create your own generic, reusable composite transforms. The source code for the included transforms can serve as a model for creating your own composite transforms using ParDo, Combine, and other core transforms. See Creating Composite Transforms for more information.

Common Processing Patterns

The transforms included in the Dataflow SDKs provide convenient mechanisms to perform common data processing operations in your pipelines. The source code for these transforms illustrate how the core transforms like ParDo can be used (or re-used) for various operations.

Simple ParDo Wrappers

Some of the simplest transforms provided in the Dataflow SDKs are utility transforms for dealing with key/value pairs. Given a PCollection of key/value pairs, the Keys transform returns a PCollection containing only the keys; the Values transform returns a PCollection containing only the values. The KvSwap transform swaps the key element and the value element for each key/value pair, and returns a PCollection of the reversed pairs.

Java

Keys, Values, KvSwap, MapElements, FlatMapElements, Filter, and Partition are simple transforms composed of a single ParDo. In each case, the ParDo invokes a relatively simple DoFn to produce the elements of the output PCollection.

Here's the apply method for the Keys transform, which accepts a generic PCollection of KV<K, V> elements, and returns a PCollection<K> of only the keys from the key/value pairs:

  @Override
  public PCollection<K> apply(PCollection<? extends KV<K, ?>> in) {
    return
        in.apply(ParDo.named("Keys")
                 .of(new DoFn<KV<K, ?>, K>() {
                     @Override
                     public void processElement(ProcessContext c) {
                       c.output(c.element().getKey());
                     }
                    }));
  }

In the example, the apply method applies a ParDo transform to the input collection (in). That ParDo invokes a simple DoFn to output the key part of the key/value pair. The DoFn is trivial, and can be defined as an anonymous inner class instance.

Patterns That Combine Elements

The Dataflow SDKs contain a number of convenient transforms that perform common statistical and mathematical combinations on elements. For example, there are transforms that accept a PCollection of numerical data (such as integers) and perform a mathematical combination: finding the sum of all elements, the mean average of all elements, or the largest/smallest of all elements in the collection. Examples of transforms of this kind are Sum and Mean.

Other transforms perform basic statistical analysis on a collection: finding the top N elements, for example, or returning a random sample of every N elements in a given PCollection. Examples of transforms of this kind include Top and Sample.

Java

These transforms are based on the Combine core transform. They include variants that work on PCollections of individual values (using Combine.globally) and PCollections of key/value pairs (using Combine.perKey).

See the source and API for Java reference documentation for the Top transform for an example of a combining transform with both global and per-key variants.

Map/Shuffle/Reduce-Style Processing

Some of the transforms included in the Dataflow SDKs perform processing similar to a Map/Shuffle/Reduce-style algorithm. These transforms include Count, which accepts a potentially non-unique collection of elements and returns a reduced collection of just the unique elements paired with an occurrence count for each. The RemoveDuplicates transform likewise reduces a non-unique collection to just the unique elements, but does not provide an occurrence count.

Java

These transforms make use of the core ParDo and Combine.perKey transforms. Combine.perKey is itself a composite operation that performs a GroupByKey and combines the resulting value stream for each key into a single value. The ParDo represents the Map phase of Map/Shuffle/Reduce; Combine.perKey represents the Shuffle and Reduce phases.

Here's the apply method for the Count transform, showing the processing logic in the nested ParDo and Combine.perKey transforms:

  @Override
  public PCollection<KV<T, Long>> apply(PCollection<T> in) {
    return
        in
        .apply(ParDo.named("Init")
               .of(new DoFn<T, KV<T, Long>>() {
                   @Override
                   public void processElement(ProcessContext c) {
                     c.output(KV.of(c.element(), 1L));
                   }
                 }))

        .apply(Combine.<T, Long>perKey(
                 new SerializableFunction<Iterable<Long>, Long>() {
                   @Override
                   public Long apply(Iterable<Long> values) {
                     long sum = 0;
                     for (Long value : values) {
                       sum += value;
                     }
                     return sum;
                   }
                 }));
  }

In the example, the apply method uses a ParDo transform to attach an occurrence count to each element in the input PCollection, creating a key/value pair for each element—this is the Map phase of Map/Shuffle/Reduce. Count then applies a Combine.perKey transform to perform the Shuffle and Reduce logic, and produce a PCollection of unique elements with a combined count of occurrences.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Dataflow Documentation