There are a number of example pipelines available in the
master-1.x branch) on GitHub. These pipelines illustrate how to use the SDK classes to implement common data processing design
patterns. This document briefly describes each example and provides a link to the source code.
See End-to-End Examples for a list of complete examples that illustrate common end-to-end pipeline patterns using sample scenarios.
The following terms will appear throughout this document:
- Pipeline - A Pipeline is the code you write to represent your data processing job. Dataflow takes the pipeline code and uses it to build a job.
- PCollection - PCollection is a special class provided by the Dataflow SDK that represents a typed data set of virtually any size.
- Transform - In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data.
The BigQueryTornadoes pipeline reads public samples of weather data from BigQuery and counts the number of tornadoes that occur each month.
The pipeline applies a user-defined transform
that takes rows from a table and generates a table of counts. The user-defined transform in this
example consists of several transforms, one of them being the SDK-defined
Count is used here to count the total number of tornadoes found in
each month. The pipeline then writes the end results to BigQuery.
You'll often find that you need to combine values in your pipeline's data. The Dataflow SDK
provides a number of Combine
transforms that you can use to merge the values in your pipeline's
Combine.perKey, for example, groups a collection of tuples by key and then
performs some combination of all the values that share that key. You can combine the values by
Combine.perKey with one of the predefined
combine functions, or your own custom
reads a set of Shakespeare's plays from Google BigQuery,
and for each word in the dataset that exceeds a given length, it generates a string containing the
list of play names in which that word appears. Specifically, it demonstrates how to
combine values in a key-grouped collection. It uses
Combine.perKey to generate a string containing the list of play names in which each
word that meets the length requirement appears. It does this using a (user-defined) combine
function that builds a comma-separated string of all input items. That is, for a collection of
key-value pairs where both key and value are strings, the function concats all string values,
grouped under the same key, into one string.
In addition to combining values in your pipeline's data,
demonstrates the use of Aggregators.
Aggregators are used to track certain information to display in the
Dataflow Monitoring UI.
CombinePerKeyExamples pipeline uses an
Aggregator to track the
number of words that are shorter than the given length (and are, therefore, not included in the
MaxPerKeyExamples example reads public samples of weather data from BigQuery and
uses the Max
function to find the maximum temperature for each month.
DatastoreWordCount consists of two examples. The first example,
writeDataToDatastore, creates a pipeline that reads input from a text file and
DatastoreIO from it. The second example,
creates a pipeline that performs a
DatastoreIO.Read to read Shakespeare's plays,
counts the number of times each word occurs in the plays, and writes the results to a text file.
DeDupExample demonstrates how to remove duplicate records from a data set. In the example, the pipeline's input is a set of Shakespeare's plays as plain text files.
The example shows how to use the SDK-provided RemoveDuplicates transform to remove the duplicate records (in this case, text lines).
FilterExamples demonstrates different approaches to filtering. One approach is projection in the relational database sense, which derives a desired subset of the set of all columns found in a table. Another class of filtering is when you output a collection of elements based on whether some condition is true (a statically-defined value, a parameter defined at pipeline creation time, or a value derived by the pipeline itself).
FilterExamples pipeline reads public samples of weather data from
BigQuery. It performs a projection on the data, finds the global mean of the temperature readings,
filters on readings for a single given month, and then outputs only data (for that month) that
has a mean temperature smaller than the derived global mean.
Often, it's useful for a filter condition to be based on a value that's derived by the
pipeline itself. The derived value can be used as a
FilterExamples shows how to filter on a side input; it contains a
a composite transform that reads weather data, extracts only the temperature value from each
element, and then uses that collection as the input to a global mean derivation. The result of
that derivation is then used to filter on a collection, outputting only elements with a
temperature below that global mean.
JoinExamples uses sample data from two BigQuery tables, one with current events
indexed by country codes, and the other with country codes mapped to country names. The example
demonstrates joining the data from the two tables with a
country code key.