Cookbook Examples

There are a number of example pipelines available in the Cookbook Examples directory (master-1.x branch) on GitHub. These pipelines illustrate how to use the SDK classes to implement common data processing design patterns. This document briefly describes each example and provides a link to the source code.

See End-to-End Examples for a list of complete examples that illustrate common end-to-end pipeline patterns using sample scenarios.

Important Keywords

The following terms will appear throughout this document:

  • Pipeline - A Pipeline is the code you write to represent your data processing job. Dataflow takes the pipeline code and uses it to build a job.
  • PCollection - PCollection is a special class provided by the Dataflow SDK that represents a typed data set of virtually any size.
  • Transform - In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data.

BigQueryTornadoes

The BigQueryTornadoes pipeline reads public samples of weather data from BigQuery and counts the number of tornadoes that occur each month.

The pipeline applies a user-defined transform that takes rows from a table and generates a table of counts. The user-defined transform in this example consists of several transforms, one of them being the SDK-defined Count Transform. Count is used here to count the total number of tornadoes found in each month. The pipeline then writes the end results to BigQuery.

CombinePerKeyExamples

You'll often find that you need to combine values in your pipeline's data. The Dataflow SDK provides a number of Combine transforms that you can use to merge the values in your pipeline's PCollection objects. Combine.perKey, for example, groups a collection of tuples by key and then performs some combination of all the values that share that key. You can combine the values by supplying Combine.perKey with one of the predefined combine functions, or your own custom combine function.

CombinePerKeyExamples reads a set of Shakespeare's plays from Google BigQuery, and for each word in the dataset that exceeds a given length, it generates a string containing the list of play names in which that word appears. Specifically, it demonstrates how to combine values in a key-grouped collection. It uses Combine.perKey to generate a string containing the list of play names in which each word that meets the length requirement appears. It does this using a (user-defined) combine function that builds a comma-separated string of all input items. That is, for a collection of key-value pairs where both key and value are strings, the function concats all string values, grouped under the same key, into one string.

In addition to combining values in your pipeline's data, CombinePerKeyExamples demonstrates the use of Aggregators. Aggregators are used to track certain information to display in the Dataflow Monitoring UI. The CombinePerKeyExamples pipeline uses an Aggregator to track the number of words that are shorter than the given length (and are, therefore, not included in the final result).

MaxPerKeyExamples

MaxPerKeyExamples also demonstrates how to use combine transforms and CombineFns — specifically, how to use the statistical combination functions included in the Dataflow SDK for Java.

The MaxPerKeyExamples example reads public samples of weather data from BigQuery and uses the Max function to find the maximum temperature for each month.

DatastoreWordCount

The DatastoreWordCount example shows how to use DatastoreIO to read/write from Cloud Datastore, a schemaless NoSQL database.

DatastoreWordCount consists of two examples. The first example, writeDataToDatastore, creates a pipeline that reads input from a text file and populates DatastoreIO from it. The second example, readDataFromDatastore, creates a pipeline that performs a DatastoreIO.Read to read Shakespeare's plays, counts the number of times each word occurs in the plays, and writes the results to a text file.

DeDupExample

DeDupExample demonstrates how to remove duplicate records from a data set. In the example, the pipeline's input is a set of Shakespeare's plays as plain text files.

The example shows how to use the SDK-provided RemoveDuplicates transform to remove the duplicate records (in this case, text lines).

FilterExamples

FilterExamples demonstrates different approaches to filtering. One approach is projection in the relational database sense, which derives a desired subset of the set of all columns found in a table. Another class of filtering is when you output a collection of elements based on whether some condition is true (a statically-defined value, a parameter defined at pipeline creation time, or a value derived by the pipeline itself).

The FilterExamples pipeline reads public samples of weather data from BigQuery. It performs a projection on the data, finds the global mean of the temperature readings, filters on readings for a single given month, and then outputs only data (for that month) that has a mean temperature smaller than the derived global mean.

Often, it's useful for a filter condition to be based on a value that's derived by the pipeline itself. The derived value can be used as a side input. FilterExamples shows how to filter on a side input; it contains a a composite transform that reads weather data, extracts only the temperature value from each element, and then uses that collection as the input to a global mean derivation. The result of that derivation is then used to filter on a collection, outputting only elements with a temperature below that global mean.

JoinExamples

JoinExamples shows how to perform relational joins between datasets by using Dataflow's CoGroupByKey transform. CoGroupByKey can join values from multiple PCollections that share a common key.

JoinExamples uses sample data from two BigQuery tables, one with current events indexed by country codes, and the other with country codes mapped to country names. The example demonstrates joining the data from the two tables with a country code key.

Send feedback about...

Cloud Dataflow Documentation