Complete Examples

There are a number of example pipelines available in the Complete Examples directory (master-1.x branch) on GitHub. These pipelines illustrate common end-to-end pipeline patterns using sample scenarios. Each sample scenario is inspired by a realistic data processing domain, such as analyzing traffic pattern data, Twitter hashtags, and Wikipedia edit data. This document briefly describes each example and provides a link to the source code.

A good starting point for new users is the WordCount Example Walkthrough, which runs over provided input text files and computes how many times each word occurs in the input. WordCount teaches key dataflow concepts through a very simple example. However, the pipelines below are more realistic.

Important Keywords

The following terms will appear throughout this document:

  • Pipeline - A Pipeline is the code you write to represent your data processing job. Dataflow takes the pipeline code and uses it to build a job.
  • PCollection - PCollection is a special class provided by the Dataflow SDK that represents a typed data set.
  • Transform - In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data.
  • Bounded and Unbounded PCollections - A PCollection's size can be either bounded or unbounded. Your PCollection is bounded if it represents a fixed data set, or unbounded if it represents a continuously updating data set.
  • Batch mode - If a pipeline "runs in batch mode", then its input is bounded.
  • Streaming mode - If a pipeline "runs in streaming mode", then its input is unbounded.
  • Windowing - Windowing is a concept used in the Dataflow SDK to subdivide a PCollection according to the timestamps of its individual elements.

Auto Complete

Java

The AutoComplete pipeline computes the most popular hash tags for every prefix of a word in the input. You might use the results of the AutoComplete pipeline for auto-completion. AutoComplete receives words as input, and computes the most popular suggestions for each prefix of each word in the input.

AutoComplete can be run over bounded data (in batch mode) or unbounded data (in streaming mode). When using bounded data, the input is a list of words in a text file. In streaming mode, the words constantly flow in from a Google Cloud Pub/Sub topic.

The AutoComplete pipeline applies a series of transforms to the input strings to extract the words with hashtags in front of them, determines those words' prefixes, and computes the top suggestions for each prefix. The pipeline's output is the top suggestions data.

Streaming Word Extract

Java

The StreamingWordExtract pipeline is a small example pipeline that demonstrates how to work with streaming data. The pipeline reads lines of text streaming from Cloud Pub/Sub, tokenizes each line into individual words, and then capitalizes the words. The pipeline then formats the capitalized words as BigQuery table rows and performs a streaming write to a BigQuery table.

TfIdf

Java

TfIdf, which stands for term frequency - inverse document frequency, is a calculation of how important a word is to a document or a set of documents.

The Tf-idf pipeline reads a set of documents from a directory or from Google Cloud Storage, and applies a series of transforms that calculate the components of each word's tf-idf rank. One of the components of the tf-idf calculation is the idf part (inverse document frequency), which is just the number of documents the word appears in, divided by the total number of documents. The total number of documents is injected into the pipeline as a side input, since this is a value that doesn't change. A side input is an additional input that the pipeline can access each time it processes an element in the input PCollection.

Tf-idf outputs a mapping from every word to its rank to the document it appears in.

Top Wikipedia Sessions

Java

The TopWikipediaSessions is a batch pipeline that processes Wikipedia edit data that it reads from Google Cloud Storage. It finds the user with the largest sequence of Wikipedia edits in a single session. A session is defined as a string of edits where each is separated from the next by less than an hour.

For example, let's say user A edits Wikipedia 5 times, with each edit 30 minutes apart. A day later, user A edits Wikipedia once again. His longest session consists of 5 edits (not 6). User B edits Wikipedia 20 times, but he edits every 2 hours; his longest session consists of 1 edit. Therefore, user A is the one with the largest sequence of edits in a session.

This example uses Windowing to perform time-based aggregations of data. It uses 1-hour window durations to define sessions, and computes the number of edits in each user session.

The pipeline writes its output as formatted Strings to a text file in Google Cloud Storage.

Traffic Max Lane Flow

Java

The TrafficMaxLaneFlow pipeline analyzes data from traffic sensors. This pipeline can run in both batch and streaming modes. In batch mode, the pipeline reads traffic sensor data from an input file. In streaming mode, the data constantly flows in from a Cloud Pub/Sub topic.

TrafficMaxLaneFlow analyzes the incoming data stream using Windowing, specifically Sliding Time Windows. Sliding time Windows use time intervals in the data stream to define bundles of data with windows that overlap.

TrafficMaxLaneFlow uses a custom Combine transform to extract lane information and calculate the max lane flow found for a given station for each Window. A custom combine transform is necessary because the combination is not a simple Max combination--it needs to retain some additional information along with the flow value.

The pipeline formats and writes the max values along with the auxiliary information to a BigQuery table.

Traffic Routes

Java

TrafficRoutes analyzes data from traffic sensors, calculates the average speed for some small set of predefined 'routes', and looks for 'slowdowns' in those routes.

The TrafficRoutes pipeline runs over bounded data (in batch mode) or unbounded data (in streaming mode) In batch mode, the pipeline reads the traffic sensor data from a text file. In streaming mode, the pipeline reads the data from a Cloud Pub/Sub topic.

The pipeline analyzes the data stream using Windowing, specifically Sliding Time Windows. Sliding Time Windows use time intervals in the data stream to define bundles of data with windows that overlap. The default window duration in TrafficRoutes is 3 minutes, and the default window interval is 1 minute. Hence, each window contains a 3-minute data sample that starts 1 minute after the start of the preceding window. For each window, the pipeline calculates the average speed for the set of predefined 'routes' and looks for 'slowdowns'. A 'slowdown' occurs if a supermajority of speeds in a sliding window are less than the reading of the previous window.

The pipeline then formats the results and writes them to BigQuery.

Estimate PI

Java

This example pipeline is not availabe in the Dataflow SDK for Java.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...

Cloud Dataflow Documentation