There are a number of example pipelines available in the
master-1.x branch) on GitHub. These pipelines illustrate common end-to-end pipeline patterns using sample scenarios. Each sample
scenario is inspired by a realistic data processing domain, such as analyzing traffic pattern
data, Twitter hashtags, and Wikipedia edit data. This document briefly describes each example and
provides a link to the source code.
A good starting point for new users is the
WordCount Example Walkthrough, which runs over
provided input text files and computes how many times each word occurs in the input.
WordCount teaches key dataflow concepts through a very simple example. However, the
pipelines below are more realistic.
The following terms will appear throughout this document:
- Pipeline - A Pipeline is the code you write to represent your data processing job. Dataflow takes the pipeline code and uses it to build a job.
- PCollection - PCollection is a special class provided by the Dataflow SDK that represents a typed data set.
- Transform - In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data.
- Bounded and Unbounded PCollections - A
PCollection's size can be either bounded or unbounded. Your
PCollectionis bounded if it represents a fixed data set, or unbounded if it represents a continuously updating data set.
- Batch mode - If a pipeline "runs in batch mode", then its input is bounded.
- Streaming mode - If a pipeline "runs in streaming mode", then its input is unbounded.
- Windowing - Windowing is a concept used in the
Dataflow SDK to subdivide a
PCollectionaccording to the timestamps of its individual elements.
pipeline computes the most popular hash tags for every prefix of a word in the input. You might
use the results of the
AutoComplete pipeline for auto-completion.
AutoComplete receives words as input, and computes the most popular suggestions for
each prefix of each word in the input.
AutoComplete can be run over bounded data (in batch mode) or unbounded data (in
streaming mode). When using bounded data, the input is a list of words in a
text file. In streaming mode, the words constantly flow in
from a Google Cloud Pub/Sub topic.
AutoComplete pipeline applies a series of transforms to the input strings to
extract the words with hashtags in front of them, determines those words' prefixes, and computes
the top suggestions for each prefix. The pipeline's output is the top suggestions data.
Streaming Word Extract
The StreamingWordExtract pipeline is a small example pipeline that demonstrates how to work with streaming data. The pipeline reads lines of text streaming from Cloud Pub/Sub, tokenizes each line into individual words, and then capitalizes the words. The pipeline then formats the capitalized words as BigQuery table rows and performs a streaming write to a BigQuery table.
TfIdf, which stands for term frequency - inverse document frequency, is a calculation of how important a word is to a document or a set of documents.
Tf-idf pipeline reads a set of documents from a directory or from
Google Cloud Storage, and applies a series of transforms that
calculate the components of each word's tf-idf rank. One of the components of the tf-idf
calculation is the idf part (inverse document frequency), which is just the number of
documents the word appears in, divided by the total number of documents. The total number of
documents is injected into the pipeline as a
side input, since this is a value that doesn't
change. A side input is an additional input that the pipeline can access each time it processes
an element in the input
Tf-idf outputs a mapping from every word to its rank to the document it
Top Wikipedia Sessions
The TopWikipediaSessions is a batch pipeline that processes Wikipedia edit data that it reads from Google Cloud Storage. It finds the user with the largest sequence of Wikipedia edits in a single session. A session is defined as a string of edits where each is separated from the next by less than an hour.
For example, let's say user A edits Wikipedia 5 times, with each edit 30 minutes apart. A day later, user A edits Wikipedia once again. His longest session consists of 5 edits (not 6). User B edits Wikipedia 20 times, but he edits every 2 hours; his longest session consists of 1 edit. Therefore, user A is the one with the largest sequence of edits in a session.
This example uses Windowing to perform time-based aggregations of data. It uses 1-hour window durations to define sessions, and computes the number of edits in each user session.
The pipeline writes its output as formatted Strings to a text file in Google Cloud Storage.
Traffic Max Lane Flow
The TrafficMaxLaneFlow pipeline analyzes data from traffic sensors. This pipeline can run in both batch and streaming modes. In batch mode, the pipeline reads traffic sensor data from an input file. In streaming mode, the data constantly flows in from a Cloud Pub/Sub topic.
TrafficMaxLaneFlow analyzes the incoming data stream using Windowing, specifically
Sliding Time Windows. Sliding time Windows use
time intervals in the data stream to define bundles of data with windows that overlap.
TrafficMaxLaneFlow uses a custom Combine
transform to extract lane information and calculate the max lane flow found for a given station
for each Window. A custom combine transform is necessary because the combination is not a simple
combination--it needs to retain some additional information along with the flow value.
The pipeline formats and writes the max values along with the auxiliary information to a BigQuery table.
TrafficRoutes analyzes data from traffic sensors, calculates the average speed for some small set of predefined 'routes', and looks for 'slowdowns' in those routes.
TrafficRoutes pipeline runs over bounded data (in batch mode) or unbounded data
(in streaming mode) In batch mode, the pipeline reads the traffic sensor data from a
text file. In streaming mode, the pipeline reads the data
from a Cloud Pub/Sub topic.
The pipeline analyzes the data stream using Windowing,
specifically Sliding Time Windows. Sliding Time
Windows use time intervals in the data stream to define bundles of data with windows that overlap.
The default window duration in
TrafficRoutes is 3 minutes, and the default window
interval is 1 minute. Hence, each window contains a 3-minute data sample that starts 1 minute
after the start of the preceding window. For each window, the pipeline calculates the average
speed for the set of predefined 'routes' and looks for 'slowdowns'. A 'slowdown' occurs if a
supermajority of speeds in a sliding window are less than the reading of the previous window.
The pipeline then formats the results and writes them to BigQuery.
This example pipeline is not availabe in the Dataflow SDK for Java.