Data Analytics

Dataflow Under the Hood: the origin story


Editor’s note: This is the first blog in a three-part series examining the internal Google history that led to Dataflow, how Dataflow works as a Google Cloud service, and how it compares and contrasts with other products in the marketplace. Here's where you can find parts two and three. 

Google Cloud’s Dataflow, part of our smart analytics platform, is a streaming analytics service that unifies stream and batch data processing. To get a better understanding of Dataflow, it helps to also understand its history, which starts with MillWheel

A history of Dataflow

Like many projects at Google, MillWheel started in 2008 with a tiny team and a bold idea. When this project started, our team (led by Paul Nordstrom), wanted to create a system that did for streaming data processing what MapReduce had done for batch data processing—provide robust abstractions and scale to massive size. In those early days, we had a handful of key internal Google customers (from Search and Ads), who were driving requirements for the system and pressure-testing the latest versions. 

What MillWheel did was build pipelines operating on click logs to attempt to compute real-time session information in order to better understand how to improve systems like Search for our customers. Up until this point, session information was computed on a daily basis, spinning up a colossal number of machines in the wee hours of the morning to produce results in time for when engineers logged on that morning. MillWheel aimed to change that by spreading that load over the entire day, resulting in more predictable resource usage, as well as vastly improved data freshness. Since a session can be an arbitrary length of time, this Search use case helped provide early motivation for key MillWheel concepts like watermarks and timers.

Alongside this session's use case, we started working with the Google Zeitgeist team—now Google Trends—to look at an early version of trending queries from search traffic. In order to do this, we needed to compare current traffic for a given keyword to historical traffic so that we could determine fluctuations compared to the baseline. This drove a lot of the early work that we did around state aggregation and management, as well as efficiency improvements to the system, to handle cases like first-time queries or one-and-done queries that we'd never see again.

In building MillWheel, we encountered a number of challenges that will sound familiar to any developer working on streaming data processing. For one thing, it's much harder to test and verify correctness for a streaming system, since you can't just rerun a batch pipeline to see if it produces the same "golden" outputs for a given input. For our streaming tests, one of the early frameworks that we developed was called the "numbers" pipeline, which staggered inputs from 1 to 1e6 over different time delivery intervals, aggregated them, and verified the outputs at the end. Though it was a bit arduous to build, it more than paid for itself in the number of bugs it caught. 

Dataflow represents the latest innovation in a long line of precursors at Google. The engineers who built Dataflow (co-led with Frances Perry) first experimented with streaming systems by building MillWheel, which defined some of the core semantics around timers, state management, and watermarks, but proved to be challenging to use in a number of ways. A lot of these challenges were similar to the issues that led us to build Flume for users who wanted to run multiple logical MapReduce (actually map-shuffle-combine-reduce) options together. So, to meet those challenges, we experimented with a higher-level model for programming pipelines called Streaming Flume (no relation to Apache Flume). This model allowed users to reason in terms of datasets and transformations, rather than physical details like computation nodes and the streams between them.

When it came time to build something for Google Cloud, we knew that we wanted to build a system that combined the best of what we'd learned with ambitious goals for the future. Our big bet with Dataflow was to take the semantics of (batch) Flume and Streaming Flume and combine them into a single system, which unified streaming and batch semantics. Under the hood, we had a number of technologies that we could build the system on top of, which we've successfully decoupled from the semantic model of Dataflow. That has let us continue to improve this implementation over time without requiring major rewrites to user pipelines. Along the way, we've created a number of publications about our work in data processing, particularly around streaming systems. Check those out here:

How Dataflow works

Let's take a moment to quickly review some key concepts in Dataflow. When we say that Dataflow is a streaming system, we mean that it processes (and can emit) records as they arrive, rather than according to some fixed threshold (e.g., record count or time window). While users can impose these fixed semantics in defining what outputs they want to see, the underlying system supports streaming inputs and outputs. Within Dataflow, a key concept is the idea of event time, which is a timestamp that corresponds to the time when an event occurred (rather than the time at which it is processed). In order to support a number of interesting applications, it's critical for a system to support event time, so that users can ask questions like "How many people logged on between 1am and 2am?"

One of the architectures that Dataflow is often compared to is the Lambda Architecture, where users run parallel copies of a pipeline (one streaming, one batch) in order to have a "fast" copy of (often partial) results as well as a correct one. There are a number of drawbacks to this approach, including the obvious costs (computational and operational, as well as development costs) of running two systems instead of one. It's also important to note that Lambda Architectures often use systems with very different software ecosystems, making it challenging to replicate complex application logic across both. Finally, it's non-trivial to reconcile the outputs of the two pipelines at the end. This is a key problem that we've solved with Dataflow—users write their application logic once, and can choose whether they would like fast (but potentially incomplete) results, slow (but correct) results, or both.

To help demonstrate Dataflow's advantage over Lambda Architectures, let’s consider the use case of a large retailer with online and in-store sales. These retailers would benefit from in-store BI dashboards, used by in-store employees, that could show regional and global inventory to help shoppers find what they’re looking for, and to let the retailers know what’s been popular with their customers. The dashboards could also be used to drive inventory distribution decisions from a central or regional team. In a Lambda Architecture, these systems would likely have delays in updates that are corrected later by batch processes, but before those corrections are made, they could misrepresent availability for low-inventory items, particularly during high-volume times like the holidays. Poor results in retail can lead to bad customer experiences, but in other fields like cybersecurity, they can lead to complacency and ignored intrusion alerts. With Dataflow, this data would always be up-to-date, ensuring a better experience for customers by avoiding promises of inventory that’s not available—or in cybersecurity, an alerting system that can be trusted.

That covers much of Dataflow’s origin story, but there are more interesting concepts to discuss. Be sure to check out the other blogs in our Dataflow “Under the Hood” series to learn more.