Google Cloud

Comparing the Dataflow/Beam and Spark Programming Models

February 3, 2016

Frances Perry

Software Engineer, Apache Beam committers

Tyler Akidau

Staff Software Engineer, Apache Beam committers

Having collectively spent more than a decade working on massive-scale batch and streaming systems at Google, it was with much pride and enthusiasm that we watched the induction of the Google Cloud Dataflow SDK (and corresponding runners for Apache Flink and Apache Spark) into the new Apache Beam incubator project earlier this week.

We sincerely hope this move heralds the beginning of a new era in data processing, one in which support for robust out-of-order processing is the norm, pipelines are portable across a variety of execution engines and environments (both cloud and on-premise), and today’s icons of operational angst (*cough* Lambda Architecture *cough*) are but mere vestiges of their current selves.

With that goal in mind, we wanted to spend some time calling out a few of the intentional design aspects of the Dataflow/Beam model that make it so useful in addressing the practical needs of modern businesses. Though we’ve talked at length elsewhere about the power and flexibility it affords (see The World Beyond Batch: Streaming 101 and Streaming 102, and The Dataflow Model VLDB 2015 paper), it’s not until you actually sit down and try to build and evolve something practical with it that the more subtle benefits of the design (those informed by years of experience creating and maintaining real-time, massive-scale data processing systems at Google) truly stand out.

To that end, we’ve written an article that compares the programming models of Dataflow and Spark as they exist today, based off a real-world mobile gaming scenario, involving the evolution of a pipeline from a simple batch use case to ever more sophisticated streaming use cases, with side-by-side code snippets contrasting the two.

The innovations in Spark that have driven its success as a platform have also revitalized the world of Big Data and as a result Spark represents the archetype of prevailing data processing methodology. We hope this comparison will highlight the practical advantages that Dataflow and Beam offer the industry as a whole from a programming model perspective, and perhaps hint at what the future of data processing might look like.

Plus, there are lots of pretty colors.

See full article here: Dataflow/Beam & Spark: A Programming Model Comparison

Posted in