Separation of compute and state in Google BigQuery and Cloud Dataflow (and why it matters)
Posted by Tino Tereshko, Big Data Lead, Google Cloud Platform Office of CTO. (Thanks to Rodd Zurcher, Engineering Director, Motorola Mobility, and Matthew Baird, Cofounder and CTO, AtScale, Alexey Maloletkin, Advisory Software Engineer, Motorola Mobility for their contributions to this post.)
As previously discussed on this blog, one benefit of cloud-native data analytics technologies is the separation of storage and compute. Using cloud-native storage services instead of instance-attached storage can be easier to manage and scale, and potentially more affordable as well. Furthermore, by avoiding long-term state in your instances, you can rapidly scale your processing power and leverage cheap ephemeral compute capacity.
Two Google Cloud Platform (GCP) services — Google BigQuery (a fully managed, petabyte-scale, low-cost enterprise data warehouse) and Google Cloud Dataflow (for running batch and streaming data processing pipelines with equal expressiveness and reliability) — push this envelope even further by separating compute and state. This approach can improve performance and efficiency, especially for workloads with high contention and concurrency (which are typical of real-world analytics use cases). Separation of compute and state is the “secret sauce” that makes BigQuery so good at concurrency.
Defining separation of compute and state
Separation of compute and state refers to the ability to maintain intermediate state between processing stages in a high-performance component separate from either the compute cluster or storage. You may recognize its utility in the shuffle step — the key difference is that the shuffle state is facilitated by a separate sub-service. Whereas the shuffle task is often the source of processing bottlenecks, the separation between compute and state turns this weakness into a strength.
There are several benefits to separating compute from state (watch this video for more details):
- Less state in compute means compute becomes more ephemeral and scalable. It’s easier to re-parallelize processing intra-stage and interstage, and easier to recover from a lost node.
- Processing is more streamlined; processing stages don’t conflict within the same compute nodes, resulting in resource contention and bottlenecks.
- It’s easier for the processing engine to re-partition workloads between stages.
- Your processing engine can take advantage of pipelined execution. In other words, it doesn’t have to wait for Stage N to finish before starting Stage N+1.
- The processing engine can implement dynamic work repartitioning (the ability to re-parallelize work due to slow workers or data skew).
- Keeping less state in processing nodes makes workloads more resilient to individual node issues.
- The service can utilize available resources much more efficiently across compute as well as shuffle.
- BigQuery’s current implementation of the Dremel distributed query engine deviates from the architecture described in the Dremel paper from 2010. Subsequent versions of Dremel and BigQuery leverage dynamic processing trees and a separate in-memory shuffle component.
- While historically Cloud Dataflow has implemented shuffle within the processing nodes, using its new shuffle service functionality (in beta at the time of this writing) leads to up to a 5x performance improvement.
The result is faster performance for jobs in high-concurrency environments — jobs that, by definition, compete for resources. For this reason, complex workloads (AKA large, multi-stage queries with numerous JOINs executed at high concurrency) perform exceptionally well on services that separate compute and state. (Note: Because intra-stage state is generally kept in compute, however, sometimes it makes sense to bypass shuffle entirely — as in the case of broadcast joins.)
Separation of compute and state in action
To demonstrate this concept, let’s take a look at a BigQuery query presented by Jordan Tigani (BigQuery tech lead) in his Google Cloud Next ‘17 session:
#StandardSQL SELECT language, sum (views) as views FROM ( SELECT title, language, max(views) as views FROM `bigquery-samples.wikipedia_benchmark.Wiki100B` WHERE title LIKE "G%o%" GROUP BY title, language ) GROUP BY language ORDER BY views DESC
(This is a public dataset, so you’re free to replicate the query on your own.)
QUERY EXPLAIN for this query reveals a shuffle after both Stage 1 and Stage 2. However, rather than moving data from Stage 1 nodes to Stage 2 nodes, the query uses a separate, in-memory subservice to keep the shuffled data. This is a much cleaner architecture — BigQuery avoids the resource sloshing and bottlenecking associated with classical shuffle. It can also quickly recover from unexpected out-of-memory conditions associated with data skew in shuffle.
If we re-draw this
QUERY EXPLAIN as a series of mini-steps in disparate storage, compute, and shuffle components, the query looks more like this:
Now that the concepts involved here are clearer, let’s examine some real use cases.
Concurrent workloads at Motorola Mobility
As an avid user of BigQuery, Motorola Mobility has experienced the differences in performance before and after BigQuery implemented separation of compute and state. As a consumer of BigQuery’s Flat Rate pricing model, it maintains its cluster at a consistent size.
Since BigQuery rolled out its most recent iteration of separation of compute and state in 2015, BigQuery usage at Motorola Mobility has nearly tripled. At the same time, both performance variability and query execution speeds improved dramatically.
Specifically, Motorola Mobility observed significant decrease in performance variability of jobs with average runtime around 45 seconds (a typical query). Light queries (lowest 25% in terms of data scanned) complete in 3 seconds on average. Meanwhile, Motorola Mobility sees execution of its heaviest 5% of queries averaging around 100 seconds — down from almost 200 seconds a year and a half ago.
Benchmarking by AtScale
AtScale is a big data analytics software vendor whose founders experienced the power of first-generation big data technologies at Yahoo!. AtScale architects built what they could not buy when they tired of moving data out of Apache Hadoop into small relational engines to support analytic and business intelligence (BI) workloads. HDFS, Apache Spark SQL, and Apache Impala were the ideal starting points to bring interactive queries to Hadoop.
In November 2016, AtScale added support for BigQuery and Cloud Dataflow. Soon after, its performance guru Trystan Leftwich (author of the BI on SQL-on-Hadoop benchmark) implemented a real-world BI workload test on BigQuery. Leftwich found the following advantages:
- Data loading — The process to move data to the Google cloud and load it into BigQuery is simple, scalable, and is well-documented.
- Out-of-the-box performance — The BigQuery engine performs well out-of-the-box with minimal query tuning and no system configuration.
- Impressive concurrency — BigQuery’s serverless model on small data sets shows no query degradation, even at query volumes above 25 concurrent BI users.
Separation of compute and state is a relatively new architectural consideration, but one that lets BigQuery and Cloud Dataflow push the boundaries of performance, efficiency, reliability, and scalability. Ultimately, customers benefit from more powerful services and greater competition.
To experiment with BigQuery and Cloud Dataflow, we recommend checking out Google Cloud’s perpetual free tier. You’ll get 10GB of BigQuery storage and 1TB of BigQuery processing for free every month.