Google Cloud Platform
Separation of compute and state in Google BigQuery and Cloud Dataflow (and why it matters)
As previously discussed on this blog, one benefit of cloud-native data analytics technologies is the separation of storage and compute. Using cloud-native storage services instead of instance-attached storage can be easier to manage and scale, and potentially more affordable as well. Furthermore, by avoiding long-term state in your instances, you can rapidly scale your processing power and leverage cheap ephemeral compute capacity.
Two Google Cloud Platform (GCP) services — Google BigQuery (a fully managed, petabyte-scale, low-cost enterprise data warehouse) and Google Cloud Dataflow (for running batch and streaming data processing pipelines with equal expressiveness and reliability) — push this envelope even further by separating compute and state. This approach can improve performance and efficiency, especially for workloads with high contention and concurrency (which are typical of real-world analytics use cases). Separation of compute and state is the “secret sauce” that makes BigQuery so good at concurrency.
Defining separation of compute and stateSeparation of compute and state refers to the ability to maintain intermediate state between processing stages in a high-performance component separate from either the compute cluster or storage. You may recognize its utility in the shuffle step — the key difference is that the shuffle state is facilitated by a separate sub-service. Whereas the shuffle task is often the source of processing bottlenecks, the separation between compute and state turns this weakness into a strength.
- Less state in compute means compute becomes more ephemeral and scalable. It’s easier to re-parallelize processing intra-stage and interstage, and easier to recover from a lost node.
- Processing is more streamlined; processing stages don’t conflict within the same compute nodes, resulting in resource contention and bottlenecks.
- It’s easier for the processing engine to re-partition workloads between stages.
- Your processing engine can take advantage of pipelined execution. In other words, it doesn’t have to wait for Stage N to finish before starting Stage N+1.
- The processing engine can implement dynamic work repartitioning (the ability to re-parallelize work due to slow workers or data skew).
- Keeping less state in processing nodes makes workloads more resilient to individual node issues.
- The service can utilize available resources much more efficiently across compute as well as shuffle.
- BigQuery’s current implementation of the Dremel distributed query engine deviates from the architecture described in the Dremel paper from 2010. Subsequent versions of Dremel and BigQuery leverage dynamic processing trees and a separate in-memory shuffle component.
- While historically Cloud Dataflow has implemented shuffle within the processing nodes, using its new shuffle service functionality (in beta at the time of this writing) leads to up to a 5x performance improvement.
SELECT language, sum (views) as views FROM
SELECT title, language, max(views) as views
WHERE title LIKE "G%o%"
GROUP BY title, language
GROUP BY language
ORDER BY views DESC
QUERY EXPLAIN for this query reveals a shuffle after both Stage 1 and Stage 2. However, rather than moving data from Stage 1 nodes to Stage 2 nodes, the query uses a separate, in-memory subservice to keep the shuffled data. This is a much cleaner architecture — BigQuery avoids the resource sloshing and bottlenecking associated with classical shuffle. It can also quickly recover from unexpected out-of-memory conditions associated with data skew in shuffle.
If we re-draw this
QUERY EXPLAIN as a series of mini-steps in disparate storage, compute, and shuffle components, the query looks more like this:
Now that the concepts involved here are clearer, let’s examine some real use cases.
Concurrent workloads at Motorola MobilityAs an avid user of BigQuery, Motorola Mobility has experienced the differences in performance before and after BigQuery implemented separation of compute and state. As a consumer of BigQuery’s Flat Rate pricing model, it maintains its cluster at a consistent size.
Since BigQuery rolled out its most recent iteration of separation of compute and state in 2015, BigQuery usage at Motorola Mobility has nearly tripled. At the same time, both performance variability and query execution speeds improved dramatically.
Specifically, Motorola Mobility observed significant decrease in performance variability of jobs with average runtime around 45 seconds (a typical query). Light queries (lowest 25% in terms of data scanned) complete in 3 seconds on average. Meanwhile, Motorola Mobility sees execution of its heaviest 5% of queries averaging around 100 seconds — down from almost 200 seconds a year and a half ago.
Benchmarking by AtScaleAtScale is a big data analytics software vendor whose founders experienced the power of first-generation big data technologies at Yahoo!. AtScale architects built what they could not buy when they tired of moving data out of Apache Hadoop into small relational engines to support analytic and business intelligence (BI) workloads. HDFS, Apache Spark SQL, and Apache Impala were the ideal starting points to bring interactive queries to Hadoop.
- Data loading — The process to move data to the Google cloud and load it into BigQuery is simple, scalable, and is well-documented.
- Out-of-the-box performance — The BigQuery engine performs well out-of-the-box with minimal query tuning and no system configuration.
- Impressive concurrency — BigQuery’s serverless model on small data sets shows no query degradation, even at query volumes above 25 concurrent BI users.
(Chart courtesy of AtScale, used with permission.)