Google Cloud

BigQuery and Dataproc shine in independent big data platform comparison

May 17, 2016

Felipe Hoffa

Developer Advocate, Google Cloud Platform

I'm a big fan of Mark Litwintschik's latest series of blog posts. He's been measuring every big data platform he can find using the billion taxi trips made available as open data by the NYC TLC. Spoiler alert: I love the stellar results for Google BigQuery and Google Cloud Dataproc — but before we get there, here’s my visualization of Mark’s findings in a nutshell:

https://storage.googleapis.com/gweb-cloudblog-publish/images/BigQuery-shines-19hsv.max-700x700.PNG

Reviewing Mark's first series of findings:

PostgreSQL

Initial results for a simple query (COUNT+GROUP BY): >1 hour
Same query using a columnar storage extension: >2 minutes
A COUNT+AVG query: ~3 minutes

ElasticSearch

Initial experiments weren't able to load all the data ("There currently isn't a single 2TB SSD drive big enough to store 1,711 GB of index data and have enough space to not trigger the high watermarks")
Reading and indexing the data took "just under 70 hours"
COUNT+GROUP BY query: ~18 seconds
COUNT+AVG query: ~8 seconds
A month later Mark found a way to load the full dataset into Elasticsearch, but results were slower

Spark on AWS EMR

A Spark on Amazon EMR "cluster can take 20 - 30 minutes to finish provisioning and bootstrapping all of the nodes" (meanwhile Google Cloud Dataproc clusters are available in less than 90 seconds)
Moving data to Parquet can "take around 2 hours"
Queries are 4x to 40x slower than Presto on AWS

Presto on AWS EMR, 5 nodes

Starting a cluster on AWS EMR takes more than 20 minutes (why not 90 seconds?)
Loading data "took 3 hours and 45 minutes to complete"
The COUNT(*) and AVG() queries took ~70 seconds

Hive vs Presto, on a local machine

Queries between 2 and 5 minutes; Presto ~4x faster than Hive

AWS Redshift, within the free tier

Loading at the measured rate: "the whole dataset should finish in about 6 - 8 hours”
Instead, to keep Redshift within the free trial tier only 20% of the data was loaded. "The 200M records loaded in 1 hour, 4 minutes and 25 seconds”
This post doesn't mention any benchmarks for the queries, but it does mention how important it is to keep running VACUUM and ANALYZE commands to ensure Redshift performs well (these operations slow down the database, prevent users from using it, and Redshift still charges for the server/hours spent on these tasks)

AWS Redshift, large cluster:

Loading the full dataset on Redshift took around 10 hours, while experts gave their best advice on Twitter
Queries took around a minute

All results looked pretty normal (until now). But then — on April 6th 2016 — Mark decided to give BigQuery a try:

Google BigQuery:

Loading a billion rows took 25 minutes
Queries took ~2 seconds

Wow. Why?

No cluster to deploy, no sizing decision to make
Loading data into BigQuery is fast (getting faster), and free (you only pay for storage once it's loaded), unlike other platforms that charge CPU hours to load data
Queries are many times faster than anything else tested so far
No optimizations were needed. BigQuery is fast and efficient by default
Want to try it yourself? The one billion rows are loaded into BigQuery as open data, so you don't need to wait 25 minutes. Just use your free monthly quota to query this and other open datasets.

The quest could have ended there. We got something way faster and simpler than anything else, but Mark had a bigger goal: He wanted to test every platform and configuration possible. Here's what happened next:

Presto on AWS EMR with 50 nodes

Again, AWS EMR takes around 20 minutes to get a cluster ready to run (note, GCP takes ~90 seconds for a similar task)
The data was already loaded as ORC files on S3, so there was no need to load it again
With 10 times the nodes (compared to the previous Presto-on-EMR run), queries only got 1.3x to 3x faster (~40 seconds)

Presto on Google Cloud Dataproc

Mark doesn't mention how long it took to launch the cluster, but based on earlier results, 90 seconds, plus installing Presto (~30 seconds)
Transforming the data from CSV to ORC took ~3 hours to complete
Queries were initially 4x times slower than on AWS EMR

That's right. Mark's initial test of Dataproc got slower queries than the equivalent AWS EMR configuration. But guess what, the story doesn't end here. One of my favorite elements of this series of blog posts, is how it has attracted some big data players to Mark's inbox. So far I've seen tweets from Rahul Pathak, head of EMR at AWS, Jeff Barr, head evangelist at AWS as well as ElasticSearch engineer, Zachary Tong, which is great. We are engineers. We see benchmarks and we jump to improve them. On Google's side, Dennis Huo, tech lead and OSS software engineer for big data also jumped in. So the race got even more interesting:

S3 vs HDFS (on AWS):

There are many advantages for keeping data on S3 (or the equivalent GCS), but Mark discovered a ~1.5x performance advantage for having a copy of the data loaded on HDFS (some queries down to ~40 seconds)

Google Cloud Dataproc revisited: 33x faster

10 nodes, ~2 minutes to launch (I know I said ~90 seconds earlier, but in this case we are also installing Presto, all within these ~120 seconds)
Playing with compression alternatives (Zlib, Snappy), index stride sizes and index strip sizes brought queries down to 44 seconds
And then he tried HDFS instead of GCS: Some queries went down to ~10 seconds

Did you see that? Queries are down to 10 seconds! We haven't seen similar performance for Presto on any other configuration or service . . . so far.

Google Cloud Dataproc with 50 nodes (win):

Woohoo! Query results in 4 seconds

Finally! After many tries and optimizations, Mark found a configuration that could run these queries on Presto just half as slow as on BigQuery. And only on Google Cloud — so far. As I mentioned earlier, the most interesting part of this race is how many players and platforms are coming forward looking for the best tweaks to offer a solution that can get as fast as BigQuery.

https://storage.googleapis.com/gweb-cloudblog-publish/images/BigQuery-shines-2lcs2.max-700x700.PNG

In summary, Google Cloud Dataproc is winning "the Presto configuration race" with 4 second queries. No doubt there will be very precisely-tuned configurations that may improve processing efficiency further. But from Mark’s results, it’s clear that BigQuery is the best choice on price and performance and if you want to run Presto, then GCP is the best place to run it.

This kind of competition is awesome. When everyone competes to get you faster and more cost effective results, everyone wins — especially you.

Posted in