Posted by Felipe Hoffa, Developer Advocate, Google Cloud Platform
I'm a big fan of Mark Litwintschik's latest series of blog posts. He's been measuring every big data platform he can find using the billion taxi trips made available as open data by the NYC TLC. Spoiler alert: I love the stellar results for Google BigQuery and Google Cloud Dataproc — but before we get there, here’s my visualization of Mark’s findings in a nutshell:
Reviewing Mark's first series of findings:
- Initial results for a simple query (COUNT+GROUP BY): >1 hour
- Same query using a columnar storage extension: >2 minutes
- A COUNT+AVG query: ~3 minutes
- Initial experiments weren't able to load all the data ("There currently isn't a single 2TB SSD drive big enough to store 1,711 GB of index data and have enough space to not trigger the high watermarks")
- Reading and indexing the data took "just under 70 hours"
- COUNT+GROUP BY query: ~18 seconds
- COUNT+AVG query: ~8 seconds
- A month later Mark found a way to load the full dataset into Elasticsearch, but results were slower
- A Spark on Amazon EMR "cluster can take 20 - 30 minutes to finish provisioning and bootstrapping all of the nodes" (meanwhile Google Cloud Dataproc clusters are available in less than 90 seconds)
- Moving data to Parquet can "take around 2 hours"
- Queries are 4x to 40x slower than Presto on AWS
- Starting a cluster on AWS EMR takes more than 20 minutes (why not 90 seconds?)
- Loading data "took 3 hours and 45 minutes to complete"
- The COUNT(*) and AVG() queries took ~70 seconds
- Queries between 2 and 5 minutes; Presto ~4x faster than Hive
- Loading at the measured rate: "the whole dataset should finish in about 6 - 8 hours”
- Instead, to keep Redshift within the free trial tier only 20% of the data was loaded. "The 200M records loaded in 1 hour, 4 minutes and 25 seconds”
- This post doesn't mention any benchmarks for the queries, but it does mention how important it is to keep running VACUUM and ANALYZE commands to ensure Redshift performs well (these operations slow down the database, prevent users from using it, and Redshift still charges for the server/hours spent on these tasks)
- Loading the full dataset on Redshift took around 10 hours, while experts gave their best advice on Twitter
- Queries took around a minute
All results looked pretty normal (until now). But then — on April 6th 2016 — Mark decided to give BigQuery a try:
- Loading a billion rows took 25 minutes
- Queries took ~2 seconds
- No cluster to deploy, no sizing decision to make
- Loading data into BigQuery is fast (getting faster), and free (you only pay for storage once it's loaded), unlike other platforms that charge CPU hours to load data
- Queries are many times faster than anything else tested so far
- No optimizations were needed. BigQuery is fast and efficient by default
- Want to try it yourself? The one billion rows are loaded into BigQuery as open data, so you don't need to wait 25 minutes. Just use your free monthly quota to query this and other open datasets.
- Again, AWS EMR takes around 20 minutes to get a cluster ready to run (note, GCP takes ~90 seconds for a similar task)
- The data was already loaded as ORC files on S3, so there was no need to load it again
- With 10 times the nodes (compared to the previous Presto-on-EMR run), queries only got 1.3x to 3x faster (~40 seconds)
- Mark doesn't mention how long it took to launch the cluster, but based on earlier results, 90 seconds, plus installing Presto (~30 seconds)
- Transforming the data from CSV to ORC took ~3 hours to complete
- Queries were initially 4x times slower than on AWS EMR
That's right. Mark's initial test of Dataproc got slower queries than the equivalent AWS EMR configuration. But guess what, the story doesn't end here. One of my favorite elements of this series of blog posts, is how it has attracted some big data players to Mark's inbox. So far I've seen tweets from Rahul Pathak, head of EMR at AWS, Jeff Barr, head evangelist at AWS as well as ElasticSearch engineer, Zachary Tong, which is great. We are engineers. We see benchmarks and we jump to improve them. On Google's side, Dennis Huo, tech lead and OSS software engineer for big data also jumped in. So the race got even more interesting:
- There are many advantages for keeping data on S3 (or the equivalent GCS), but Mark discovered a ~1.5x performance advantage for having a copy of the data loaded on HDFS (some queries down to ~40 seconds)
- 10 nodes, ~2 minutes to launch (I know I said ~90 seconds earlier, but in this case we are also installing Presto, all within these ~120 seconds)
- Playing with compression alternatives (Zlib, Snappy), index stride sizes and index strip sizes brought queries down to 44 seconds
- And then he tried HDFS instead of GCS: Some queries went down to ~10 seconds
- Woohoo! Query results in 4 seconds
Finally! After many tries and optimizations, Mark found a configuration that could run these queries on Presto just half as slow as on BigQuery. And only on Google Cloud — so far. As I mentioned earlier, the most interesting part of this race is how many players and platforms are coming forward looking for the best tweaks to offer a solution that can get as fast as BigQuery.
In summary, Google Cloud Dataproc is winning "the Presto configuration race" with 4 second queries. No doubt there will be very precisely-tuned configurations that may improve processing efficiency further. But from Mark’s results, it’s clear that BigQuery is the best choice on price and performance and if you want to run Presto, then GCP is the best place to run it.
This kind of competition is awesome. When everyone competes to get you faster and more cost effective results, everyone wins — especially you.