Data Analytics

Celebrating a decade of data: BigQuery turns 10

May 20, 2020

https://storage.googleapis.com/gweb-cloudblog-publish/images/BQ10-01.max-2100x2100.jpg

Jordan Tigani

Director of Product Management, BigQuery

Editor’s note: Today we’re hearing from some of the team members involved in building BigQuery over the past decade, and even before. Our thanks go to Jeremy Condit, Dan Delorey, Sudhir Hasbe, Felipe Hoffa, Chad Jennings, Jing Jing Long, Mosha Pasumansky, Tino Tereshko, and William Vambenepe, and Alicia Williams.

This month, Google’s cloud data warehouse BigQuery turns 10. From its infancy as an internal Google product to its current status as a petabyte-scale data warehouse helping customers make informed business decisions, it’s been in a class of its own. We got together to reflect on some of the technical milestones and memorable moments along the way, and here are some of the moments through the years:

Applying SQL to big data was a big deal.

When we started developing BigQuery, the ability to perform big data tasks using SQL was a huge step. At that time, either you had a small database that used SQL, or you used MapReduce. Hadoop was just emerging then, so for large queries, you had to put on your spelunking hat and use MapReduce.

Since MapReduce was too hard to use for complex problems, we developed Sawzall to run on top of MapReduce to simplify and optimize those tasks. But Sawzall still wasn’t interactive. We then built Dremel, BigQuery’s forerunner, to serve Google’s internal data analysis needs. When we were designing it, we aimed for high performance, since users needed fast results, along with richer semantics and more effective execution than MapReduce. At the time, people expected to wait hours to get query results, and we wanted to see how we could get queries processed in seconds. That’s important technically, but it’s really a way to encourage people to get more out of their data. If you can get query results quickly, that engenders more questions and more exploration.

Our internal community cheered us on.

Dremel was something we had developed internally at Google to analyze data faster, in turn improving our Search product. Dremel became BigQuery’s query engine, and by the time we launched BigQuery, Dremel was a popular product that many Google employees relied on. It powered data search beyond server logs, such as for dashboards, reports, emails, spreadsheets, and more.

A lot of the value of Dremel was its operating model, where users focused on sending queries and getting results without being responsible for any technical or operational back end. (We call that “serverless” now, though it didn’t have a term back then.) A Dremel user put data on a shared storage platform and could immediately query it, any time. Faster performance was an added bonus.

We built Dremel as a cloud-based data engine, similar to what we had done with App Engine, for internal users. We saw how useful it was for Google employees, and wanted to use those concepts for a broader external audience. To build BigQuery into an enterprise data warehouse, we kept the focus on the value of serverless, which is a lot more convenient and doesn’t require management overhead.

In those early days of BigQuery, we heard from users frequently on StackOverflow. We’d see a comment and address it that afternoon. We started out scrappy, and really closely looped in with the community. Those early fans were the ones who helped us mature and expand our support team.

We also worked closely with our first hyperscale customer as they ramped up to using thousands of slots (BigQuery’s unit of computational capacity), then the next customer after that. This kind of hyperscale has been possible because of Google’s networking infrastructure. This infrastructure allowed us to build a shuffler that used disaggregated memory into Dremel.

The team also launched two file formats, and inspired external emulation for other developers. ColumnIO inspired the column encoding of open-source Parquet, a columnar storage format. And the Capacitor format used a columnar approach that supports semistructured data. The idea of using a columnar format for this type of analytics work was new, but popular, in the industry back then.

Tech concepts and assumptions changed quickly.

Ten years ago in the data warehouse market, high scalability meant high cost. But BigQuery brought a new way of thinking about big data into a data warehouse format that could scale quickly at low cost. The user can be front and center and doesn’t have to worry about infrastructure—and that’s defined BigQuery from the start. Separating storage and processing was a big shift. The method ten years ago was essentially just to throw compute resources at big data problems, so that users often ran out of room in their data warehouse, thus running out of querying ability. In 2020, it’s become much cheaper to keep a lot of data in a ready-to-query store, even if it isn’t queried often.

Along the way, we’ve added lots of features to BigQuery, making it a mature and scalable data warehousing platform. We’ve also really enjoyed hearing from BigQuery customers about the projects they’ve used it for. BigQuery users have run more than 10,000 concurrent queries across their organization. We’ve heard over the years about projects like DNA analysis, astronomical queries, and more, and we see businesses across industries using BigQuery today.

We also had our founding engineering team record a video celebrating BigQuery’s decade in data, talking about some of their memorable moments, naming the product, and favorite facts and innovations—plus usage tips. Check it out here:

What’s next for data analytics?

Ten years later, a lot has changed. What we used to call big data is now, essentially, just data. It’s an embedded part of business and IT teams.

When we started BigQuery, we asked ourselves, “What if all the world’s data looked like one giant database?” In the last 10 years, we’ve come a lot closer to achieving that goal than we had thought possible. Ten years from now, will we still even need different operational databases and data warehouses and data lakes and business intelligence tools? Do we still need to treat structured data and unstructured data differently… isn’t it all just “data”? And then, once you have all of your data in one place, why should you even need to figure out on your own what questions to ask? Advances in AI, ML and NLP will transform our interactions with data to the level that we cannot fully imagine today.

No matter what brave new world of data lies ahead, we’ll be developing and dreaming to help you bring your data to life. We’re looking forward to lots more exploration. And you can join the community monthly in the BigQuery Data Challenge.

We’re also excited to announce the BigQuery Trial Slots promotional offer for new and returning BigQuery customers. This lets you purchase 500 slots for $500 per month for six months. This is a 95% discount from current monthly pricing. This limited time offer is subject to available capacity and qualification criteria and while supplies last. Learn more here. To express interest in this promotion, fill out this form and we'll be in touch with the next steps.

We’re also hosting a very special BigQuery live event today, May 20, at 12PM PDT with hosts Felipe Hoffa and Yufeng Guo. Check it out.

Posted in