Jump to Content
Google Cloud

Google BigQuery public datasets now include Stack Overflow Q&A

December 15, 2016
Felipe Hoffa

Developer Advocate, Google Cloud Platform

Exploring hidden trends and relationships in Stack Overflow data is a good lesson in doing SQL analytics with BigQuery.​ 

Great news: we’ve just added Stack Overflow's history of questions and answers to the collection of public datasets on BigQuery. This means that anyone with a Google Cloud Platform account can use SQL queries (or some other favorite tool) to dig into this treasure trove of data.

You can find some some sample queries on the Stack Overflow Data documentation page, for example:

  • "What percentage of questions have been answered over the years?"
  • "What is the reputation and badge count of users across different tenures on Stack Overflow?"
  • "What are the 10 ‘easiest’ gold badges to earn?"
  • "Which day of the week has most questions answered within an hour?"

Take these questions as a starting point, then feel free to share your results and query variations with us via reddit.com/r/bigquery. And if you have any questions, ask the community on Stack Overflow.

Diving into the data

You might be wondering: What's so special about querying Stack Overflow with BigQuery? After all, Stack Overflow already refers users to Stack Exchange Data Explorer (SEDE), a data focused site where users have shared and prioritized thousands of questions—and that works really well. So, let's review some of the advantages of having Stack Overflow data in BigQuery too:

  • Surpass the 50,000 row limit. SEDE can only output up to 50,000 rows. This is not a problem for BigQuery.
  • Robots welcome. SEDE protects itself from abuse with CAPTCHAs, and has no API. With BigQuery no CAPTCHAs are needed to login, and its REST API allows a variety of tools to leverage its power. Feel free to connect Tableau, re:dash, Looker, R, pandas, and your favorite tools to it.
  • JOIN everything. There are plenty of other datasets shared on BigQuery, and there’s nothing stopping you from loading even more, privately or for public consumption. Imagine the questions you could answer by querying across them?
Let’s look at an example of joining. We have terabytes of GitHub's open source code shared on BigQuery. Let’s find out which are the most referenced Stack Overflow questions in the GitHub code—specifically, Javascript.

Loading...

Here are the most referenced Stack Overflow questions within Javascript code on GitHub:

https://storage.googleapis.com/gweb-cloudblog-publish/images/google-bigquery-public-datasets-stack-overfl.max-800x800.PNG

Or, we can look at GitHub pull-request comments from GHTorrent (also on BigQuery):

Loading...

Here are the results:

https://storage.googleapis.com/gweb-cloudblog-publish/images/google-bigquery-public-datasets-stack-overfl.max-900x900_Blec08z.PNG

Or, let's look at Hacker News. What are most popular tags of questions that have been posted there since 2014?

Loading...

Here are the most popular tags on Stack Overflow questions linked from Hacker News since 2014:

https://storage.googleapis.com/gweb-cloudblog-publish/images/google-bigquery-public-datasets-stack-overfl.max-300x300_nKCLF8C.PNG

How does that compare to the rest of Stack Overflow?

Loading...

It would seem that the Hacker News community cares a lot more about Haskell, C, C++, and performance than Stack Overflow as a whole, which lists php, android, jquery, and css within its most popular tags:

https://storage.googleapis.com/gweb-cloudblog-publish/images/google-bigquery-public-datasets-stack-overfl.max-300x300_ly0KuKx.PNG

Next steps

If you haven't tried BigQuery yet, follow this Beginner’s Tutorial, which shows how to analyze 50 billion page views in 5 seconds. Then, you’re ready to feel free to play with any other query or dataset you like: for example, our official public BigQuery datasets, datasets that other users have shared, and of course your very own.

Posted in