A public dataset is any dataset that is stored in BigQuery and made available to the general public. This page lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1 TB per month is free, subject to query pricing details).
A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879.
NYC TLC Trips
Data collected by the NYC Taxi and Limousine Commission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015.
A dataset that contains all stories and comments from Hacker News since its launch in 2006.
USA Disease Surveillance
A dataset published by the US Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
GDELT Book Corpus
A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes).
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. This dataset draws from more than 20 sources, including some data from every year since 1763.
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and 2016, collected from over 9000 stations.
This public dataset contains GitHub activity data for more than 2.8 million open source GitHub repositories, more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files.
Major League Baseball Data
This public dataset contains pitch-by-pitch activity data for Major League Baseball (MLB) in 2016.
How to query public data sets using BigQuery
BigQuery is a fully managed data warehouse and analytics platform. The public datasets listed on this page are available for you to analyze using SQL queries. You can access BigQuery public data sets using the web UI the command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.
The first terabyte of data processed per month is free, so you can start querying datasets without enabling billing. To get started running some sample queries, select or create a project and then run the example queries on the NOAA GSOD weather dataset.
- Select or create a Cloud Platform Console project.
- Go to the NOAA GSOD dataset in the BigQuery Web UI.
Go to NOAA GSOD dataset
- Click the COMPOSE QUERY button.
- Copy and paste the SQL examples on the NOAA GSOD page.
Other Public Datasets
There are many other public datasets available for you to query, some of which are also hosted by Google, but many more that are hosted by third parties. You can share any of your datasets with the public by changing the sharing permissions associated with your dataset. For more information about sharing datasets, see Access Control.
- Sample Tables
- Google Genomics Public Data
- Datasets publicly available on Google BigQuery (reddit.com)
How to list your public data set on BigQuery
If you have any questions about listing a public data set on this page, please contact us at firstname.lastname@example.org.