for users to access and analyze big data in the cloud."/>

Access and Analyze Data

Public Datasets on Google Cloud Platform makes it easy for users to access and analyze data in the cloud. These datasets are freely hosted and accessible using a variety of data warehouse and analytics software, from open source Apache Spark to cutting edge Google technologies like Google BigQuery and Google Cloud Dataflow. From structured genomic or encyclopedic data to unstructured climate data, Public Datasets provide a playground for those new to big data and data analysis and a powerful repository for skilled researchers. You can also integrate with your application to add valuable insights for your users. Whatever your use case, these datasets are freely available on GCP.

Access and Analyze Data

Google BigQuery Public Datasets

BigQuery hosts a variety of public datasets that can be analyzed using familiar SQL. Users can query this data directly in the BigQuery web UI or programmatically using the BigQuery REST API. These data sets are freely hosted and accessible to everyone. You can query this data up to 1TB per month for free. You pay only for the queries that you perform above this free quota, subject to query pricing details.

How to run a terabyte of Google BigQuery queries each month without a credit card video
Querying BigQuery Public Datasets

Google Genomics Public Datasets

Google collaborates with the genomics community to host select genomic data, like the 1000 Genomes Project, as a public resource. You can access these datasets through the Google Genomics API, the BigQuery web interface and open source examples.

Google Genomics Public Datasets

Geo Imagery Datasets

Landsat and Sentinel satellite imagery datasets are available on Google Cloud Storage. You can use GCP to perform analysis and develop new products without needing to worry about the cost of storing the data or the time and cost required to download very large datasets.

In addition to these datasets hosted on Google Cloud Storage, a wide variety of standard Earth science raster datasets are also available in Earth Engine. Earth Engine provides a convenient web-based code editor designed to make developing complex geospatial workflows fast and easy.

Geo Imagery Datasets

BigQuery Datasets

Bay Area Bike Share Trips
This data includes all Bay Area Bike Share trips from August 2013 to the present, and is updated daily. Learn More
GDELT Book Corpus
A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes). Learn More
GitHub Data
This public dataset contains GitHub activity data for more than 2.8 million open source GitHub repositories, more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files. Learn More
IRS Form 990 Data
A dataset that contains financial information about nonprofit/exempt organizations in the United States, gathered by the Internal Revenue Service (IRS) using Form 990. Learn More
Stack Overflow Data
This public dataset contains an archive of Stack Overflow content, including posts, votes, tags, and badges. Learn More
San Francisco Street Trees Data
This data includes a list of San Francisco Department of Public Works maintained street trees including: planting date, species, and location. Learn More
San Francisco Police Reports Data
This data includes incidents from the San Francisco Police Department (SFPD) Crime Incident Reporting system, from January 2003 until the present. Learn More
San Francisco Fire Department Service Calls Data
This data includes fire unit responses to calls from April 2000 to present and is updated daily. Data contains the call number, incident number, address, unit identifier, call type, and disposition. Learn More
San Francisco 311 Service Requests Data
This data includes all San Francisco 311 service requests from July 2008 to the present, and is updated daily. Learn More
USA Names
A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879. Learn More
USA Disease Surveillance
A dataset published by the US Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013. Learn More
USA Bureau of Labor Statistics
This dataset includes economic statistics on inflation, prices, unemployment, and pay & benefits provided by the Bureau of Labor Statistics (BLS). Learn More
Hacker News
A dataset that contains all stories and comments from Hacker News since its launch in 2006. Learn More
Major League Baseball Data
This public data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016. Learn More
Medicare Data
This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarizes the utilization and payments for procedures, services, and prescription drugs provided to Medicare beneficiaries. Learn More
NOAA GSOD Weather Data
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and 2016, collected from over 9000 stations. Learn More
NOAA GHCN
This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. This dataset draws from more than 20 sources, including some data from every year since 1763. Learn More
NYC TLC Trips
Data collected by the NYC Taxi and Limousine Commission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to present. Learn More
NYC 311 Service Requests
This public data includes all 311 service requests from 2010 to the present, and is updated daily. 311 is a non-emergency number that provides access to non-emergency municipal services. Learn More
NYC Citi Bike Trips
Data collected by the NYC Citi Bike bicycle sharing program, that includes trip records for 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City since Citi Bike launched in September 2013. Learn More
NYC Tree Census
The NYC street tree data includes data from the 1995, 2005 and 2015 Street Tree Censuses, which are conducted by volunteers organized by the NYC Department of Parks and Recreation. Learn More
NYPD Motor Vehicle Collisions
This dataset includes details of Motor Vehicle Collisions in New York City provided by the Police Department (NYPD) from 2012 to the present. Learn More
Open Images Data
A dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. Learn More

Geo Imagery Datasets

Landsat
A satellite image dataset from the United States Geological Survey (USGS) that includes millions of multispectral images of the Earth's land surface, at resolutions of between 15 and 60 meters per pixel, from 1982 through the present. Learn More
Earth Engine datasets
Earth Engine’s public data catalog includes a variety of standard Earth science raster datasets. Learn More
Sentinel-2
A satellite image dataset from the European Space Agency (ESA) that includes multispectral images of the Earth's land surface, with a resolution of 10–60 meters per pixel, from 2015 through the present. Learn More

Genomics Datasets

1,000 Genomes
This dataset comprises roughly 2,500 genomes from 25 populations around the world. Learn More
Reference Genomes
Reference Genomes such as GRCh37, GRCh37lite, GRCh38, hg19, hs37d5, and b37. Learn More
Illumina Platinum Genomes
This dataset comprises the 17 member CEPH pedigree 1463. Learn More
Personal Genome Project Data
This dataset comprises roughly 180 Complete Genomics genomes. Learn More
ICGC-TCGA DREAM Mutation Calling Challenge synthetic genomes
This dataset comprises the three public synthetic tumor/normal pairs created for the ICGC-TCGA DREAM Mutation Calling challenge. Learn More
Simons Genome Diversity Project
This dataset comprises 25 genomes from 13 diverse populations serving as the pilot project dataset for the Simons Genome Diversity Project. Learn More
TCGA Cancer Genomics Data in the Cloud
Open-access TCGA data including somatic mutation calls, clinical data, mRNA and miRNA expression, DNA methylation and protein expression from 33 different tumor types. Learn More

Public Datasets Pricing

Google Cloud Public Datasets are freely accessible with a Google account. Charges may be incurred for large queries and certain use cases.

  • BigQuery - Public Datasets hosted in BigQuery provide users with free access of up to 1TB/mo in queries. Queries over the 1TB/mo are subject to query pricing.
  • Google Cloud Storage - Public Datasets hosted in Google Cloud Storage, like raster and Genomics data, are free to access. You pay only for GCP resources used to analyze the data, such as compute resources or additional storage you use for your own applications.