Jump to Content
Data Analytics

Expanding our Public Datasets for geospatial and ML-based analytics

August 30, 2018
Shane Glass

Developer Advocate

The Google Cloud Public Datasets Program, launched in 2016, works with public data providers to store copies of high-value, high-demand public datasets in Google Cloud to make them more accessible and discoverable. We currently host more than 100 of these public datasets in BigQuery. They stretch across a wide variety of domains to create the only ML-ready public dataset program. Additionally, we have made more than 3 petabytes (PBs) of data, such as Landsat data from the United States Geological Survey (USGS), available in Cloud Storage so users can access and analyze huge volumes of meteorological satellite data faster than ever before.

Public Datasets provide an experimentation toolbox for those new to data analysis, and they offer a powerful repository for skilled researchers and organizations looking to join and augment their own datasets. Users can query up to 1 terabyte (TB) of data per month at no charge, and so far more than 95 petabytes (PBs) of public data have been queried in BigQuery since the program launched.

We have worked hard over the past three years to continually listen to our users’ needs so we can host datasets that help improve their workflows and maximize their ability to realize the value of this data. And we have developed collaborative relationships with data providers, like National Oceanic and Atmospheric Association (NOAA), to ensure that these datasets are stored and described according to appropriate best practices as defined by subject-matter experts. After receiving positive feedback from users, we decided to expand the program.

At Google Cloud Next ‘18, we announced an additional 5PBs of BigQuery storage available for public datasets, and that announcement has already generated a lot of excitement. We believe this expansion will enable us to host more of the datasets our users would like to use. We are particularly focused on making available datasets that can support BigQuery’s new GIS capabilities like BigQuery Geo Viz. Since Next ‘18, we have onboarded seven datasets that define boundaries in the United States by parameters such as zip code tabulation area (ZCTA) to support geospatial queries. These types of datasets support a broad number of use cases, and help our users better understand the geospatial impact of their data. But most importantly, hosting these datasets significantly reduces the time and effort required to conduct their geospatial analysis with BigQuery Geo Viz.
https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_QV24WMz.max-2000x2000.png

We also continuing to curate and host datasets in BigQuery so users can leverage BigQuery Machine Learning to analyze data with machine learning using standard SQL queries. We are continually working to minimize the effort required to use geospatial datasets and applying machine learning, so that our users can JOIN their private data and the world’s public data with as little time and effort as possible.


In order to do all this at a bigger scale, we needed more help from the public data providers with whom we are working. And while Google Cloud has made recently significant investments in expanding the Public Dataset Program, we were continually asked one question when we talked to our collaborators: how can we provide our users with a consistent data source or repository?


That is why, in addition to announcing the additional storage volume, we made a second announcement at Google Cloud Next ‘18: this additional storage will be available for the next five years. We believe this is truly an unparalleled commitment to public data. It excites us, our users, and our collaborators because it provides everyone with an assurance that their effort applied to our public datasets will be matched by our commitment going forward. This means that data users can spend their time and focus on analysis instead of searching for data.


We are excited to continue our work going forward to eliminate as many barriers as possible to using high-value, high-demand public datasets. If you have a dataset you would like to make publicly available through the Cloud Public Datasets Program, we would love to hear from you! Reach out to us via email. If you want to get started working with public datasets, check out our catalog of available datasets on Google Cloud Platform Marketplace. If you need help getting started, our very own Lak Lakshmanan will present on how to use these Public Datasets with BigQuery GIS and BigQuery ML at our online webinar, Cloud OnAir, on September 18. Find out more and register here.
Posted in