Most popular public datasets to enrich your BigQuery analyses
Michael Hamamoto Tribble
Head of Datasets for Google Cloud
From rice genomes to historical hurricane data, Google Cloud Public Datasets offer a world of exploration and insight. The more than 20 PB across 200+ datasets in our Public Dataset Program helps you explore big data and data analytics without a lot of cost, setup, or overhead. You can explore up to 1 TB per month at no cost, and you don’t even need a billing account to start using BigQuery sandbox. Joining public datasets with your own data gets you insights right away, such as adding location data for better transportation management or incorporating NOAA’s climate data into forecasting models. Retailers can use census demographics for market analysis, and analysts at those companies can map users with census block, zip code, and county boundary geometries.
These datasets can help you start exploring and layering data points, and they also make data analytics a lot easier for enterprise customers. These utility datasets let you start with a set of valid, clean data, rather than having to start from scratch.
You can access Google Cloud’s public datasets through BigQuery and Cloud Storage using either legacy or standard SQL queries. Researchers can also use BigQuery ML to train advanced machine learning models with this data right inside BigQuery at no additional cost. BigQuery GIS provides convenient, built-in capabilities to ingest, process, and analyze geospatial data when you want a location component in your data analysis.
Here, we’ll explore some common datasets and how they’re used.
Expanding access to data for healthcare and research. This year, COVID-19 public datasets have been incredibly important to researchers looking to understand and combat the virus. As the pandemic began in March, we announced an initial set of free public datasets to help researchers, data scientists, and analysts combat the coronavirus. These include the COVID-19 Open Data dataset, the Global Health Data from the World Bank, and OpenStreetMap data. The COVID-19 datasets are free to access and query through September 15, 2021. Looker customers can also install the COVID-19 block, which includes the Community Mobility Data Block, from the Marketplace, where they can accelerate their analyses of the public datasets using curated explore environments and purposeful dashboards. Anyone can go ahead and access the dashboards and explore environments here. The Looker Demographic data block contains demographic information from the American Community Survey.
Building the right tools to bring COVID-19 data to all. Google Cloud and partner SADA also collaborated earlier this year on building the National Response Portal, an open data platform that combines multiple datasets for an on-the-ground view of the pandemic. The Oklahoma State Department of Health and governor’s office used COVID-19 public datasets and Looker data blocks to build a dashboard on the state website to monitor cases and update residents.
Layering weather, climate, and GIS datasets for a better understanding of nature. Weather and climate are popular datasets to explore. Within BigQuery, you can explore climate simulation data from a collaboration with the Lamont-Doherty Earth Observatory of Columbia University and the Pangeo Project. In addition, the World Climate Research Programme released the Coupled Model Intercomparison Project Phase 6 (CMIP6) data archive. This dataset will be continuously updated and may eventually contain 20 PB of data. Other climate-related datasets include those from NOAA on lightning and hurricanes, and Looker’s Weather data block that contains daily weather reporting in the United States at the zip code level from 1920 until now.
You can see how GlideFinder built a platform that ingests satellite data to monitor wildfires, using data characteristics like temperature. And here’s how to use a Colab notebook to analyze data on daily temperature readings from around the world. In Looker, users can leverage the weather block to analyze weather data and join it back onto their own data sources to get an entire picture of how climate may be impacting their business.
Using genomics data to improve food security. Our rice genome dataset derives from the Rice 3K dataset, which analyzes genetic variation, population structure, and diversity among more than 3,000 diverse Asian cultivated rice genomes. Our researchers then used DeepVariant to re-analyze that dataset with the goal of improving food security by speeding up genetic enhancement to increase rice crop yield.
Get to know cryptocurrencies using blockchain datasets. Our Public Datasets Program includes a set of cryptocurrency blockchain datasets, so you can start to better understand this modern concept. The datasets consist of the blockchain transaction history of Bitcoin and Ethereum, plus others, and you’ll also find a set of queries and views to enable multi-chain meta analysis and integration with conventional financial record processing systems.
Putting public datasets to use
We’re always interested to hear all the ways that analysts and researchers use public datasets to further understanding of so many different causes and topics. 2020 has brought fascinating, hopeful stories of how data has helped fight COVID-19, including our COVID-specific datasets and other public health datasets. Google Cloud has been able to help with COVID-19 academic research by offering high-performance compute and other technology resources along with public datasets.
One important note is that the contents of these datasets are provided to the public strictly for educational and research purposes only. We are not onboarding or managing PHI or PII data as part of our COVID-19 public datasets. Google has practices and policies in place to ensure that data is handled in accordance with widely recognized patient privacy and data security policies.
Get started with geospatial data exploration in this beginner’s guide to BigQuery GIS.
See how a cross-industry team of AI practitioners ramped up data use to fight COVID.
Check out the latest Kaggle competitions to test your skills.