Data Analytics

COVID-19 public datasets: our continued commitment to open, accessible data

August 13, 2020

Michael Hamamoto Tribble

Head of Datasets for Google Cloud

Donny Cheung

Tech Lead/Engineering Manager, Healthcare & Life Sciences AI, Google Cloud

Back in March, we announced that new COVID-19 public datasets would be joining our Google Cloud Public Datasets program to increase access to critical datasets to support the global response to the novel coronavirus. While the program initially focused on COVID-19 case data, we’ve since expanded our datasets offering to provide additional value to members of the research community and public decision makers. In addition, we’re extending our initial offering of free querying of COVID-19 public datasets for an additional year, through September 15, 2021.

These expanded datasets would not have been possible without numerous partnerships with data providers working closely with Google Cloud to onboard their data to BigQuery. By onboarding public data to BigQuery, these data providers remove barriers and increase the velocity with which users can access and query these large data files. With the COVID-19 public datasets and BigQuery, everything is easily found in one place.

As we strive to continue supporting our users, we want to help ensure that a lack of resources is not a contributing factor in one’s ability to make sense of this data. That’s why we’re expanding datasets access, and we hope that this will expand the pool of contributors who are finding solutions to this pandemic, whether that’s students and faculty querying these datasets through distance learning in the fall or public decision makers gauging when their communities can safely reopen. We hope that these datasets continue to provide universally accessible and useful information in the fight against COVID-19.

How Google has worked with organizations to make COVID-19 datasets available

Since the beginning of the pandemic, The New York Times has tracked and visualized cases across the United States. They have publicly shared aggregated case data at the county and state level, allowing researchers to track, model, and visualize the spread of the virus. These rich datasets provide U.S. national-level, state-level, and county-level cases and deaths, beginning with the first reported coronavirus case in Washington State on January 21, 2020. As deaths began to increase across the United States and abroad, The New York Times published the data behind their excess deaths tracker to provide researchers and the public with a better record of the true toll of the pandemic globally. We worked with them to make this data accessible on BigQuery. The New York Times also estimated the prevalence of mask-wearing in counties in the United States and made that data available to provide researchers a way to better understand the role of mask-wearing in the course of the pandemic.

To complement this and many other efforts to better understand the impact of policy actions, Google also released the COVID-19 Community Mobility Reports, which provide data on community movement trends, and made the data available on BigQuery. We also recently announced our COVID-19 Public Forecasts to help first responders and other healthcare and public sector-impacted organizations project metrics, such as case counts and deaths, into the future. This data is also available on BigQuery.

Next, we prioritized data that could help in understanding the varying effects of COVID-19 in our communities and healthcare systems by publishing datasets relating to social determinants of health. To expand the scope of COVID-19 related queries that qualify for free querying, we included existing datasets like the American Community Survey from the U.S. Census Bureau and OpenStreetMap. We also worked with organizations like BroadStreet to make datasets like the U.S. Area Deprivation Index available on BigQuery. This dataset provides a measure of community vulnerability to public health issues at a highly granular level. Finally, we are publishing aggregated hospital capacity data from the American Hospital Association to help decision makers better understand their community’s ability to handle a surge in hospitalizations.

We also recognize that the scientific community’s response to COVID-19 often depends on the availability and accessibility of high-quality scientific data. We’ve worked to include the Immune Epitope Database (Vita et al, Nucleic Acid Research, 2018) on BigQuery as a resource to researchers investigating the immune response to the SARS-CoV-2 virus. We have also published a series of articles to show how researchers can explore and build predictive models from this dataset using Google Cloud AI Platform. As an additional resource to the scientific community, we’ve created the COVID-19 Open Data dataset, which combines numerous publicly available COVID-19 and related datasets at a fine geographic level and makes them available both in BigQuery and in CSV and JSON formats. The code used to create this dataset is open-source and available on GitHub.

As we continue to expand the list of COVID-19 public datasets, we will continue to release new datasets aligned with these four established focus areas:

Epidemiology and health response, such as case and testing statistics and hospital data
Government policy response and effects, such as mobility and mask compliance
Social determinants of health and community response
Biomedical and other research data

For those attending Google Cloud Next ‘20: OnAir, be sure to check out Data vs. COVID-19: How Public Data is Helping Flatten the Curve. This session will highlight how public data and the Google Cloud and COVID-19 public datasets are helping combat the pandemic and informing individual decision-making to help everyone make informed decisions about the spread and risks of the virus, and explain how cooperative efforts could help flatten the curve.

Posted in