Google Cloud

Public datasets: how nonprofits can drive social impact with planetary-scale data

Editor’s Note: We are thrilled to announce our sponsorship of the Data for Development Festival, which is taking place this week in Bristol. There, we are hosting a session for nonprofits to come and learn more about how they can access and implement GCP and Kaggle’s Public Dataset programs in order to help drive social impact.

Public datasets on Google Cloud Platform democratize access to planetary-scale data. These datasets are freely hosted and accessible via Google BigQuery and Cloud Storage; all authenticated users can get up to one terabyte per month of free queries on BigQuery and no egress is charged to the data provider via Cloud Storage.

While public datasets are available to enterprise and nonprofit users alike, nonprofit organizations can particularly benefit from transparency at scale. GCP currently hosts climate, geographical, air quality, and public health datasets—among others—which can be combined, compared, or contrasted for joint analysis. So for example, if a nonprofit organization would like to analyze historical climate data (i.e. temperature) in relation to rising water levels, they might draw upon the historical records captured by NOAA GSOD and NASA Landsat datasets. Similarly, they could query EPA and OpenAQ datasets in order to assess air quality and its relation to carbon emissions. These are just two possible examples of what nonprofits can accomplish by, in the words of Isaac Newton, “standing upon the shoulders of giants.”


Information about the above EPA dataset is available here, and its Kernel is available here.

Public datasets on Kaggle

In March 2017, Google Cloud acquired Kaggle, a platform that engages over 1.5 million data scientists worldwide, in order to further data accessibility and implementation. Kaggle not only hosts crowd-sourced competitions for collaborative problem solving, but they also host data via their Kaggle Datasets platform. Each of the thousands of datasets shared on Kaggle is available through Kernels, which is a data science and machine learning workbench that comes pre-loaded with Python and R, popular libraries, and the ability to fork and comment on other Kagglers’ analyses—at no charge.

Public data combined with reproducible code samples allows nonprofits to start their investigation from a common basis, but then repurpose community-vetted data science for their own particular needs. Kaggle Datasets aspires to become the go-to community for vibrant discussion and collaboration. Already, users from around the world come to Kaggle to share analyses and learn from each other. These kinds of interactions are especially important in fields like medicine, where new methods frequently emerge and advancements like early detection have the potential to save millions of lives. The CT Medical Image dataset published by the Cancer Imaging Archive is just one example of how easy it is to share and learn from peers when the data and code examples coexist in the same space.

For Google Cloud and Kaggle, realizing the full potential of public datasets means making the world’s data accessible and useful to people everywhere. It’s now possible to access many of GCP’s Public Datasets through Kaggle’s recent integration with BigQuery, like the aforementioned EPA And OpenAQ datasets. These datasets have been viewed by tens of thousands of Kaggle data scientists who then analyzed the data in over 1,000 Kernels. And Kagglers have used these datasets to create introductory guides in order to understand air pollution and mapping its impact around the world.

Data science for good

Google Cloud and Kaggle are committed to creating many more opportunities for data scientists everywhere to solve problems vital to their communities and to the world as a whole, and addressing the needs of nonprofit organizations is an essential part of this strategy.

If you’re a data scientist that wants to access public datasets and/or volunteer their time and expertise, join the Kaggle community and participate in an upcoming Data Science for Good event.