Data Analytics

New climate model data now in Google Public Datasets

December 9, 2019

Shane Glass

Developer Advocate, Google Cloud

Exploring public datasets is an important aspect of modern data analytics, and all this gathered data can help us understand our world. At Google Cloud, we maintain a collection of public datasets, and we’re pleased to collaborate with the Lamont-Doherty Earth Observatory (LDEO) of Columbia University and the Pangeo Project to host the latest climate simulation data in the cloud.

The World Climate Research Programme (WCRP) recently began releasing the Coupled Model Intercomparison Project Phase 6 (CMIP6) data archive, aggregating the climate models created across approximately 30 working groups and 1,000 researchers investigating the urgent environmental problem of climate change. The CMIP6 climate model datasets include rich details on many aspects of the climate system, including historical and future simulations. The data are now accessible in Cloud Storage and will be in BigQuery soon. Along with making CMIP6 available on Google Cloud, the Pangeo Project develops software and infrastructure to make it easier to analyze and visualize climate data using cloud computing.

On Google Cloud, this dataset will be continuously updated and available to researchers around the globe to use for their own projects—without the constraints of downloading terabytes or even petabytes of data. The entire archive may eventually contain 20 PB of data, of which about 100 TB of data are currently available in the cloud. You can request data from Pangeo’s CMIP6 Google Cloud Collection in this form.

“It’s a very live data set. It's going to be updated over the next year as the data come online and as people’s needs arise,” says Ryan Abernathey, associate professor of Earth and environmental sciences at Columbia University and LDEO. He emphasizes the practical impact of this project. “What people actually care about most is not the global mean temperature because no one lives in the ‘global mean world.’ People care about the local impacts of drought or extreme rainfall, which can cause severe hardship for society. With these high-resolution simulations of rare events, we get much better information for planning in response to expected changes in the climate.”

What you’ll find in the CMIP6 data

The models in CMIP6’s data range from high-resolution simulations based on historical data from 1850 onward to hypothetical scenarios that manipulate key variables. For example, Abernathey asks, “What if carbon dioxide (CO2) were to instantaneously quadruple its concentration overnight? That's a very useful experiment, not because it helps us make a detailed projection about the future, but because it helps us probe our physical understanding of how the climate system responds to CO2.” Each of the CMIP6 models includes dozens of variables, ensemble members, and scenarios, leading to large, unwieldy datasets. But Pangeo, an ensemble of open-source Python tools for big data analysis, makes it easier to perform large-scale computations on CMIP6 and other similar large datasets.

To help researchers work with the multidimensional datasets of climate research, Abernathey and his colleagues at LDEO and the National Center for Atmospheric Research (NCAR) drew on funding from the National Science Foundation (NSF) and computing support from Google Cloud to develop Pangeo, which is an open-source platform aimed at accelerating geoscience data analysis. Pangeo can be run on nearly any high-performance computing system, including Google Kubernetes Engine (GKE), which supports easy deployment with autoscaling (both up and down) and integration with other Google Cloud tools such as Cloud Storage and BigQuery. The Pangeo community shares expertise, such as use cases for different domain-specific applications, and contributes to the development of open-source tools, like a cloud-optimized data storage format called Zarr.

"The CMIP project has grown since its early days, and now is seeing tremendous growth beyond the U.S. and E.U. into the developing world,” says V. Balaji, a computational climate scientist on leave from Princeton University. Currently at the Institut Pierre-Simon Laplace in Paris, Balaji has been involved with all aspects of CMIP, from defining the experiments and running the simulations to analyzing the output and designing the Earth System Grid Federation (ESGF), a network of services that underpin the global data infrastructure enabling this critical research enterprise. “For new entrants, and for academic researchers worldwide, Pangeo in the cloud represents an exciting new opportunity to broaden the user base of very large-scale climate data, without the need to acquire supercomputer-scale storage and analysis facilities,” says Balaji. “It bridges what I call the gap between 'inspiration-driven' and 'industrial strength' science, enabling a scientist to explore the data and design their own analysis, and immediately apply their findings at very large scale. The progress of Pangeo in the cloud will inform our own architectural choices in designing the future of the global climate data infrastructure."

With these high-resolution simulations of rare events, we get much better information for planning in response to expected changes in the climate.

Tweet this quote

The Pangeo team at LDEO and NCAR recently hosted a hackathon to jumpstart the analysis of the CMIP6 data on Google Cloud for pressing scientific questions. One participant—Henri Drake, a Ph.D. candidate in MIT’s Program in Atmospheres, Oceans, and Climate—created a tutorial for analyzing simulations of global warming in state-of-the-art CMIP6 models, under the worst-case scenario of uncontrolled greenhouse gas emissions. These CMIP6 model projections “reflect millions of lines of model code and represent everything from forest transpiration in the Amazon rainforest and thunderstorms in the U.S. Midwest to the formation of meltwater ponds on Arctic sea ice,” says Drake. “We would need a huge supercomputer to run the simulations from the model source code ourselves. Thankfully, the climate modeling community does this for us by making their output publicly available.”

Drake used these tutorials as a teaching assistant for the Climate Change course at MIT to demonstrate the ease of cloud computing for data-intensive climate science research, and also the value of open-source tools like the Pangeo software stack on Google Cloud. “The CMIP6 dataset was already technically publicly available, it just was not very accessible,” says Drake. “The cloud-based data and computation, when combined with the Pangeo software stack, enabled me to make calculations in just a few hours that could have taken weeks using more conventional methods. Using the Pangeo binder, it was easy to make these calculations available to the rest of the world.”

The CMIP6 data join many other weather and climate-related datasets available through Google’s Public Dataset program at no charge. By making data more accessible and usable with BigQuery and Cloud Storage, we support academic research by accelerating discoveries and promoting innovative solutions to complex problems. For Abernathey, the benefits of cloud computing are a particularly good match for the needs of scientific research: “With Google Cloud, you've essentially got a supercomputer just sitting right there, so you can directly process the data at a very high speed.”

Posted in