ISB-Cancer Gateway in the Cloud: Sharing terabytes of cancer data with the power of BigQuery
About The Institute for Systems Biology-Cancer Gateway in the Cloud
The Institute for Systems Biology-Cancer Gateway in the Cloud (ISB-CGC) is part of the National Cancer Institute’s (NCI) cloud-based data science infrastructure, the Cancer Research Data Commons (CRDC). A partnership between ISB and General Dynamics Information Technology (GDIT), ISB-CGC uses Google Cloud to provide public cancer data and compute resources to researchers.
Tell us your challenge. We're here to help.
Contact usOpen, cloud-based data and analytics enable the National Cancer Institute’s Institute for Systems Biology-Cancer Gateway in the Cloud to securely store and safely share up-to-the-minute research.
Google Cloud results
- Hosts terabytes of cancer genomic and proteomic data
- Provides analytical tools including user-defined functions to help speed analysis as well as Python and R notebooks
- Enables complex computations to be completed affordably and fast—in minutes or hours instead of days
Open data, compute, and analytics resources for the global cancer research community
Breast cancer is the world’s most prevalent cancer. According to a 2021 overview from the World Health Organization, in 2020 alone, over two million women were diagnosed with breast cancer worldwide. With such a large number of women impacted, each with complex biological features and personal paths through the disease, breast cancer research is particularly challenging—and data intensive.
To contend with the large amount of data that they must sift through to analyze breast cancer genomic data—or any cancer, for that matter—medical researchers grapple with a broad range of challenges. Foremost, they must determine how to analyze data at a global scale while keeping it secure and compliant with a range of national and international standards. To compound the difficulty, data is often siloed within different organizations, managed on incompatible platforms, or trapped in local storage on the machines of individual researchers. And the costs to manage and maintain on-premises infrastructure can be daunting.
“The novelty of it, and the ease with which researchers can quickly integrate BigQuery with familiar tools, has made it a seamless addition to their everyday workflow.”
—Dr. Kawther Abdilleh, Lead Bioinformatics Scientist, GDITEven with significant resources, any one of these issues can become an insurmountable barrier for research organizations. It is precisely these barriers that the ISB-Cancer Gateway in the Cloud platform, and its team of engineers and bioinformaticians, including Dr. Boris Aguilar and Dr. Kawther Abdilleh, are breaking down with cloud technology.
The National Cancer Institute (NCI) is the United States federal government’s principal agency for cancer research and training. As cloud technology emerged, their forward-thinking researchers understood its benefits and established the Cancer Research Data Commons (CRDC), which includes ISB-CGC under its domain. Google Cloud became a partner to ISB-CGC in 2014 to support its mission and deliver cloud capabilities necessary to carry out cancer research at what Aguilar calls an “unprecedented scale.” Using Google Cloud technology, ISB-CGC has developed a large-scale cloud-based platform that connects researchers to a wide collection of cancer datasets, along with the analytical and computational infrastructure to analyze that data at scale.
“We’re trying to spread the message of the cost-effectiveness of the cloud. And I think we’ve illustrated, especially with BigQuery, that researchers can analyze a lot of data and it’s not as expensive as they imagined.”
—Dr. Kawther Abdilleh, Lead Bioinformatics Scientist, GDITMake sharing terabyte-scale cancer data fast, secure—and affordable
The ability to affordably and securely share data, and the seamless cross-platform integration of tools like BigQuery, have enabled the ISB-CGC team to fundamentally change how cancer investigators conduct research.
The greater goal of ISB-CGC is to make an ever-larger pool of data more widely accessible and useful to cancer researchers around the world. Abdilleh reports that her team has “heard back from researchers that the ability to rapidly explore and analyze large collections of data, using some simple SQL queries, has allowed them to very quickly gain meaningful biological discoveries and insights.” Working with large-scale databases is often something new to biologists, but thanks to intuitive training resources, Abdilleh notes, “the novelty of it, and the ease with which researchers can quickly integrate BigQuery with familiar tools, has made it a seamless addition to their everyday workflow.”
Adoption is growing quickly. According to Abdilleh, “researchers are more eager to work in the cloud because they realize that the amount of data is just impossible to work with locally.” And it’s not only due to its ease of use—they’re pleasantly surprised that the big advantages of the cloud don’t come with a big price tag. Abdilleh continues, “We're trying to spread the message of the cost-effectiveness of the cloud. I think we’ve illustrated, especially with BigQuery, that they can analyze a lot of data, and it’s not as expensive as they imagined.” To make it even easier to get started, researchers trying out the ISB-CGC platform can request free cloud credits.
Access to that speed and scale has made all the difference. As Aguilar says of the data in their September 2020 paper, “Multi-omics data integration in the Cloud: Analysis of Statistically Significant Associations Between Clinical and Molecular Features in Breast Cancer,” “BigQuery can really accelerate and facilitate large-scale statistical analysis commonly used in cancer research.” The analyses in that publication were part of a proof-of-concept study that demonstrated how cloud-based data analysis can be used to identify novel biological associations between clinical and molecular features of breast cancer. These kinds of analyses can be conducted on all types of cancers using the rich datasets provided in BigQuery tables by ISB-CGC.
In preparation for the above publication, Aguilar and Abdilleh worked with Google Solutions Architect Ross Thomson to build their own tools. As Aguilar explains, “We developed a set of BigQuery user-defined functions (UDFs) to perform statistical tests using the genomic data of breast cancer. This data is publicly available as BigQuery tables in the ISB-CGC projects. By using these UDFs, we were able to compute hundreds of millions of statistical tests … analysis that typically requires several days of computation and access to supercomputers; but with BigQuery, we were able to complete the analysis in minutes.” The team has now made their UDFs part of ISB-CGC’s resources and available for use by the broader research community.
“The AI platform, for example, allows us to easily create notebooks to use R or Python in combination with BigQuery or machine learning to perform large-scale statistical analysis of genomic data, all in the cloud. This type of analysis is particularly effective when the data is large and heterogenous, which is the case for cancer-related data.”
—Dr. Boris Aguilar, Senior Research Scientist, Institute for Systems BiologyBreaking down silos, integrating diverse tools and datasets
Traditionally, researchers have downloaded source data and performed analysis locally on their machines using tools like R and Python. But as the volume and complexity of cancer data continues to grow, this method has become unsustainable. The good news? Google Cloud fully integrates a range of popular tools, so researchers can still use their favorite languages, including R and Python, to analyze data on the ISB-CGC platforms, directly in the cloud—without the need to download data.
Currently, ISB-CGC has terabytes of publicly available multi-omics data from various cancer datasets available in the cloud on BigQuery, such as The Cancer Genome Atlas (TCGA), and offers example notebooks and workflows in several languages. Aguilar finds that, “Google Cloud offers several user-friendly environments that integrate diverse tools necessary for cancer research. The AI platform, for example, allows us to easily create notebooks to use R or Python in combination with BigQuery or machine learning to perform large-scale statistical analysis of genomic data, all in the cloud. This type of analysis is particularly effective when the data is large and heterogenous, which is the case for cancer-related data.” And making cancer data more broadly available is precisely the aim of the NCI.
Opening access to critical cancer data—and capturing the scope of its impact
ISB-CGC’s success using Google Cloud as the backbone of its infrastructure is evident in the many publications that cite it and demonstrates the scope of the research that’s being conducted using the platform. Abdilleh says those publications may be the greatest measure of ISB-CGC’s impact: “We’ve seen a lot of publications being generated; there’s a lot of cool science that’s being done using the platform.”
Tell us your challenge. We're here to help.
Contact usAbout The Institute for Systems Biology-Cancer Gateway in the Cloud
The Institute for Systems Biology-Cancer Gateway in the Cloud (ISB-CGC) is part of the National Cancer Institute’s (NCI) cloud-based data science infrastructure, the Cancer Research Data Commons (CRDC). A partnership between ISB and General Dynamics Information Technology (GDIT), ISB-CGC uses Google Cloud to provide public cancer data and compute resources to researchers.