About National Institute on Aging
Established in 1974, the National Institute on Aging (NIA) is part of the National Institutes of Health, where it supports research to understand the biology of aging and how to improve health and activity as people age. NIA funded research targets areas from age-related cellular changes to elucidating age-related conditions such as Alzheimer’s and Parkinson’s disease.
By accessing Google Cloud Platform through Google Genomics, researchers at the National Institute on Aging can more securely store, process, explore, and share large biological datasets.
Google Cloud Results
- Uses Broad Institute’s GATK on Google Genomics for exome data processing
- Processes nearly 200TB of data for 6,500 exomes in just 3.5 weeks, compared to months on local infrastructure
- Plans to share data with researchers at 50+ institutions around the world, with user access control to enhance security
New discoveries in weeks versus months
The National Institute on Aging works with the International Parkinson’s Disease Genomics Consortium, a broad collaboration of scientists striving to characterize molecular changes associated with the debilitating disease. A recent study involved compiling information from thousands of exomes—or the DNA sequence of all transcribed regions in an individual’s genome—from data generated at various research institutes on different sequencing platforms over a period of several years.
To make real scientific discoveries possible from so many sources of data, the data had to be reanalyzed for consistency. To reduce the possibility of technical artifacts, scientists had to perform realignment, recalibration, and re-genotyping of the exomes. But there was a problem: none of the consortium members had enough local computational resources to process all 6,500 exomes.
“Cloud computing allowed us to speed up discovery. We collaborated with Google Genomics to test varying implementations of the standard processing pipeline for exome sequence data on the cohort and population scale. The cloud was really our only option for this.”—Mike Nalls, PhD, Scientist, National Institute on Aging
The team decided to use Google Genomics, a fully managed service on Google Cloud Platform. Scientist Mike Nalls ran Broad Institute’s GATK Best Practices pipeline using Google Genomics, processing the full 200TB set of 6,500 exomes—starting with raw, unaligned sequence data and leading to a set of variant calls—in just three and a half weeks. The dataset was subsequently used to identify six new risk loci for Parkinson’s disease, helping scientists better understand genetic risks for the disease.
“Cloud computing allowed us to speed up discovery,” says Mike Nalls, PhD, Scientist at National Institute on Aging. “We collaborated with Google Genomics to test varying implementations of the standard processing pipeline for exome sequence data on the cohort and population scale.”
Analyzing massive genetic datasets
Mike could have run the analysis even faster, but opted to limit the number of virtual machines and disks to take advantage of sustained use discounts and reduce costs. Even if hardware could have been procured, the effort would have taken months of compute time using local infrastructure. With Google Genomics on Google Cloud Platform, the National Institute on Aging can now analyze massive datasets, giving scientists access to virtually unlimited compute resources for large-scale projects.
Extensive controls for data access
Because the consortium spans more than 50 research institutes across Europe and the U.S., cloud computing was helpful in providing access to the dataset and analysis results. That’s important for a large consortium, where members may not have equal access to all data.
“We used Google Cloud Platform to share data between sites. The partitioning of data in the cloud allows us to have control over who can see what data. We can maintain privacy of individual samples and how they need to be treated in the cloud.”—Mike Nalls, PhD, Scientist, National Institute on Aging
“We used Google Cloud Platform to share data between sites,” adds Mike. “The partitioning of data in the cloud, in terms of permissions for different buckets, allows us to have control over who can see what data. We can maintain privacy of individual samples and how they need to be treated in the cloud.”
The cloud environment also allows for greater flexibility in manipulating data. Mike, for instance, could perform analyses on and check the status of the Parkinson’s dataset from any computer with Internet access or even his cell phone, rather than relying on a massive cluster.
Powering future studies
Today, the scientists comprising the International Parkinson’s Disease Genomics Consortium have a high-quality dataset that is securely accessible and will power a number of future studies into biological underpinnings of the disease. With cloud computing, the consortium can begin generating results much sooner and more cost-effectively than they would have with local compute resources.
“We’re using the dataset for a number of projects that will attempt to identify and refine both novel and known risk loci for Parkinson’s disease,” says Mike.