International Parkinson’s consortium uses Google Genomics to analyze massive genetic dataset

Mike Nalls, a staff scientist at the National Institute on Aging, works with the International Parkinson’s Disease Genomics Consortium, a broad collaboration of scientists striving to characterize molecular changes associated with the debilitating disease. A recent study involved compiling information from 6,500 exomes — or the DNA sequence of all transcribed regions in an individual’s genome — from data generated at various research institutes on different sequencing platforms over a period of several years. The new dataset will be used “for a number of projects that will attempt to identify and refine both novel and known risk loci for Parkinson’s disease,” Nalls says.

To make real scientific discoveries possible from this data, it had to be reanalyzed for the sake of consistency. “It’s so many sources of data. To reduce the possibility of technical artifacts, we had to perform realignment, recalibration and re-genotyping of the exomes,” Nalls says. But there was a problem: none of the consortium members had enough local computational resources to process all 6,500 exomes. He estimates that even with enough local infrastructure, the effort would have taken months of compute time.

Using cloud computing for faster discovery

Nalls and his collaborators decided to use Google Genomics, which gives scientists access to virtually unlimited compute resources from for large-scale projects. By accessing Google Cloud Platform through Google Genomics, researchers can securely store, process, explore and share biological datasets. “The cloud was really the only option for this,” Nalls says. “We collaborated with Google Genomics to test varying implementations of the standard processing pipeline for exome sequence data on the cohort and population scale.”

Nalls ran Broad Institute’s GATK Best Practices pipeline using the fully managed service offered by Google Genomics. With that, he processed the full set of 6,500 exomes — starting with raw, unaligned sequence data and leading to a set of variant calls — in just three and a half weeks. He could have run the analysis even faster, but opted to limit the number of virtual machines and disks to take advantage of sustained use discounts. “Cloud computing allowed us to speed up discovery,” he says.

Extensive controls for data access

Because the consortium spans institutes across Europe and the U.S., cloud computing was helpful in providing access to the dataset and analysis results. “We used Google Cloud Platform to share data between sites that make up these collaborations,” Nalls says. “The partitioning of data in the cloud, in terms of permissions for different buckets, allows us to have control over who can see what data.”

That’s important for a large consortium, where members may not have equal access to all data. “We can maintain privacy of the individual samples and how they need to be treated in the cloud,” Nalls adds.

The cloud environment also allows for greater flexibility in manipulating data. Nalls, for instance, could perform analyses on and check the status of the Parkinson’s dataset from any computer with internet access or even his cell phone, rather than relying on a massive cluster.

Today, the scientists comprising the International Parkinson’s Disease Genomics Consortium have a high-quality dataset that is securely accessible and will power a number of future studies into biological underpinnings of the disease. With cloud computing, the consortium can begin generating results much sooner and more cost-effectively than they would have with local compute resources.