Stanford Center for Genomics and Personalized Medicine: Building a mega-scale genetic variation analysis pipeline
About Stanford Center for Genomics and Personalized Medicine
The Stanford Center for Genomics and Personalized Medicine (SCGPM) focuses on functional genomics and proteomics and has developed many technologies in this area. Using next-gen DNA sequencing, SCGPM researchers are devising new approaches to study genomic changes in cancers to understand cancer origins and progression.
Using Google Genomics and Google BigQuery, SCGPM can analyze hundreds of entire genomes in days and return query results in seconds while providing reliable security for DNA data.
Google Cloud Platform Results
- Processed 500 genomes from raw data to variant calls for the Million Veteran Program pilot data in days
- Implemented a variant analysis pipeline that returns query results in less than 10 seconds
- Established security best practices to help genomics labs confidently store and share data in the cloud
Returning genomics results at unprecedented speed
The genomics revolution is forever changing how healthcare providers understand and treat disease and giving patients new options for healing. At the core of the revolution is the ability to process and manage extremely large and complex DNA sequences to understand how each individual’s genome sequence impacts their health.
Mike Snyder, PhD, Director of the Stanford Center of Genomics and Personalized Medicine (SCGPM) and Chair of Genetics at Stanford University, has been crunching massive genomic data sets for years. Mike focuses on functional genomics and proteomics and has developed many technologies in this area. He conducted the first integrated analysis of a person using multiple ‘omics’ technologies. Using next-gen DNA sequencing and transcriptomics, his laboratory discovered that far more of the human genome is active than was previously expected.
“We’re entering an era where people are working with tens of thousands or even millions of genome projects, and you’re never going to easily do that on a local cluster. Cloud computing is where the field is going.”
—Mike Snyder, PhD, Director, Stanford Center for Genomics and Personalized MedicineThe SCGPM has been involved in a number of massive efforts, from the ENCODE Encyclopedia of DNA Elements to the Million Veteran Program, a national research partnership with one million veteran volunteers to study how genes affect health.
Scientists at the SCGPM lab work on a very large scale. Naturally, the team needs computational resources and tools suited to power bioinformatics users. Working closely with the Google Genomics team, Mike’s group built a genetic variation, or variant, analysis pipeline on Google Cloud Platform that uses Google BigQuery to return results at unprecedented speed.
This group was also the first users of Global Alliance for Genomics and Health (GA4HG) APIs implemented by Google Genomics, thus showcasing use of community developed data standards at a large scale.
“We’re entering an era where people are working with thousands or tens of thousands or even million genome projects, and you’re never going to do that on a local cluster very easily,” says Mike. “Cloud computing is where the field is going.”
“What you can do with Google Genomics—and can’t do in-house—is run 1,000 genomes in parallel. From our point of view, it’s almost infinite resources. A single user can boot up 5,000 machines.”
—Somalee Datta, PhD, Director of Research IT at Stanford School of MedicineScaling to virtually infinite resources
With Google Cloud Platform, scientists can securely store, process, explore, and share biological datasets. The team recently ran a pilot project for the Million Veteran Program, uploading and analyzing 500 genomes in Google Genomics. Initial processing of each genome with on-premises resources took about 36 hours, and the team was limited to processing one or two genomes at a time. But with Google Cloud Platform, they ran all 500 genomes in a matter of days.
“What you can do with Google Genomics—and can’t do in-house—is run 1,000 genomes in parallel,” says says Somalee Datta, PhD, Director of Research IT at Stanford School of Medicine. “From our point of view, it’s almost infinite resources. A single user can boot up 5,000 machines.”
Letting scientists ask more questions
At volumes where typical systems start to get slow—500 to 1,000 genomes—it might take a few hours or even overnight for scientists to get answers back. That delay dramatically limits the number of questions a scientist can ask of the data. By building a variant analysis pipeline in the cloud, scientists were able to quickly mine DNA variants found in patients’ genomes and compare them to variants in a host of publicly accessible databases using Google BigQuery.
“With Google BigQuery, we get answers back in 10 seconds, even with 500 genomes and millions of variants,” says Somalee. “It’s amazingly fast. And the faster the system is, the more questions we can ask.”
Heavy-duty security for DNA data
In DNA research, data security is paramount. When moving to a cloud environment, the team needed to ensure that data would be as safe as it had been in the well-protected lab. Working with Google Genomics, they created a series of best practices to keep data safe and sound. Those guidelines will be especially useful to the genomics community now that National Institutes of Health (NIH) and other agencies are allowing scientists to upload their data to the cloud.
“We brought the biological questions, and the Google folks were super computational.”
—Mike Snyder, PhD, Director, Stanford Center for Genomics and Personalized MedicineThe team’s recommendations have been carefully documented and published for anyone to follow; they include factors such as encryption, user permissions, and limiting data access. This documentation has since been published as peer reviewed commentary in Nature Biotechnology, June 2016 (10.1038/nbt.3496). While the steps are not trivial, labs with access to IT experts shouldn’t have problems complying with the guidelines. Having a clear path to achieve security will be immensely helpful to a community actively looking for ways to share data safely and efficiently.
“We take security very seriously, and there are a lot of rules to follow,” says Somalee. “The moment we knew that NIH would be approving the process, we jumped on the chance to integrate Google Cloud Platform with our internal systems in a way that would be very secure.”
“A terrific experience” in the cloud
With the costs of cloud computing dropping significantly and demand for ever-larger genomics studies growing, Mike thinks fewer labs will continue relying on local infrastructure. The protocol, which is open source and well documented, allows for consistent and comparable variant analysis across all genomes even in the largest-scale project. The protocol has since been published in peer reviewed journal, Bioinformatics, July 2017 (10.1093/bioinformatics/btx468).
“We brought the biological questions, and the Google folks were super computational,” adds Mike. “It was a terrific experience.”
Google Cloud Platform is now broadly available to Stanford genomics researchers. SCGPM operates a Bioinformatics Core that provides integrated Google Cloud Platform services including security and invoice management to participating labs. Over a petabyte of genomic data is currently stored and analyzed on Google Cloud Platform.
About Stanford Center for Genomics and Personalized Medicine
The Stanford Center for Genomics and Personalized Medicine (SCGPM) focuses on functional genomics and proteomics and has developed many technologies in this area. Using next-gen DNA sequencing, SCGPM researchers are devising new approaches to study genomic changes in cancers to understand cancer origins and progression.