Stanford genomics center builds mega-scale variant analysis pipeline on Google Genomics

At Stanford University’s Center of Genomics and Personalized Medicine, scientists work at scale. Director Mike Snyder and his lab have been involved in a number of massive efforts, from ENCODE to the Million Veteran Program. Naturally, the team needs computational resources and tools suited to power bioinformatics users.

Scaling to virtually infinite resources

Snyder and his colleagues turned to Google Genomics, which gives scientists access to Google Cloud Platform to securely store, process, explore and share biological datasets. With the costs of cloud computing dropping significantly and demand for ever-larger genomics studies growing, Snyder thinks fewer labs will continue relying on local infrastructure. “We’re entering an era where people are working with thousands or tens of thousands or even million genome projects, and you’re never going to do that on a local cluster very easily,” he says. “Cloud computing is where the field is going.”

Snyder’s team recently ran a pilot project for the Million Veteran Program, uploading and analyzing 500 genomes in Google Genomics. Initial processing of each genome with on-premise resources took about 36 hours, and the team was limited to processing one or two genomes at a time. But with Google Cloud Platform, they ran all 500 genomes in a matter of days. “What you can do with Google Genomics — and you can’t do in-house — is run 1,000 genomes in parallel,” says Somalee Datta, bioinformatics director of Stanford University’s Center of Genomics. “From our point of view, it’s almost infinite resources. A single user can boot up 5,000 machines.”

Fast results let scientists ask more questions

Working closely with the Google Genomics team, Snyder’s group built a variant analysis pipeline in the cloud that took advantage of existing Google’s big data tools to return results at unprecedented speed.

The scientists were looking for a fast way to mine DNA variants found in patients’ genomes and compare them to variants in a host of publicly accessible databases. “If you want to do this for two patients or five patients, you could do it pretty much anywhere,” Datta says. “But when you want to do this for 500 or 1,000 patients, that’s when your typical systems start to get slow — slow enough that you run a query and it takes a few hours or even overnight to get your answers back.”

That kind of delay dramatically limits the number of questions a scientist can ask of the data. But Google BigQuery, developed to compare massive log files, provided a completely different approach. “With BigQuery, we get answers back in 10 seconds, even with 500 genomes and millions of variants,” Datta says. “It’s amazingly fast. And the faster the system is, the more questions we can ask.”

The pipeline was developed jointly by Google Genomics and Snyder’s team. “We brought the biological questions, and the Google folks were super computational. It was a terrific experience,” Snyder says. The protocol, which is open source and well documented, allows for consistent and comparable variant analysis across all genomes even in the largest-scale project.

Heavy-duty security for DNA data

At Snyder’s center, security is paramount. When it came to moving to a cloud environment, the team needed to ensure that data would be as safe as it had been in the well-protected lab. “We take security very seriously,” Datta says. “There are a lot of rules to follow.”

Working with Google Genomics, they created a series of best practices to keep data safe and sound. Those guidelines will be especially useful to the genomics community now that National Institutes of Health (NIH) and other agencies are allowing scientists to upload their data to the cloud. “The moment we knew that NIH would be approving the process, we jumped on the chance to integrate Google Cloud Platform with our internal systems in a way that would be very secure,” Datta says.

The team’s recommendations have been carefully documented and published for anyone to follow; they include factors such as encryption, user permissions and limiting data access. While the steps are not trivial, Datta says that labs with access to IT experts shouldn’t have problems complying with the guidelines.

Snyder says that having a clear path to achieve security will be immensely helpful to a community actively looking for ways to share data more efficiently. “In the old days we used to pass around hard drives, and the old days weren’t that long ago,” he says. “If you have this in the cloud, people can easily share data.”