DNAstack tackles massive, complex DNA datasets with Google Genomics

Focus on scientific innovation, not data storage

DNAstack was founded to help scientists around the world interpret genomic data faster and more easily. The DNAstack platform runs on Google Cloud Platform, which frees up the team to design new apps to expand the functionality available to users. “With Google involved, our expertise does not have to be in data storage anymore — it’s going to be in making the data accessible, and that’s what we’re good at,” says Marc Fiume, CEO and founder of DNAstack.

Eliminate friction points

Without a platform like DNAstack, scientists must complete several complicated steps to make sense of the data produced by a DNA sequencer and ultimately figure out how someone’s genetic variation might be associated with disease risk. The process requires many people with various types of expertise, from bioinformatics to genomics to technical skill. “In our experience, the communication between all these different individuals with all these skill sets and different jargon is a significant bottleneck,” Fiume says.

In collaboration with its partners, DNAstack aims to build a “sequencer-to-scientist” data processing and analysis workflow that would automatically handle routine steps, such as read alignment and variant calling, to eliminate the friction points and get clear results to scientists faster — without needing all those other experts along the way. The first DNAstack application is for identifying genes causing disease.

Google’s industrial strength

Fiume and his team had been working on a data tool that would let them scale their database easily from 100 million genetic variants to 1 billion, 10 billion, or more. They built a proof-of-concept solution and realized it was similar to Google’s BigQuery tool. “When we saw the Google system that was basically the industrial-strength version of what we had built, it was a no-brainer to transfer our platform to Google,” Fiume says.

Google Genomics is dedicated to helping the life science community organize the world’s genomic information and make it accessible and useful. Through extensions to Google Cloud Platform, Fiume and his team can apply the same technologies that power Google Search and Maps to securely store, process, explore and share large, complex biological datasets.

Using BigQuery has already made a big difference in how much data the DNAstack team can process. Their proof-of-concept solution had worked well up to 100 million genetic variants, but then “our performance just fell apart,” Fiume says. Their custom-built solution was simply not able to perform quickly with massive amounts of data. With Google Genomics, DNAstack is able to crank through much larger data sets. “We’re getting 4-second turnaround times on very complex searches with BigQuery, compared to tens of seconds or even minutes previously,” he adds. “That fulfills our need for a scalable system.”

The DNAstack team is also using the Google Genomics API, which they tested by building a simple genome browser in a single day. Fiume was impressed that the API was developed in accordance with Global Alliance for Genomics and Health standards to promote interoperability and data sharing. Allowing scientists to compute across genomic data repositories with a single API is an important step in making existing data more valuable and in giving scientists access to previously siloed information.

‘Days are numbered’ for in-house pipelines
Fiume is no stranger to the challenges of building in-house data solutions for genomics workflows. “When you consider the cost of purchasing, setting up, and maintaining these systems, compared to what is possible through cloud-based solutions, the days of running in-house bioinformatics pipelines for serious genomics applications are probably numbered,” he says. “I was so excited when Google stepped in and provided their solution.”