Tute Genomics makes genetic variant database publicly accessible with Google Genomics

The Tute Genomics team has scoured repositories of DNA data to build one of the world’s largest databases of genetic variants, complete with carefully curated annotations. The database of more than 8.5 billion variants is a key part of Tute’s genome interpretation platform.

Sharing data in a useful, accessible way

Tute decided to contribute the dataset to the biomedical community so even more people could use it, but they needed a way to organize and store the data. “Dumping data somewhere doesn’t actually make it accessible to others unless it’s easy to get and is structured and organized in a way that it can be queried,” says David Mittelman, chief scientific officer at Tute. “You need an environment where you can store, organize, and interact with your data — all in the same place.”

After evaluating their options, they chose Google Genomics for its scalability and easy-to-use ecosystem. “With the backbone of Google Cloud Platform, Google Genomics seemed like the perfect way to share this data,” Mittelman says.

Processing more genome data faster

As more genomes are sequenced, scientists and clinicians will need variant information to make sense of them. “Those genomes aren’t useful unless you have the annotation data tied to them,” says Reid Robison, Tute’s CEO. “We see combining patient genomes with the annotation database as a way to enable rapid progress for genome-guided medicine.”

Crunching data across large numbers of genomes, though, is prohibitive for most compute infrastructure. “What happens when you have 10,000 whole genomes? No one has the computing resources in their office for that,” Robison says. “That’s where Google Genomics comes in as a key partner and resource.”

Without Google Genomics, scientists would have to download the genome data, write a program to annotate the variants, and write scripts to detect variants with certain attributes, Mittelman says. “The idea is to combine everything in the same environment — to put in a really good foundation — so you spend less time writing code or moving files back and forth, and more time asking cool scientific questions,” he adds. For instance, Google BigQuery lets you compare variants in different datasets, explore variants common to multiple datasets and filter variants in one dataset based on variants in another.

Gaining deeper insights with BigQuery

BigQuery is an integral part of making the Tute dataset more useful by allowing users to query massive genomic datasets and get results quickly. “Google is improving our ability to use genomic information,” Mittelman says. “There’s a huge barrier to entry for analyzing gigantic sequencing datasets. BigQuery lowers that barrier by letting you immediately start asking questions.”

For example, the Tute team processed 88 GB of data about genetic variants from a high-quality genome repository. Within 30 seconds, they had high-quality results, compared to the minutes or hours the analysis would have taken without BigQuery.

To use the Tute Genomics dataset, scientists can upload their own genomic data or query data already hosted by Google Genomics, such as the 1,000 Genomes Project data or the Autism Speaks MSSNG Project data.

The team believes that making this data publicly available will help spark interest in genomics and cloud computing and encourage development of more cloud-based tools. “It’s a whole new world with genomic data,” Robison says. “There’s so much that can be learned, and with Google Genomics, it’s amazing what can be accomplished.”