Broad Institute: Powering the next generation of genomics research with Google Cloud Platform

Broad Institute is using genomics — the science of sequencing, mapping, and studying genetic code — to understand the biology of human disease and help lay the groundwork for a new generation of therapies. The Institute was founded in 2004 in part to build on the success of the Human Genome Project, an international effort that deciphered the human genome.

Genomics requires more than scientists’ creativity and hard work. It relies on advances in computing technology that allow researchers to gather, store, analyze, and share vast amounts of data. For a decade, Broad Institute used on-premises storage and servers to do that. But as it gathered ever-increasing amounts of information by sequencing hundreds of thousands of genetic samples, its data centers could not keep up with demand. It decided to harness the power of the cloud, and designed a platform called Genomes in the Cloud, a pipeline for processing and analyzing genomic sequencing data.

The Institute built Genomes in the Cloud on Google Cloud Platform. In this way, the Institute can quickly scale its computing capabilities to match the often changing but always immense volume of data produced by its sequencing facilities, and help its scientists, their collaborators, and other researchers around the world rapidly turn raw DNA sequence data into genomic insights and biological knowledge.

“The pace and volume of data being produced for our research was increasing and we needed a place where it could be managed professionally and securely. With Google Cloud Platform we can achieve this.” — Niall Lennon, Senior Director, Translational Genomics for the Genomics Platform of Broad Institute

Building a powerful sequence-analysis pipeline

Modern DNA sequencing machines produce about 100 gigabytes of raw data for every whole human genome sequenced. Broad Institute’s sequencing facility produces, on average, a human genome’s worth of sequence data every 12 minutes, equivalent to roughly 12 terabytes of data every day. Taming such volumes of data requires an immense amount of storage and processing.

That’s where Genomes in the Cloud and Google Cloud Platform come in. Together, they allow Broad Institute’s scientists and collaborators to continue to process and analyze genome data from thousands of samples each year.

Every sequencing study starts with a sample, for instance blood from a person or cells from a laboratory culture. After extraction and initial prep work, the cells’ DNA is loaded into sequencing machines, which read the DNA and churn out snippets of DNA sequence in digital form. The machines send these data to Google Cloud Storage.

By design, Broad’s sequencing facility frequently operates at close to full capacity. As new projects and new sequencing instruments come on line, that capacity often needs to grow dramatically in a short timeframe. In addition, individual projects sometimes need to process and analyze large batches (e.g., thousands to tens of thousands) of genomes on tight deadlines. With Google Cloud Storage, Broad’s computing team can scale their capacity on the fly to match their changing storage demands.

Once in storage, the snippets of sequence need to be put back in the right order. In the cloud, the Broad Institute uses the Google Genomics API to queue incoming data and feed them to Genome Analysis Toolkit (GATK) and Picard — two Broad-written, Java-based genomics analysis programs running in Docker containers on Google Compute Engine — for processing and analysis.

Keeping genetic data secure

Broad Institute recognizes the importance of keeping genomic data secure, and relies on Google Cloud Platform’s security features to help it do that.

“We believe the cloud can offer security advantages compared to typical on-premises architecture,” says Lennon. Google Cloud Platform’s implementation has features for privacy control, daily use control, data access control, sharing, permissions, and overall security.

More sequencing, better data sharing

Broad Institute previously had a fixed capacity for processing and analyzing sequence data. Once it reached that limit, new analysis requests had to wait in a queue. Now Broad’s analysts and engineers can process analyses much more quickly, roughly four times faster than was possible using on-premises computing clusters.

Genomes in the Cloud also allows the nonprofit’s scientists to share genomics data securely and easily with collaborators and other researchers around the world. Today, Broad Institute only needs to send a researcher a link to the cloud-accessible data. Broad has also built a public-facing platform that allows researchers to share the methods they use for analyzing data, as well as the data itself. This will allow scientists around the world to better collaborate on genomic research.

“We’re able to do important research more quickly than before. That will lead to a greater understanding of the human genome and the links between genetics and human disease.” — Geraldine Van der Auwera, Associate Director of Outreach and Support, Data Sciences and Data Engineering at Broad Institute