Broad Institute: Discovering human health revelations hidden in DNA

About Broad Institute

Founded in 2004 by MIT, Harvard University, Harvard-affiliated hospitals, and the visionary Los Angeles philanthropists Eli and Edythe L. Broad, the Broad Institute seeks to describe all the molecular components of life and their connections; discover the molecular basis of major human diseases; develop effective new approaches to diagnostics and therapeutics; and disseminate discoveries, tools, methods, and data openly to the entire scientific community.

Industries: Healthcare

Location: United States

Products: Compute Engine, Cloud Storage, Cloud Life Sciences

Tell us your challenge. We're here to help.

Broad Institute replaced its in-house genome sequence analysis computers and storage with Google Cloud Platform, which delivers greater speed, scalability, and data security.

Google Cloud Platform Results

Increases computing speed to sequence a complete human genome fourfold
Protects genomic data by controlling privacy, daily use, data access, sharing, and permissions
Solution scales to meet spikes in demand for data processing and storage

Cloud platform analyzes human genomes 400% faster

Human genomics, the science of studying patterns within human DNA, increasingly relies on high-performance compute and storage resources.

Broad Institute studies the human genome to reveal the secrets behind the origins of diseases and to help find new cures and therapies. A collaboration of MIT, Harvard University, and Harvard-affiliated hospitals, it builds on the success of the Human Genome Project, the international research effort to sequence and map the genetic instructions inside each of us.

A single human genome contains more than 3 billion base pairs of genetic material. For accuracy, researchers typically examine each base pair approximately 30 times for sequencing, meaning they gather almost 100 billion base pairs worth of raw data—(approximately 100 gigabytes) per person.

"The pace and volume of data produced for our research was increasing and we needed a place where it could be managed professionally and securely."

—Niall Lennon, Senior Director, Translational Genomics for the Genomics Platform, Broad Institute

For its first decade, the Broad Institute used onsite servers and storage to sequence genomes. But with more scientists conducting research on more samples, the Broad Institute found itself having to store and process data volumes far beyond what its on-premises infrastructure could handle. Not only was the institute referring to existing data on an ongoing basis, its rate of new data generation was doubling every year.

"The pace and volume of data produced for our research was increasing and we needed a place where it could be managed professionally and securely," says Niall Lennon, Senior Director, Translational Genomics for the Genomics Platform of Broad Institute.

Genomes in the Cloud

With an eye on future demands, the Broad Institute envisioned Genomes in the Cloud, a secure, cloud-based infrastructure for processing and analyzing genomic data. The institute built Genomes in the Cloud on Google Cloud Platform to quickly scale compute and storage capabilities.

Broad Institute generates one human genome's worth of sequence data every 12 minutes—roughly 12 terabytes of data every day. That's where Genomes in the Cloud and Google Cloud Platform come in. Together, they let Broad Institute researchers continually analyze data from thousands of samples each year without having to worry about delays or interruptions to their potentially life-saving work.

Dynamically adjusting to demand

Broad Institute's sequencing operations typically run 24/7 to optimize the use of available resources. As new projects and new sequencing instruments come on line, the compute and storage capacities need to grow quickly. Similarly, individual projects sometimes have to process and analyze large batches (thousands or tens of thousands) of DNA samples on tight deadlines. Google Cloud Storage lets the institute's computing team scale capacity on-the-fly to accommodate spikes in resource demands.

Every sequencing study starts with a DNA sample extracted from a source such as blood, hair, or saliva. From a single specimen, scientists divide long strands of DNA into millions of fragments, each around 400 base pairs long. Sequencing machines then take pictures of each base pair and send the data to Google Cloud Storage.

"We believe the cloud can offer security advantages compared to typical on-premises architecture."

—Niall Lennon, Senior Director, Translational Genomics for the Genomics Platform, Broad Institute

Broad Institute then reassembles the DNA fragments in the correct order by using the Cloud Life Sciences in the cloud to queue incoming data and feed it to the Genome Analysis Toolkit (GATK) and Picard. These two Java-based genomics analysis programs, developed by the Broad Institute, both run in Docker containers on Google Compute Engine for processing and analysis.

Keeping genetic data secure

The confidentiality, privacy, and security of testing data are essential to researchers' ability to keep study subjects' genetic and medical information private. Broad Institute recognizes the importance of keeping genomic data secure, and relies on security features in Google Cloud Platform to make it happen.

"We believe the cloud can offer security advantages compared to typical on-premises architecture," says Niall. Google Cloud Platform has features for privacy control, daily use control, data access control, sharing, permissions, and overall security.

When Broad Institute had a limited technological capacity for processing and analyzing sequence data, new requests had to wait in a queue. Now Broad Institute analysts and engineers can process analyses roughly four times faster than they could with on-site computing clusters.

"We can do important research faster than ever. That will lead to a greater understanding of the human genome and the links between genetics and human disease."

—Geraldine Van der Auwera, Associate Director of Outreach and Communications, Data Sciences Platform, Broad Institute

Genomes in the Cloud also allows the not-for-profit institute's scientists to share genomics data securely and easily with collaborators and researchers worldwide. Today, Broad Institute simply sends a link to the cloud-accessible data. Broad Institute has also built a public-facing platform that lets researchers share their data analysis methods and data they've collected. Moving to the cloud supports faster and more efficient collaboration among researchers in their efforts to find new cures and therapies.

"We can do important research faster than ever," says Geraldine Van der Auwera, Associate Director of Outreach and Communications, Data Sciences Platform at Broad Institute. "That will lead to a greater understanding of the human genome and the links between genetics and human disease."

Tell us your challenge. We're here to help.

About Broad Institute

Industries: Healthcare

Location: United States

Compute Engine Cloud Storage Cloud Life Sciences

About Broad Institute

Tell us your challenge. We're here to help.

Broad Institute replaced its in-house genome sequence analysis computers and storage with Google Cloud Platform, which delivers greater speed, scalability, and data security.

Google Cloud Platform Results

Genomes in the Cloud

Dynamically adjusting to demand

Keeping genetic data secure

More sequencing, better data sharing

Tell us your challenge. We're here to help.

About Broad Institute