Broad Institute speeds scientific research with Cloud SQL
Kristian Cibulskis
Director of Technology - Data Sciences Platform, Broad Institute of MIT and Harvard
Editor’s note: The Broad Institute of MIT and Harvard, a nonprofit biomedical research organization that develops genomics software, needs to keep pace with the latest scientific discoveries. Here’s how they use managed database services from Google Cloud to move fast and stay on the cutting edge.
The Broad Institute of MIT and Harvard is a nonprofit biomedical research organization that focuses on advancing the understanding and treatment of human disease. One of our major initiatives is developing genomics tools and making them available across the scientific ecosystem. The rapid pace of discovery means our data sciences team has to keep pace so that our software products enable the best research. Our ability to move fast is critical. And when we decided to pivot our focus during the pandemic to develop and process tens of millions of COVID-19 tests, speed was a driving factor. Fully managed database services and analytics solutions from Google Cloud helped us accelerate our pace of development.
Accelerating genomics insights with Cloud SQL
One of our main products that uses Google Cloud services is Terra —a secure, scalable, open-source platform for biomedical research. We co-developed it with Microsoft and Verily to help researchers access public datasets, manage their private data, organize their research, and collaborate with others. After a long history of working with Google Cloud, it was natural for us to leverage Google Cloud services for the control plane of Terra.
For the backend, we use a number of cloud services including Cloud SQL for PostgreSQL and MySQL, as well as Firestore, to allow users to track their different data assets, methods, and research results, and to power the Terra control plane. Cloud SQL helps us accelerate development in two key areas. First, our developers can get these database services up and running quickly, without going through some centralized system that might become a bottleneck. And secondly, using Cloud SQL lowers our operational burden. We can keep managed services running and performing well using fewer of our own developers. Instead, these teams can focus on developing new features for users.
Optimizing cloud spend with BigQuery analytics
For much of our genomics analysis, we use BigQuery, Compute Engine and Dataproc, but understanding the detailed costs of that research has been challenging. Billing data can be exported into BigQuery, but the costs wouldn't be attributed to the specific analyses being performed. However, by adding billing labels to each cloud resource used and joining that information with detailed metadata in our relational Cloud SQL databases we can provide extremely fine grained cost information. As a result, for example, we’re able to tell a researcher that their virtual machine spent 17 cents as part of a certain analysis, research project, or sample. With these insights, our researchers have visibility into their costs, and are able to decide where to focus their optimizations.
Pivoting to process COVID-19 tests
When the global pandemic hit, the Broad Institute volunteered to make our clinical testing and diagnostics facilities available to serve public health needs. We created a novel automation system for COVID-19 test processing that is scalable, modular, and high-throughput, in service of the public health needs of the Commonwealth of Massachusetts and surrounding areas. In the first several months of the pandemic, Broad processed more than 10 percent of all PCR tests in the United States, and today has processed more than 30 million tests, with turnaround times of less than 24 hours. Using serverless components with a Cloud SQL for PostgreSQL database at its core, we built a testing solution—going from an idea to launching our large-scale COVID-19 operation in just two weeks. On our first day, we delivered 140 tests. But a year later we were delivering up to 150,000 tests a day. That’s in part because our database solution was able to scale up really quickly.
With a few CLI commands, we enabled high availability and read replicas for our database, while backups and maintenance upgrades were handled automatically. This scalability made a big difference to us considering we were a small team working on very tight timelines.