ETH Zurich: Deciphering life with the largest-ever DNA search engine
About ETH Zurich
ETH Zurich aims to find solutions for the defining challenges of our time, while cultivating a team of innovative and critical researchers. Its Biomedical Informatics (BMI) Group combines medicine and biology with computer science to model and make sense of molecular processes and diseases and contribute to improving treatment options together with medical collaborators.
Tell us your challenge. We're here to help.
Contact usThe BMI Group is processing 4 petabytes of sequencing data to create the world’s largest-ever DNA search index, making the world’s genetic code more accessible for medical and scientific research.
Google Cloud results
- Supports fast access to all publicly available DNA sequencing data from National Center for Biotechnology Information (NCBI) database
- Enables processing of 4 petabytes of data to construct the largest-ever search index for DNA sequencing data
- Broadens the scope of projects that can be undertaken by optimizing costs with infrastructure flexibility
- Allows researchers to focus on science by eliminating the risk of infrastructure shortages with Compute Engine
Petabyte-scale DNA sequences now efficiently searchable
What is life? It’s one of the world’s oldest questions, affecting entire branches of biology, genetics, and biochemistry, not to mention philosophy. And yet, in many ways, there’s no conclusive answer. That’s not to say we don’t know a great deal about life: so far, we know that all living organisms store genetic information in nucleotide sequences (DNA or RNA), for example. For some species, including our own, we even have a complete blueprint, written in the letters A, C, G, T. These letters represent the four base types in a single molecule of DNA: adenine (A), cytosine (C), guanine (G), and thymine. Institutions such as the National Center for Biotechnology Information (NCBI) currently maintain about 15 quadrillion nucleotides of sequence information.
But despite everything we know about life’s underlying code, we’re still scratching the surface of many fundamental questions. How does life emerge? Why does it end? How do we survive diseases? These are just some of the questions that researchers at the Biomedical Informatics (BMI) Group at ETH Zurich are working to address.
Prof. Dr. Gunnar Rätsch and his team are combining machine learning, health informatics, and bioinformatics with clinical data science, bridging medicine and biology with computer science to streamline the analysis of large genomic and medical datasets. Beyond making the data more accessible to researchers and companies all around the world, the BMI Group aims to develop and apply methods and tools that address specific biomedical questions or solve practical problems, thereby helping to optimize treatments for genomic diseases such as cancer, as well as genetic disorders. By doing this, it aims to advance personalized medicine, which targets illnesses based on an individual’s DNA.
Faster access to large-scale data with Google Cloud
Methods developed by the BMI Group can help to uncover the biological processes behind the development and progression of cancer, integrate signal data from intensive care units to build early warning systems, and address many more important issues. To develop some of the groundbreaking algorithms that make these initiatives possible, the BMI group relies heavily on sequencing data. But despite access to a vast amount of information in the NCBI repository, existing methods don’t allow for the most effective use of these datasets.
“Each time we needed to access a dataset, we had to download it from the repository and apply our algorithms locally,” says Dr. Andre Kahles, Senior Researcher and leader of the metagraph project at the BMI Group. “What’s more, these repositories are not fully searchable. They cannot yet answer the simple question, ‘have I ever seen this piece of DNA sequence before?’”
“Just imagine the World Wide Web without search engines,” adds Dr. Daniel Danciu, Software Engineer at the BMI Group. “That’s the state of DNA sequence databases today. Instead of single web pages, we have sequenced genomes or metagenomes. The problem is, there are millions of them, but no way to search the information contained within them. That’s what we’re trying to change, by developing a search engine for DNA sequences.”
This search engine, the BMI Group’s Metagraph project, will hold all available sequencing data for eukaryotes, or organisms with a membrane-bound nucleus, including fungi, plants, and animals. Eventually, the researchers plan to extend its scope to the rest of cellular life. It’s an enormous task that has never been achieved before, but the team’s ambitions were curtailed by their other major obstacle: efficient accessibility.
Fortunately, the NCBI decided to make all sequencing data available within Google Cloud in 2019, which allowed the researchers to bring the algorithms to the data, instead of the other way around. Today, the BMI Group uses Cloud Storage to store sequencing information and Compute Engine VM instances to process the data.
“With local clusters, we were tied to a few default configurations. Compute Engine allows us to create the ideal machines for each job. If I need four CPUs and 160 gigabytes of RAM, I can start using them instantly. This way, we can leverage our existing resources much more efficiently.”
—Dr. Daniel Danciu, Software Engineer at the BMI Group, ETH ZurichEmpowering scientists to explore new ideas
Before the switch to Google Cloud, the BMI Group had to limit its operations to smaller sequencing datasets of several terabytes in size, just to keep download and processing times manageable. “We downloaded the data onto our premises and used local compute to process them,” says Andre. “That works well for small datasets, but when you’re dealing with petabytes, it’s no longer feasible.”
The tree of life contains 11 petabytes of sequencing data, 4 of them on the public eukaryote branch alone. Previously, even at an average speed of one gigabit per second, downloading that 4-petabyte dataset would take a whole year. The availability of this data in Google Cloud was a game changer, removing bottlenecks while fast-tracking data processing. The elasticity of cloud computing allowed for optimal parallelization of compute power, increasing the throughput. “In the past, downloading 10 terabytes of data could easily take a week,” says Andre. “In the Google Cloud, processing this data is a matter of hours. We’re more than 10 times faster.”
This significant increase in efficiency did not require an overhaul of established workflows. In the past, the group had to download data in small increments into the local disk and process it in memory, an approach no longer feasible in a petabyte-scale environment. By slightly adjusting the data journey, streaming it to the VM memory rather than downloading it, the BMI Group made a smooth transition to the new setup. “Moving to Google Cloud was an effortless process,” says Daniel. “We didn’t really have to change our philosophy, just tweak our existing systems a little.”
At peak times, the team is using 4,000 CPUs and 15 terabytes of RAM to process all this data, adapting the customizable VM setup to current needs. In the past, the BMI team had to be constantly mindful of disk space, without the ability to expand it quickly upon reaching limits. “Sometimes, it’s hard to predict exactly how much space you need,” says Daniel. “With our on-prem setup, buying additional space is often a lengthy process. With Google Cloud, we can just increase the space on demand, making sure we’ll never run out.”
This new cloud flexibility extends far beyond disk space. “With local clusters, we were tied to a few default configurations,” says Daniel. “Compute Engine allows us to create custom machine types for each job. If I need four CPUs and 160 gigabytes of RAM, I can start using them instantly. This way, we can leverage our existing resources much more efficiently.”
”IT procurement in universities is often optimized for long research projects. You’re locked into infrastructure for four to five years, lacking the flexibility needed in fast-paced projects. Google Cloud lets us readjust the setup to our needs, creating opportunities and preventing us from spending on infrastructure we can’t use optimally.”
—Dr. Andre Kahles, Senior Researcher at the BMI Group, ETH ZurichCreating new research opportunities by optimizing costs
To maximize efficiency, the BMI team built a custom server infrastructure, with one central server node distributing worker jobs across the available instances. “It distributes the tasks out to the nodes that compute and remembers what's been done and what hasn’t, creating checkpoints,” says Daniel.
This checkpointing feature adds resilience to the group’s operations, minimizing the risk of losing progress due to technical failures or errors. “Sometimes, jobs die for no discernible reason,” says Daniel. “In that case, you lose all your progress. With our distributed cloud setup, we can rely on the ability to resume operations whenever they’re interrupted.”
Cost-effectiveness is crucial for publicly funded research. To lower the overall compute cost, the ETH team used Compute Engine Preemptible VMs, which allow any compute node to be reclaimed by the provider for other duties at any time. “Thanks to checkpointing, we don't mind if nodes are switched off or repurposed for various reasons,” says Daniel. “With Preemptible VMs, we’re more resilient at a cheaper price.” Using this strategy, the ETH team cut the overall compute cost by 75%.
“Without moving our computation into the cloud, we simply wouldn’t have been able to process this incredible amount of data. We’re one petabyte in and can confidently say we’re going to complete our 4-petabyte metagraph project successfully.”
—Dr. Daniel Danciu, Software Engineer at the BMI Group, ETH ZurichIn addition, the cost-effective dynamism of Google Cloud has expanded the scope of future projects for the BMI Group. “IT procurement in universities is often optimized for long research projects,” says Andre. “You’re locked into infrastructure for four to five years, lacking the flexibility needed in fast-paced projects. Google Cloud lets us readjust the setup to our needs, creating opportunities and preventing us from spending on infrastructure we can’t use optimally.”
Daniel adds: “Without moving our computation into the cloud, we simply wouldn’t have been able to process this incredible amount of data within a feasible time frame. We’re one petabyte in and can confidently say we’re going to complete our 4-petabyte metagraph project successfully.”
The index will be available for use both as an API and an easily accessible data structure in the cloud; its success could transform the field of bioinformatics, changing the way we engage with DNA. “Researchers will be able to query any samples they have collected, such as organisms or environmental samples, against our search graph of trillions of nodes, to find every piece of information that’s known about it already,” says Andre. “By making this data truly accessible for the first time, we’re using Google Cloud to advance science, streamline biological research, and, hopefully, spark many new areas of research.”
Tell us your challenge. We're here to help.
Contact usAbout ETH Zurich
ETH Zurich aims to find solutions for the defining challenges of our time, while cultivating a team of innovative and critical researchers. Its Biomedical Informatics (BMI) Group combines medicine and biology with computer science to model and make sense of molecular processes and diseases and contribute to improving treatment options together with medical collaborators.