Single-cell genomic analysis accelerated by NVIDIA on Google Cloud
Randi Cowin, Ph.D.
Customer Engineering Lead Genomics Specialist, Google Cloud
Corey J. Nolet
Data Scientist & Senior Software Engineer, NVIDIA RAPIDS AI
Try Google Cloud
Start building on Google Cloud with $300 in free credits and 20+ always free products.Free trial
In the past decade, the Healthcare and Life Sciences industry has enjoyed a boon in technological and scientific advancement. New insights and possibilities are revealed almost daily. At Google Cloud, driving innovation in cloud computing is in our DNA. Our team is dedicated to sharing ways Google Cloud can be used to accelerate scientific discovery. For example, the recent announcement of AlphaFold2 showcases a scientific breakthrough, powered by Google Cloud, that will promote a quantum leap in the field of proteomics. In this blog, we’ll review another omics use case, single-cell analysis, and how Google Cloud’s Dataproc and NVIDIA GPUs can help accelerate that analysis.
The Need for Performance in Scientific Analysis
The ability to understand the causal relationship between genotypes and phenotypes is one of the long-standing challenges in biology and medicine. Understanding and drawing insights from the complexity of biological systems abounds from the actual code of life (DNA) through to expression of genes (RNA) to translation of gene transcripts into proteins that function in different pathways, cells, and tissues within an organism. Even the smallest of changes in our DNA can have large impacts on protein expression, structure, and function, which ultimately drives development and response - at both cellular and organism levels. And, as the omics space becomes increasingly data- and compute-intensive, research requires an adequate informatics infrastructure. An infrastructure that scales with growing data demands, enables a diverse range of resource-intensive computational activities, and is affordable and efficient - reducing data bottlenecks and enabling researchers to maximize insight.
But where do all these data and compute challenges come from and what makes scientific study so arduous? The layers of biological complexity begin to be made apparent immediately when looking at not just the genes themselves, but their expression. Although all the cells in our body share nearly identical genotypes, our many diverse cell types (e.g. hepatocytes versus melanocytes) express a unique subset of genes necessary for specific functions, making transcriptomics a more powerful method of analysis by allowing researchers to map gene expression to observable traits. Studies have shown that gene expression is heterogeneous, even in similar cell types. Yet, conventional sequencing methods require DNA or RNA extracted from a cell population. The development of single-cell sequencing was pivotal to the omics field. Single-cell RNA sequencing has been critical in allowing scientists to study transcriptomes across large numbers of individual cells.
Despite its potential, and the increasing availability of single-cell sequencing technology, there are several obstacles: an ever increasing volume of high-dimensionality data, the need to integrate data across different types of measurements (e.g. genetic variants, transcript and protein expression, epigenetics) and across samples or conditions, as well as varying levels of resolution and the granularity needed to map specific cell types or states. These challenges present themselves in a number of ways including background noise, signal dropouts requiring imputation, and limited bioinformatics pipelines that lack statistical flexibility. These and other challenges result in analysis workflows that are very slow, prohibiting the iterative, visual, and interactive analysis required to detect differential gene activity.
Cloud computing can help not only with data challenges, but with some of the biggest obstacles: scalability, performance, and automation of analysis. To address several of the data and infrastructure challenges facing single-cell analysis, NVIDIA developed end-to-end accelerated single-cell RNA sequencing workflows that can be paired with Google Cloud Dataproc, a fully-managed service for running open source frameworks like Spark, Hadoop, and RAPIDS. The Jupyter notebooks that power these workflows include examples using samples like human lung cells and mouse brains cells and demonstrate acceleration between CPU-based processing compared to GPU-based workflows.
Google Cloud Dataproc powers the NVIDIA GPU-based approach and demonstrates data processing capabilities and acceleration, which in turn have the potential of delivering considerable performance gains. When paired with RAPIDS, practitioners can accelerate data science pipelines on NVIDIA GPUs, reducing operations like data loading, processing, and training from hours to seconds. RAPIDS abstracts the complexities of accelerated data science by building upon popular Python and Java libraries effortlessly. When applying RAPIDS and NVIDIA accelerated compute to single-cell genomics use cases, practitioners can churn through analysis of a million cells in only a few minutes.
Give it a Try
The journey to realizing the full potential of omics is long; but through collaboration with industry experts, customers, and partners like NVIDIA, Google Cloud is here to help shine a light on the road ahead. To learn more about the notebook provided for single-cell genomic analysis, please take a look at NVIDIA’s walkthrough. To give this pattern a try on Dataproc, please visit our technical reference guide.