Data Analytics

Scaling reaction-based enumeration for next-gen drug discovery using Google Cloud

May 25, 2023

Shivam Patel

Head of Data Science and Machine Learning, Psivant

Vincent Beltrani

Enterprise Customer Engineer, Google Cloud

Discovering new drugs is at the heart of modern medicine, yet finding a “needle in the haystack” is immensely challenging due to the enormous number of possible drug-like compounds (estimated at 10^60 or more). To increase our chances of finding breakthrough medicines for patients with unmet medical needs, we need to explore the vast universe of chemical compounds and use predictive in silico methods to select the best compounds for lab-based experiments. Enter reaction-based enumeration, a powerful technique that generates novel, synthetically accessible molecules. Our team at Psivant has been pushing the boundaries of this process to an unprecedented scale, implementing reaction-based enumeration on Google Cloud. By tapping into Google Cloud’s robust infrastructure and scalability, we're unlocking the potential of this technique to uncover new chemical entities, leading to groundbreaking advancements and life-altering therapeutics.

Our journey began with a Python-based prototype, leveraging RDKit for chemistry and Ray for distributed computing. Despite initial progress, we encountered a roadblock: our on-premises computing resources were limited, holding back our prototype's potential. While we could explore millions of compounds, our ambition was to explore billions and beyond. To address this limitation, we sought a solution that offered greater flexibility and scalability, leading us to the powerful ecosystem provided by Google Cloud.

Leveraging Google Cloud infrastructure

Google Cloud's technologies allowed us to supercharge our pipelines and conduct chemical compound exploration at scale. By integrating Dataflow, Google Workflows, and Compute Engine, we built a sophisticated, high-performance system that is both flexible and resilient.

Dataflow is a managed batch and streaming system that provides real-time, fault-tolerant, and parallel processing capabilities to manage and manipulate massive datasets effectively. Google Workflows orchestrates the complex, multi-stage processes involved in enumeration, ensuring smooth transitions and error handling across various tasks. Finally, Compute Engine provides us with scalable, customizable infrastructure to run our demanding computational workloads, ensuring optimal performance and cost-effectiveness. Together, these technologies laid the foundation for our cutting-edge solution to explore the endless possibilities of reaction-based enumeration.

We built a cloud-native solution to achieve the scalability we sought, taking advantage of Dataflow, which relies on Apache Beam, a versatile programming model with its own data structures, such as the PCollection — a distributed dictionary designed to handle computation efficiently.

Enter Dataflow

Balancing performance and cost-efficiency was crucial during pipeline development. That is where Dataflow came in, allowing us to optimize resource utilization without compromising performance, paving the way for optimal resource allocation and cost control.

Our pipeline required a deep understanding of the chemistry libraries and Google Cloud ecosystem. We built a simple, highly distributed enumeration pipeline, then added various chemistry operations while ensuring scalability and performance at every step. Google Cloud's team played a pivotal role in our success, providing expert guidance and troubleshooting support.

To 100 billion and beyond

Our journey implementing reaction-based enumeration at scale on Google Cloud has been an inspiring testament to the collaborative spirit, relentless innovation, and unwavering pursuit of excellence. With smart cloud-native engineering and cutting-edge technologies, our workflow achieves rapid scalability, capable of deploying thousands of workers within minutes, enabling us to explore an astounding 100 billion compounds in under a day. Looking ahead, we're excited to integrate Vertex AI into our workflow as our go-to MLOps solution, and to supercharge our high-throughput virtual screening experiments with the robust capabilities of Batch, further enhancing our capacity to innovate.

We'd like to extend our heartfelt thanks to Javier Tordable for his guidance in distributed computing, enriching our understanding of building a massively scalable pipeline.

As we persistently push the boundaries of computational chemistry and drug discovery, we are continuously motivated by the immense potential of reaction-based enumeration. This potential is driven by the powerful and flexible infrastructure of Google Cloud, combined with the comprehensive capabilities of Psivant's QUAISAR platform. Together, they empower us to design the next generation of groundbreaking medicines to combat the most challenging diseases.

Posted in