Stanford center makes comprehensive COVID dataset available on Google Cloud
Brent Mitchell
Vice President, Google Public Sector
As an interdisciplinary research center, Stanford’s Center for Population Health Sciences (PHS) aims to improve the health of populations by bringing together researchers and data to understand and address social, environmental, behavioral, and biological factors on both a domestic and global scale. This entails making large-scale biomedical datasets available for research and analysis while keeping personal health information and electronic health records private and secure. Recently, PHS collaborated with the Center for Disease Control (CDC) to de-identify, standardize, and manage access and permissions to the American Family Cohort (AFC) medical records, which represent over 6.6 million patients from over 800 primary care practices across 47 states. This comprehensive, longitudinal dataset can provide a unique window into the impact of the COVID-19 pandemic throughout the U.S. With the AFC dataset now hosted through PHS on Google Cloud, researchers can analyze COVID-19 disease patterns, progression, and health outcomes; evaluate COVID-19 clinical guidelines uptake, treatments, and interventions; and conduct public health surveillance for COVID-19 and related conditions.
Analyzing high-value, high-risk data at scale
Based on the American Board of Family Medicine’s extensive clinical records since the pandemic began, the AFC dataset comprises three terabytes of medical data– from lab values, medications, procedures, diagnoses, insurance type, vital signs, and social history to about one billion notes by clinicians. It is particularly valuable because of its breadth: it represents populations that are underserved and often missing from other data sources, including rural, low income, and racial and ethnic minorities. It comprises patients on Medicare and Medicaid as well as private insurance plans, making it a more representative sampling of the overall U.S. population.
But the challenges of managing data at this scale are daunting. “Because the datasets we work with are both large and high risk, we needed flexible, scalable, and customizable computational resources for our users,” says David Rehkopf, Director of PHS and Associate Professor in the Department of Epidemiology and Population Health and Department of Medicine at Stanford. The tools also need to be accessible for epidemiologists without a data science background.
Accelerating workflows from four days to 30 seconds
By managing the AFC data on Google Cloud, PHS makes them secure and easy to analyze with cutting-edge AI and machine learning tools. “Features which are standard in Google would be prohibitively expensive to develop in a bespoke fashion for research use,” says Rehkopf. “With Natural Language Processing, we can start to examine those clinical notes for signs of long COVID before there were even any diagnostic codes for it. With Big Query, we can cross-reference demographics to look for risk factors we wouldn’t see otherwise.” Rehkopf reports that the preliminary results are promising: in fact, long COVID may not be as prevalent as other studies have predicted. The team also noticed that workloads that took four days to run on servers now run in about 30 seconds on Google Cloud.
PHS was an early adopter of Google Cloud at Stanford. For the past eight years, the center has managed more than 74 datasets on their Secure Data Ecosystem, which was built on Google Cloud for its affordability, scalability, and stability. Rehkopf says that “the culture is an excellent fit with research and science in the public interest and the continual improvements are invaluable. It’s very difficult to replicate the quality and quantity of compute, and especially the stability, offered by Google. During the COVID-19 pandemic, many on premises systems were overwhelmed by an influx of users, but Google systems remained stable.”
The AFC project is just one example of how PHS uses cloud technology to accelerate biomedical research and develop evidence-based health policies. Rehkopf says that “as we move into machine learning, natural language processing, and transforming our data to synthetic data, we rely on the power and scalability of commercial cloud.” With secure access to real-world data, researchers can address complex community health issues and improve patient outcomes.
If you’re a researcher interested in exploring the benefits of the cloud for your projects, apply here for access to the Google Cloud research credits program in eligible countries. To find out how you can get started with gen AI for higher education, sign up for an interactive half-day workshop with Google Cloud and partners Nuvalence and Carahsoft. Participants will work with experts in small groups to design a gen AI strategy package customized for their needs.