Providing open access to the Genome Aggregation Database (gnomAD) on Google Cloud
Johanna Katz
Program Manager, Healthcare & Life Sciences, Google Cloud
Grace Tiao
Associate Director, Computational Genomics, Broad Institute
Today, we are excited to announce a collaboration between Google Cloud Healthcare & Life Sciences and the Broad Institute of MIT and Harvard to provide free access to one of the world's most comprehensive public genomic datasets, the Genome Aggregation Database (gnomAD).
gnomAD brings together data from numerous large-scale sequencing projects, including population and disease-specific genetic studies. With more than 241 million unique short human genetic variants and 335,000 structural variants observed in more than 141,000 healthy adult individuals across a diverse range of genetic ancestry groups, this dataset is a near-ubiquitous resource for human genetics research and clinical variant interpretation. It is used in clinical genetic diagnostic pipelines worldwide.
gnomAD data is hosted in several formats to address a broad range of biomedical and healthcare use cases. This data is available in Hail-formatted tables and Variant Call Format (VCF) files in Google Cloud Storage. This data is also made available in BigQuery as part of the Public Datasets Program. Users receive 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Google Cloud users can securely access this data in any of these formats across all Google Cloud regions through their bioinformatics pipelines on Google Cloud without paying egress charges.
To make gnomAD available in BigQuery, the Google Cloud team used Variant Transforms to ingest VCF files. Once ingested, the variants were sharded to split the output tables by chromosome. In addition, we utilized integer range partitioning and clustering to reduce the cost of queries. This work enables researchers to explore gnomAD quickly and efficiently, without needing to request or pay for dedicated cloud compute resources. By querying a smaller targeted genomic region, query costs are expected to be reduced significantly compared to querying the whole dataset. This application of Variant Transforms has been leveraged by partners and customers like the Mayo Clinic and Color Genomics to accelerate their genomics research. More information on using gnomAD in BigQuery is available in this tutorial.
The data in the Google Cloud Storage bucket also includes standard truth sets used to assess and validate variant calls, data from the Broad Institute’s papers in Nature, interval lists, and other annotation resources.
To access gnomAD on Google Cloud, explore the documentation here. Files can also be browsed and downloaded using the Cloud Console or the command line tool gsutil. After installing gsutil, start browsing with
$ gsutil ls gs://gcp-public-data--gnomad.
Explore additional Healthcare and Life Sciences dataset offerings on Google Cloud here.