The Genome Aggregation Database (gnomAD) is maintained by an international coalition of investigators to aggregate and harmonize data from large-scale sequencing projects.
These public datasets are available in VCF format in Cloud Storage buckets and in BigQuery as integer ranged partitioned tables. Each dataset is sharded by chromosome, meaning variants are distributed across 24 tables (indicated with "__chr*" suffix). Utilizing the sharded tables reduces query costs significantly.
Variant Transforms was used to process these VCF files and import them to BigQuery. VEP annotations were parsed into separate columns for easier analysis using Variant Transforms's annotation support.
Cloud Storage folders
The following files are available in the
Cloud Storage bucket:
- Full gnomAD data: gs://gcp-public-data--gnomad
- Release 2.1.1 exomes and genomes: gs://gcp-public-data--gnomad/release/2.1.1
- Release 3.0 genomes: gs://gcp-public-data--gnomad/release/3.0
You can access the gnomAD dataset in BigQuery for data exploration and querying of the following:
- Release 2.1.1 exomes
- Release 2.1.1 genomes
- Release 3.0 genomes
The dataset is also available in the following regions:
About the dataset
The v2 data set (GRCh37/hg19) spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals sequenced as part of various disease-specific and population genetic studies. The v3 data set (GRCh38) spans 71,702 genomes, selected as in v2.
More information about the BigQuery dataset and sample queries are available in the Google Cloud Marketplace.
- gnomAD is hosted by the Broad Institute's gnomAD site