Genome Aggregation Database

The Genome Aggregation Database (gnomAD) is maintained by an international coalition of investigators to aggregate and harmonize data from large-scale sequencing projects.

These public datasets are available in VCF format in Cloud Storage buckets and in BigQuery as integer ranged partitioned tables. Each dataset is sharded by chromosome, meaning variants are distributed across 24 tables (indicated with "__chr*" suffix). Utilizing the sharded tables reduces query costs significantly.

Variant Transforms was used to process these VCF files and import them to BigQuery. VEP annotations were parsed into separate columns for easier analysis using Variant Transforms's annotation support.

Dataset access

Cloud Storage folders

The following files are available in the gcp-public-data--gnomad Cloud Storage bucket:

BigQuery datasets

You can access the gnomAD dataset in BigQuery for data exploration and querying of the following:

  • Release 2.1.1 exomes
  • Release 2.1.1 genomes
  • Release 3.0 genomes

The dataset is also available in the following regions:

About the dataset

The v2 data set (GRCh37/hg19) spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals sequenced as part of various disease-specific and population genetic studies. The v3 data set (GRCh38) spans 71,702 genomes, selected as in v2.

More information about the BigQuery dataset and sample queries are available in the Google Cloud Marketplace.

Dataset source:

Use: See the Broad Institute's site for full terms of use for the dataset. The data are provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the datasets.