This dataset is provided by the Simons Genome Diversity Project (SGDP) and comprises 279 publicly available genomes from 127 diverse populations. See the following publications for full details:
- Pilot publication: The complete genome sequence of a Neanderthal from the Altai Mountains
- Full dataset publication: The Simons Genome Diversity Project: 300 genomes from 142 diverse populations
Dataset access
Cloud Storage folders
The following files are available in the genomics-public-data
Cloud Storage bucket:
BigQuery datasets
You can access the following datasets in BigQuery for data exploration and querying:
- Variants: bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_variants
- Sample attributes: bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_attributes
- Sample metadata: bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_metadata
About the dataset
Full dataset containing 279 genomes
The public VCF files from the SGSP README were extracted to the gs://genomics-public-data/simons-genome-diversity-project Cloud Storage bucket.
The files were then imported into Cloud Life Sciences, and the variants were exported to the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_variants BigQuery table.
The sample metadata was loaded into the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_metadata BigQuery table by running the following commands:
wget http://simonsfoundation.s3.amazonaws.com/share/SCDA/datasets/10_24_2014_SGDP_metainformation_update.txt # Strip blank lines from end of file and white space from end of lines. sed ':a;/^[\t\r\n]\*$/{$d;N;ba}' 10_24_2014_SGDP_metainformation_update.txt \ | sed 's/\s*$//g' > 10_24_2014_SGDP_metainformation_update.tsv bq load --autodetect \ simons_genome_diversity_project.sample_metadata 10_24_2014_SGDP_metainformation_update.tsv
The sample metadata does not use the same sample identifiers that the VCFs do,
and it's also missing one row. Its sample attributes were downloaded
from http://www.ebi.ac.uk/ena/data/view/PRJEB9586
and reshaped into the
bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_attributes
BigQuery table. This was done using the
wrangle-simons-sample-attributes.R
script. The script remaps
three samples whose IDs in the source VCFs did not match the corresponding
Illumina ID attribute on EBI.
Use: This dataset is publicly available for anyone to use under the terms provided by the dataset sources (https://www.hms.harvard.edu, https://www.simonsfoundation.org/simons-genome-diversity-project/) and are provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the datasets.