Simons Genome Diversity Project

This dataset is provided by the Simons Genome Diversity Project (SGDP) and comprises 279 publicly available genomes from 127 diverse populations. See the following publications for full details:

Dataset access

Cloud Storage folders

The following files are available in the genomics-public-data Cloud Storage bucket:

BigQuery datasets

You can access the following datasets in BigQuery for data exploration and querying:

About the dataset

Full dataset containing 279 genomes

The public VCF files from the SGSP README were extracted to the gs://genomics-public-data/simons-genome-diversity-project Cloud Storage bucket.

The files were then imported into Cloud Life Sciences, and the variants were exported to the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_variants BigQuery table.

The sample metadata was loaded into the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_metadata BigQuery table by running the following commands:

# Strip blank lines from end of file and white space from end of lines.
sed ':a;/^[\t\r\n]\*$/{$d;N;ba}' 10_24_2014_SGDP_metainformation_update.txt \
    | sed 's/\s*$//g' > 10_24_2014_SGDP_metainformation_update.tsv
bq load --autodetect \
    simons_genome_diversity_project.sample_metadata 10_24_2014_SGDP_metainformation_update.tsv

The sample metadata does not use the same sample identifiers that the VCFs do, and it's also missing one row. Its sample attributes were downloaded from and reshaped into the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_attributes BigQuery table. This was done using the wrangle-simons-sample-attributes.R script. The script remaps three samples whose IDs in the source VCFs did not match the corresponding Illumina ID attribute on EBI.

