Cloud Life Sciences is deprecated and will no longer be available on Google Cloud after July 8, 2025. Use cases for Cloud Life Sciences are now supported by Batch. To learn how to migrate your workload, see Migrate to Batch.

Simons Genome Diversity Project

This dataset is provided by the Simons Genome Diversity Project (SGDP) and comprises 279 publicly available genomes from 127 diverse populations. See the following publications for full details:

Pilot publication: The complete genome sequence of a Neanderthal from the Altai Mountains
Full dataset publication: The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

Dataset access

Cloud Storage folders

The following files are available in the genomics-public-data Cloud Storage bucket:

gs://genomics-public-data/simons-genome-diversity-project

BigQuery datasets

You can access the following datasets in BigQuery for data exploration and querying:

Variants: bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_variants
Sample attributes: bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_attributes
Sample metadata: bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_metadata

About the dataset

Full dataset containing 279 genomes

The public VCF files from the SGSP README were extracted to the gs://genomics-public-data/simons-genome-diversity-project Cloud Storage bucket.

The files were then imported into Cloud Life Sciences, and the variants were exported to the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_variants BigQuery table.

The sample metadata was loaded into the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_metadata BigQuery table by running the following commands:

wget http://simonsfoundation.s3.amazonaws.com/share/SCDA/datasets/10_24_2014_SGDP_metainformation_update.txt
# Strip blank lines from end of file and white space from end of lines.
sed ':a;/^[\t\r\n]\*$/{$d;N;ba}' 10_24_2014_SGDP_metainformation_update.txt \
    | sed 's/\s*$//g' > 10_24_2014_SGDP_metainformation_update.tsv
bq load --autodetect \
    simons_genome_diversity_project.sample_metadata 10_24_2014_SGDP_metainformation_update.tsv

The sample metadata does not use the same sample identifiers that the VCFs do, and it's also missing one row. Its sample attributes were downloaded from http://www.ebi.ac.uk/ena/data/view/PRJEB9586 and reshaped into the bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_attributes BigQuery table. This was done using the wrangle-simons-sample-attributes.R script. The script remaps three samples whose IDs in the source VCFs did not match the corresponding Illumina ID attribute on EBI.

Use: This dataset is publicly available for anyone to use under the terms provided by the dataset sources (https://www.hms.harvard.edu, https://www.simonsfoundation.org/simons-genome-diversity-project/) and are provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the datasets.