Cloud Life Sciences 已弃用，2025 年 7 月 8 日之后将不再在 Google Cloud 上提供。批处理现在支持 Cloud Life Sciences 用例。如需了解如何迁移工作负载，请参阅迁移到 Batch。

此页面由 Cloud Translation API 翻译。

Simons Genome Diversity Project

本数据集由 Simons Genome Diversity Project (SGDP) 提供，包含来自 127 个不同种群的 279 个公开可用基因组。如需了解详情，请参阅以下出版物：

试点出版物：The complete genome sequence of a Neanderthal from the Altai Mountains
完整数据集出版物：The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

数据集访问

Cloud Storage 文件夹

以下文件位于 genomics-public-data Cloud Storage 存储分区中：

gs://genomics-public-data/simons-genome-diversity-project

BigQuery 数据集

您可以访问 BigQuery 中的以下数据集以进行数据探索和查询：

关于数据集

包含 279 个基因组的完整数据集

来自 SGSP README 的公开 VCF 文件已被提取至 gs://genomics-public-data/simons-genome-diversity-project Cloud Storage 存储分区。

随后这些文件被导入到 Cloud Life Sciences 中，变体被导出到 bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_variants BigQuery 表格中。

通过运行以下命令，样本元数据已被加载到 bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_metadata BigQuery 表格中：

wget http://simonsfoundation.s3.amazonaws.com/share/SCDA/datasets/10_24_2014_SGDP_metainformation_update.txt
# Strip blank lines from end of file and white space from end of lines.
sed ':a;/^[\t\r\n]\*$/{$d;N;ba}' 10_24_2014_SGDP_metainformation_update.txt \
    | sed 's/\s*$//g' > 10_24_2014_SGDP_metainformation_update.tsv
bq load --autodetect \
    simons_genome_diversity_project.sample_metadata 10_24_2014_SGDP_metainformation_update.tsv

样本元数据不使用 VCF 所使用的样本标识符，并且它也缺少一行。其样本属性从 http://www.ebi.ac.uk/ena/data/view/PRJEB9586 下载并重新整合到 bigquery-public-data:human_genome_variants.simons_genome_diversity_project_sample_attributes BigQuery 表格中。此过程使用 wrangle-simons-sample-attributes.R 脚本完成。该脚本重新映射三个样本，这些样本的源 VCF 中的 ID 与 EBI 上的相应 Illumina ID 属性不匹配。

使用：此数据集公开提供给所有人使用，但使用者需遵循数据集来源（https://www.hms.harvard.edu 和 https://www.simonsfoundation.org/simons-genome-diversity-project/）规定的条款；Google“按原样”提供数据集，对此不作任何明示或暗示的保证。对于因使用此数据集而导致的任何直接或间接损害，Google 不承担任何责任。