使用 BigQuery,您可以对带有基因组区域间隔所描述数据的变体运行 JOIN
查询。此页面介绍如何使用复杂 JOIN
查询获取基因名称列表并执行以下操作:
- 查找与基因重叠的罕见 SNP
- 在整个基因组样本中基因的任一侧找到 100000 个碱基对
以下介绍了三个查询,每个查询都演示了 BigQuery 如何根据不同大小的基因组数据进行扩缩:
数据来自近 90 亿行的 Tute Genomics Annotation 表和 Illumina Platinum Genomes 数据集。如果您不熟悉这些数据集,请参阅以下链接:
- Google Cloud 博客中的通过 Google Genomics 和 Tute 探索基因变异
- 公共数据集内的 Illumina Platinum Genomes 部分
使用内嵌表运行间隔 JOIN
本示例使用查询中定义的间隔表(称为 intervals
),并演示如何对包含来自 Illumina Platinum Genomes 的变体的表运行 JOIN
查询:
转到 BigQuery 界面。
点击编写查询。
在新查询框中,输入以下查询:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- Define an inline table that uses five rows -- selected from silver-wall-555.TuteTable.hg19. intervals AS ( SELECT * FROM UNNEST ([ STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64> ('PRCC', '1', 156736274, 156771607, 156636274, 156871607), ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640), ('PAX8', '2', 113972574, 114037496, 113872574, 114137496), ('FHIT', '3', 59734036, 61238131, 59634036, 61338131), ('PPARG', '3', 12328349, 12476853, 12228349, 12576853) ])), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
点击运行查询。查询大约需要运行十秒钟,并处理约 334 GB 的数据。查询结果标识同类群组中与目标区域重叠的罕见变体。
使用来自 1000 Genomes Phase 3 的数据的类似查询大约需要运行 90 秒,并处理约 3.38 TB 的数据。
使用具体化表
在大规模处理大数据时,您可以具体化间隔表并对新表运行 JOIN
查询。
运行包含特定基因的间隔 JOIN
下面的示例介绍如何具体化一个新的间隔表,该表中包含来自 silver-wall-555:TuteTable.hg19 表的特定基因列表:
转到 BigQuery 界面。
通过完成以下步骤创建数据集:
在导航窗格中,点击 Google Cloud 项目名称旁边的向下箭头图标 ,然后点击新建数据集。
在数据集 ID 部分,输入
genomics
。保留其他默认设置,然后点击确定。
通过完成以下步骤创建间隔表:
在 BigQuery 中,点击编写查询。
在新查询框中,输入以下查询,该查询将 silver-wall-555:TuteTable.hg19 表的一部分具体化为新的
genomics.myIntervalTable
间隔表:#standardSQL CREATE TABLE `genomics.myIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start)-100000 AS region_start, MAX(`End`)+100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4', 'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3', 'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA', 'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11', 'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX') GROUP BY Chr, Gene );
- 点击运行查询。
在新查询框中,输入以下查询。
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.myIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
点击运行查询。查询大约需要运行十秒钟,并处理约 334 GB 的数据。查询结果标识同类群组中与目标区域重叠的罕见变体。
使用来自 1,000 Genomes Phase 3 的数据的类似查询大约需要运行 90 秒,并处理约 3.38 TB 的数据。
运行包含随机基因的间隔 JOIN
下面的示例介绍如何对具体化表运行间隔 JOIN
,该具体化表中包含从 silver-wall-555:TuteTable.hg19 表中随机选出的 250 个基因:
转到 BigQuery 界面。
通过完成以下步骤创建数据集:
在导航窗格中,点击 Google Cloud 项目名称旁边的向下箭头图标 ,然后点击新建数据集。
在数据集 ID 部分,输入
genomics
。保留其他默认设置,然后点击确定。
通过完成以下步骤创建间隔表:
在 BigQuery 中,点击编写查询。
在新查询框中,输入以下查询,该查询将 silver-wall-555:TuteTable.hg19 表的一部分具体化为新的
genomics.randomGenesIntervalTable
间隔表:#standardSQL CREATE TABLE `genomics.randomGenesIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start) - 100000 AS region_start, MAX(`End`) + 100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250) GROUP BY Chr, Gene );
- 点击运行查询。
在新查询框中,输入以下查询。
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.randomGenesIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
点击运行查询。查询大约需要运行十秒钟,并处理约 334 GB 的数据。查询结果标识同类群组中与目标区域重叠的罕见变体。
使用来自 1,000 Genomes Phase 3 的数据的类似查询大约需要运行 90 秒,并处理约 3.38 TB 的数据。