使用 BigQuery 运行间隔联接

使用 BigQuery,您可以对带有基因组区域间隔所描述数据的变体运行 JOIN 查询。此页面介绍如何使用复杂 JOIN 查询获取基因名称列表并执行以下操作:

  • 查找与基因重叠的罕见 SNP
  • 在整个基因组样本中基因的任一侧找到 100000 个碱基对

以下介绍了三个查询,每个查询都演示了 BigQuery 如何根据不同大小的基因组数据进行扩缩:

数据来自近 90 亿行的 Tute Genomics Annotation 表和 Illumina Platinum Genomes 数据集。如果您不熟悉这些数据集,请参阅以下链接:

使用内嵌表运行间隔 JOIN

本示例使用查询中定义的间隔表(称为 intervals),并演示如何对包含来自 Illumina Platinum Genomes 的变体的表运行 JOIN 查询:

  1. 转到 BigQuery 界面。

    转到 BigQuery 界面

  2. 点击编写查询

  3. 新查询框中,输入以下查询:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- Define an inline table that uses five rows
      -- selected from silver-wall-555.TuteTable.hg19.
      intervals AS (
        SELECT * FROM UNNEST ([
        STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64>
        ('PRCC', '1', 156736274, 156771607, 156636274, 156871607),
        ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640),
        ('PAX8', '2', 113972574, 114037496, 113872574, 114137496),
        ('FHIT', '3', 59734036, 61238131, 59634036, 61338131),
        ('PPARG', '3', 12328349, 12476853, 12228349, 12576853)
      ])),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  4. 点击运行查询。查询大约需要运行十秒钟,并处理约 334 GB 的数据。查询结果标识同类群组中与目标区域重叠的罕见变体。

    使用来自 1000 Genomes Phase 3 的数据的类似查询大约需要运行 90 秒,并处理约 3.38 TB 的数据。

使用具体化表

在大规模处理大数据时,您可以具体化间隔表并对新表运行 JOIN 查询。

运行包含特定基因的间隔 JOIN

下面的示例介绍如何具体化一个新的间隔表,该表中包含来自 silver-wall-555:TuteTable.hg19 表的特定基因列表:

  1. 转到 BigQuery 界面。

    转到 BigQuery 界面

  2. 通过完成以下步骤创建数据集:

    1. 在导航窗格中,点击 Google Cloud 项目名称旁边的向下箭头图标 ,然后点击新建数据集

    2. 数据集 ID 部分,输入 genomics

    3. 保留其他默认设置,然后点击确定

  3. 通过完成以下步骤创建间隔表:

    1. 在 BigQuery 中,点击编写查询

    2. 新查询框中,输入以下查询,该查询将 silver-wall-555:TuteTable.hg19 表的一部分具体化为新的 genomics.myIntervalTable 间隔表:

      #standardSQL
      CREATE TABLE `genomics.myIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start)-100000 AS region_start,
        MAX(`End`)+100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4',
        'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3',
        'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA',
        'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11',
        'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX')
      GROUP BY
        Chr,
        Gene );
      
      1. 点击运行查询
  4. 新查询框中,输入以下查询。

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.myIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  5. 点击运行查询。查询大约需要运行十秒钟,并处理约 334 GB 的数据。查询结果标识同类群组中与目标区域重叠的罕见变体。

    使用来自 1,000 Genomes Phase 3 的数据的类似查询大约需要运行 90 秒,并处理约 3.38 TB 的数据。

运行包含随机基因的间隔 JOIN

下面的示例介绍如何对具体化表运行间隔 JOIN,该具体化表中包含从 silver-wall-555:TuteTable.hg19 表中随机选出的 250 个基因:

  1. 转到 BigQuery 界面。

    转到 BigQuery 界面

  2. 通过完成以下步骤创建数据集:

    1. 在导航窗格中,点击 Google Cloud 项目名称旁边的向下箭头图标 ,然后点击新建数据集

    2. 数据集 ID 部分,输入 genomics

    3. 保留其他默认设置,然后点击确定

  3. 通过完成以下步骤创建间隔表:

    1. 在 BigQuery 中,点击编写查询

    2. 新查询框中,输入以下查询,该查询将 silver-wall-555:TuteTable.hg19 表的一部分具体化为新的 genomics.randomGenesIntervalTable 间隔表:

      #standardSQL
      CREATE TABLE `genomics.randomGenesIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start) - 100000 AS region_start,
        MAX(`End`) + 100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250)
      GROUP BY
        Chr,
        Gene );
      
      1. 点击运行查询
  4. 新查询框中,输入以下查询。

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.randomGenesIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  5. 点击运行查询。查询大约需要运行十秒钟,并处理约 334 GB 的数据。查询结果标识同类群组中与目标区域重叠的罕见变体。

    使用来自 1,000 Genomes Phase 3 的数据的类似查询大约需要运行 90 秒,并处理约 3.38 TB 的数据。