Run interval JOINs with BigQuery

You can use BigQuery to run a JOIN query on variants with data described by genomic region intervals, or overlaps. This page shows how to use a complex JOIN query to take a list of gene names and do the following:

  • Find the rare SNPs overlapping the genes.
  • Find 100,000 base pairs on either side of a gene for the whole genome samples.

This guide shows examples for three queries. Each query demonstrates how BigQuery scales over different sizes of genomic data:

The data comes from the nearly 9 billion-row Tute Genomics Annotation table and the Illumina Platinum Genomes dataset. If you're not familiar with these datasets, see the following links:

Query an inline table

The following example uses an interval table defined in the query, called intervals, and shows how to run a JOIN query with a table containing variants from Illumina Platinum Genomes:

  1. Go to the BigQuery page in Google Cloud console.

    Go to the BigQuery page

  2. Click Compose query.

  3. In the New query field, run the following query:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- Define an inline table that uses five rows
      -- selected from silver-wall-555.TuteTable.hg19.
      intervals AS (
        SELECT * FROM UNNEST ([
        STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64>
        ('PRCC', '1', 156736274, 156771607, 156636274, 156871607),
        ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640),
        ('PAX8', '2', 113972574, 114037496, 113872574, 114137496),
        ('FHIT', '3', 59734036, 61238131, 59634036, 61338131),
        ('PPARG', '3', 12328349, 12476853, 12228349, 12576853)
      ])),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  4. Click Run query. The query takes about ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.

    Expand the following section to see the results of the query:

    Query results

    Chr Start Ref Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 T C intronic RRNAD1 0.002 2 4
    1 156705390 C T intronic RRNAD1 8.0E-4 0 2
    1 156714207 T C intronic HDGF 0.003 0 6
    1 156714440 A C intronic HDGF 0.0068 0 12
    1 156723870 C T intergenic HDGF,PRCC 0.006 1 2
    1 156724456 C T intergenic HDGF,PRCC 0.002 2 4
    1 156733988 C T intergenic HDGF,PRCC 0.001 1 2
    1 156742258 T G intronic PRCC 0.001 2 4
    1 156744826 T G intronic PRCC 0.002 0 8
    1 156779764 G A intronic SH2D2A 0.001 2 4
    1 156783454 A C intronic SH2D2A 0.0014 1 2
    1 156786144 C T intronic NTRK1,SH2D2A 0.0031 2 4
    1 156790510 A T intronic NTRK1 0.002 1 2
    1 156815332 A C intronic INSRR,NTRK1 0.003 0 2
    1 156830778 G A exonic NTRK1 0.0067 missense 2 4
    1 156842064 C T intronic NTRK1 0.0014 1 2
    1 156843438 C A exonic NTRK1 0.0032 missense 1 2
    1 156845773 C T intronic NTRK1 0.001 2 4
    1 156873318 T C intronic PEAR1 0.01 4 8
    1 156922740 G A intronic ARHGEF11 0.007 1 2
    1 156930100 C T intronic ARHGEF11 0.001 2 4
    2 113901230 G A intergenic IL1RN,PSD4 0.0082 1 2
    2 113953418 C A intronic PSD4 0.001 2 4
    2 113967621 G C intergenic PSD4,PAX8 0.002 0 6
    2 113967624 T C intergenic PSD4,PAX8 0.002 0 2
    2 113980967 G A intronic PAX8 0.002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4
    2 114061327 T C intergenic PAX8,CBWD2 0.001 2 4
    2 114084018 A C intergenic PAX8,CBWD2 0.0045 0 4
    2 114099037 G A intergenic PAX8,CBWD2 0.0051 1 2
    2 114105670 A T intergenic PAX8,CBWD2 0.001 1 2
    2 114111325 G T intergenic PAX8,CBWD2 0.001 1 2
    3 12265797 C T intergenic SYN2,PPARG 0.0089 2 4
    3 12277958 A G intergenic SYN2,PPARG 0.002 1 2
    3 12296019 G A intergenic SYN2,PPARG 0.002 2 4
    3 12316549 G C intergenic SYN2,PPARG 0.002 1 2
    3 12335681 T G intronic PPARG 0.0092 2 4
    3 12348795 T C intronic PPARG 0.0014 1 2
    3 12353106 T C intronic PPARG 0.001 2 4
    3 12403825 G A intronic PPARG 0.0051 2 4
    3 12404394 G A intronic PPARG 0.001 1 2
    3 12410289 G A intronic PPARG 0.008 2 4
    3 12431381 C T intronic PPARG 0.0061 2 4
    3 12447267 G A intronic PPARG 0.0089 2 4
    3 12449379 C T intronic PPARG 0.0092 2 4
    3 12450848 C A intronic PPARG 0.0092 2 4
    3 12462847 T C intronic PPARG 0.002 1 2
    3 12492797 G A intergenic PPARG,TSEN2 0.01 1 2
    3 12503201 G A intergenic PPARG,TSEN2 0.0099 2 4
    3 12530460 A G intronic TSEN2 0.0092 2 4
    3 12531167 A G intronic TSEN2 0.0099 2 4
    3 12557737 A G intronic TSEN2 0.001 2 4
    3 59636143 A G intergenic C3orf67,FHIT 0.003 3 6
    3 59645934 A C intergenic C3orf67,FHIT 0.004 1 2
    3 59646893 G A intergenic C3orf67,FHIT 0.002 1 2
    3 59697024 A G intergenic C3orf67,FHIT 0.0072 1 2
    3 59701013 G A intergenic C3orf67,FHIT 0.004 2 4
    3 59733945 A G intergenic C3orf67,FHIT 0.001 2 4
    3 59747482 C T intronic FHIT 0.001 2 4
    3 59750635 A G intronic FHIT 0.003 1 2
    3 59757776 C T intronic FHIT 0.001 2 4
    3 59770612 G A intronic FHIT 0.001 2 4
    3 59804444 G C intronic FHIT 0.001 2 4
    3 59819769 T C intronic FHIT 0.001 2 4
    3 59884396 C T intronic FHIT 0.001 2 4
    3 59960728 A C intronic FHIT 0.01 1 2
    3 59970345 G A intronic FHIT 0.002 1 2
    3 59972417 T A intronic FHIT 0.0072 0 2
    3 60104328 C A intronic FHIT 0.01 2 4
    3 60139062 G A intronic FHIT 0.01 0 2
    3 60158066 C T intronic FHIT 0.001 1 2
    3 60169285 C T intronic FHIT 0.005 1 2
    3 60216185 T C intronic FHIT 0.002 1 2
    3 60226380 G A intronic FHIT 0.007 2 4
    3 60234539 C A intronic FHIT 0.002 1 2
    3 60247464 A C intronic FHIT 0.004 2 4
    3 60269926 A G intronic FHIT 0.007 2 4
    3 60271228 G T intronic FHIT 0.007 2 4
    3 60286972 T C intronic FHIT 0.001 2 4
    3 60301412 C G intronic FHIT 0.001 1 2
    3 60312251 C T intronic FHIT 0.0099 1 2
    3 60317682 A G intronic FHIT 0.008 1 2
    3 60328557 C G intronic FHIT 0.0043 2 4
    3 60342562 C T intronic FHIT 0.006 1 2
    3 60400033 G A intronic FHIT 0.004 2 4
    3 60435819 C T intronic FHIT 0.006 2 4
    3 60435820 G T intronic FHIT 0.004 1 2
    3 60441288 T C intronic FHIT 0.006 2 4
    3 60444465 C A intronic FHIT 0.01 1 2
    3 60444575 C T intronic FHIT 0.001 1 2
    3 60450581 T C intronic FHIT 0.01 1 2
    3 60456571 G A intronic FHIT 0.001 2 4
    3 60473568 C G intronic FHIT 0.001 1 2
    3 60487557 T C intronic FHIT 0.001 1 2
    3 60559705 A G intronic FHIT 0.002 2 4
    3 60570764 T C intronic FHIT 0.008 2 4
    3 60582100 C T intronic FHIT 0.001 1 2
    3 60587192 G A intronic FHIT 0.004 1 2
    3 60599869 G A intronic FHIT 0.0086 2 4
    3 60603091 C T intronic FHIT 0.001 2 4
    3 60603250 A T intronic FHIT 0.0099 1 2
    3 60609831 T G intronic FHIT 0.001 2 4
    3 60619756 G T intronic FHIT 0.0015 2 4
    3 60680758 C T intronic FHIT 0.0089 2 4
    3 60702243 G C intronic FHIT 0.001 2 4
    3 60702532 A G intronic FHIT 0.001 1 2
    3 60714328 A T intronic FHIT 0.004 1 2
    3 60725297 G A intronic FHIT 0.001 1 2
    3 60726640 G A intronic FHIT 0.01 2 4
    3 60795144 A G intronic FHIT 0.001 2 4
    3 60807171 A G intronic FHIT 0.001 1 2
    3 60813868 T C intronic FHIT 0.001 1 2
    3 60826546 C G intronic FHIT 0.0023 1 2
    3 60837392 C T intronic FHIT 0.001 1 2
    3 60846310 A G intronic FHIT 0.01 0 2
    3 60850985 C T intronic FHIT 0.004 1 2
    3 60852559 T C intronic FHIT 0.008 1 2
    3 60871759 T C intronic FHIT 0.004 1 2
    3 60884396 C T intronic FHIT 0.002 2 4
    3 60897092 C A intronic FHIT 0.001 2 4
    3 60940759 C T intronic FHIT 0.0089 1 2
    3 60982595 A G intronic FHIT 0.003 2 4
    3 60999283 G A intronic FHIT 0.001 1 2
    3 61042977 A G intronic FHIT 0.001 2 4
    3 61043349 T C intronic FHIT 0.001 2 4
    3 61044789 A C intronic FHIT 0.001 2 4
    3 61141621 G A intronic FHIT 0.003 1 2
    3 61148655 G C intronic FHIT 0.001 2 4
    3 61170747 C T intronic FHIT 0.003 1 2
    3 61189473 C G intronic FHIT 0.0099 1 2
    3 61190425 C T intronic FHIT 0.0023 2 4
    3 61193853 C T intronic FHIT 0.0099 0 2
    3 61194793 C T intronic FHIT 0.007 0 2
    3 61194840 A G intronic FHIT 0.0099 0 2
    3 61194886 T A intronic FHIT 0.0099 0 2
    3 61201777 C T intronic FHIT 0.001 2 4
    3 61202292 T C intronic FHIT 0.007 1 2
    3 61232806 G C intronic FHIT 0.0099 1 2
    3 61232910 C T intronic FHIT 0.0099 1 2
    3 61235824 A T intronic FHIT 0.001 2 4
    3 61283810 A C intergenic FHIT,PTPRG 0.0089 1 2
    3 61293731 T A intergenic FHIT,PTPRG 0.0089 2 4
    3 61296730 C T intergenic FHIT,PTPRG 0.001 1 2
    3 61326341 C T intergenic FHIT,PTPRG 0.004 2 4
    3 61326620 T C intergenic FHIT,PTPRG 0.01 1 2
    3 61327649 G C intergenic FHIT,PTPRG 0.001 2 4
    3 61330545 G C intergenic FHIT,PTPRG 0.001 2 4
    3 61335803 G A intergenic FHIT,PTPRG 0.001 2 4

    A similar query using data from 1,000 Genomes Phase 3 takes about 90 seconds to run and processes about 3.38 TB of data.

Use a materialized table

When working with big data at scale, you can materialize an interval table and run a JOIN query against the new table. Before continuing with the rest of this section, create a dataset by following these steps:

  1. Open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the Explorer panel, select the project where you want to create the dataset.

  3. Expand the Actions option and click Create dataset.

  4. On the Create dataset page:

    1. For Dataset ID, enter genomics.
    2. Leave the other default settings in place.
    3. Click Create dataset.

Query a materialized table with specific genes

The following steps show how to materialize a new interval table that contains a list of specific genes from the silver-wall-555:TuteTable.hg19 table.

  1. To create the interval table, follow these steps:

    1. Open the BigQuery page in the Google Cloud console.

      Go to the BigQuery page

    2. Click Compose query.

    3. In the New query field, run the following query. The query materializes part of the silver-wall-555:TuteTable.hg19 table to a new genomics.myIntervalTable interval table.

      #standardSQL
      CREATE TABLE `genomics.myIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start)-100000 AS region_start,
        MAX(`End`)+100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4',
        'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3',
        'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA',
        'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11',
        'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX')
      GROUP BY
        Chr,
        Gene );
      
    4. Click Run query. The query returns the following result:

    This statement created a new table named PROJECT_ID:genomics.myIntervalTable.
    
  2. In the New query field, run the following query:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.myIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  3. Click Run query. The query takes about ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.

    Expand the following section to see the results of the query:

    Query results

    Chr Start Ref Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 T C intronic RRNAD1 0.002 2 4
    1 156705390 C T intronic RRNAD1 8.0E-4 0 2
    1 156714207 T C intronic HDGF 0.003 0 6
    1 156714440 A C intronic HDGF 0.0068 0 12
    1 156723870 C T intergenic HDGF,PRCC 0.006 1 2
    1 156724456 C T intergenic HDGF,PRCC 0.002 2 4
    1 156733988 C T intergenic HDGF,PRCC 0.001 1 2
    1 156742258 T G intronic PRCC 0.001 2 4
    1 156744826 T G intronic PRCC 0.002 0 8
    1 156779764 G A intronic SH2D2A 0.001 2 4
    1 156783454 A C intronic SH2D2A 0.0014 1 2
    1 156786144 C T intronic NTRK1,SH2D2A 0.0031 2 4
    1 156790510 A T intronic NTRK1 0.002 1 2
    1 156815332 A C intronic INSRR,NTRK1 0.003 0 2
    1 156830778 G A exonic NTRK1 0.0067 missense 2 4
    1 156842064 C T intronic NTRK1 0.0014 1 2
    1 156843438 C A exonic NTRK1 0.0032 missense 1 2
    1 156845773 C T intronic NTRK1 0.001 2 4
    1 156873318 T C intronic PEAR1 0.01 4 8
    1 156922740 G A intronic ARHGEF11 0.007 1 2
    1 156930100 C T intronic ARHGEF11 0.001 2 4
    2 113901230 G A intergenic IL1RN,PSD4 0.0082 1 2
    2 113953418 C A intronic PSD4 0.001 2 4
    2 113967621 G C intergenic PSD4,PAX8 0.002 0 6
    2 113967624 T C intergenic PSD4,PAX8 0.002 0 2
    2 113980967 G A intronic PAX8 0.002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4
    2 114061327 T C intergenic PAX8,CBWD2 0.001 2 4
    2 114084018 A C intergenic PAX8,CBWD2 0.0045 0 4
    2 114099037 G A intergenic PAX8,CBWD2 0.0051 1 2
    2 114105670 A T intergenic PAX8,CBWD2 0.001 1 2
    2 114111325 G T intergenic PAX8,CBWD2 0.001 1 2
    3 12265797 C T intergenic SYN2,PPARG 0.0089 2 4
    3 12277958 A G intergenic SYN2,PPARG 0.002 1 2
    3 12296019 G A intergenic SYN2,PPARG 0.002 2 4
    3 12316549 G C intergenic SYN2,PPARG 0.002 1 2
    3 12335681 T G intronic PPARG 0.0092 2 4
    3 12348795 T C intronic PPARG 0.0014 1 2
    3 12353106 T C intronic PPARG 0.001 2 4
    3 12403825 G A intronic PPARG 0.0051 2 4
    3 12404394 G A intronic PPARG 0.001 1 2
    3 12410289 G A intronic PPARG 0.008 2 4
    3 12431381 C T intronic PPARG 0.0061 2 4
    3 12447267 G A intronic PPARG 0.0089 2 4
    3 12449379 C T intronic PPARG 0.0092 2 4
    3 12450848 C A intronic PPARG 0.0092 2 4
    3 12462847 T C intronic PPARG 0.002 1 2
    3 12492797 G A intergenic PPARG,TSEN2 0.01 1 2
    3 12503201 G A intergenic PPARG,TSEN2 0.0099 2 4
    3 12530460 A G intronic TSEN2 0.0092 2 4
    3 12531167 A G intronic TSEN2 0.0099 2 4
    3 12557737 A G intronic TSEN2 0.001 2 4
    3 59636143 A G intergenic C3orf67,FHIT 0.003 3 6
    3 59645934 A C intergenic C3orf67,FHIT 0.004 1 2
    3 59646893 G A intergenic C3orf67,FHIT 0.002 1 2
    3 59697024 A G intergenic C3orf67,FHIT 0.0072 1 2
    3 59701013 G A intergenic C3orf67,FHIT 0.004 2 4
    3 59733945 A G intergenic C3orf67,FHIT 0.001 2 4
    3 59747482 C T intronic FHIT 0.001 2 4
    3 59750635 A G intronic FHIT 0.003 1 2
    3 59757776 C T intronic FHIT 0.001 2 4
    3 59770612 G A intronic FHIT 0.001 2 4
    3 59804444 G C intronic FHIT 0.001 2 4
    3 59819769 T C intronic FHIT 0.001 2 4
    3 59884396 C T intronic FHIT 0.001 2 4
    3 59960728 A C intronic FHIT 0.01 1 2
    3 59970345 G A intronic FHIT 0.002 1 2
    3 59972417 T A intronic FHIT 0.0072 0 2
    3 60104328 C A intronic FHIT 0.01 2 4
    3 60139062 G A intronic FHIT 0.01 0 2
    3 60158066 C T intronic FHIT 0.001 1 2
    3 60169285 C T intronic FHIT 0.005 1 2
    3 60216185 T C intronic FHIT 0.002 1 2
    3 60226380 G A intronic FHIT 0.007 2 4
    3 60234539 C A intronic FHIT 0.002 1 2
    3 60247464 A C intronic FHIT 0.004 2 4
    3 60269926 A G intronic FHIT 0.007 2 4
    3 60271228 G T intronic FHIT 0.007 2 4
    3 60286972 T C intronic FHIT 0.001 2 4
    3 60301412 C G intronic FHIT 0.001 1 2
    3 60312251 C T intronic FHIT 0.0099 1 2
    3 60317682 A G intronic FHIT 0.008 1 2
    3 60328557 C G intronic FHIT 0.0043 2 4
    3 60342562 C T intronic FHIT 0.006 1 2
    3 60400033 G A intronic FHIT 0.004 2 4
    3 60435819 C T intronic FHIT 0.006 2 4
    3 60435820 G T intronic FHIT 0.004 1 2
    3 60441288 T C intronic FHIT 0.006 2 4
    3 60444465 C A intronic FHIT 0.01 1 2
    3 60444575 C T intronic FHIT 0.001 1 2
    3 60450581 T C intronic FHIT 0.01 1 2
    3 60456571 G A intronic FHIT 0.001 2 4
    3 60473568 C G intronic FHIT 0.001 1 2
    3 60487557 T C intronic FHIT 0.001 1 2
    3 60559705 A G intronic FHIT 0.002 2 4
    3 60570764 T C intronic FHIT 0.008 2 4
    3 60582100 C T intronic FHIT 0.001 1 2
    3 60587192 G A intronic FHIT 0.004 1 2
    3 60599869 G A intronic FHIT 0.0086 2 4
    3 60603091 C T intronic FHIT 0.001 2 4
    3 60603250 A T intronic FHIT 0.0099 1 2
    3 60609831 T G intronic FHIT 0.001 2 4
    3 60619756 G T intronic FHIT 0.0015 2 4
    3 60680758 C T intronic FHIT 0.0089 2 4
    3 60702243 G C intronic FHIT 0.001 2 4
    3 60702532 A G intronic FHIT 0.001 1 2
    3 60714328 A T intronic FHIT 0.004 1 2
    3 60725297 G A intronic FHIT 0.001 1 2
    3 60726640 G A intronic FHIT 0.01 2 4
    3 60795144 A G intronic FHIT 0.001 2 4
    3 60807171 A G intronic FHIT 0.001 1 2
    3 60813868 T C intronic FHIT 0.001 1 2
    3 60826546 C G intronic FHIT 0.0023 1 2
    3 60837392 C T intronic FHIT 0.001 1 2
    3 60846310 A G intronic FHIT 0.01 0 2
    3 60850985 C T intronic FHIT 0.004 1 2
    3 60852559 T C intronic FHIT 0.008 1 2
    3 60871759 T C intronic FHIT 0.004 1 2
    3 60884396 C T intronic FHIT 0.002 2 4
    3 60897092 C A intronic FHIT 0.001 2 4
    3 60940759 C T intronic FHIT 0.0089 1 2
    3 60982595 A G intronic FHIT 0.003 2 4
    3 60999283 G A intronic FHIT 0.001 1 2
    3 61042977 A G intronic FHIT 0.001 2 4
    3 61043349 T C intronic FHIT 0.001 2 4
    3 61044789 A C intronic FHIT 0.001 2 4
    3 61141621 G A intronic FHIT 0.003 1 2
    3 61148655 G C intronic FHIT 0.001 2 4
    3 61170747 C T intronic FHIT 0.003 1 2
    3 61189473 C G intronic FHIT 0.0099 1 2
    3 61190425 C T intronic FHIT 0.0023 2 4
    3 61193853 C T intronic FHIT 0.0099 0 2
    3 61194793 C T intronic FHIT 0.007 0 2
    3 61194840 A G intronic FHIT 0.0099 0 2
    3 61194886 T A intronic FHIT 0.0099 0 2
    3 61201777 C T intronic FHIT 0.001 2 4
    3 61202292 T C intronic FHIT 0.007 1 2
    3 61232806 G C intronic FHIT 0.0099 1 2
    3 61232910 C T intronic FHIT 0.0099 1 2
    3 61235824 A T intronic FHIT 0.001 2 4
    3 61283810 A C intergenic FHIT,PTPRG 0.0089 1 2
    3 61293731 T A intergenic FHIT,PTPRG 0.0089 2 4
    3 61296730 C T intergenic FHIT,PTPRG 0.001 1 2
    3 61326341 C T intergenic FHIT,PTPRG 0.004 2 4
    3 61326620 T C intergenic FHIT,PTPRG 0.01 1 2
    3 61327649 G C intergenic FHIT,PTPRG 0.001 2 4
    3 61330545 G C intergenic FHIT,PTPRG 0.001 2 4
    3 61335803 G A intergenic FHIT,PTPRG 0.001 2 4

    A similar query using data from 1,000 Genomes Phase 3 takes about 90 seconds to run and processes about 3.38 TB of data.

Query a materialized table with 250 random genes

The following example shows how to run an interval JOIN on a materialized table that contains 250 genes randomly selected from the silver-wall-555:TuteTable.hg19 table.

  1. To create the interval table, follow these steps:

    1. Open the BigQuery page in the Google Cloud console.

      Go to the BigQuery page

    2. Click Compose query.

    3. In the New query field, run the following query, which materializes part of the silver-wall-555:TuteTable.hg19 table to a new genomics.randomGenesIntervalTable interval table.

      #standardSQL
      CREATE TABLE `genomics.randomGenesIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start) - 100000 AS region_start,
        MAX(`End`) + 100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250)
      GROUP BY
        Chr,
        Gene );
      
      1. Click Run query. The query returns the following result:
      This statement created a new table named PROJECT_ID:genomics.randomGenesIntervalTable.
      
  2. In the New query field, run the following query:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.randomGenesIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  3. Click Run query. The query takes about ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.

    Expand the following section to see the truncated results of the query:

    Query results

    Chr Start Ref Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 2925355 C A intergenic TTC34,ACTRT2 0.001 2 4
    1 2933170 G A intergenic TTC34,ACTRT2 0.0083 0 4
    1 2944477 G A intergenic ACTRT2,LINC00982 0.003 4 6
    1 2967591 A T intergenic ACTRT2,LINC00982 0.0092 1 2
    1 2975255 T C downstream LINC00982 0.0082 1 2
    1 2977223 C T ncRNA_intronic LINC00982 0.0072 1 2
    1 2978803 G C ncRNA_exonic LINC00982 0.002 4 6
    1 3006466 G A intronic PRDM16 0.0098 1 2
    1 3011333 G T intronic PRDM16 0.004 1 2
    1 3019659 C T intronic PRDM16 0.0031 1 2
    1 3036896 G A intronic PRDM16 0.001 1 2
    1 3037388 G A intronic PRDM16 0.002 2 4
    1 3041250 T G intronic PRDM16 0.006 2 4
    1 3042502 A T intronic PRDM16 0.003 4 6
    1 3053713 A C intronic PRDM16 0.002 1 2
    1 3063109 C T intronic PRDM16 0.002 0 2
    1 3063593 T C intronic PRDM16 0.003 1 2
    1 3076439 C T intronic PRDM16 0.001 2 4
    1 3078960 G A intronic PRDM16 0.007 2 4
    1 3084268 A C intronic PRDM16 0.005 0 2
    1 3084492 T C intronic PRDM16 0.0015 0 2
    1 3084786 T C intronic PRDM16 0.0015 0 4
    1 3111119 G A intronic PRDM16 0.003 1 2
    1 3111643 C T intronic PRDM16 0.0041 1 2
    1 3114807 G A intronic PRDM16 0.0041 1 2
    1 3165530 C T intronic PRDM16 0.0089 1 2
    1 3169325 G A intronic PRDM16 0.008 2 4
    1 3179623 C T intronic PRDM16 0.003 2 4
    1 3181097 C T intronic PRDM16 0.001 2 4
    1 3194000 G C intronic PRDM16 0.005 2 4
    1 3195769 T C intronic PRDM16 0.002 1 2
    1 3197351 C T intronic PRDM16 0.0061 1 2
    1 3224100 C A intronic PRDM16 0.003 2 4
    1 3228644 G T intronic PRDM16 0.001 2 4
    1 3234045 G A intronic PRDM16 0.002 1 2
    1 3235971 G A intronic PRDM16 0.0089 1 2
    1 3274115 C T intronic PRDM16 0.001 2 4
    1 3291388 G A intronic PRDM16 0.002 2 4
    1 3295658 A C intronic PRDM16 0.0068 0 6
    1 3295937 A C intronic PRDM16 0.0068 0 2
    1 3296205 T C intronic PRDM16 0.0083 0 2
    1 3315690 G A intronic PRDM16 0.001 2 4
    1 3329212 G A exonic PRDM16 0.0031 missense 1 2
    1 3331787 C T intronic PRDM16 0.0099 1 2
    1 3370316 G C upstream ARHGEF16 0.001 2 4
    1 3379560 A G intronic ARHGEF16 0.0051 0 6
    1 3391174 C T intronic ARHGEF16 0.006 1 2
    1 3413873 G A exonic MEGF6 0.003 missense 1 2
    1 3416272 C T exonic MEGF6 0.0072 silent 2 4
    1 3417122 G A intronic MEGF6 0.0038 2 4
    1 3436219 G A intronic MEGF6 0.0046 2 4
    1 12907456 A G exonic HNRNPCL1,LOC649330 0.006 missense 0 10
    1 12907518 C A exonic HNRNPCL1,LOC649330 1.0E-4 missense 0 10
    1 12908499 G C intronic HNRNPCL1 0.0031 0 8
    1 12931660 G C intergenic PRAMEF2,PRAMEF4 0.004 1 2
    1 12937721 G T intergenic PRAMEF2,PRAMEF4 0.0038 0 2
    1 12940827 G T intronic PRAMEF4 0.007 2 4
    1 12942759 T G intronic PRAMEF4 0.0076 0 10
    1 12942805 T G intronic PRAMEF4 0.0061 0 12
    1 12942812 G A intronic PRAMEF4 0.0061 0 12
    1 12942875 A G intronic PRAMEF4 0.0068 0 6
    1 12942912 G C intronic PRAMEF4 2.0E-4 0 2
    1 12942937 A T exonic PRAMEF4 0.0029 missense 0 2
    1 12942940 T G exonic PRAMEF4 0.0038 missense 0 2
    1 12943940 T C intronic PRAMEF4 0.0015 0 12
    1 12944138 A G intronic PRAMEF4 8.0E-4 0 12
    1 12944234 G A intronic PRAMEF4 0.0015 0 12
    1 12944589 T G intronic PRAMEF4 0.003 0 4
    1 12944845 A C intronic PRAMEF4 0.0014 0 6
    1 12946439 T C upstream PRAMEF4 0.0029 0 10
    1 12946833 G A upstream PRAMEF4 0.001 0 8
    1 12946835 T A upstream PRAMEF4 0.004 0 12
    1 12995204 G T intergenic PRAMEF8,PRAMEF6 0.003 1 4
    1 12997638 T C downstream PRAMEF6,PRAMEF9 0.003 2 4
    1 13007841 G C upstream PRAMEF6 0.0043 0 8
    1 13019228 T A intergenic PRAMEF6,LOC391003 0.0015 0 10
    1 13038503 G A UTR3 LOC391003 0.0072 1 2
    1 13051650 C T intergenic LOC391003,PRAMEF5 0.002 2 4
    1 15706063 G A intronic FHAD1 0.0029 1 2
    1 15713292 C T intronic FHAD1 0.001 1 2
    1 15766541 G C intronic CTRC 0.001 1 2
    1 15782601 T C upstream CELA2A 0.0038 1 2
    1 15828125 G A intronic CASP9 0.0014 2 4
    1 15831037 G A intronic CASP9 0.0099 1 2
    1 15840513 T G intronic CASP9 0.0043 2 4
    1 15868742 G A intronic DNAJC16 0.001 1 2
    1 15876704 G A intronic DNAJC16 0.001 1 2
    1 15900342 C A intronic AGMAT 0.001 1 2
    1 15906257 T C intronic AGMAT 8.0E-4 1 2
    1 15911897 A G upstream AGMAT 0.0043 2 4
    1 22764178 C T intergenic WNT4,ZBTB40 0.001 2 4
    1 22791939 C T intronic ZBTB40 0.0089 2 4
    1 22874394 C G intergenic ZBTB40,EPHA8 0.007 1 2
    1 22875103 C G intergenic ZBTB40,EPHA8 0.007 1 2
    1 22906403 C T intronic EPHA8 0.008 2 4
    1 22912956 G A intronic EPHA8 0.001 1 2
    1 22917007 C T intronic EPHA8 0.001 2 4
    1 22927240 G A exonic EPHA8 0.0013 missense 2 4
    1 22932265 G A intergenic EPHA8,MIR6127 0.0089 2 4
    1 22944057 C T intergenic EPHA8,MIR6127 0.0089 2 4
    1 22978799 A G upstream C1QB 0.0099 2 4
    1 35170588 C T intergenic C1orf94,GJB5 0.01 1 2
    1 35172426 C T intergenic C1orf94,GJB5 0.008 1 2
    1 35172447 G A intergenic C1orf94,GJB5 0.001 1 2
    1 35175302 C T intergenic C1orf94,GJB5 0.008 1 2
    1 35177410 A T intergenic C1orf94,GJB5 0.001 1 2
    1 35178768 C T intergenic C1orf94,GJB5 0.0014 2 4
    1 35179362 G A intergenic C1orf94,GJB5 0.0014 2 4
    1 35186166 G A intergenic C1orf94,GJB5 0.0099 2 4
    1 35186520 A C intergenic C1orf94,GJB5 0.002 2 4
    1 35196361 G A intergenic C1orf94,GJB5 0.0099 2 4
    1 35223545 C T exonic GJB5 0.001 silent 1 2
    1 35224029 G A UTR3 GJB5 0.003 1 2
    1 35227895 T C UTR3 GJB4 5.0E-4 1 2
    1 35230455 G T intergenic GJB4,GJB3 0.0043 1 2
    1 35232954 T C intergenic GJB4,GJB3 0.003 1 2
    1 35237986 G A intergenic GJB4,GJB3 0.0014 1 2
    1 35245522 C T intergenic GJB4,GJB3 0.001 1 2
    1 35256979 C T intergenic GJB3,GJA4 0.002 2 4
    1 35263872 C T intergenic GJA4,SMIM12 5.0E-4 2 4
    1 35323895 A C intronic SMIM12 0.0027 2 4
    1 35369676 G A intronic DLGAP3 0.007 2 4
    1 35371634 T A upstream DLGAP3 0.0015 0 4
    1 39253519 G A intergenic LINC01343,RRAGC 0.005 1 2
    1 39288829 G A intergenic LINC01343,RRAGC 0.0051 1 2
    1 39289832 A C intergenic LINC01343,RRAGC 0.002 0 2
    1 39312638 G A intronic RRAGC 0.0038 2 4
    1 39361372 G A intronic RHBDL2 0.005 1 2
    1 39363826 T G intronic RHBDL2 0.0029 1 2
    1 39367555 T C intronic RHBDL2 0.007 2 4
    1 39369531 T C intronic RHBDL2 0.001 2 4
    1 39370202 T C intronic RHBDL2 0.01 1 2
    1 39449101 A G intergenic RHBDL2,AKIRIN1 0.001 2 4
    1 39475057 G A intergenic AKIRIN1,NDUFS5 0.01 1 2
    1 39485016 C T intergenic AKIRIN1,NDUFS5 0.001 2 4
    1 39488137 A G intergenic AKIRIN1,NDUFS5 0.001 2 4
    1 39499212 A C intronic NDUFS5 0.001 0 2
    1 39500605 C G downstream NDUFS5 0.002 0 10
    1 46813814 T C intronic NSUN4 0.0014 1 2
    1 46817258 A G intronic NSUN4 0.005 0 2
    1 46843158 T C intergenic NSUN4,FAAH 0.001 1 2
    1 46933509 A G intergenic LINC01398,DMBX1 0.002 1 2
    1 46935021 G A intergenic LINC01398,DMBX1 0.004 2 4
    1 46939253 T A intergenic LINC01398,DMBX1 0.004 2 4
    1 46951788 C A intergenic LINC01398,DMBX1 0.002 2 4
    1 46980864 G C downstream DMBX1 0.003 1 2
    1 46989657 T C intergenic DMBX1,MKNK1-AS1 0.007 1 2
    1 46994678 C T intergenic DMBX1,MKNK1-AS1 0.002 1 2
    1 46999438 T C intergenic DMBX1,MKNK1-AS1 0.002 1 2
    1 92761505 A G intronic GLMN 0.001 2 4
    1 92764270 G C intronic GLMN 0.001 2 4
    1 92802210 G A intronic RPAP2 0.0072 1 2
    1 92820663 T A intronic RPAP2 0.0058 1 2
    1 92820664 G T intronic RPAP2 0.0058 1 2
    1 92820953 G A intronic RPAP2 0.007 2 4
    1 92824766 A G intronic RPAP2 0.0058 1 2
    1 92849183 C A intronic RPAP2 0.01 2 4
    1 92850696 C G intronic RPAP2 0.0023 1 2
    1 92861357 T C intergenic RPAP2,GFI1 0.01 2 4
    1 92877460 C G intergenic RPAP2,GFI1 0.002 1 2
    1 92880643 A G intergenic RPAP2,GFI1 0.001 2 4
    1 92911540 G A intergenic RPAP2,GFI1 0.004 2 4
    1 92911721 A C intergenic RPAP2,GFI1 0.0031 0 8
    1 92918277 C T intergenic RPAP2,GFI1 0.001 2 4
    1 92950920 G A intronic GFI1 0.008 2 4
    1 92964788 G A intergenic GFI1,EVI5 0.0023 1 2
    1 92977480 C T UTR3 EVI5 0.002 1 2
    1 92985213 C T intronic EVI5 0.001 2 4
    1 92988342 C T intronic EVI5 0.008 2 4
    1 92992283 G A intronic EVI5 0.01 2 4
    1 92999760 C T intronic EVI5 0.003 1 2
    1 93005149 G C intronic EVI5 0.003 0 4
    1 93018543 A T intronic EVI5 0.01 2 4
    1 93033744 C T intronic EVI5 0.001 2 4
    1 111400296 G A intergenic KCNA3,CD53 0.0014 2 4
    1 111411924 C T intergenic KCNA3,CD53 0.003 1 2
    1 111441850 C G UTR3 CD53 0.003 2 4
    1 111451527 C T intergenic CD53,LRIF1 0.008 2 4
    1 111454082 C A intergenic CD53,LRIF1 0.001 2 4
    1 111466506 A G intergenic CD53,LRIF1 0.001 2 4
    1 111525974 G A intergenic LRIF1,DRAM2 0.002 2 4
    1 111574573 G T intergenic LRIF1,DRAM2 0.0072 2 4
    1 111574594 T A intergenic LRIF1,DRAM2 0.005 1 2
    1 111574647 G A intergenic LRIF1,DRAM2 0.005 1 2
    1 111591746 T A intergenic LRIF1,DRAM2 0.005 1 2
    1 111601459 A G intergenic LRIF1,DRAM2 0.005 1 2
    1 111604748 G C intergenic LRIF1,DRAM2 0.005 1 2
    1 112191526 T G intronic RAP1A 0.001 2 4
    1 112206765 A G intronic RAP1A 0.0043 1 2
    1 112226517 G A intronic RAP1A 0.001 0 2
    1 112263324 G T intergenic RAP1A,FAM212B 0.003 2 4
    1 112264843 G A UTR3 FAM212B 0.001 1 2
    1 112285810 C T ncRNA_intronic FAM212B-AS1 0.004 1 2
    1 112304285 T C intronic DDX20 0.0043 1 2
    1 112307213 A C intronic DDX20 0.0043 1 2
    1 112309436 G T exonic DDX20 0.0 missense 1 2
    1 112317384 T C intergenic DDX20,KCND3 0.0014 1 2
    1 112381367 C T intronic KCND3 0.002 1 2
    1 112396571 G T ncRNA_exonic KCND3-IT1 0.001 1 2
    1 113520038 G A intergenic SLC16A1-AS1,LOC100996251 0.0023 1 2

    A similar query using data from 1,000 Genomes Phase 3 takes about 90 seconds to run and processes about 3.38 TB of data.