Using BigQuery, you can run a JOIN
query on variants with
data described by genomic region intervals, or overlaps. This page shows how
to use a complex JOIN
query to take a list of gene names and do the
following:
- Find the rare SNPs overlapping the genes
- Find 100,000 base pairs on either side of a gene for the whole genome samples
There are three queries presented, each of which demonstrates how BigQuery scales over different sizes of genomic data:
- Querying an inline table
- Querying a materialized table with specific genes
- Querying a materialized table with 250 random genes
The data comes from the nearly 9 billion row Tute Genomics Annotation table and the Illumina Platinum Genomes dataset. If you're not familiar with these datasets, see the following links:
- Exploring genetic variation with Google Genomics and Tute from the Google Cloud blog
- Illumina Platinum Genomes in the section on public datasets
Running the interval JOIN
using an inline table
This example uses an interval table defined in the query, called
intervals
, and shows how to run a JOIN
query with a table
containing variants from
Illumina Platinum Genomes:
Go to the BigQuery UI.
Click Compose Query.
In the New Query box, enter the following query:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- Define an inline table that uses five rows -- selected from silver-wall-555.TuteTable.hg19. intervals AS ( SELECT * FROM UNNEST ([ STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64> ('PRCC', '1', 156736274, 156771607, 156636274, 156871607), ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640), ('PAX8', '2', 113972574, 114037496, 113872574, 114137496), ('FHIT', '3', 59734036, 61238131, 59634036, 61338131), ('PPARG', '3', 12328349, 12476853, 12228349, 12576853) ])), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
Click Run Query. The query takes roughly ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.
A similar query using data from 1,000 Genomes Phase 3 takes roughly 90 seconds to run and processes about 3.38 TB of data.
Using a materialized table
When working with big data at scale, you can materialize an interval table
and run a JOIN
query against the new table.
Running the interval JOIN
with specific genes
The following example shows how to materialize a new interval table that contains a list of specific genes from the silver-wall-555:TuteTable.hg19 table:
Go to the BigQuery UI.
Create a dataset by completing the following steps:
Click the down arrow icon next to your Google Cloud project in the navigation and click Create new dataset.
For Dataset ID, enter
genomics
.Leave the other default settings in place and click OK.
Create the interval table by completing the following steps:
In BigQuery, click Compose Query.
In the New Query box, enter the following query, which materializes part of the silver-wall-555:TuteTable.hg19 table to a new
genomics.myIntervalTable
interval table:#standardSQL CREATE TABLE `genomics.myIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start)-100000 AS region_start, MAX(`End`)+100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4', 'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3', 'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA', 'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11', 'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX') GROUP BY Chr, Gene );
- Click Run Query.
In the New Query box, enter the following query.
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.myIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
Click Run Query. The query takes roughly ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.
A similar query using data from 1,000 Genomes Phase 3 takes roughly 90 seconds to run and processes about 3.38 TB of data.
Running the interval JOIN
with random genes
The following example shows how to run an interval JOIN
on a materialized
table that contains 250 genes randomly selected from the
silver-wall-555:TuteTable.hg19 table:
Go to the BigQuery UI.
Create a dataset by completing the following steps:
Click the down arrow icon next to your Google Cloud project in the navigation and click Create new dataset.
For Dataset ID, enter
genomics
.Leave the other default settings in place and click OK.
Create the interval table by completing the following steps:
In BigQuery, click Compose Query.
In the New Query box, enter the following query, which materializes part of the silver-wall-555:TuteTable.hg19 table to a new
genomics.randomGenesIntervalTable
interval table:#standardSQL CREATE TABLE `genomics.randomGenesIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start) - 100000 AS region_start, MAX(`End`) + 100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250) GROUP BY Chr, Gene );
- Click Run Query.
In the New Query box, enter the following query.
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.randomGenesIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
Click Run Query. The query takes roughly ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.
A similar query using data from 1,000 Genomes Phase 3 takes roughly 90 seconds to run and processes about 3.38 TB of data.