You can use BigQuery to run a JOIN
query on variants with
data described by genomic region intervals, or overlaps. This page shows how
to use a complex JOIN
query to take a list of gene names and do the
following:
- Find the rare SNPs overlapping the genes.
- Find 100,000 base pairs on either side of a gene for the whole genome samples.
This guide shows examples for three queries. Each query demonstrates how BigQuery scales over different sizes of genomic data:
- Query an inline table.
- Query a materialized table with specific genes.
- Query a materialized table with 250 random genes.
The data comes from the nearly 9 billion-row Tute Genomics Annotation table and the Illumina Platinum Genomes dataset. If you're not familiar with these datasets, see the following links:
- Exploring genetic variation with Google Genomics and Tute from the Google Cloud blog
- Illumina Platinum Genomes in the section on public datasets
Query an inline table
The following example uses an interval table defined in the query, called
intervals
, and shows how to run a JOIN
query with a table
containing variants from
Illumina Platinum Genomes:
Go to the BigQuery page in Google Cloud console.
Click Compose query.
In the New query field, run the following query:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- Define an inline table that uses five rows -- selected from silver-wall-555.TuteTable.hg19. intervals AS ( SELECT * FROM UNNEST ([ STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64> ('PRCC', '1', 156736274, 156771607, 156636274, 156871607), ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640), ('PAX8', '2', 113972574, 114037496, 113872574, 114137496), ('FHIT', '3', 59734036, 61238131, 59634036, 61338131), ('PPARG', '3', 12328349, 12476853, 12228349, 12576853) ])), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
Click Run query. The query takes about ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.
Expand the following section to see the results of the query:
Query results
Chr Start Ref Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles 1 156699757 T C intronic RRNAD1 0.002 2 4 1 156705390 C T intronic RRNAD1 8.0E-4 0 2 1 156714207 T C intronic HDGF 0.003 0 6 1 156714440 A C intronic HDGF 0.0068 0 12 1 156723870 C T intergenic HDGF,PRCC 0.006 1 2 1 156724456 C T intergenic HDGF,PRCC 0.002 2 4 1 156733988 C T intergenic HDGF,PRCC 0.001 1 2 1 156742258 T G intronic PRCC 0.001 2 4 1 156744826 T G intronic PRCC 0.002 0 8 1 156779764 G A intronic SH2D2A 0.001 2 4 1 156783454 A C intronic SH2D2A 0.0014 1 2 1 156786144 C T intronic NTRK1,SH2D2A 0.0031 2 4 1 156790510 A T intronic NTRK1 0.002 1 2 1 156815332 A C intronic INSRR,NTRK1 0.003 0 2 1 156830778 G A exonic NTRK1 0.0067 missense 2 4 1 156842064 C T intronic NTRK1 0.0014 1 2 1 156843438 C A exonic NTRK1 0.0032 missense 1 2 1 156845773 C T intronic NTRK1 0.001 2 4 1 156873318 T C intronic PEAR1 0.01 4 8 1 156922740 G A intronic ARHGEF11 0.007 1 2 1 156930100 C T intronic ARHGEF11 0.001 2 4 2 113901230 G A intergenic IL1RN,PSD4 0.0082 1 2 2 113953418 C A intronic PSD4 0.001 2 4 2 113967621 G C intergenic PSD4,PAX8 0.002 0 6 2 113967624 T C intergenic PSD4,PAX8 0.002 0 2 2 113980967 G A intronic PAX8 0.002 2 4 2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4 2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4 2 114061327 T C intergenic PAX8,CBWD2 0.001 2 4 2 114084018 A C intergenic PAX8,CBWD2 0.0045 0 4 2 114099037 G A intergenic PAX8,CBWD2 0.0051 1 2 2 114105670 A T intergenic PAX8,CBWD2 0.001 1 2 2 114111325 G T intergenic PAX8,CBWD2 0.001 1 2 3 12265797 C T intergenic SYN2,PPARG 0.0089 2 4 3 12277958 A G intergenic SYN2,PPARG 0.002 1 2 3 12296019 G A intergenic SYN2,PPARG 0.002 2 4 3 12316549 G C intergenic SYN2,PPARG 0.002 1 2 3 12335681 T G intronic PPARG 0.0092 2 4 3 12348795 T C intronic PPARG 0.0014 1 2 3 12353106 T C intronic PPARG 0.001 2 4 3 12403825 G A intronic PPARG 0.0051 2 4 3 12404394 G A intronic PPARG 0.001 1 2 3 12410289 G A intronic PPARG 0.008 2 4 3 12431381 C T intronic PPARG 0.0061 2 4 3 12447267 G A intronic PPARG 0.0089 2 4 3 12449379 C T intronic PPARG 0.0092 2 4 3 12450848 C A intronic PPARG 0.0092 2 4 3 12462847 T C intronic PPARG 0.002 1 2 3 12492797 G A intergenic PPARG,TSEN2 0.01 1 2 3 12503201 G A intergenic PPARG,TSEN2 0.0099 2 4 3 12530460 A G intronic TSEN2 0.0092 2 4 3 12531167 A G intronic TSEN2 0.0099 2 4 3 12557737 A G intronic TSEN2 0.001 2 4 3 59636143 A G intergenic C3orf67,FHIT 0.003 3 6 3 59645934 A C intergenic C3orf67,FHIT 0.004 1 2 3 59646893 G A intergenic C3orf67,FHIT 0.002 1 2 3 59697024 A G intergenic C3orf67,FHIT 0.0072 1 2 3 59701013 G A intergenic C3orf67,FHIT 0.004 2 4 3 59733945 A G intergenic C3orf67,FHIT 0.001 2 4 3 59747482 C T intronic FHIT 0.001 2 4 3 59750635 A G intronic FHIT 0.003 1 2 3 59757776 C T intronic FHIT 0.001 2 4 3 59770612 G A intronic FHIT 0.001 2 4 3 59804444 G C intronic FHIT 0.001 2 4 3 59819769 T C intronic FHIT 0.001 2 4 3 59884396 C T intronic FHIT 0.001 2 4 3 59960728 A C intronic FHIT 0.01 1 2 3 59970345 G A intronic FHIT 0.002 1 2 3 59972417 T A intronic FHIT 0.0072 0 2 3 60104328 C A intronic FHIT 0.01 2 4 3 60139062 G A intronic FHIT 0.01 0 2 3 60158066 C T intronic FHIT 0.001 1 2 3 60169285 C T intronic FHIT 0.005 1 2 3 60216185 T C intronic FHIT 0.002 1 2 3 60226380 G A intronic FHIT 0.007 2 4 3 60234539 C A intronic FHIT 0.002 1 2 3 60247464 A C intronic FHIT 0.004 2 4 3 60269926 A G intronic FHIT 0.007 2 4 3 60271228 G T intronic FHIT 0.007 2 4 3 60286972 T C intronic FHIT 0.001 2 4 3 60301412 C G intronic FHIT 0.001 1 2 3 60312251 C T intronic FHIT 0.0099 1 2 3 60317682 A G intronic FHIT 0.008 1 2 3 60328557 C G intronic FHIT 0.0043 2 4 3 60342562 C T intronic FHIT 0.006 1 2 3 60400033 G A intronic FHIT 0.004 2 4 3 60435819 C T intronic FHIT 0.006 2 4 3 60435820 G T intronic FHIT 0.004 1 2 3 60441288 T C intronic FHIT 0.006 2 4 3 60444465 C A intronic FHIT 0.01 1 2 3 60444575 C T intronic FHIT 0.001 1 2 3 60450581 T C intronic FHIT 0.01 1 2 3 60456571 G A intronic FHIT 0.001 2 4 3 60473568 C G intronic FHIT 0.001 1 2 3 60487557 T C intronic FHIT 0.001 1 2 3 60559705 A G intronic FHIT 0.002 2 4 3 60570764 T C intronic FHIT 0.008 2 4 3 60582100 C T intronic FHIT 0.001 1 2 3 60587192 G A intronic FHIT 0.004 1 2 3 60599869 G A intronic FHIT 0.0086 2 4 3 60603091 C T intronic FHIT 0.001 2 4 3 60603250 A T intronic FHIT 0.0099 1 2 3 60609831 T G intronic FHIT 0.001 2 4 3 60619756 G T intronic FHIT 0.0015 2 4 3 60680758 C T intronic FHIT 0.0089 2 4 3 60702243 G C intronic FHIT 0.001 2 4 3 60702532 A G intronic FHIT 0.001 1 2 3 60714328 A T intronic FHIT 0.004 1 2 3 60725297 G A intronic FHIT 0.001 1 2 3 60726640 G A intronic FHIT 0.01 2 4 3 60795144 A G intronic FHIT 0.001 2 4 3 60807171 A G intronic FHIT 0.001 1 2 3 60813868 T C intronic FHIT 0.001 1 2 3 60826546 C G intronic FHIT 0.0023 1 2 3 60837392 C T intronic FHIT 0.001 1 2 3 60846310 A G intronic FHIT 0.01 0 2 3 60850985 C T intronic FHIT 0.004 1 2 3 60852559 T C intronic FHIT 0.008 1 2 3 60871759 T C intronic FHIT 0.004 1 2 3 60884396 C T intronic FHIT 0.002 2 4 3 60897092 C A intronic FHIT 0.001 2 4 3 60940759 C T intronic FHIT 0.0089 1 2 3 60982595 A G intronic FHIT 0.003 2 4 3 60999283 G A intronic FHIT 0.001 1 2 3 61042977 A G intronic FHIT 0.001 2 4 3 61043349 T C intronic FHIT 0.001 2 4 3 61044789 A C intronic FHIT 0.001 2 4 3 61141621 G A intronic FHIT 0.003 1 2 3 61148655 G C intronic FHIT 0.001 2 4 3 61170747 C T intronic FHIT 0.003 1 2 3 61189473 C G intronic FHIT 0.0099 1 2 3 61190425 C T intronic FHIT 0.0023 2 4 3 61193853 C T intronic FHIT 0.0099 0 2 3 61194793 C T intronic FHIT 0.007 0 2 3 61194840 A G intronic FHIT 0.0099 0 2 3 61194886 T A intronic FHIT 0.0099 0 2 3 61201777 C T intronic FHIT 0.001 2 4 3 61202292 T C intronic FHIT 0.007 1 2 3 61232806 G C intronic FHIT 0.0099 1 2 3 61232910 C T intronic FHIT 0.0099 1 2 3 61235824 A T intronic FHIT 0.001 2 4 3 61283810 A C intergenic FHIT,PTPRG 0.0089 1 2 3 61293731 T A intergenic FHIT,PTPRG 0.0089 2 4 3 61296730 C T intergenic FHIT,PTPRG 0.001 1 2 3 61326341 C T intergenic FHIT,PTPRG 0.004 2 4 3 61326620 T C intergenic FHIT,PTPRG 0.01 1 2 3 61327649 G C intergenic FHIT,PTPRG 0.001 2 4 3 61330545 G C intergenic FHIT,PTPRG 0.001 2 4 3 61335803 G A intergenic FHIT,PTPRG 0.001 2 4 A similar query using data from 1,000 Genomes Phase 3 takes about 90 seconds to run and processes about 3.38 TB of data.
Use a materialized table
When working with big data at scale, you can materialize an interval table
and run a JOIN
query against the new table. Before continuing with the rest
of this section, create a dataset by following these steps:
Open the BigQuery page in the Google Cloud console.
In the Explorer panel, select the project where you want to create the dataset.
Expand the
Actions option and click Create dataset.On the Create dataset page:
- For Dataset ID, enter
genomics
. - Leave the other default settings in place.
- Click Create dataset.
- For Dataset ID, enter
Query a materialized table with specific genes
The following steps show how to materialize a new interval table that contains a list of specific genes from the silver-wall-555:TuteTable.hg19 table.
To create the interval table, follow these steps:
Open the BigQuery page in the Google Cloud console.
Click Compose query.
In the New query field, run the following query. The query materializes part of the silver-wall-555:TuteTable.hg19 table to a new
genomics.myIntervalTable
interval table.#standardSQL CREATE TABLE `genomics.myIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start)-100000 AS region_start, MAX(`End`)+100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4', 'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3', 'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA', 'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11', 'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX') GROUP BY Chr, Gene );
Click Run query. The query returns the following result:
This statement created a new table named PROJECT_ID:genomics.myIntervalTable.
In the New query field, run the following query:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.myIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
Click Run query. The query takes about ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.
Expand the following section to see the results of the query:
Query results
Chr Start Ref Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles 1 156699757 T C intronic RRNAD1 0.002 2 4 1 156705390 C T intronic RRNAD1 8.0E-4 0 2 1 156714207 T C intronic HDGF 0.003 0 6 1 156714440 A C intronic HDGF 0.0068 0 12 1 156723870 C T intergenic HDGF,PRCC 0.006 1 2 1 156724456 C T intergenic HDGF,PRCC 0.002 2 4 1 156733988 C T intergenic HDGF,PRCC 0.001 1 2 1 156742258 T G intronic PRCC 0.001 2 4 1 156744826 T G intronic PRCC 0.002 0 8 1 156779764 G A intronic SH2D2A 0.001 2 4 1 156783454 A C intronic SH2D2A 0.0014 1 2 1 156786144 C T intronic NTRK1,SH2D2A 0.0031 2 4 1 156790510 A T intronic NTRK1 0.002 1 2 1 156815332 A C intronic INSRR,NTRK1 0.003 0 2 1 156830778 G A exonic NTRK1 0.0067 missense 2 4 1 156842064 C T intronic NTRK1 0.0014 1 2 1 156843438 C A exonic NTRK1 0.0032 missense 1 2 1 156845773 C T intronic NTRK1 0.001 2 4 1 156873318 T C intronic PEAR1 0.01 4 8 1 156922740 G A intronic ARHGEF11 0.007 1 2 1 156930100 C T intronic ARHGEF11 0.001 2 4 2 113901230 G A intergenic IL1RN,PSD4 0.0082 1 2 2 113953418 C A intronic PSD4 0.001 2 4 2 113967621 G C intergenic PSD4,PAX8 0.002 0 6 2 113967624 T C intergenic PSD4,PAX8 0.002 0 2 2 113980967 G A intronic PAX8 0.002 2 4 2 113994010 A C ncRNA_exonic PAX8-AS1 0.001 0 4 2 113997745 C A ncRNA_exonic PAX8-AS1 0.001 2 4 2 114061327 T C intergenic PAX8,CBWD2 0.001 2 4 2 114084018 A C intergenic PAX8,CBWD2 0.0045 0 4 2 114099037 G A intergenic PAX8,CBWD2 0.0051 1 2 2 114105670 A T intergenic PAX8,CBWD2 0.001 1 2 2 114111325 G T intergenic PAX8,CBWD2 0.001 1 2 3 12265797 C T intergenic SYN2,PPARG 0.0089 2 4 3 12277958 A G intergenic SYN2,PPARG 0.002 1 2 3 12296019 G A intergenic SYN2,PPARG 0.002 2 4 3 12316549 G C intergenic SYN2,PPARG 0.002 1 2 3 12335681 T G intronic PPARG 0.0092 2 4 3 12348795 T C intronic PPARG 0.0014 1 2 3 12353106 T C intronic PPARG 0.001 2 4 3 12403825 G A intronic PPARG 0.0051 2 4 3 12404394 G A intronic PPARG 0.001 1 2 3 12410289 G A intronic PPARG 0.008 2 4 3 12431381 C T intronic PPARG 0.0061 2 4 3 12447267 G A intronic PPARG 0.0089 2 4 3 12449379 C T intronic PPARG 0.0092 2 4 3 12450848 C A intronic PPARG 0.0092 2 4 3 12462847 T C intronic PPARG 0.002 1 2 3 12492797 G A intergenic PPARG,TSEN2 0.01 1 2 3 12503201 G A intergenic PPARG,TSEN2 0.0099 2 4 3 12530460 A G intronic TSEN2 0.0092 2 4 3 12531167 A G intronic TSEN2 0.0099 2 4 3 12557737 A G intronic TSEN2 0.001 2 4 3 59636143 A G intergenic C3orf67,FHIT 0.003 3 6 3 59645934 A C intergenic C3orf67,FHIT 0.004 1 2 3 59646893 G A intergenic C3orf67,FHIT 0.002 1 2 3 59697024 A G intergenic C3orf67,FHIT 0.0072 1 2 3 59701013 G A intergenic C3orf67,FHIT 0.004 2 4 3 59733945 A G intergenic C3orf67,FHIT 0.001 2 4 3 59747482 C T intronic FHIT 0.001 2 4 3 59750635 A G intronic FHIT 0.003 1 2 3 59757776 C T intronic FHIT 0.001 2 4 3 59770612 G A intronic FHIT 0.001 2 4 3 59804444 G C intronic FHIT 0.001 2 4 3 59819769 T C intronic FHIT 0.001 2 4 3 59884396 C T intronic FHIT 0.001 2 4 3 59960728 A C intronic FHIT 0.01 1 2 3 59970345 G A intronic FHIT 0.002 1 2 3 59972417 T A intronic FHIT 0.0072 0 2 3 60104328 C A intronic FHIT 0.01 2 4 3 60139062 G A intronic FHIT 0.01 0 2 3 60158066 C T intronic FHIT 0.001 1 2 3 60169285 C T intronic FHIT 0.005 1 2 3 60216185 T C intronic FHIT 0.002 1 2 3 60226380 G A intronic FHIT 0.007 2 4 3 60234539 C A intronic FHIT 0.002 1 2 3 60247464 A C intronic FHIT 0.004 2 4 3 60269926 A G intronic FHIT 0.007 2 4 3 60271228 G T intronic FHIT 0.007 2 4 3 60286972 T C intronic FHIT 0.001 2 4 3 60301412 C G intronic FHIT 0.001 1 2 3 60312251 C T intronic FHIT 0.0099 1 2 3 60317682 A G intronic FHIT 0.008 1 2 3 60328557 C G intronic FHIT 0.0043 2 4 3 60342562 C T intronic FHIT 0.006 1 2 3 60400033 G A intronic FHIT 0.004 2 4 3 60435819 C T intronic FHIT 0.006 2 4 3 60435820 G T intronic FHIT 0.004 1 2 3 60441288 T C intronic FHIT 0.006 2 4 3 60444465 C A intronic FHIT 0.01 1 2 3 60444575 C T intronic FHIT 0.001 1 2 3 60450581 T C intronic FHIT 0.01 1 2 3 60456571 G A intronic FHIT 0.001 2 4 3 60473568 C G intronic FHIT 0.001 1 2 3 60487557 T C intronic FHIT 0.001 1 2 3 60559705 A G intronic FHIT 0.002 2 4 3 60570764 T C intronic FHIT 0.008 2 4 3 60582100 C T intronic FHIT 0.001 1 2 3 60587192 G A intronic FHIT 0.004 1 2 3 60599869 G A intronic FHIT 0.0086 2 4 3 60603091 C T intronic FHIT 0.001 2 4 3 60603250 A T intronic FHIT 0.0099 1 2 3 60609831 T G intronic FHIT 0.001 2 4 3 60619756 G T intronic FHIT 0.0015 2 4 3 60680758 C T intronic FHIT 0.0089 2 4 3 60702243 G C intronic FHIT 0.001 2 4 3 60702532 A G intronic FHIT 0.001 1 2 3 60714328 A T intronic FHIT 0.004 1 2 3 60725297 G A intronic FHIT 0.001 1 2 3 60726640 G A intronic FHIT 0.01 2 4 3 60795144 A G intronic FHIT 0.001 2 4 3 60807171 A G intronic FHIT 0.001 1 2 3 60813868 T C intronic FHIT 0.001 1 2 3 60826546 C G intronic FHIT 0.0023 1 2 3 60837392 C T intronic FHIT 0.001 1 2 3 60846310 A G intronic FHIT 0.01 0 2 3 60850985 C T intronic FHIT 0.004 1 2 3 60852559 T C intronic FHIT 0.008 1 2 3 60871759 T C intronic FHIT 0.004 1 2 3 60884396 C T intronic FHIT 0.002 2 4 3 60897092 C A intronic FHIT 0.001 2 4 3 60940759 C T intronic FHIT 0.0089 1 2 3 60982595 A G intronic FHIT 0.003 2 4 3 60999283 G A intronic FHIT 0.001 1 2 3 61042977 A G intronic FHIT 0.001 2 4 3 61043349 T C intronic FHIT 0.001 2 4 3 61044789 A C intronic FHIT 0.001 2 4 3 61141621 G A intronic FHIT 0.003 1 2 3 61148655 G C intronic FHIT 0.001 2 4 3 61170747 C T intronic FHIT 0.003 1 2 3 61189473 C G intronic FHIT 0.0099 1 2 3 61190425 C T intronic FHIT 0.0023 2 4 3 61193853 C T intronic FHIT 0.0099 0 2 3 61194793 C T intronic FHIT 0.007 0 2 3 61194840 A G intronic FHIT 0.0099 0 2 3 61194886 T A intronic FHIT 0.0099 0 2 3 61201777 C T intronic FHIT 0.001 2 4 3 61202292 T C intronic FHIT 0.007 1 2 3 61232806 G C intronic FHIT 0.0099 1 2 3 61232910 C T intronic FHIT 0.0099 1 2 3 61235824 A T intronic FHIT 0.001 2 4 3 61283810 A C intergenic FHIT,PTPRG 0.0089 1 2 3 61293731 T A intergenic FHIT,PTPRG 0.0089 2 4 3 61296730 C T intergenic FHIT,PTPRG 0.001 1 2 3 61326341 C T intergenic FHIT,PTPRG 0.004 2 4 3 61326620 T C intergenic FHIT,PTPRG 0.01 1 2 3 61327649 G C intergenic FHIT,PTPRG 0.001 2 4 3 61330545 G C intergenic FHIT,PTPRG 0.001 2 4 3 61335803 G A intergenic FHIT,PTPRG 0.001 2 4 A similar query using data from 1,000 Genomes Phase 3 takes about 90 seconds to run and processes about 3.38 TB of data.
Query a materialized table with 250 random genes
The following example shows how to run an interval JOIN
on a materialized
table that contains 250 genes randomly selected from the
silver-wall-555:TuteTable.hg19 table.
To create the interval table, follow these steps:
Open the BigQuery page in the Google Cloud console.
Click Compose query.
In the New query field, run the following query, which materializes part of the silver-wall-555:TuteTable.hg19 table to a new
genomics.randomGenesIntervalTable
interval table.#standardSQL CREATE TABLE `genomics.randomGenesIntervalTable` AS ( SELECT Gene, Chr, MIN(Start) AS gene_start, MAX(`End`) AS gene_end, MIN(Start) - 100000 AS region_start, MAX(`End`) + 100000 AS region_end FROM `silver-wall-555.TuteTable.hg19` WHERE Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250) GROUP BY Chr, Gene );
- Click Run query. The query returns the following result:
This statement created a new table named PROJECT_ID:genomics.randomGenesIntervalTable.
In the New query field, run the following query:
#standardSQL WITH -- -- Retrieve the variants in this cohort, flattening by alternate bases and -- counting affected alleles. variants AS ( SELECT REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt, (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles, (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles FROM `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v, UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ), -- -- JOIN the variants with the genomic intervals overlapping -- the genes of interest. -- -- The JOIN criteria is complicated because the task is to see if -- an SNP overlaps an interval. With standard SQL you can use complex -- JOIN predicates, including arbitrary expressions. gene_variants AS ( SELECT reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles FROM variants INNER JOIN `genomics.randomGenesIntervalTable` AS intervals ON variants.reference_name = intervals.Chr AND intervals.region_start <= variants.start_position AND intervals.region_end >= variants.end_position ) -- -- And finally JOIN the variants in the regions of interest -- with annotations for rare variants. SELECT DISTINCT Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles FROM `silver-wall-555.TuteTable.hg19` AS annots INNER JOIN gene_variants AS vars ON vars.reference_name = annots.Chr AND vars.start_position = annots.Start AND vars.reference_bases = annots.Ref AND vars.alt = annots.Alt WHERE -- Retrieve annotations for rare variants only. PopFreqMax <= 0.01 ORDER BY Chr, Start;
Click Run query. The query takes about ten seconds to run, and processes about 334 GB of data. The results of the query identify rare variants within the cohort that overlap the regions of interest.
Expand the following section to see the truncated results of the query:
Query results
Chr Start Ref Alt Func Gene PopFreqMax ExonicFunc num_variant_alleles total_num_alleles 1 2925355 C A intergenic TTC34,ACTRT2 0.001 2 4 1 2933170 G A intergenic TTC34,ACTRT2 0.0083 0 4 1 2944477 G A intergenic ACTRT2,LINC00982 0.003 4 6 1 2967591 A T intergenic ACTRT2,LINC00982 0.0092 1 2 1 2975255 T C downstream LINC00982 0.0082 1 2 1 2977223 C T ncRNA_intronic LINC00982 0.0072 1 2 1 2978803 G C ncRNA_exonic LINC00982 0.002 4 6 1 3006466 G A intronic PRDM16 0.0098 1 2 1 3011333 G T intronic PRDM16 0.004 1 2 1 3019659 C T intronic PRDM16 0.0031 1 2 1 3036896 G A intronic PRDM16 0.001 1 2 1 3037388 G A intronic PRDM16 0.002 2 4 1 3041250 T G intronic PRDM16 0.006 2 4 1 3042502 A T intronic PRDM16 0.003 4 6 1 3053713 A C intronic PRDM16 0.002 1 2 1 3063109 C T intronic PRDM16 0.002 0 2 1 3063593 T C intronic PRDM16 0.003 1 2 1 3076439 C T intronic PRDM16 0.001 2 4 1 3078960 G A intronic PRDM16 0.007 2 4 1 3084268 A C intronic PRDM16 0.005 0 2 1 3084492 T C intronic PRDM16 0.0015 0 2 1 3084786 T C intronic PRDM16 0.0015 0 4 1 3111119 G A intronic PRDM16 0.003 1 2 1 3111643 C T intronic PRDM16 0.0041 1 2 1 3114807 G A intronic PRDM16 0.0041 1 2 1 3165530 C T intronic PRDM16 0.0089 1 2 1 3169325 G A intronic PRDM16 0.008 2 4 1 3179623 C T intronic PRDM16 0.003 2 4 1 3181097 C T intronic PRDM16 0.001 2 4 1 3194000 G C intronic PRDM16 0.005 2 4 1 3195769 T C intronic PRDM16 0.002 1 2 1 3197351 C T intronic PRDM16 0.0061 1 2 1 3224100 C A intronic PRDM16 0.003 2 4 1 3228644 G T intronic PRDM16 0.001 2 4 1 3234045 G A intronic PRDM16 0.002 1 2 1 3235971 G A intronic PRDM16 0.0089 1 2 1 3274115 C T intronic PRDM16 0.001 2 4 1 3291388 G A intronic PRDM16 0.002 2 4 1 3295658 A C intronic PRDM16 0.0068 0 6 1 3295937 A C intronic PRDM16 0.0068 0 2 1 3296205 T C intronic PRDM16 0.0083 0 2 1 3315690 G A intronic PRDM16 0.001 2 4 1 3329212 G A exonic PRDM16 0.0031 missense 1 2 1 3331787 C T intronic PRDM16 0.0099 1 2 1 3370316 G C upstream ARHGEF16 0.001 2 4 1 3379560 A G intronic ARHGEF16 0.0051 0 6 1 3391174 C T intronic ARHGEF16 0.006 1 2 1 3413873 G A exonic MEGF6 0.003 missense 1 2 1 3416272 C T exonic MEGF6 0.0072 silent 2 4 1 3417122 G A intronic MEGF6 0.0038 2 4 1 3436219 G A intronic MEGF6 0.0046 2 4 1 12907456 A G exonic HNRNPCL1,LOC649330 0.006 missense 0 10 1 12907518 C A exonic HNRNPCL1,LOC649330 1.0E-4 missense 0 10 1 12908499 G C intronic HNRNPCL1 0.0031 0 8 1 12931660 G C intergenic PRAMEF2,PRAMEF4 0.004 1 2 1 12937721 G T intergenic PRAMEF2,PRAMEF4 0.0038 0 2 1 12940827 G T intronic PRAMEF4 0.007 2 4 1 12942759 T G intronic PRAMEF4 0.0076 0 10 1 12942805 T G intronic PRAMEF4 0.0061 0 12 1 12942812 G A intronic PRAMEF4 0.0061 0 12 1 12942875 A G intronic PRAMEF4 0.0068 0 6 1 12942912 G C intronic PRAMEF4 2.0E-4 0 2 1 12942937 A T exonic PRAMEF4 0.0029 missense 0 2 1 12942940 T G exonic PRAMEF4 0.0038 missense 0 2 1 12943940 T C intronic PRAMEF4 0.0015 0 12 1 12944138 A G intronic PRAMEF4 8.0E-4 0 12 1 12944234 G A intronic PRAMEF4 0.0015 0 12 1 12944589 T G intronic PRAMEF4 0.003 0 4 1 12944845 A C intronic PRAMEF4 0.0014 0 6 1 12946439 T C upstream PRAMEF4 0.0029 0 10 1 12946833 G A upstream PRAMEF4 0.001 0 8 1 12946835 T A upstream PRAMEF4 0.004 0 12 1 12995204 G T intergenic PRAMEF8,PRAMEF6 0.003 1 4 1 12997638 T C downstream PRAMEF6,PRAMEF9 0.003 2 4 1 13007841 G C upstream PRAMEF6 0.0043 0 8 1 13019228 T A intergenic PRAMEF6,LOC391003 0.0015 0 10 1 13038503 G A UTR3 LOC391003 0.0072 1 2 1 13051650 C T intergenic LOC391003,PRAMEF5 0.002 2 4 1 15706063 G A intronic FHAD1 0.0029 1 2 1 15713292 C T intronic FHAD1 0.001 1 2 1 15766541 G C intronic CTRC 0.001 1 2 1 15782601 T C upstream CELA2A 0.0038 1 2 1 15828125 G A intronic CASP9 0.0014 2 4 1 15831037 G A intronic CASP9 0.0099 1 2 1 15840513 T G intronic CASP9 0.0043 2 4 1 15868742 G A intronic DNAJC16 0.001 1 2 1 15876704 G A intronic DNAJC16 0.001 1 2 1 15900342 C A intronic AGMAT 0.001 1 2 1 15906257 T C intronic AGMAT 8.0E-4 1 2 1 15911897 A G upstream AGMAT 0.0043 2 4 1 22764178 C T intergenic WNT4,ZBTB40 0.001 2 4 1 22791939 C T intronic ZBTB40 0.0089 2 4 1 22874394 C G intergenic ZBTB40,EPHA8 0.007 1 2 1 22875103 C G intergenic ZBTB40,EPHA8 0.007 1 2 1 22906403 C T intronic EPHA8 0.008 2 4 1 22912956 G A intronic EPHA8 0.001 1 2 1 22917007 C T intronic EPHA8 0.001 2 4 1 22927240 G A exonic EPHA8 0.0013 missense 2 4 1 22932265 G A intergenic EPHA8,MIR6127 0.0089 2 4 1 22944057 C T intergenic EPHA8,MIR6127 0.0089 2 4 1 22978799 A G upstream C1QB 0.0099 2 4 1 35170588 C T intergenic C1orf94,GJB5 0.01 1 2 1 35172426 C T intergenic C1orf94,GJB5 0.008 1 2 1 35172447 G A intergenic C1orf94,GJB5 0.001 1 2 1 35175302 C T intergenic C1orf94,GJB5 0.008 1 2 1 35177410 A T intergenic C1orf94,GJB5 0.001 1 2 1 35178768 C T intergenic C1orf94,GJB5 0.0014 2 4 1 35179362 G A intergenic C1orf94,GJB5 0.0014 2 4 1 35186166 G A intergenic C1orf94,GJB5 0.0099 2 4 1 35186520 A C intergenic C1orf94,GJB5 0.002 2 4 1 35196361 G A intergenic C1orf94,GJB5 0.0099 2 4 1 35223545 C T exonic GJB5 0.001 silent 1 2 1 35224029 G A UTR3 GJB5 0.003 1 2 1 35227895 T C UTR3 GJB4 5.0E-4 1 2 1 35230455 G T intergenic GJB4,GJB3 0.0043 1 2 1 35232954 T C intergenic GJB4,GJB3 0.003 1 2 1 35237986 G A intergenic GJB4,GJB3 0.0014 1 2 1 35245522 C T intergenic GJB4,GJB3 0.001 1 2 1 35256979 C T intergenic GJB3,GJA4 0.002 2 4 1 35263872 C T intergenic GJA4,SMIM12 5.0E-4 2 4 1 35323895 A C intronic SMIM12 0.0027 2 4 1 35369676 G A intronic DLGAP3 0.007 2 4 1 35371634 T A upstream DLGAP3 0.0015 0 4 1 39253519 G A intergenic LINC01343,RRAGC 0.005 1 2 1 39288829 G A intergenic LINC01343,RRAGC 0.0051 1 2 1 39289832 A C intergenic LINC01343,RRAGC 0.002 0 2 1 39312638 G A intronic RRAGC 0.0038 2 4 1 39361372 G A intronic RHBDL2 0.005 1 2 1 39363826 T G intronic RHBDL2 0.0029 1 2 1 39367555 T C intronic RHBDL2 0.007 2 4 1 39369531 T C intronic RHBDL2 0.001 2 4 1 39370202 T C intronic RHBDL2 0.01 1 2 1 39449101 A G intergenic RHBDL2,AKIRIN1 0.001 2 4 1 39475057 G A intergenic AKIRIN1,NDUFS5 0.01 1 2 1 39485016 C T intergenic AKIRIN1,NDUFS5 0.001 2 4 1 39488137 A G intergenic AKIRIN1,NDUFS5 0.001 2 4 1 39499212 A C intronic NDUFS5 0.001 0 2 1 39500605 C G downstream NDUFS5 0.002 0 10 1 46813814 T C intronic NSUN4 0.0014 1 2 1 46817258 A G intronic NSUN4 0.005 0 2 1 46843158 T C intergenic NSUN4,FAAH 0.001 1 2 1 46933509 A G intergenic LINC01398,DMBX1 0.002 1 2 1 46935021 G A intergenic LINC01398,DMBX1 0.004 2 4 1 46939253 T A intergenic LINC01398,DMBX1 0.004 2 4 1 46951788 C A intergenic LINC01398,DMBX1 0.002 2 4 1 46980864 G C downstream DMBX1 0.003 1 2 1 46989657 T C intergenic DMBX1,MKNK1-AS1 0.007 1 2 1 46994678 C T intergenic DMBX1,MKNK1-AS1 0.002 1 2 1 46999438 T C intergenic DMBX1,MKNK1-AS1 0.002 1 2 1 92761505 A G intronic GLMN 0.001 2 4 1 92764270 G C intronic GLMN 0.001 2 4 1 92802210 G A intronic RPAP2 0.0072 1 2 1 92820663 T A intronic RPAP2 0.0058 1 2 1 92820664 G T intronic RPAP2 0.0058 1 2 1 92820953 G A intronic RPAP2 0.007 2 4 1 92824766 A G intronic RPAP2 0.0058 1 2 1 92849183 C A intronic RPAP2 0.01 2 4 1 92850696 C G intronic RPAP2 0.0023 1 2 1 92861357 T C intergenic RPAP2,GFI1 0.01 2 4 1 92877460 C G intergenic RPAP2,GFI1 0.002 1 2 1 92880643 A G intergenic RPAP2,GFI1 0.001 2 4 1 92911540 G A intergenic RPAP2,GFI1 0.004 2 4 1 92911721 A C intergenic RPAP2,GFI1 0.0031 0 8 1 92918277 C T intergenic RPAP2,GFI1 0.001 2 4 1 92950920 G A intronic GFI1 0.008 2 4 1 92964788 G A intergenic GFI1,EVI5 0.0023 1 2 1 92977480 C T UTR3 EVI5 0.002 1 2 1 92985213 C T intronic EVI5 0.001 2 4 1 92988342 C T intronic EVI5 0.008 2 4 1 92992283 G A intronic EVI5 0.01 2 4 1 92999760 C T intronic EVI5 0.003 1 2 1 93005149 G C intronic EVI5 0.003 0 4 1 93018543 A T intronic EVI5 0.01 2 4 1 93033744 C T intronic EVI5 0.001 2 4 1 111400296 G A intergenic KCNA3,CD53 0.0014 2 4 1 111411924 C T intergenic KCNA3,CD53 0.003 1 2 1 111441850 C G UTR3 CD53 0.003 2 4 1 111451527 C T intergenic CD53,LRIF1 0.008 2 4 1 111454082 C A intergenic CD53,LRIF1 0.001 2 4 1 111466506 A G intergenic CD53,LRIF1 0.001 2 4 1 111525974 G A intergenic LRIF1,DRAM2 0.002 2 4 1 111574573 G T intergenic LRIF1,DRAM2 0.0072 2 4 1 111574594 T A intergenic LRIF1,DRAM2 0.005 1 2 1 111574647 G A intergenic LRIF1,DRAM2 0.005 1 2 1 111591746 T A intergenic LRIF1,DRAM2 0.005 1 2 1 111601459 A G intergenic LRIF1,DRAM2 0.005 1 2 1 111604748 G C intergenic LRIF1,DRAM2 0.005 1 2 1 112191526 T G intronic RAP1A 0.001 2 4 1 112206765 A G intronic RAP1A 0.0043 1 2 1 112226517 G A intronic RAP1A 0.001 0 2 1 112263324 G T intergenic RAP1A,FAM212B 0.003 2 4 1 112264843 G A UTR3 FAM212B 0.001 1 2 1 112285810 C T ncRNA_intronic FAM212B-AS1 0.004 1 2 1 112304285 T C intronic DDX20 0.0043 1 2 1 112307213 A C intronic DDX20 0.0043 1 2 1 112309436 G T exonic DDX20 0.0 missense 1 2 1 112317384 T C intergenic DDX20,KCND3 0.0014 1 2 1 112381367 C T intronic KCND3 0.002 1 2 1 112396571 G T ncRNA_exonic KCND3-IT1 0.001 1 2 1 113520038 G A intergenic SLC16A1-AS1,LOC100996251 0.0023 1 2 A similar query using data from 1,000 Genomes Phase 3 takes about 90 seconds to run and processes about 3.38 TB of data.