Jalankan JOIN interval dengan BigQuery

Anda dapat menggunakan BigQuery untuk menjalankan kueri JOIN pada varian dengan data yang dijelaskan berdasarkan interval wilayah genomik, atau tumpang-tindih. Halaman ini menunjukkan cara menggunakan kueri JOIN yang kompleks untuk mengambil daftar nama gen dan melakukan hal berikut:

  • Temukan TPM langka yang tumpang tindih dengan gen.
  • Temukan 100.000 pasangan basa di kedua sisi gen untuk seluruh sampel genom.

Panduan ini menunjukkan contoh untuk tiga kueri. Setiap kueri menunjukkan cara BigQuery melakukan penskalaan pada berbagai ukuran data genomik:

Datanya berasal dari tabel Anotasi Genomics Tute Genomics sebanyak hampir 9 miliar baris dan set data Illumina Platinum Genomes. Jika Anda tidak terbiasa dengan {i>dataset<i} ini, lihat tautan berikut:

Membuat kueri tabel inline

Contoh berikut menggunakan tabel interval yang ditentukan dalam kueri, yang disebut intervals, dan menunjukkan cara menjalankan kueri JOIN dengan tabel yang berisi varian dari Illumina Platinum Genomes:

  1. Buka halaman BigQuery di Konsol Google Cloud.

    Buka halaman BigQuery

  2. Klik Compose query.

  3. Di kolom New query, jalankan kueri berikut:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- Define an inline table that uses five rows
      -- selected from silver-wall-555.TuteTable.hg19.
      intervals AS (
        SELECT * FROM UNNEST ([
        STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64>
        ('PRCC', '1', 156736274, 156771607, 156636274, 156871607),
        ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640),
        ('PAX8', '2', 113972574, 114037496, 113872574, 114137496),
        ('FHIT', '3', 59734036, 61238131, 59634036, 61338131),
        ('PPARG', '3', 12328349, 12476853, 12228349, 12576853)
      ])),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  4. Klik Jalankan kueri. Kueri membutuhkan waktu sekitar sepuluh detik untuk dijalankan, dan memproses sekitar 334 GB data. Hasil kueri mengidentifikasi varian langka dalam kohor yang tumpang-tindih dengan wilayah minat.

    Luaskan bagian berikut untuk melihat hasil kueri:

    Hasil kueri

    Chr Mulai Ref Alt Funk Gen PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 T C intronik RRNAD1 0,002 2 4
    1 156705390 C T intronik RRNAD1 8,0E-4 0 2
    1 156714207 T C intronik HDGF 0,003 0 6
    1 156714440 A C intronik HDGF 0,0068 0 12
    1 156723870 C T intergenik HDGF,PRCC 0,006 1 2
    1 156724456 C T intergenik HDGF,PRCC 0,002 2 4
    1 156733988 C T intergenik HDGF,PRCC 0,001 1 2
    1 156742258 T G intronik PR 0,001 2 4
    1 156744826 T G intronik PR 0,002 0 8
    1 156779764 G A intronik SH2D2A 0,001 2 4
    1 156783454 A C intronik SH2D2A 0,0014 1 2
    1 156786144 C T intronik NTRK1,SH2D2A 0,0031 2 4
    1 156790510 A T intronik NTRK1 0,002 1 2
    1 156815332 A C intronik INSRR,NTRK1 0,003 0 2
    1 156830778 G A eksonik NTRK1 0,0067 Misense 2 4
    1 156842064 C T intronik NTRK1 0,0014 1 2
    1 156843438 C A eksonik NTRK1 0,0032 Misense 1 2
    1 156845773 C T intronik NTRK1 0,001 2 4
    1 156873318 T C intronik PEAR1 0,01 4 8
    1 156922740 G A intronik ARHGEF11 0,007 1 2
    1 156930100 C T intronik ARHGEF11 0,001 2 4
    2 113901230 G A intergenik IL1RN,PSD4 0,0082 1 2
    2 113953418 C A intronik PSD4 0,001 2 4
    2 113967621 G C intergenik PSD4,PAX8 0,002 0 6
    2 113967624 T C intergenik PSD4,PAX8 0,002 0 2
    2 113980967 G A intronik PAX8 0,002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0,001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0,001 2 4
    2 114061327 T C intergenik PAX8,CBWD2 0,001 2 4
    2 114084018 A C intergenik PAX8,CBWD2 0,0045 0 4
    2 114099037 G A intergenik PAX8,CBWD2 0,0051 1 2
    2 114105670 A T intergenik PAX8,CBWD2 0,001 1 2
    2 114111325 G T intergenik PAX8,CBWD2 0,001 1 2
    3 12265797 C T intergenik SYN2,PPARG 0,0089 2 4
    3 12277958 A G intergenik SYN2,PPARG 0,002 1 2
    3 12296019 G A intergenik SYN2,PPARG 0,002 2 4
    3 12316549 G C intergenik SYN2,PPARG 0,002 1 2
    3 12335681 T G intronik PPARG 0,0092 2 4
    3 12348795 T C intronik PPARG 0,0014 1 2
    3 12353106 T C intronik PPARG 0,001 2 4
    3 12403825 G A intronik PPARG 0,0051 2 4
    3 12404394 G A intronik PPARG 0,001 1 2
    3 12410289 G A intronik PPARG 0,008 2 4
    3 12431381 C T intronik PPARG 0,0061 2 4
    3 12447267 G A intronik PPARG 0,0089 2 4
    3 12449379 C T intronik PPARG 0,0092 2 4
    3 12450848 C A intronik PPARG 0,0092 2 4
    3 12462847 T C intronik PPARG 0,002 1 2
    3 12492797 G A intergenik PPARG,TSEN2 0,01 1 2
    3 12503201 G A intergenik PPARG,TSEN2 0,0099 2 4
    3 12530460 A G intronik TSEN2 0,0092 2 4
    3 12531167 A G intronik TSEN2 0,0099 2 4
    3 12557737 A G intronik TSEN2 0,001 2 4
    3 59636143 A G intergenik C3orf67,FHIT 0,003 3 6
    3 59645934 A C intergenik C3orf67,FHIT 0,004 1 2
    3 59646893 G A intergenik C3orf67,FHIT 0,002 1 2
    3 59697024 A G intergenik C3orf67,FHIT 0,0072 1 2
    3 59701013 G A intergenik C3orf67,FHIT 0,004 2 4
    3 59733945 A G intergenik C3orf67,FHIT 0,001 2 4
    3 59747482 C T intronik FHIT 0,001 2 4
    3 59750635 A G intronik FHIT 0,003 1 2
    3 59757776 C T intronik FHIT 0,001 2 4
    3 59770612 G A intronik FHIT 0,001 2 4
    3 59804444 G C intronik FHIT 0,001 2 4
    3 59819769 T C intronik FHIT 0,001 2 4
    3 59884396 C T intronik FHIT 0,001 2 4
    3 59960728 A C intronik FHIT 0,01 1 2
    3 59970345 G A intronik FHIT 0,002 1 2
    3 59972417 T A intronik FHIT 0,0072 0 2
    3 60104328 C A intronik FHIT 0,01 2 4
    3 60139062 G A intronik FHIT 0,01 0 2
    3 60158066 C T intronik FHIT 0,001 1 2
    3 60169285 C T intronik FHIT 0,005 1 2
    3 60216185 T C intronik FHIT 0,002 1 2
    3 60226380 G A intronik FHIT 0,007 2 4
    3 60234539 C A intronik FHIT 0,002 1 2
    3 60247464 A C intronik FHIT 0,004 2 4
    3 60269926 A G intronik FHIT 0,007 2 4
    3 60271228 G T intronik FHIT 0,007 2 4
    3 60286972 T C intronik FHIT 0,001 2 4
    3 60301412 C G intronik FHIT 0,001 1 2
    3 60312251 C T intronik FHIT 0,0099 1 2
    3 60317682 A G intronik FHIT 0,008 1 2
    3 60328557 C G intronik FHIT 0,0043 2 4
    3 60342562 C T intronik FHIT 0,006 1 2
    3 60400033 G A intronik FHIT 0,004 2 4
    3 60435819 C T intronik FHIT 0,006 2 4
    3 60435820 G T intronik FHIT 0,004 1 2
    3 60441288 T C intronik FHIT 0,006 2 4
    3 60444465 C A intronik FHIT 0,01 1 2
    3 60444575 C T intronik FHIT 0,001 1 2
    3 60450581 T C intronik FHIT 0,01 1 2
    3 60456571 G A intronik FHIT 0,001 2 4
    3 60473568 C G intronik FHIT 0,001 1 2
    3 60487557 T C intronik FHIT 0,001 1 2
    3 60559705 A G intronik FHIT 0,002 2 4
    3 60570764 T C intronik FHIT 0,008 2 4
    3 60582100 C T intronik FHIT 0,001 1 2
    3 60587192 G A intronik FHIT 0,004 1 2
    3 60599869 G A intronik FHIT 0,0086 2 4
    3 60603091 C T intronik FHIT 0,001 2 4
    3 60603250 A T intronik FHIT 0,0099 1 2
    3 60609831 T G intronik FHIT 0,001 2 4
    3 60619756 G T intronik FHIT 0,0015 2 4
    3 60680758 C T intronik FHIT 0,0089 2 4
    3 60702243 G C intronik FHIT 0,001 2 4
    3 60702532 A G intronik FHIT 0,001 1 2
    3 60714328 A T intronik FHIT 0,004 1 2
    3 60725297 G A intronik FHIT 0,001 1 2
    3 60726640 G A intronik FHIT 0,01 2 4
    3 60795144 A G intronik FHIT 0,001 2 4
    3 60807171 A G intronik FHIT 0,001 1 2
    3 60813868 T C intronik FHIT 0,001 1 2
    3 60826546 C G intronik FHIT 0,0023 1 2
    3 60837392 C T intronik FHIT 0,001 1 2
    3 60846310 A G intronik FHIT 0,01 0 2
    3 60850985 C T intronik FHIT 0,004 1 2
    3 60852559 T C intronik FHIT 0,008 1 2
    3 60871759 T C intronik FHIT 0,004 1 2
    3 60884396 C T intronik FHIT 0,002 2 4
    3 60897092 C A intronik FHIT 0,001 2 4
    3 60940759 C T intronik FHIT 0,0089 1 2
    3 60982595 A G intronik FHIT 0,003 2 4
    3 60999283 G A intronik FHIT 0,001 1 2
    3 61042977 A G intronik FHIT 0,001 2 4
    3 61043349 T C intronik FHIT 0,001 2 4
    3 61044789 A C intronik FHIT 0,001 2 4
    3 61141621 G A intronik FHIT 0,003 1 2
    3 61148655 G C intronik FHIT 0,001 2 4
    3 61170747 C T intronik FHIT 0,003 1 2
    3 61189473 C G intronik FHIT 0,0099 1 2
    3 61190425 C T intronik FHIT 0,0023 2 4
    3 61193853 C T intronik FHIT 0,0099 0 2
    3 61194793 C T intronik FHIT 0,007 0 2
    3 61194840 A G intronik FHIT 0,0099 0 2
    3 61194886 T A intronik FHIT 0,0099 0 2
    3 61201777 C T intronik FHIT 0,001 2 4
    3 61202292 T C intronik FHIT 0,007 1 2
    3 61232806 G C intronik FHIT 0,0099 1 2
    3 61232910 C T intronik FHIT 0,0099 1 2
    3 61235824 A T intronik FHIT 0,001 2 4
    3 61283810 A C intergenik FHIT,PTPRG 0,0089 1 2
    3 61293731 T A intergenik FHIT,PTPRG 0,0089 2 4
    3 61296730 C T intergenik FHIT,PTPRG 0,001 1 2
    3 61326341 C T intergenik FHIT,PTPRG 0,004 2 4
    3 61326620 T C intergenik FHIT,PTPRG 0,01 1 2
    3 61327649 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61330545 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61335803 G A intergenik FHIT,PTPRG 0,001 2 4

    Kueri serupa yang menggunakan data dari 1.000 Genom Fase 3 memerlukan waktu sekitar 90 detik untuk dijalankan dan memproses sekitar 3,38 TB data.

Gunakan tabel terwujud

Saat menangani big data dalam skala besar, Anda dapat mewujudkan tabel interval dan menjalankan kueri JOIN terhadap tabel baru. Sebelum melanjutkan ke sisa bagian ini, buat {i>dataset<i} dengan mengikuti langkah-langkah berikut:

  1. Buka halaman BigQuery di Konsol Google Cloud.

    Buka halaman BigQuery

  2. Di panel Explorer, pilih project tempat Anda ingin membuat set data.

  3. Luaskan opsi Actions, lalu klik Create dataset.

  4. Di halaman Create dataset:

    1. Untuk Dataset ID, masukkan genomics.
    2. Biarkan setelan default lainnya.
    3. Klik Create dataset.

Mengkueri tabel terwujud dengan gen tertentu

Langkah-langkah berikut menunjukkan cara mewujudkan tabel interval baru yang berisi daftar gen spesifik dari tabel silver-wall-555:TuteTable.hg19.

  1. Untuk membuat tabel interval, ikuti langkah-langkah berikut:

    1. Buka halaman BigQuery di Konsol Google Cloud.

      Buka halaman BigQuery

    2. Klik Compose query.

    3. Di kolom New query, jalankan kueri berikut. Kueri ini mewujudkan bagian dari tabel silver-wall-555:TuteTable.hg19 menjadi tabel interval genomics.myIntervalTable baru.

      #standardSQL
      CREATE TABLE `genomics.myIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start)-100000 AS region_start,
        MAX(`End`)+100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4',
        'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3',
        'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA',
        'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11',
        'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX')
      GROUP BY
        Chr,
        Gene );
      
    4. Klik Jalankan kueri. Kueri tersebut menampilkan hasil berikut:

    This statement created a new table named PROJECT_ID:genomics.myIntervalTable.
    
  2. Di kolom New query, jalankan kueri berikut:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.myIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  3. Klik Jalankan kueri. Kueri membutuhkan waktu sekitar sepuluh detik untuk dijalankan, dan memproses sekitar 334 GB data. Hasil kueri mengidentifikasi varian langka dalam kohor yang tumpang-tindih dengan wilayah minat.

    Luaskan bagian berikut untuk melihat hasil kueri:

    Hasil kueri

    Chr Mulai Ref Alt Funk Gen PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 T C intronik RRNAD1 0,002 2 4
    1 156705390 C T intronik RRNAD1 8,0E-4 0 2
    1 156714207 T C intronik HDGF 0,003 0 6
    1 156714440 A C intronik HDGF 0,0068 0 12
    1 156723870 C T intergenik HDGF,PRCC 0,006 1 2
    1 156724456 C T intergenik HDGF,PRCC 0,002 2 4
    1 156733988 C T intergenik HDGF,PRCC 0,001 1 2
    1 156742258 T G intronik PR 0,001 2 4
    1 156744826 T G intronik PR 0,002 0 8
    1 156779764 G A intronik SH2D2A 0,001 2 4
    1 156783454 A C intronik SH2D2A 0,0014 1 2
    1 156786144 C T intronik NTRK1,SH2D2A 0,0031 2 4
    1 156790510 A T intronik NTRK1 0,002 1 2
    1 156815332 A C intronik INSRR,NTRK1 0,003 0 2
    1 156830778 G A eksonik NTRK1 0,0067 Misense 2 4
    1 156842064 C T intronik NTRK1 0,0014 1 2
    1 156843438 C A eksonik NTRK1 0,0032 Misense 1 2
    1 156845773 C T intronik NTRK1 0,001 2 4
    1 156873318 T C intronik PEAR1 0,01 4 8
    1 156922740 G A intronik ARHGEF11 0,007 1 2
    1 156930100 C T intronik ARHGEF11 0,001 2 4
    2 113901230 G A intergenik IL1RN,PSD4 0,0082 1 2
    2 113953418 C A intronik PSD4 0,001 2 4
    2 113967621 G C intergenik PSD4,PAX8 0,002 0 6
    2 113967624 T C intergenik PSD4,PAX8 0,002 0 2
    2 113980967 G A intronik PAX8 0,002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0,001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0,001 2 4
    2 114061327 T C intergenik PAX8,CBWD2 0,001 2 4
    2 114084018 A C intergenik PAX8,CBWD2 0,0045 0 4
    2 114099037 G A intergenik PAX8,CBWD2 0,0051 1 2
    2 114105670 A T intergenik PAX8,CBWD2 0,001 1 2
    2 114111325 G T intergenik PAX8,CBWD2 0,001 1 2
    3 12265797 C T intergenik SYN2,PPARG 0,0089 2 4
    3 12277958 A G intergenik SYN2,PPARG 0,002 1 2
    3 12296019 G A intergenik SYN2,PPARG 0,002 2 4
    3 12316549 G C intergenik SYN2,PPARG 0,002 1 2
    3 12335681 T G intronik PPARG 0,0092 2 4
    3 12348795 T C intronik PPARG 0,0014 1 2
    3 12353106 T C intronik PPARG 0,001 2 4
    3 12403825 G A intronik PPARG 0,0051 2 4
    3 12404394 G A intronik PPARG 0,001 1 2
    3 12410289 G A intronik PPARG 0,008 2 4
    3 12431381 C T intronik PPARG 0,0061 2 4
    3 12447267 G A intronik PPARG 0,0089 2 4
    3 12449379 C T intronik PPARG 0,0092 2 4
    3 12450848 C A intronik PPARG 0,0092 2 4
    3 12462847 T C intronik PPARG 0,002 1 2
    3 12492797 G A intergenik PPARG,TSEN2 0,01 1 2
    3 12503201 G A intergenik PPARG,TSEN2 0,0099 2 4
    3 12530460 A G intronik TSEN2 0,0092 2 4
    3 12531167 A G intronik TSEN2 0,0099 2 4
    3 12557737 A G intronik TSEN2 0,001 2 4
    3 59636143 A G intergenik C3orf67,FHIT 0,003 3 6
    3 59645934 A C intergenik C3orf67,FHIT 0,004 1 2
    3 59646893 G A intergenik C3orf67,FHIT 0,002 1 2
    3 59697024 A G intergenik C3orf67,FHIT 0,0072 1 2
    3 59701013 G A intergenik C3orf67,FHIT 0,004 2 4
    3 59733945 A G intergenik C3orf67,FHIT 0,001 2 4
    3 59747482 C T intronik FHIT 0,001 2 4
    3 59750635 A G intronik FHIT 0,003 1 2
    3 59757776 C T intronik FHIT 0,001 2 4
    3 59770612 G A intronik FHIT 0,001 2 4
    3 59804444 G C intronik FHIT 0,001 2 4
    3 59819769 T C intronik FHIT 0,001 2 4
    3 59884396 C T intronik FHIT 0,001 2 4
    3 59960728 A C intronik FHIT 0,01 1 2
    3 59970345 G A intronik FHIT 0,002 1 2
    3 59972417 T A intronik FHIT 0,0072 0 2
    3 60104328 C A intronik FHIT 0,01 2 4
    3 60139062 G A intronik FHIT 0,01 0 2
    3 60158066 C T intronik FHIT 0,001 1 2
    3 60169285 C T intronik FHIT 0,005 1 2
    3 60216185 T C intronik FHIT 0,002 1 2
    3 60226380 G A intronik FHIT 0,007 2 4
    3 60234539 C A intronik FHIT 0,002 1 2
    3 60247464 A C intronik FHIT 0,004 2 4
    3 60269926 A G intronik FHIT 0,007 2 4
    3 60271228 G T intronik FHIT 0,007 2 4
    3 60286972 T C intronik FHIT 0,001 2 4
    3 60301412 C G intronik FHIT 0,001 1 2
    3 60312251 C T intronik FHIT 0,0099 1 2
    3 60317682 A G intronik FHIT 0,008 1 2
    3 60328557 C G intronik FHIT 0,0043 2 4
    3 60342562 C T intronik FHIT 0,006 1 2
    3 60400033 G A intronik FHIT 0,004 2 4
    3 60435819 C T intronik FHIT 0,006 2 4
    3 60435820 G T intronik FHIT 0,004 1 2
    3 60441288 T C intronik FHIT 0,006 2 4
    3 60444465 C A intronik FHIT 0,01 1 2
    3 60444575 C T intronik FHIT 0,001 1 2
    3 60450581 T C intronik FHIT 0,01 1 2
    3 60456571 G A intronik FHIT 0,001 2 4
    3 60473568 C G intronik FHIT 0,001 1 2
    3 60487557 T C intronik FHIT 0,001 1 2
    3 60559705 A G intronik FHIT 0,002 2 4
    3 60570764 T C intronik FHIT 0,008 2 4
    3 60582100 C T intronik FHIT 0,001 1 2
    3 60587192 G A intronik FHIT 0,004 1 2
    3 60599869 G A intronik FHIT 0,0086 2 4
    3 60603091 C T intronik FHIT 0,001 2 4
    3 60603250 A T intronik FHIT 0,0099 1 2
    3 60609831 T G intronik FHIT 0,001 2 4
    3 60619756 G T intronik FHIT 0,0015 2 4
    3 60680758 C T intronik FHIT 0,0089 2 4
    3 60702243 G C intronik FHIT 0,001 2 4
    3 60702532 A G intronik FHIT 0,001 1 2
    3 60714328 A T intronik FHIT 0,004 1 2
    3 60725297 G A intronik FHIT 0,001 1 2
    3 60726640 G A intronik FHIT 0,01 2 4
    3 60795144 A G intronik FHIT 0,001 2 4
    3 60807171 A G intronik FHIT 0,001 1 2
    3 60813868 T C intronik FHIT 0,001 1 2
    3 60826546 C G intronik FHIT 0,0023 1 2
    3 60837392 C T intronik FHIT 0,001 1 2
    3 60846310 A G intronik FHIT 0,01 0 2
    3 60850985 C T intronik FHIT 0,004 1 2
    3 60852559 T C intronik FHIT 0,008 1 2
    3 60871759 T C intronik FHIT 0,004 1 2
    3 60884396 C T intronik FHIT 0,002 2 4
    3 60897092 C A intronik FHIT 0,001 2 4
    3 60940759 C T intronik FHIT 0,0089 1 2
    3 60982595 A G intronik FHIT 0,003 2 4
    3 60999283 G A intronik FHIT 0,001 1 2
    3 61042977 A G intronik FHIT 0,001 2 4
    3 61043349 T C intronik FHIT 0,001 2 4
    3 61044789 A C intronik FHIT 0,001 2 4
    3 61141621 G A intronik FHIT 0,003 1 2
    3 61148655 G C intronik FHIT 0,001 2 4
    3 61170747 C T intronik FHIT 0,003 1 2
    3 61189473 C G intronik FHIT 0,0099 1 2
    3 61190425 C T intronik FHIT 0,0023 2 4
    3 61193853 C T intronik FHIT 0,0099 0 2
    3 61194793 C T intronik FHIT 0,007 0 2
    3 61194840 A G intronik FHIT 0,0099 0 2
    3 61194886 T A intronik FHIT 0,0099 0 2
    3 61201777 C T intronik FHIT 0,001 2 4
    3 61202292 T C intronik FHIT 0,007 1 2
    3 61232806 G C intronik FHIT 0,0099 1 2
    3 61232910 C T intronik FHIT 0,0099 1 2
    3 61235824 A T intronik FHIT 0,001 2 4
    3 61283810 A C intergenik FHIT,PTPRG 0,0089 1 2
    3 61293731 T A intergenik FHIT,PTPRG 0,0089 2 4
    3 61296730 C T intergenik FHIT,PTPRG 0,001 1 2
    3 61326341 C T intergenik FHIT,PTPRG 0,004 2 4
    3 61326620 T C intergenik FHIT,PTPRG 0,01 1 2
    3 61327649 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61330545 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61335803 G A intergenik FHIT,PTPRG 0,001 2 4

    Kueri serupa yang menggunakan data dari 1.000 Genom Fase 3 memerlukan waktu sekitar 90 detik untuk dijalankan dan memproses sekitar 3,38 TB data.

Mengkueri tabel terwujud dengan 250 gen acak

Contoh berikut menunjukkan cara menjalankan interval JOIN pada tabel terwujud yang berisi 250 gen yang dipilih secara acak dari tabel silver-wall-555:TuteTable.hg19.

  1. Untuk membuat tabel interval, ikuti langkah-langkah berikut:

    1. Buka halaman BigQuery di Konsol Google Cloud.

      Buka halaman BigQuery

    2. Klik Compose query.

    3. Di kolom New query, jalankan kueri berikut, yang mewujudkan bagian dari tabel silver-wall-555:TuteTable.hg19 menjadi tabel interval genomics.randomGenesIntervalTable baru.

      #standardSQL
      CREATE TABLE `genomics.randomGenesIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start) - 100000 AS region_start,
        MAX(`End`) + 100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250)
      GROUP BY
        Chr,
        Gene );
      
      1. Klik Jalankan kueri. Kueri tersebut menampilkan hasil berikut:
      This statement created a new table named PROJECT_ID:genomics.randomGenesIntervalTable.
      
  2. Di kolom New query, jalankan kueri berikut:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.randomGenesIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
    
  3. Klik Jalankan kueri. Kueri membutuhkan waktu sekitar sepuluh detik untuk dijalankan, dan memproses sekitar 334 GB data. Hasil kueri mengidentifikasi varian langka dalam kohor yang tumpang-tindih dengan wilayah minat.

    Luaskan bagian berikut untuk melihat hasil kueri yang terpotong:

    Hasil kueri

    Chr Mulai Ref Alt Funk Gen PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 2925355 C A intergenik TTC34,ACTRT2 0,001 2 4
    1 2933170 G A intergenik TTC34,ACTRT2 0,0083 0 4
    1 2944477 G A intergenik ACTRT2,LINC00982 0,003 4 6
    1 2967591 A T intergenik ACTRT2,LINC00982 0,0092 1 2
    1 2975255 T C hilir LINC00982 0,0082 1 2
    1 2977223 C T ncRNA_intronic LINC00982 0,0072 1 2
    1 2978803 G C ncRNA_exonic LINC00982 0,002 4 6
    1 3006466 G A intronik PRDM16 0,0098 1 2
    1 3011333 G T intronik PRDM16 0,004 1 2
    1 3019659 C T intronik PRDM16 0,0031 1 2
    1 3036896 G A intronik PRDM16 0,001 1 2
    1 3037388 G A intronik PRDM16 0,002 2 4
    1 3041250 T G intronik PRDM16 0,006 2 4
    1 3042502 A T intronik PRDM16 0,003 4 6
    1 3053713 A C intronik PRDM16 0,002 1 2
    1 3063109 C T intronik PRDM16 0,002 0 2
    1 3063593 T C intronik PRDM16 0,003 1 2
    1 3076439 C T intronik PRDM16 0,001 2 4
    1 3078960 G A intronik PRDM16 0,007 2 4
    1 3084268 A C intronik PRDM16 0,005 0 2
    1 3084492 T C intronik PRDM16 0,0015 0 2
    1 3084786 T C intronik PRDM16 0,0015 0 4
    1 3111119 G A intronik PRDM16 0,003 1 2
    1 3111643 C T intronik PRDM16 0,0041 1 2
    1 3114807 G A intronik PRDM16 0,0041 1 2
    1 3165530 C T intronik PRDM16 0,0089 1 2
    1 3169325 G A intronik PRDM16 0,008 2 4
    1 3179623 C T intronik PRDM16 0,003 2 4
    1 3181097 C T intronik PRDM16 0,001 2 4
    1 3194000 G C intronik PRDM16 0,005 2 4
    1 3195769 T C intronik PRDM16 0,002 1 2
    1 3197351 C T intronik PRDM16 0,0061 1 2
    1 3224100 C A intronik PRDM16 0,003 2 4
    1 3228644 G T intronik PRDM16 0,001 2 4
    1 3234045 G A intronik PRDM16 0,002 1 2
    1 3235971 G A intronik PRDM16 0,0089 1 2
    1 3274115 C T intronik PRDM16 0,001 2 4
    1 3291388 G A intronik PRDM16 0,002 2 4
    1 3295658 A C intronik PRDM16 0,0068 0 6
    1 3295937 A C intronik PRDM16 0,0068 0 2
    1 3296205 T C intronik PRDM16 0,0083 0 2
    1 3315690 G A intronik PRDM16 0,001 2 4
    1 3329212 G A eksonik PRDM16 0,0031 Misense 1 2
    1 3331787 C T intronik PRDM16 0,0099 1 2
    1 3370316 G C hulu ARHGEF16 0,001 2 4
    1 3379560 A G intronik ARHGEF16 0,0051 0 6
    1 3391174 C T intronik ARHGEF16 0,006 1 2
    1 3413873 G A eksonik MEGF6 0,003 Misense 1 2
    1 3416272 C T eksonik MEGF6 0,0072 senyap 2 4
    1 3417122 G A intronik MEGF6 0,0038 2 4
    1 3436219 G A intronik MEGF6 0,0046 2 4
    1 12907456 A G eksonik HNRNPCL1,LOC649330 0,006 Misense 0 10
    1 12907518 C A eksonik HNRNPCL1,LOC649330 1,0E-4 Misense 0 10
    1 12908499 G C intronik HNRNPCL1 0,0031 0 8
    1 12931660 G C intergenik PRAMEF2,PRAMEF4 0,004 1 2
    1 12937721 G T intergenik PRAMEF2,PRAMEF4 0,0038 0 2
    1 12940827 G T intronik PRAMEF4 0,007 2 4
    1 12942759 T G intronik PRAMEF4 0,0076 0 10
    1 12942805 T G intronik PRAMEF4 0,0061 0 12
    1 12942812 G A intronik PRAMEF4 0,0061 0 12
    1 12942875 A G intronik PRAMEF4 0,0068 0 6
    1 12942912 G C intronik PRAMEF4 2,0E-4 0 2
    1 12942937 A T eksonik PRAMEF4 0,0029 Misense 0 2
    1 12942940 T G eksonik PRAMEF4 0,0038 Misense 0 2
    1 12943940 T C intronik PRAMEF4 0,0015 0 12
    1 12944138 A G intronik PRAMEF4 8,0E-4 0 12
    1 12944234 G A intronik PRAMEF4 0,0015 0 12
    1 12944589 T G intronik PRAMEF4 0,003 0 4
    1 12944845 A C intronik PRAMEF4 0,0014 0 6
    1 12946439 T C hulu PRAMEF4 0,0029 0 10
    1 12946833 G A hulu PRAMEF4 0,001 0 8
    1 12946835 T A hulu PRAMEF4 0,004 0 12
    1 12995204 G T intergenik PRAMEF8,PRAMEF6 0,003 1 4
    1 12997638 T C hilir PRAMEF6,PRAMEF9 0,003 2 4
    1 13007841 G C hulu PRAMEF6 0,0043 0 8
    1 13019228 T A intergenik PRAMEF6,LOC391003 0,0015 0 10
    1 13038503 G A UTR3 LOC391003 0,0072 1 2
    1 13051650 C T intergenik LOC391003,PRAMEF5 0,002 2 4
    1 15706063 G A intronik FHAD1 0,0029 1 2
    1 15713292 C T intronik FHAD1 0,001 1 2
    1 15766541 G C intronik CTRC 0,001 1 2
    1 15782601 T C hulu CELA2A 0,0038 1 2
    1 15828125 G A intronik CASP9 0,0014 2 4
    1 15831037 G A intronik CASP9 0,0099 1 2
    1 15840513 T G intronik CASP9 0,0043 2 4
    1 15868742 G A intronik DNAJC16 0,001 1 2
    1 15876704 G A intronik DNAJC16 0,001 1 2
    1 15900342 C A intronik AGMA 0,001 1 2
    1 15906257 T C intronik AGMA 8,0E-4 1 2
    1 15911897 A G hulu AGMA 0,0043 2 4
    1 22764178 C T intergenik WNT4,ZBTB40 0,001 2 4
    1 22791939 C T intronik ZBTB40 0,0089 2 4
    1 22874394 C G intergenik ZBTB40,EPHA8 0,007 1 2
    1 22875103 C G intergenik ZBTB40,EPHA8 0,007 1 2
    1 22906403 C T intronik EPHA8 0,008 2 4
    1 22912956 G A intronik EPHA8 0,001 1 2
    1 22917007 C T intronik EPHA8 0,001 2 4
    1 22927240 G A eksonik EPHA8 0,0013 Misense 2 4
    1 22932265 G A intergenik EPHA8,MIR6127 0,0089 2 4
    1 22944057 C T intergenik EPHA8,MIR6127 0,0089 2 4
    1 22978799 A G hulu C1QB 0,0099 2 4
    1 35170588 C T intergenik C1orf94,GJB5 0,01 1 2
    1 35172426 C T intergenik C1orf94,GJB5 0,008 1 2
    1 35172447 G A intergenik C1orf94,GJB5 0,001 1 2
    1 35175302 C T intergenik C1orf94,GJB5 0,008 1 2
    1 35177410 A T intergenik C1orf94,GJB5 0,001 1 2
    1 35178768 C T intergenik C1orf94,GJB5 0,0014 2 4
    1 35179362 G A intergenik C1orf94,GJB5 0,0014 2 4
    1 35186166 G A intergenik C1orf94,GJB5 0,0099 2 4
    1 35186520 A C intergenik C1orf94,GJB5 0,002 2 4
    1 35196361 G A intergenik C1orf94,GJB5 0,0099 2 4
    1 35223545 C T eksonik GJB5 0,001 senyap 1 2
    1 35224029 G A UTR3 GJB5 0,003 1 2
    1 35227895 T C UTR3 GJB4 5,0E-4 1 2
    1 35230455 G T intergenik GJB4,GJB3 0,0043 1 2
    1 35232954 T C intergenik GJB4,GJB3 0,003 1 2
    1 35237986 G A intergenik GJB4,GJB3 0,0014 1 2
    1 35245522 C T intergenik GJB4,GJB3 0,001 1 2
    1 35256979 C T intergenik GJB3,GJA4 0,002 2 4
    1 35263872 C T intergenik GJA4,SMIM12 5,0E-4 2 4
    1 35323895 A C intronik SMIM12 0,0027 2 4
    1 35369676 G A intronik DLGAP3 0,007 2 4
    1 35371634 T A hulu DLGAP3 0,0015 0 4
    1 39253519 G A intergenik LINC01343,RRAGC 0,005 1 2
    1 39288829 G A intergenik LINC01343,RRAGC 0,0051 1 2
    1 39289832 A C intergenik LINC01343,RRAGC 0,002 0 2
    1 39312638 G A intronik RRAGC 0,0038 2 4
    1 39361372 G A intronik RHBDL2 0,005 1 2
    1 39363826 T G intronik RHBDL2 0,0029 1 2
    1 39367555 T C intronik RHBDL2 0,007 2 4
    1 39369531 T C intronik RHBDL2 0,001 2 4
    1 39370202 T C intronik RHBDL2 0,01 1 2
    1 39449101 A G intergenik RHBDL2,AKIRIN1 0,001 2 4
    1 39475057 G A intergenik AKIRIN1,NDUFS5 0,01 1 2
    1 39485016 C T intergenik AKIRIN1,NDUFS5 0,001 2 4
    1 39488137 A G intergenik AKIRIN1,NDUFS5 0,001 2 4
    1 39499212 A C intronik NDUFS5 0,001 0 2
    1 39500605 C G hilir NDUFS5 0,002 0 10
    1 46813814 T C intronik NSUN4 0,0014 1 2
    1 46817258 A G intronik NSUN4 0,005 0 2
    1 46843158 T C intergenik NSUN4,FAAH 0,001 1 2
    1 46933509 A G intergenik LINC01398,DMBX1 0,002 1 2
    1 46935021 G A intergenik LINC01398,DMBX1 0,004 2 4
    1 46939253 T A intergenik LINC01398,DMBX1 0,004 2 4
    1 46951788 C A intergenik LINC01398,DMBX1 0,002 2 4
    1 46980864 G C hilir DMBX1 0,003 1 2
    1 46989657 T C intergenik DMBX1,MKNK1-AS1 0,007 1 2
    1 46994678 C T intergenik DMBX1,MKNK1-AS1 0,002 1 2
    1 46999438 T C intergenik DMBX1,MKNK1-AS1 0,002 1 2
    1 92761505 A G intronik GLMN 0,001 2 4
    1 92764270 G C intronik GLMN 0,001 2 4
    1 92802210 G A intronik RPAP2 0,0072 1 2
    1 92820663 T A intronik RPAP2 0,0058 1 2
    1 92820664 G T intronik RPAP2 0,0058 1 2
    1 92820953 G A intronik RPAP2 0,007 2 4
    1 92824766 A G intronik RPAP2 0,0058 1 2
    1 92849183 C A intronik RPAP2 0,01 2 4
    1 92850696 C G intronik RPAP2 0,0023 1 2
    1 92861357 T C intergenik RPAP2,GFI1 0,01 2 4
    1 92877460 C G intergenik RPAP2,GFI1 0,002 1 2
    1 92880643 A G intergenik RPAP2,GFI1 0,001 2 4
    1 92911540 G A intergenik RPAP2,GFI1 0,004 2 4
    1 92911721 A C intergenik RPAP2,GFI1 0,0031 0 8
    1 92918277 C T intergenik RPAP2,GFI1 0,001 2 4
    1 92950920 G A intronik GFI1 0,008 2 4
    1 92964788 G A intergenik GFI1,EVI5 0,0023 1 2
    1 92977480 C T UTR3 EVI5 0,002 1 2
    1 92985213 C T intronik EVI5 0,001 2 4
    1 92988342 C T intronik EVI5 0,008 2 4
    1 92992283 G A intronik EVI5 0,01 2 4
    1 92999760 C T intronik EVI5 0,003 1 2
    1 93005149 G C intronik EVI5 0,003 0 4
    1 93018543 A T intronik EVI5 0,01 2 4
    1 93033744 C T intronik EVI5 0,001 2 4
    1 111400296 G A intergenik KCNA3,CD53 0,0014 2 4
    1 111411924 C T intergenik KCNA3,CD53 0,003 1 2
    1 111441850 C G UTR3 CD53 0,003 2 4
    1 111451527 C T intergenik CD53,LRIF1 0,008 2 4
    1 111454082 C A intergenik CD53,LRIF1 0,001 2 4
    1 111466506 A G intergenik CD53,LRIF1 0,001 2 4
    1 111525974 G A intergenik LRIF1,DRAM2 0,002 2 4
    1 111574573 G T intergenik LRIF1,DRAM2 0,0072 2 4
    1 111574594 T A intergenik LRIF1,DRAM2 0,005 1 2
    1 111574647 G A intergenik LRIF1,DRAM2 0,005 1 2
    1 111591746 T A intergenik LRIF1,DRAM2 0,005 1 2
    1 111601459 A G intergenik LRIF1,DRAM2 0,005 1 2
    1 111604748 G C intergenik LRIF1,DRAM2 0,005 1 2
    1 112191526 T G intronik RAP1A 0,001 2 4
    1 112206765 A G intronik RAP1A 0,0043 1 2
    1 112226517 G A intronik RAP1A 0,001 0 2
    1 112263324 G T intergenik RAP1A,FAM212B 0,003 2 4
    1 112264843 G A UTR3 FAM212B 0,001 1 2
    1 112285810 C T ncRNA_intronic FAM212B-AS1 0,004 1 2
    1 112304285 T C intronik DDX20 0,0043 1 2
    1 112307213 A C intronik DDX20 0,0043 1 2
    1 112309436 G T eksonik DDX20 0.0 Misense 1 2
    1 112317384 T C intergenik DDX20,KCND3 0,0014 1 2
    1 112381367 C T intronik KCND3 0,002 1 2
    1 112396571 G T ncRNA_exonic KCND3-IT1 0,001 1 2
    1 113520038 G A intergenik SLC16A1-AS1,LOC100996251 0,0023 1 2

    Kueri serupa yang menggunakan data dari 1.000 Genom Fase 3 memerlukan waktu sekitar 90 detik untuk dijalankan dan memproses sekitar 3,38 TB data.