Jalankan JOIN interval dengan BigQuery

Anda dapat menggunakan BigQuery untuk menjalankan kueri JOIN pada varian dengan data yang dijelaskan oleh interval region genomik, atau tumpang-tindih. Halaman ini menunjukkan cara menggunakan kueri JOIN yang kompleks untuk mengambil daftar nama gen dan melakukan hal berikut:

  • Temukan SNP langka yang tumpang-tindih dengan gen.
  • Temukan 100.000 pasangan basa di kedua sisi gen untuk seluruh contoh genom.

Panduan ini menunjukkan contoh untuk tiga kueri. Setiap kueri menunjukkan cara BigQuery diskalakan pada berbagai ukuran data genomik:

Data berasal dari tabel Anotasi Tute Genomics yang memiliki hampir 9 miliar baris dan set data Illumina Platinum Genomes. Jika Anda tidak memahami set data ini, lihat link berikut:

Membuat kueri tabel inline

Contoh berikut menggunakan tabel interval yang ditentukan dalam kueri, yang disebut intervals, dan menunjukkan cara menjalankan kueri JOIN dengan tabel yang berisi varian dari Illumina Platinum Genomes:

  1. Buka halaman BigQuery di konsol Google Cloud.

    Buka halaman BigQuery

  2. Klik Compose query.

  3. Di kolom New query, jalankan kueri berikut:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- Define an inline table that uses five rows
      -- selected from silver-wall-555.TuteTable.hg19.
      intervals AS (
        SELECT * FROM UNNEST ([
        STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64>
        ('PRCC', '1', 156736274, 156771607, 156636274, 156871607),
        ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640),
        ('PAX8', '2', 113972574, 114037496, 113872574, 114137496),
        ('FHIT', '3', 59734036, 61238131, 59634036, 61338131),
        ('PPARG', '3', 12328349, 12476853, 12228349, 12576853)
      ])),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
  4. Klik Run query. Kueri ini membutuhkan waktu sekitar sepuluh detik untuk berjalan, dan memproses sekitar 334 GB data. Hasil kueri mengidentifikasi varian langka dalam kelompok yang tumpang-tindih dengan area minat.

    Luaskan bagian berikut untuk melihat hasil kueri:

    Hasil kueri

    Chr Mulai Ref Alt Func Gen PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 S C intron RRNAD1 0,002 2 4
    1 156705390 C S intron RRNAD1 8,0E-4 0 2
    1 156714207 S C intron HDGF 0,003 0 6
    1 156714440 A C intron HDGF 0,0068 0 12
    1 156723870 C S intergenik HDGF,PRCC 0,006 1 2
    1 156724456 C S intergenik HDGF,PRCC 0,002 2 4
    1 156733988 C S intergenik HDGF,PRCC 0,001 1 2
    1 156742258 S G intron PRCC 0,001 2 4
    1 156744826 S G intron PRCC 0,002 0 8
    1 156779764 G A intron SH2D2A 0,001 2 4
    1 156783454 A C intron SH2D2A 0,0014 1 2
    1 156786144 C S intron NTRK1,SH2D2A 0,0031 2 4
    1 156790510 A S intron NTRK1 0,002 1 2
    1 156815332 A C intron INSRR,NTRK1 0,003 0 2
    1 156830778 G A ekson NTRK1 0,0067 missense 2 4
    1 156842064 C S intron NTRK1 0,0014 1 2
    1 156843438 C A ekson NTRK1 0,0032 missense 1 2
    1 156845773 C S intron NTRK1 0,001 2 4
    1 156873318 S C intron PEAR1 0,01 4 8
    1 156922740 G A intron ARHGEF11 0,007 1 2
    1 156930100 C S intron ARHGEF11 0,001 2 4
    2 113901230 G A intergenik IL1RN,PSD4 0,0082 1 2
    2 113953418 C A intron PSD4 0,001 2 4
    2 113967621 G C intergenik PSD4,PAX8 0,002 0 6
    2 113967624 S C intergenik PSD4,PAX8 0,002 0 2
    2 113980967 G A intron PAX8 0,002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0,001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0,001 2 4
    2 114061327 S C intergenik PAX8,CBWD2 0,001 2 4
    2 114084018 A C intergenik PAX8,CBWD2 0,0045 0 4
    2 114099037 G A intergenik PAX8,CBWD2 0,0051 1 2
    2 114105670 A S intergenik PAX8,CBWD2 0,001 1 2
    2 114111325 G S intergenik PAX8,CBWD2 0,001 1 2
    3 12265797 C S intergenik SYN2,PPARG 0,0089 2 4
    3 12277958 A G intergenik SYN2,PPARG 0,002 1 2
    3 12296019 G A intergenik SYN2,PPARG 0,002 2 4
    3 12316549 G C intergenik SYN2,PPARG 0,002 1 2
    3 12335681 S G intron PPARG 0,0092 2 4
    3 12348795 S C intron PPARG 0,0014 1 2
    3 12353106 S C intron PPARG 0,001 2 4
    3 12403825 G A intron PPARG 0,0051 2 4
    3 12404394 G A intron PPARG 0,001 1 2
    3 12410289 G A intron PPARG 0,008 2 4
    3 12431381 C S intron PPARG 0,0061 2 4
    3 12447267 G A intron PPARG 0,0089 2 4
    3 12449379 C S intron PPARG 0,0092 2 4
    3 12450848 C A intron PPARG 0,0092 2 4
    3 12462847 S C intron PPARG 0,002 1 2
    3 12492797 G A intergenik PPARG,TSEN2 0,01 1 2
    3 12503201 G A intergenik PPARG,TSEN2 0,0099 2 4
    3 12530460 A G intron TSEN2 0,0092 2 4
    3 12531167 A G intron TSEN2 0,0099 2 4
    3 12557737 A G intron TSEN2 0,001 2 4
    3 59636143 A G intergenik C3orf67,FHIT 0,003 3 6
    3 59645934 A C intergenik C3orf67,FHIT 0,004 1 2
    3 59646893 G A intergenik C3orf67,FHIT 0,002 1 2
    3 59697024 A G intergenik C3orf67,FHIT 0,0072 1 2
    3 59701013 G A intergenik C3orf67,FHIT 0,004 2 4
    3 59733945 A G intergenik C3orf67,FHIT 0,001 2 4
    3 59747482 C S intron FHIT 0,001 2 4
    3 59750635 A G intron FHIT 0,003 1 2
    3 59757776 C S intron FHIT 0,001 2 4
    3 59770612 G A intron FHIT 0,001 2 4
    3 59804444 G C intron FHIT 0,001 2 4
    3 59819769 S C intron FHIT 0,001 2 4
    3 59884396 C S intron FHIT 0,001 2 4
    3 59960728 A C intron FHIT 0,01 1 2
    3 59970345 G A intron FHIT 0,002 1 2
    3 59972417 S A intron FHIT 0,0072 0 2
    3 60104328 C A intron FHIT 0,01 2 4
    3 60139062 G A intron FHIT 0,01 0 2
    3 60158066 C S intron FHIT 0,001 1 2
    3 60169285 C S intron FHIT 0,005 1 2
    3 60216185 S C intron FHIT 0,002 1 2
    3 60226380 G A intron FHIT 0,007 2 4
    3 60234539 C A intron FHIT 0,002 1 2
    3 60247464 A C intron FHIT 0,004 2 4
    3 60269926 A G intron FHIT 0,007 2 4
    3 60271228 G S intron FHIT 0,007 2 4
    3 60286972 S C intron FHIT 0,001 2 4
    3 60301412 C G intron FHIT 0,001 1 2
    3 60312251 C S intron FHIT 0,0099 1 2
    3 60317682 A G intron FHIT 0,008 1 2
    3 60328557 C G intron FHIT 0,0043 2 4
    3 60342562 C S intron FHIT 0,006 1 2
    3 60400033 G A intron FHIT 0,004 2 4
    3 60435819 C S intron FHIT 0,006 2 4
    3 60435820 G S intron FHIT 0,004 1 2
    3 60441288 S C intron FHIT 0,006 2 4
    3 60444465 C A intron FHIT 0,01 1 2
    3 60444575 C S intron FHIT 0,001 1 2
    3 60450581 S C intron FHIT 0,01 1 2
    3 60456571 G A intron FHIT 0,001 2 4
    3 60473568 C G intron FHIT 0,001 1 2
    3 60487557 S C intron FHIT 0,001 1 2
    3 60559705 A G intron FHIT 0,002 2 4
    3 60570764 S C intron FHIT 0,008 2 4
    3 60582100 C S intron FHIT 0,001 1 2
    3 60587192 G A intron FHIT 0,004 1 2
    3 60599869 G A intron FHIT 0,0086 2 4
    3 60603091 C S intron FHIT 0,001 2 4
    3 60603250 A S intron FHIT 0,0099 1 2
    3 60609831 S G intron FHIT 0,001 2 4
    3 60619756 G S intron FHIT 0,0015 2 4
    3 60680758 C S intron FHIT 0,0089 2 4
    3 60702243 G C intron FHIT 0,001 2 4
    3 60702532 A G intron FHIT 0,001 1 2
    3 60714328 A S intron FHIT 0,004 1 2
    3 60725297 G A intron FHIT 0,001 1 2
    3 60726640 G A intron FHIT 0,01 2 4
    3 60795144 A G intron FHIT 0,001 2 4
    3 60807171 A G intron FHIT 0,001 1 2
    3 60813868 S C intron FHIT 0,001 1 2
    3 60826546 C G intron FHIT 0,0023 1 2
    3 60837392 C S intron FHIT 0,001 1 2
    3 60846310 A G intron FHIT 0,01 0 2
    3 60850985 C S intron FHIT 0,004 1 2
    3 60852559 S C intron FHIT 0,008 1 2
    3 60871759 S C intron FHIT 0,004 1 2
    3 60884396 C S intron FHIT 0,002 2 4
    3 60897092 C A intron FHIT 0,001 2 4
    3 60940759 C S intron FHIT 0,0089 1 2
    3 60982595 A G intron FHIT 0,003 2 4
    3 60999283 G A intron FHIT 0,001 1 2
    3 61042977 A G intron FHIT 0,001 2 4
    3 61043349 S C intron FHIT 0,001 2 4
    3 61044789 A C intron FHIT 0,001 2 4
    3 61141621 G A intron FHIT 0,003 1 2
    3 61148655 G C intron FHIT 0,001 2 4
    3 61170747 C S intron FHIT 0,003 1 2
    3 61189473 C G intron FHIT 0,0099 1 2
    3 61190425 C S intron FHIT 0,0023 2 4
    3 61193853 C S intron FHIT 0,0099 0 2
    3 61194793 C S intron FHIT 0,007 0 2
    3 61194840 A G intron FHIT 0,0099 0 2
    3 61194886 S A intron FHIT 0,0099 0 2
    3 61201777 C S intron FHIT 0,001 2 4
    3 61202292 S C intron FHIT 0,007 1 2
    3 61232806 G C intron FHIT 0,0099 1 2
    3 61232910 C S intron FHIT 0,0099 1 2
    3 61235824 A S intron FHIT 0,001 2 4
    3 61283810 A C intergenik FHIT,PTPRG 0,0089 1 2
    3 61293731 S A intergenik FHIT,PTPRG 0,0089 2 4
    3 61296730 C S intergenik FHIT,PTPRG 0,001 1 2
    3 61326341 C S intergenik FHIT,PTPRG 0,004 2 4
    3 61326620 S C intergenik FHIT,PTPRG 0,01 1 2
    3 61327649 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61330545 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61335803 G A intergenik FHIT,PTPRG 0,001 2 4

    Kueri serupa yang menggunakan data dari 1,000 Genomes Phase 3 memerlukan waktu sekitar 90 detik untuk berjalan dan memproses sekitar 3,38 TB data.

Menggunakan tabel yang diwujudkan

Saat menangani big data dalam skala besar, Anda dapat mewujudkan tabel interval dan menjalankan kueri JOIN terhadap tabel baru. Sebelum melanjutkan ke bagian lainnya, buat set data dengan mengikuti langkah-langkah berikut:

  1. Buka halaman BigQuery di konsol Google Cloud.

    Buka halaman BigQuery

  2. Di panel Explorer, pilih project tempat Anda ingin membuat set data.

  3. Luaskan opsi Actions, lalu klik Create dataset.

  4. Di halaman Create dataset:

    1. Untuk Dataset ID, masukkan genomics.
    2. Tetap gunakan setelan default lainnya.
    3. Klik Create dataset.

Membuat kueri tabel yang diwujudkan dengan gen tertentu

Langkah-langkah berikut menunjukkan cara mewujudkan tabel interval baru yang berisi daftar gen tertentu dari tabel silver-wall-555:TuteTable.hg19.

  1. Untuk membuat tabel interval, ikuti langkah-langkah berikut:

    1. Buka halaman BigQuery di konsol Google Cloud.

      Buka halaman BigQuery

    2. Klik Compose query.

    3. Di kolom New query, jalankan kueri berikut. Kueri mewujudkan bagian dari tabel silver-wall-555:TuteTable.hg19 ke tabel interval genomics.myIntervalTable baru.

      #standardSQL
      CREATE TABLE `genomics.myIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start)-100000 AS region_start,
        MAX(`End`)+100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN ('APC', 'ATM', 'BMPR1A', 'BRCA1', 'BRCA2', 'CDK4',
        'CDKN2A', 'CREBBP', 'EGFR', 'EP300', 'ETV6', 'FHIT', 'FLT3',
        'HRAS', 'KIT', 'MET', 'MLH1', 'NTRK1', 'PAX8', 'PDGFRA',
        'PPARG', 'PRCC', 'PRKAR1A', 'PTEN', 'RET', 'STK11',
        'TFE3', 'TGFB1', 'TGFBR2', 'TP53', 'WWOX')
      GROUP BY
        Chr,
        Gene );
    4. Klik Run query. Kueri akan menampilkan hasil berikut:

    This statement created a new table named PROJECT_ID:genomics.myIntervalTable.
  2. Di kolom New query, jalankan kueri berikut:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.myIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
  3. Klik Run query. Kueri ini membutuhkan waktu sekitar sepuluh detik untuk berjalan, dan memproses sekitar 334 GB data. Hasil kueri mengidentifikasi varian langka dalam kelompok yang tumpang-tindih dengan area minat.

    Luaskan bagian berikut untuk melihat hasil kueri:

    Hasil kueri

    Chr Mulai Ref Alt Func Gen PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 156699757 S C intron RRNAD1 0,002 2 4
    1 156705390 C S intron RRNAD1 8,0E-4 0 2
    1 156714207 S C intron HDGF 0,003 0 6
    1 156714440 A C intron HDGF 0,0068 0 12
    1 156723870 C S intergenik HDGF,PRCC 0,006 1 2
    1 156724456 C S intergenik HDGF,PRCC 0,002 2 4
    1 156733988 C S intergenik HDGF,PRCC 0,001 1 2
    1 156742258 S G intron PRCC 0,001 2 4
    1 156744826 S G intron PRCC 0,002 0 8
    1 156779764 G A intron SH2D2A 0,001 2 4
    1 156783454 A C intron SH2D2A 0,0014 1 2
    1 156786144 C S intron NTRK1,SH2D2A 0,0031 2 4
    1 156790510 A S intron NTRK1 0,002 1 2
    1 156815332 A C intron INSRR,NTRK1 0,003 0 2
    1 156830778 G A ekson NTRK1 0,0067 missense 2 4
    1 156842064 C S intron NTRK1 0,0014 1 2
    1 156843438 C A ekson NTRK1 0,0032 missense 1 2
    1 156845773 C S intron NTRK1 0,001 2 4
    1 156873318 S C intron PEAR1 0,01 4 8
    1 156922740 G A intron ARHGEF11 0,007 1 2
    1 156930100 C S intron ARHGEF11 0,001 2 4
    2 113901230 G A intergenik IL1RN,PSD4 0,0082 1 2
    2 113953418 C A intron PSD4 0,001 2 4
    2 113967621 G C intergenik PSD4,PAX8 0,002 0 6
    2 113967624 S C intergenik PSD4,PAX8 0,002 0 2
    2 113980967 G A intron PAX8 0,002 2 4
    2 113994010 A C ncRNA_exonic PAX8-AS1 0,001 0 4
    2 113997745 C A ncRNA_exonic PAX8-AS1 0,001 2 4
    2 114061327 S C intergenik PAX8,CBWD2 0,001 2 4
    2 114084018 A C intergenik PAX8,CBWD2 0,0045 0 4
    2 114099037 G A intergenik PAX8,CBWD2 0,0051 1 2
    2 114105670 A S intergenik PAX8,CBWD2 0,001 1 2
    2 114111325 G S intergenik PAX8,CBWD2 0,001 1 2
    3 12265797 C S intergenik SYN2,PPARG 0,0089 2 4
    3 12277958 A G intergenik SYN2,PPARG 0,002 1 2
    3 12296019 G A intergenik SYN2,PPARG 0,002 2 4
    3 12316549 G C intergenik SYN2,PPARG 0,002 1 2
    3 12335681 S G intron PPARG 0,0092 2 4
    3 12348795 S C intron PPARG 0,0014 1 2
    3 12353106 S C intron PPARG 0,001 2 4
    3 12403825 G A intron PPARG 0,0051 2 4
    3 12404394 G A intron PPARG 0,001 1 2
    3 12410289 G A intron PPARG 0,008 2 4
    3 12431381 C S intron PPARG 0,0061 2 4
    3 12447267 G A intron PPARG 0,0089 2 4
    3 12449379 C S intron PPARG 0,0092 2 4
    3 12450848 C A intron PPARG 0,0092 2 4
    3 12462847 S C intron PPARG 0,002 1 2
    3 12492797 G A intergenik PPARG,TSEN2 0,01 1 2
    3 12503201 G A intergenik PPARG,TSEN2 0,0099 2 4
    3 12530460 A G intron TSEN2 0,0092 2 4
    3 12531167 A G intron TSEN2 0,0099 2 4
    3 12557737 A G intron TSEN2 0,001 2 4
    3 59636143 A G intergenik C3orf67,FHIT 0,003 3 6
    3 59645934 A C intergenik C3orf67,FHIT 0,004 1 2
    3 59646893 G A intergenik C3orf67,FHIT 0,002 1 2
    3 59697024 A G intergenik C3orf67,FHIT 0,0072 1 2
    3 59701013 G A intergenik C3orf67,FHIT 0,004 2 4
    3 59733945 A G intergenik C3orf67,FHIT 0,001 2 4
    3 59747482 C S intron FHIT 0,001 2 4
    3 59750635 A G intron FHIT 0,003 1 2
    3 59757776 C S intron FHIT 0,001 2 4
    3 59770612 G A intron FHIT 0,001 2 4
    3 59804444 G C intron FHIT 0,001 2 4
    3 59819769 S C intron FHIT 0,001 2 4
    3 59884396 C S intron FHIT 0,001 2 4
    3 59960728 A C intron FHIT 0,01 1 2
    3 59970345 G A intron FHIT 0,002 1 2
    3 59972417 S A intron FHIT 0,0072 0 2
    3 60104328 C A intron FHIT 0,01 2 4
    3 60139062 G A intron FHIT 0,01 0 2
    3 60158066 C S intron FHIT 0,001 1 2
    3 60169285 C S intron FHIT 0,005 1 2
    3 60216185 S C intron FHIT 0,002 1 2
    3 60226380 G A intron FHIT 0,007 2 4
    3 60234539 C A intron FHIT 0,002 1 2
    3 60247464 A C intron FHIT 0,004 2 4
    3 60269926 A G intron FHIT 0,007 2 4
    3 60271228 G S intron FHIT 0,007 2 4
    3 60286972 S C intron FHIT 0,001 2 4
    3 60301412 C G intron FHIT 0,001 1 2
    3 60312251 C S intron FHIT 0,0099 1 2
    3 60317682 A G intron FHIT 0,008 1 2
    3 60328557 C G intron FHIT 0,0043 2 4
    3 60342562 C S intron FHIT 0,006 1 2
    3 60400033 G A intron FHIT 0,004 2 4
    3 60435819 C S intron FHIT 0,006 2 4
    3 60435820 G S intron FHIT 0,004 1 2
    3 60441288 S C intron FHIT 0,006 2 4
    3 60444465 C A intron FHIT 0,01 1 2
    3 60444575 C S intron FHIT 0,001 1 2
    3 60450581 S C intron FHIT 0,01 1 2
    3 60456571 G A intron FHIT 0,001 2 4
    3 60473568 C G intron FHIT 0,001 1 2
    3 60487557 S C intron FHIT 0,001 1 2
    3 60559705 A G intron FHIT 0,002 2 4
    3 60570764 S C intron FHIT 0,008 2 4
    3 60582100 C S intron FHIT 0,001 1 2
    3 60587192 G A intron FHIT 0,004 1 2
    3 60599869 G A intron FHIT 0,0086 2 4
    3 60603091 C S intron FHIT 0,001 2 4
    3 60603250 A S intron FHIT 0,0099 1 2
    3 60609831 S G intron FHIT 0,001 2 4
    3 60619756 G S intron FHIT 0,0015 2 4
    3 60680758 C S intron FHIT 0,0089 2 4
    3 60702243 G C intron FHIT 0,001 2 4
    3 60702532 A G intron FHIT 0,001 1 2
    3 60714328 A S intron FHIT 0,004 1 2
    3 60725297 G A intron FHIT 0,001 1 2
    3 60726640 G A intron FHIT 0,01 2 4
    3 60795144 A G intron FHIT 0,001 2 4
    3 60807171 A G intron FHIT 0,001 1 2
    3 60813868 S C intron FHIT 0,001 1 2
    3 60826546 C G intron FHIT 0,0023 1 2
    3 60837392 C S intron FHIT 0,001 1 2
    3 60846310 A G intron FHIT 0,01 0 2
    3 60850985 C S intron FHIT 0,004 1 2
    3 60852559 S C intron FHIT 0,008 1 2
    3 60871759 S C intron FHIT 0,004 1 2
    3 60884396 C S intron FHIT 0,002 2 4
    3 60897092 C A intron FHIT 0,001 2 4
    3 60940759 C S intron FHIT 0,0089 1 2
    3 60982595 A G intron FHIT 0,003 2 4
    3 60999283 G A intron FHIT 0,001 1 2
    3 61042977 A G intron FHIT 0,001 2 4
    3 61043349 S C intron FHIT 0,001 2 4
    3 61044789 A C intron FHIT 0,001 2 4
    3 61141621 G A intron FHIT 0,003 1 2
    3 61148655 G C intron FHIT 0,001 2 4
    3 61170747 C S intron FHIT 0,003 1 2
    3 61189473 C G intron FHIT 0,0099 1 2
    3 61190425 C S intron FHIT 0,0023 2 4
    3 61193853 C S intron FHIT 0,0099 0 2
    3 61194793 C S intron FHIT 0,007 0 2
    3 61194840 A G intron FHIT 0,0099 0 2
    3 61194886 S A intron FHIT 0,0099 0 2
    3 61201777 C S intron FHIT 0,001 2 4
    3 61202292 S C intron FHIT 0,007 1 2
    3 61232806 G C intron FHIT 0,0099 1 2
    3 61232910 C S intron FHIT 0,0099 1 2
    3 61235824 A S intron FHIT 0,001 2 4
    3 61283810 A C intergenik FHIT,PTPRG 0,0089 1 2
    3 61293731 S A intergenik FHIT,PTPRG 0,0089 2 4
    3 61296730 C S intergenik FHIT,PTPRG 0,001 1 2
    3 61326341 C S intergenik FHIT,PTPRG 0,004 2 4
    3 61326620 S C intergenik FHIT,PTPRG 0,01 1 2
    3 61327649 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61330545 G C intergenik FHIT,PTPRG 0,001 2 4
    3 61335803 G A intergenik FHIT,PTPRG 0,001 2 4

    Kueri serupa yang menggunakan data dari 1,000 Genomes Phase 3 memerlukan waktu sekitar 90 detik untuk berjalan dan memproses sekitar 3,38 TB data.

Membuat kueri tabel yang diwujudkan dengan 250 gen acak

Contoh berikut menunjukkan cara menjalankan JOIN interval pada tabel yang diwujudkan yang berisi 250 gen yang dipilih secara acak dari tabel silver-wall-555:TuteTable.hg19.

  1. Untuk membuat tabel interval, ikuti langkah-langkah berikut:

    1. Buka halaman BigQuery di konsol Google Cloud.

      Buka halaman BigQuery

    2. Klik Compose query.

    3. Di kolom New query, jalankan kueri berikut, yang mewujudkan bagian dari tabel silver-wall-555:TuteTable.hg19 ke tabel interval genomics.randomGenesIntervalTable baru.

      #standardSQL
      CREATE TABLE `genomics.randomGenesIntervalTable` AS (
      SELECT
        Gene,
        Chr,
        MIN(Start) AS gene_start,
        MAX(`End`) AS gene_end,
        MIN(Start) - 100000 AS region_start,
        MAX(`End`) + 100000 AS region_end
      FROM
        `silver-wall-555.TuteTable.hg19`
      WHERE
        Gene IN (SELECT Gene FROM `silver-wall-555.TuteTable.hg19` GROUP BY Gene LIMIT 250)
      GROUP BY
        Chr,
        Gene );
      1. Klik Run query. Kueri akan menampilkan hasil berikut:
      This statement created a new table named PROJECT_ID:genomics.randomGenesIntervalTable.
  2. Di kolom New query, jalankan kueri berikut:

    #standardSQL
    WITH
      --
      -- Retrieve the variants in this cohort, flattening by alternate bases and
      -- counting affected alleles.
      variants AS (
      SELECT
        REPLACE(reference_name, 'chr', '') as reference_name,
        start_position,
        end_position,
        reference_bases,
        alternate_bases.alt AS alt,
        (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
        (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
      FROM
        `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
        UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
      --
      -- JOIN the variants with the genomic intervals overlapping
      -- the genes of interest.
      --
      -- The JOIN criteria is complicated because the task is to see if
      -- an SNP overlaps an interval.  With standard SQL you can use complex
      -- JOIN predicates, including arbitrary expressions.
      gene_variants AS (
      SELECT
        reference_name,
        start_position,
        reference_bases,
        alt,
        num_variant_alleles,
        total_num_alleles
      FROM
        variants
      INNER JOIN
        `genomics.randomGenesIntervalTable` AS intervals ON
        variants.reference_name = intervals.Chr
        AND intervals.region_start <= variants.start_position
        AND intervals.region_end >= variants.end_position )
      --
      -- And finally JOIN the variants in the regions of interest
      -- with annotations for rare variants.
    SELECT DISTINCT
      Chr,
      annots.Start AS Start,
      Ref,
      annots.Alt,
      Func,
      Gene,
      PopFreqMax,
      ExonicFunc,
      num_variant_alleles,
      total_num_alleles
    FROM
      `silver-wall-555.TuteTable.hg19` AS annots
    INNER JOIN
      gene_variants AS vars
    ON
      vars.reference_name = annots.Chr
      AND vars.start_position = annots.Start
      AND vars.reference_bases = annots.Ref
      AND vars.alt = annots.Alt
    WHERE
      -- Retrieve annotations for rare variants only.
      PopFreqMax <= 0.01
    ORDER BY
      Chr,
      Start;
  3. Klik Run query. Kueri ini membutuhkan waktu sekitar sepuluh detik untuk berjalan, dan memproses sekitar 334 GB data. Hasil kueri mengidentifikasi varian langka dalam kelompok yang tumpang-tindih dengan area minat.

    Luaskan bagian berikut untuk melihat hasil kueri yang terpotong:

    Hasil kueri

    Chr Mulai Ref Alt Func Gen PopFreqMax ExonicFunc num_variant_alleles total_num_alleles
    1 2925355 C A intergenik TTC34,ACTRT2 0,001 2 4
    1 2933170 G A intergenik TTC34,ACTRT2 0,0083 0 4
    1 2944477 G A intergenik ACTRT2,LINC00982 0,003 4 6
    1 2967591 A S intergenik ACTRT2,LINC00982 0,0092 1 2
    1 2975255 S C downstream LINC00982 0,0082 1 2
    1 2977223 C S ncRNA_intronic LINC00982 0,0072 1 2
    1 2978803 G C ncRNA_exonic LINC00982 0,002 4 6
    1 3006466 G A intron PRDM16 0,0098 1 2
    1 3011333 G S intron PRDM16 0,004 1 2
    1 3019659 C S intron PRDM16 0,0031 1 2
    1 3036896 G A intron PRDM16 0,001 1 2
    1 3037388 G A intron PRDM16 0,002 2 4
    1 3041250 S G intron PRDM16 0,006 2 4
    1 3042502 A S intron PRDM16 0,003 4 6
    1 3053713 A C intron PRDM16 0,002 1 2
    1 3063109 C S intron PRDM16 0,002 0 2
    1 3063593 S C intron PRDM16 0,003 1 2
    1 3076439 C S intron PRDM16 0,001 2 4
    1 3078960 G A intron PRDM16 0,007 2 4
    1 3084268 A C intron PRDM16 0,005 0 2
    1 3084492 S C intron PRDM16 0,0015 0 2
    1 3084786 S C intron PRDM16 0,0015 0 4
    1 3111119 G A intron PRDM16 0,003 1 2
    1 3111643 C S intron PRDM16 0,0041 1 2
    1 3114807 G A intron PRDM16 0,0041 1 2
    1 3165530 C S intron PRDM16 0,0089 1 2
    1 3169325 G A intron PRDM16 0,008 2 4
    1 3179623 C S intron PRDM16 0,003 2 4
    1 3181097 C S intron PRDM16 0,001 2 4
    1 3194000 G C intron PRDM16 0,005 2 4
    1 3195769 S C intron PRDM16 0,002 1 2
    1 3197351 C S intron PRDM16 0,0061 1 2
    1 3224100 C A intron PRDM16 0,003 2 4
    1 3228644 G S intron PRDM16 0,001 2 4
    1 3234045 G A intron PRDM16 0,002 1 2
    1 3235971 G A intron PRDM16 0,0089 1 2
    1 3274115 C S intron PRDM16 0,001 2 4
    1 3291388 G A intron PRDM16 0,002 2 4
    1 3295658 A C intron PRDM16 0,0068 0 6
    1 3295937 A C intron PRDM16 0,0068 0 2
    1 3296205 S C intron PRDM16 0,0083 0 2
    1 3315690 G A intron PRDM16 0,001 2 4
    1 3329212 G A ekson PRDM16 0,0031 missense 1 2
    1 3331787 C S intron PRDM16 0,0099 1 2
    1 3370316 G C upstream ARHGEF16 0,001 2 4
    1 3379560 A G intron ARHGEF16 0,0051 0 6
    1 3391174 C S intron ARHGEF16 0,006 1 2
    1 3413873 G A ekson MEGF6 0,003 missense 1 2
    1 3416272 C S ekson MEGF6 0,0072 senyap 2 4
    1 3417122 G A intron MEGF6 0,0038 2 4
    1 3436219 G A intron MEGF6 0,0046 2 4
    1 12907456 A G ekson HNRNPCL1,LOC649330 0,006 missense 0 10
    1 12907518 C A ekson HNRNPCL1,LOC649330 1,0E-4 missense 0 10
    1 12908499 G C intron HNRNPCL1 0,0031 0 8
    1 12931660 G C intergenik PRAMEF2,PRAMEF4 0,004 1 2
    1 12937721 G S intergenik PRAMEF2,PRAMEF4 0,0038 0 2
    1 12940827 G S intron PRAMEF4 0,007 2 4
    1 12942759 S G intron PRAMEF4 0,0076 0 10
    1 12942805 S G intron PRAMEF4 0,0061 0 12
    1 12942812 G A intron PRAMEF4 0,0061 0 12
    1 12942875 A G intron PRAMEF4 0,0068 0 6
    1 12942912 G C intron PRAMEF4 2,0E-4 0 2
    1 12942937 A S ekson PRAMEF4 0,0029 missense 0 2
    1 12942940 S G ekson PRAMEF4 0,0038 missense 0 2
    1 12943940 S C intron PRAMEF4 0,0015 0 12
    1 12944138 A G intron PRAMEF4 8,0E-4 0 12
    1 12944234 G A intron PRAMEF4 0,0015 0 12
    1 12944589 S G intron PRAMEF4 0,003 0 4
    1 12944845 A C intron PRAMEF4 0,0014 0 6
    1 12946439 S C upstream PRAMEF4 0,0029 0 10
    1 12946833 G A upstream PRAMEF4 0,001 0 8
    1 12946835 S A upstream PRAMEF4 0,004 0 12
    1 12995204 G S intergenik PRAMEF8,PRAMEF6 0,003 1 4
    1 12997638 S C downstream PRAMEF6,PRAMEF9 0,003 2 4
    1 13007841 G C upstream PRAMEF6 0,0043 0 8
    1 13019228 S A intergenik PRAMEF6,LOC391003 0,0015 0 10
    1 13038503 G A UTR3 LOC391003 0,0072 1 2
    1 13051650 C S intergenik LOC391003,PRAMEF5 0,002 2 4
    1 15706063 G A intron FHAD1 0,0029 1 2
    1 15713292 C S intron FHAD1 0,001 1 2
    1 15766541 G C intron CTRC 0,001 1 2
    1 15782601 S C upstream CELA2A 0,0038 1 2
    1 15828125 G A intron CASP9 0,0014 2 4
    1 15831037 G A intron CASP9 0,0099 1 2
    1 15840513 S G intron CASP9 0,0043 2 4
    1 15868742 G A intron DNAJC16 0,001 1 2
    1 15876704 G A intron DNAJC16 0,001 1 2
    1 15900342 C A intron AGMAT 0,001 1 2
    1 15906257 S C intron AGMAT 8,0E-4 1 2
    1 15911897 A G upstream AGMAT 0,0043 2 4
    1 22764178 C S intergenik WNT4,ZBTB40 0,001 2 4
    1 22791939 C S intron ZBTB40 0,0089 2 4
    1 22874394 C G intergenik ZBTB40,EPHA8 0,007 1 2
    1 22875103 C G intergenik ZBTB40,EPHA8 0,007 1 2
    1 22906403 C S intron EPHA8 0,008 2 4
    1 22912956 G A intron EPHA8 0,001 1 2
    1 22917007 C S intron EPHA8 0,001 2 4
    1 22927240 G A ekson EPHA8 0,0013 missense 2 4
    1 22932265 G A intergenik EPHA8,MIR6127 0,0089 2 4
    1 22944057 C S intergenik EPHA8,MIR6127 0,0089 2 4
    1 22978799 A G upstream C1QB 0,0099 2 4
    1 35170588 C S intergenik C1orf94,GJB5 0,01 1 2
    1 35172426 C S intergenik C1orf94,GJB5 0,008 1 2
    1 35172447 G A intergenik C1orf94,GJB5 0,001 1 2
    1 35175302 C S intergenik C1orf94,GJB5 0,008 1 2
    1 35177410 A S intergenik C1orf94,GJB5 0,001 1 2
    1 35178768 C S intergenik C1orf94,GJB5 0,0014 2 4
    1 35179362 G A intergenik C1orf94,GJB5 0,0014 2 4
    1 35186166 G A intergenik C1orf94,GJB5 0,0099 2 4
    1 35186520 A C intergenik C1orf94,GJB5 0,002 2 4
    1 35196361 G A intergenik C1orf94,GJB5 0,0099 2 4
    1 35223545 C S ekson GJB5 0,001 senyap 1 2
    1 35224029 G A UTR3 GJB5 0,003 1 2
    1 35227895 S C UTR3 GJB4 5,0E-4 1 2
    1 35230455 G S intergenik GJB4,GJB3 0,0043 1 2
    1 35232954 S C intergenik GJB4,GJB3 0,003 1 2
    1 35237986 G A intergenik GJB4,GJB3 0,0014 1 2
    1 35245522 C S intergenik GJB4,GJB3 0,001 1 2
    1 35256979 C S intergenik GJB3,GJA4 0,002 2 4
    1 35263872 C S intergenik GJA4,SMIM12 5,0E-4 2 4
    1 35323895 A C intron SMIM12 0,0027 2 4
    1 35369676 G A intron DLGAP3 0,007 2 4
    1 35371634 S A upstream DLGAP3 0,0015 0 4
    1 39253519 G A intergenik LINC01343,RRAGC 0,005 1 2
    1 39288829 G A intergenik LINC01343,RRAGC 0,0051 1 2
    1 39289832 A C intergenik LINC01343,RRAGC 0,002 0 2
    1 39312638 G A intron RRAGC 0,0038 2 4
    1 39361372 G A intron RHBDL2 0,005 1 2
    1 39363826 S G intron RHBDL2 0,0029 1 2
    1 39367555 S C intron RHBDL2 0,007 2 4
    1 39369531 S C intron RHBDL2 0,001 2 4
    1 39370202 S C intron RHBDL2 0,01 1 2
    1 39449101 A G intergenik RHBDL2,AKIRIN1 0,001 2 4
    1 39475057 G A intergenik AKIRIN1,NDUFS5 0,01 1 2
    1 39485016 C S intergenik AKIRIN1,NDUFS5 0,001 2 4
    1 39488137 A G intergenik AKIRIN1,NDUFS5 0,001 2 4
    1 39499212 A C intron NDUFS5 0,001 0 2
    1 39500605 C G downstream NDUFS5 0,002 0 10
    1 46813814 S C intron NSUN4 0,0014 1 2
    1 46817258 A G intron NSUN4 0,005 0 2
    1 46843158 S C intergenik NSUN4,FAAH 0,001 1 2
    1 46933509 A G intergenik LINC01398,DMBX1 0,002 1 2
    1 46935021 G A intergenik LINC01398,DMBX1 0,004 2 4
    1 46939253 S A intergenik LINC01398,DMBX1 0,004 2 4
    1 46951788 C A intergenik LINC01398,DMBX1 0,002 2 4
    1 46980864 G C downstream DMBX1 0,003 1 2
    1 46989657 S C intergenik DMBX1,MKNK1-AS1 0,007 1 2
    1 46994678 C S intergenik DMBX1,MKNK1-AS1 0,002 1 2
    1 46999438 S C intergenik DMBX1,MKNK1-AS1 0,002 1 2
    1 92761505 A G intron GLMN 0,001 2 4
    1 92764270 G C intron GLMN 0,001 2 4
    1 92802210 G A intron RPAP2 0,0072 1 2
    1 92820663 S A intron RPAP2 0,0058 1 2
    1 92820664 G S intron RPAP2 0,0058 1 2
    1 92820953 G A intron RPAP2 0,007 2 4
    1 92824766 A G intron RPAP2 0,0058 1 2
    1 92849183 C A intron RPAP2 0,01 2 4
    1 92850696 C G intron RPAP2 0,0023 1 2
    1 92861357 S C intergenik RPAP2,GFI1 0,01 2 4
    1 92877460 C G intergenik RPAP2,GFI1 0,002 1 2
    1 92880643 A G intergenik RPAP2,GFI1 0,001 2 4
    1 92911540 G A intergenik RPAP2,GFI1 0,004 2 4
    1 92911721 A C intergenik RPAP2,GFI1 0,0031 0 8
    1 92918277 C S intergenik RPAP2,GFI1 0,001 2 4
    1 92950920 G A intron GFI1 0,008 2 4
    1 92964788 G A intergenik GFI1,EVI5 0,0023 1 2
    1 92977480 C S UTR3 EVI5 0,002 1 2
    1 92985213 C S intron EVI5 0,001 2 4
    1 92988342 C S intron EVI5 0,008 2 4
    1 92992283 G A intron EVI5 0,01 2 4
    1 92999760 C S intron EVI5 0,003 1 2
    1 93005149 G C intron EVI5 0,003 0 4
    1 93018543 A S intron EVI5 0,01 2 4
    1 93033744 C S intron EVI5 0,001 2 4
    1 111400296 G A intergenik KCNA3,CD53 0,0014 2 4
    1 111411924 C S intergenik KCNA3,CD53 0,003 1 2
    1 111441850 C G UTR3 CD53 0,003 2 4
    1 111451527 C S intergenik CD53,LRIF1 0,008 2 4
    1 111454082 C A intergenik CD53,LRIF1 0,001 2 4
    1 111466506 A G intergenik CD53,LRIF1 0,001 2 4
    1 111525974 G A intergenik LRIF1,DRAM2 0,002 2 4
    1 111574573 G S intergenik LRIF1,DRAM2 0,0072 2 4
    1 111574594 S A intergenik LRIF1,DRAM2 0,005 1 2
    1 111574647 G A intergenik LRIF1,DRAM2 0,005 1 2
    1 111591746 S A intergenik LRIF1,DRAM2 0,005 1 2
    1 111601459 A G intergenik LRIF1,DRAM2 0,005 1 2
    1 111604748 G C intergenik LRIF1,DRAM2 0,005 1 2
    1 112191526 S G intron RAP1A 0,001 2 4
    1 112206765 A G intron RAP1A 0,0043 1 2
    1 112226517 G A intron RAP1A 0,001 0 2
    1 112263324 G S intergenik RAP1A,FAM212B 0,003 2 4
    1 112264843 G A UTR3 FAM212B 0,001 1 2
    1 112285810 C S ncRNA_intronic FAM212B-AS1 0,004 1 2
    1 112304285 S C intron DDX20 0,0043 1 2
    1 112307213 A C intron DDX20 0,0043 1 2
    1 112309436 G S ekson DDX20 0,0 missense 1 2
    1 112317384 S C intergenik DDX20,KCND3 0,0014 1 2
    1 112381367 C S intron KCND3 0,002 1 2
    1 112396571 G S ncRNA_exonic KCND3-IT1 0,001 1 2
    1 113520038 G A intergenik SLC16A1-AS1,LOC100996251 0,0023 1 2

    Kueri serupa yang menggunakan data dari 1,000 Genomes Phase 3 memerlukan waktu sekitar 90 detik untuk berjalan dan memproses sekitar 3,38 TB data.