Analyzing Variants with BigQuery

This page describes how to use BigQuery to analyze variants. The example below shows how to compute the ratio of transitions to transversions in SNPs in each chromosome for each sample.

Analyzing variants from the Illumina Platinum Genomes dataset

The following example uses data from the Illumina Platinum Genomes project that was exported from Cloud Genomics to BigQuery. The name of the table in BigQuery that holds the data is platinum_genomes_deepvariant_variants_20180823.

To analyze the variants in the table:

  1. Go to the BigQuery UI.

    Go to the BigQuery UI

  2. Click Compose query.

  3. Copy and paste the following query into the New Query text area:

    #standardSQL
     --
     -- Compute the transition/transversion ratio per sample and reference name.
     --
     WITH filtered_snp_calls AS (
       SELECT
         reference_name,
         c.name,
         CONCAT(reference_bases, '->', alternate_bases[ORDINAL(1)].alt) AS mutation
       FROM
         `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` AS v, UNNEST(v.call) AS c
       WHERE
         # Only include biallelic SNPs.
         reference_bases IN ('A','C','G','T')
         AND alternate_bases[ORDINAL(1)].alt IN ('A','C','G','T')
         AND (ARRAY_LENGTH(alternate_bases) = 1
           OR (ARRAY_LENGTH(alternate_bases) = 2 AND alternate_bases[ORDINAL(2)].alt = '<*>'))
         # Skip homozygous reference calls and no-calls.
         AND EXISTS (SELECT g FROM UNNEST(c.genotype) AS g WHERE g > 0)
         AND NOT EXISTS (SELECT g FROM UNNEST(c.genotype) AS g WHERE g < 0)
         # Include only high quality calls.
         AND NOT EXISTS (SELECT ft FROM UNNEST(c.filter) ft WHERE ft NOT IN ('PASS', '.'))
     ),

    mutation_type_counts AS ( SELECT reference_name, name, SUM(CAST(mutation IN ('A->G', 'G->A', 'C->T', 'T->C') AS INT64)) AS transitions, SUM(CAST(mutation IN ('A->C', 'C->A', 'G->T', 'T->G', 'A->T', 'T->A', 'C->G', 'G->C') AS INT64)) AS transversions FROM filtered_snp_calls GROUP BY reference_name, name )

    SELECT reference_name, name, transitions, transversions, transitions/transversions AS titv FROM mutation_type_counts WHERE transversions > 0 ORDER BY titv DESC, name

  4. Click Run query. Running the query returns:

    Row reference_name name transitions transversions titv
    1 chr22 NA12892 35299 15017 2.3506026503296265
    2 chr22 NA12889 34091 14624 2.331167943107221
    3 chr17 NA12892 67297 28885 2.3298251687727194
    4 chr22 NA12878 33627 14439 2.3289008934136715
    5 chr22 NA12877 34751 14956 2.3235490772933938
    6 chr22 NA12891 33534 14434 2.323264514341139
    7 chr17 NA12877 70600 30404 2.3220628864623074
    8 chr17 NA12878 66010 28475 2.3181738366988585
    9 chr17 NA12890 67242 29057 2.314141170802216
    10 chr17 NA12889 69767 30189 2.311007320547219
    ... ... ... ... ... ... ...

The titv column shows the transition to transversion ratio.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics