Skip to content

Latest commit

 

History

History
230 lines (181 loc) · 12.2 KB

variant_annotation.md

File metadata and controls

230 lines (181 loc) · 12.2 KB

Annotation Support in Variant Transforms

Overview

The annotation support in Variant Transforms has two goals:

  • First, we want to be able to properly parse annotation fields in input VCFs.
  • The second goal is to be able to annotate VCF files during the import process, using open annotation tools/databases like Variant Effect Predictor (VEP).

Parsing Annotation Fields

The annotation field format that is currently supported is based on this spec developed in the SnpEff project. VEP uses the same format and here is a trimmed example from a file annotated by VEP (taken from the gnomAD dataset):

VCF Header:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|ALLELE_NUM|...>

Corresponding annotations for a sample variant record where REF=AAC and ALT=TAC,A:

CSQ=-|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000423562|unprocessed_pseudogene|||2,...,T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000518655|transcribed_unprocessed_pseudogene|||1

For each variant record, multiple annotation lists are provided which are separated by comma. The first field of each set refers to an alternate allele or ALT (for details of the format see the spec). Note that the actual record of the above example, has more fields and also each alternate allele can be repeated multiple times (with different annotations); here we have only selected a small sub-set, one for each ALT.

When --annotation_fields flag is used, annotation lists are split and moved under their corresponding alternate_bases records. In other words, there will be a repeated record field CSQ under each alternate_bases:

A sample CSQ schema

Alternate Allele Matching

As it can be seen in the above example, the alternate alleles listed in the CSQ field are different from ALT strings of the variant record (for example the AAC->C variant is represented by a single - which means a deletion). This is because, the gnomAD dataset is run through VEP with the --minimal option. Variant Transforms supports matching alt fields in the minimal mode. There are actually three supported modes of ALT matching:

  • If no ALT matching flag is set, the spec is implemented by default (support for cancer samples is not currently implemented and is tracked by this issue).

  • With --use_allele_num, the value of the ALLELE_NUM annotation field is looked up, which is a feature of VEP. This is the case in the above example.

  • Variant Transforms can also simulate the --minimal mode of VEP with --minimal_vep_alt_matching but this is not recommended as it can result in ambiguous matching. The number of such ambiguous cases are counted and reported in the Dataflow dashboard. Also with this flag, a new field is added to the output BigQuery table to mark ambiguous cases. This should be used only as a last resort.

Running VEP through Variant Transforms

Summary

You can run VEP on input VCF files before importing them into BigQuery. The minimum number of flags to enable this feature is --run_annotation_pipeline and --annotation_output_dir [GCS_PATH] where [GCS_PATH] is a path in a GCS bucket that your project owns.

Variant annotation will start a separate Cloud Life Sciences pipeline to run the vep_runner. You can provide --location to specify the location to use for Cloud Life Sciences API. If not provided, it will default to us-central1. The compute region will come from the --region flag passed from docker.

Details

While parsing annotation fields and breaking them into separate fields is a useful feature but it still requires pre-annotated VCFs. To address this need, we have started an effort to provide options to annotate input files as part of the BigQuery import process, i.e., automatically done by Variant Transforms.

Currently, this option is only supported for VEP and that is done by bringing up Virtual Machines (VM) on Google Cloud Platform (GCP) and running VEP docker images. This is slow and improving this experience is on our roadmap.

Relevant flags for this feature are:

  • --run_annotation_pipeline enables the automatic annotation feature.

  • --annotation_output_dir a path on Google Cloud Storage (GCS) where VEP output files are copied. The file hierarchy of input files is replicated at this location with the same file names followed by _vep_output.vcf. Note that if this directory already exists, then Variant Transforms fails. This is to prevent unintentional overwriting of old annotated VCFs.

  • --shard_variants by default, the input files are sharded into smaller temporary VCF files before running VEP annotation. If the input files are small, i.e., each VCF file contains less than 50,000 variants, setting this flag can be computationally wasteful.

  • --number_of_variants_per_shard the maximum number of variants written to each shard if shard_variants is true. The default value should work for most cases. You may change this flag to a smaller value if you have a dataset with a lot of samples. Notice that pipeline may take longer to finish for smaller value of this flag.

  • --vep_image_uri the docker image for VEP created using the Dockerfile in variant-annotation GitHub repo. By default gcr.io/cloud-lifesciences/vep:104 is used which is a public image that Google maintains (VEP version 104).

  • --vep_cache_path the GCS location that has the compressed version of VEP cache. This file can be created using build_vep_cache.sh script. By default gs://cloud-lifesciences/vep/vep_cache_homo_sapiens_GRCh38_104.tar.gz is used which is good for human genome aligned with GRCh38 reference sequence.

  • --vep_info_field by default, an INFO field called CSQ_VT is added to hold the new annotations; use this field to override the field name.

  • --vep_num_fork number of forks when running VEP, see --fork option of VEP.

For other parameters, like how many VMs to use or where the VMs should be located, the same parameters for Dataflow pipeline execution are reused, e.g., --num_workers. It is recommended to provide one worker per 50,000 variants. For instance, for a dataset with 5,000,000 variants, you may set --num_workers to 100.

Caveats and troubleshooting

In the annotation_output_dir, beside output annotated VCFs, there is a logs directory which contains the logs from virtual machines on which VEP was run. If VEP fails, this is a good place to look for causes (in addition to usual Variant Transforms log files).

Finally, the --check_ref option of VEP is enabled for these runs and the above log files contain a report of variants that do not match the reference. If you are using the right VEP cache, there should be no mismatch. Note that mismatches are dropped and hence are not present in the final BigQuery output table.

Sample Queries

Querying annotations directly

We imported the gnomAD genomes table using the features described in Parsing Annotation Fields.

The gnomAD sample query notebook has additional queries you can run to practice querying annotation data.

Here is another sample query to extract high impact variants in the HBB gene which encodes the beta chain of Haemoglobin:

#standardSQL
SELECT reference_name, start_position, reference_bases, ALT.alt, CSQ.*
FROM vcf_imports_external.gnomad_genomes_chr_hg19 AS T, T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE CSQ.SYMBOL = "HBB" AND CSQ.IMPACT = "HIGH"
ORDER BY start_position

and here is a snapshot from the BigQuery UI for this query:

A sample query on the gnomAD genomes table

Note that the gnomAD dataset does not have any calls or samples but it is fairly simple practice to extend this query to pick samples that have these high impact variants.

Joining with other databases

You can also join with other databases that are available in BigQuery. For instance, there are several pathway databases that are available publicly in BigQuery. Here is a sample query to extract high impact variants that are in the double-strand break repair pathway as defined by the GO (Gene Ontology) biological process (GO:0006302).

#standardSQL
SELECT
  reference_name, start_position, reference_bases, ALT.alt,
  CSQ.Consequence, CSQ.Impact, CSQ.SYMBOL
FROM
  `vcf_imports_external.gnomad_genomes_chr_hg19` AS T,
  T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE
  # Note: Matching based on symbol is "best effort" as the names may not be
  # standardized across sources.
  CSQ.SYMBOL IN (SELECT DB_Object_Symbol
                 FROM `isb-cgc.genome_reference.GO_Annotations`
                 WHERE GO_ID = 'GO:0006302')
  AND CSQ.IMPACT = "HIGH"
ORDER BY
  start_position

A sample query joining annotations with GO database

As part of the ISB-CGC project, several more public databases are also made available in BigQuery including Reactome (table) and WikiPathways (table).