Parsing and Annotating VCF Files

This page describes how to:

  • Parse annotation fields in input VCF files.
  • Annotate VCF files while they're being imported, using open source annotation tools and databases like Variant Effect Predictor (VEP).

Parsing annotation fields

The Variant Transforms tool supports an annotation field format specified in Variant annotations in VCF format. VEP uses this same format. The following example shows a VCF header from a file that was annotated by VEP:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|ALLELE_NUM|...>

The corresponding annotations for a sample variant record where REF=AAC and ALT=TAC,A look like this:

CSQ=-|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000423562|unprocessed_pseudogene|||2,...,T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000518655|transcribed_unprocessed_pseudogene|||1

Multiple comma-separated annotation lists are provided for each variant record. The first field in each set corresponds to an alternate allele (ALT). The example above only uses a subset for each ALT. The actual file has more fields, and each alternate allele can be repeated multiple times with different annotations.

You can pass the --annotation_fields when running the Variant Transforms tool, which splits annotation lists and moves them under their corresponding alternate_bases records. For example, you can see below that there is a repeated record field, CSQ, under each alternate_bases:

Name Type Mode Description
alternate_bases.CSQ RECORD REPEATED List of CSQ annotations for this alternate.
alternate_bases.CSQ.allele STRING NULLABLE The ALT part of the annotation field.
alternate_bases.CSQ.Consequence STRING Nullable None
alternate_bases.CSQ.IMPACT STRING Nullable None
alternate_bases.CSQ.SYMBOL STRING Nullable None
alternate_bases.CSQ.Gene STRING Nullable None
alternate_bases.CSQ.Feature_type STRING Nullable None
alternate_bases.CSQ.Feature STRING Nullable None
alternate_bases.CSQ.BIOTYPE STRING Nullable None
alternate_bases.CSQ.EXON STRING Nullable None
alternate_bases.CSQ.INTRON STRING Nullable None

Alternate allele matching

In the above example, the alternate alleles listed in the CSQ column are different from the ALT strings of the variant record. For example, the AAC->C variant is represented by a -, which signifies a deletion. This occurs because VEP runs on the dataset with the --minimal flag. Variant Transforms supports matching ALT fields in this minimal mode.

The following modes of ALT matching are supported:

  • By default, if no ALT matching flag is set, the Variant annotations in VCF format specification is used.

  • As shown in the above example, when using the --use_allele_num flag, the Variant Transforms tool looks up the value of the ALLELE_NUM annotation field. This is a feature of VEP.

  • You can simulate the --minimal flag with the Variant Transforms tool by using the --minimal_vep_alt_matching flag. However, this can result in ambiguous matching. These ambiguous matches are counted and reported in the Cloud Dataflow Console. They are also added under a new field in the created table.

    For best results, use --minimal rather than--minimal_vep_alt_matching.

Automatically annotating VCF files using VEP

Typically, if you want to parse annotation fields and break them up into separate fields, you have to first pre-annotate your VCF files. However, you do not need to pre-annotate your VCF files when using the Variant Transforms tool.

The Variant Transforms can automatically annotate VCF files as they are being loaded into BigQuery. To do so, you must run VEP with the Variant Transforms tool, which requires creating and running Compute Engine virtual machine instances containing preconfigured Docker images.

Running VEP with the Variant Transforms tool

To run the Variant Transforms tool with VEP and automatically annotate VCF files as they are being loaded into BigQuery, you must run the Variant Transforms tool with at least the following flags:

  • --run_annotation_pipeline
  • Either --max_num_workers MAX_NUM_WORKERS or --num_workers NUM_WORKERS
  • --annotation_output_dir gs://CLOUD_STORAGE_BUCKET

    The bucket you choose for the CLOUD_STORAGE_BUCKET variable must be owned by your project.

A full list of the flags you can pass to the Variant Transforms tool when running VEP is provided below:

Flag Default Description
--run_annotation_pipeline False Enables automatic annotation.
--max_num_workers or --num_workers Automatically determined The maximum number of workers in the job to which Cloud Dataflow will autoscale or the number of workers that are initially assigned to the job, respectively.
--annotation_output_dir None A Cloud Storage path for the VEP output files. The hierarchy of input files is replicated in this location, but the new files are appended with _vep_output.vcf. If the directory already exists, then the Variant Transforms tool will fail.
--vep_image_uri gcr.io/cloud-lifesciences/vep_91 The Cloud Genomics VEP Docker image. Google maintains the default image (VEP version 91).
--vep_cache_path gs://cloud-lifesciences/vep/vep_cache_homo_sapiens_GRCh38_91.tar.gz (Recommended for a human genome aligned with a GRCh38 reference sequence) A Cloud Storage path for the compressed version of the VEP cache. You can create this file using the --build_vep_cache.sh script.
--vep_info_field CSQ_VT The name of a new INFO field that holds the new annotations.
--vep_num_fork 2 The number of local processes used when running VEP on a single file. See the --fork [num_forks] flag in the VEP documentation for more information.
--shard_variants True Determines whether input files are sharded into smaller, temporary VCF files before annotating the VCF files using VEP. If the input files are small (for example, if each VCF file contains less than 50,000 variants), then setting this flag to True can waste compute resources.
--number_of_variants_per_shard 2,000 If shard_variants is True, the maximum number of variants written to each shard. If you have a dataset with many samples, you might want to use a smaller value. The pipeline might take longer to finish when a smaller value is supplied. The default value is recommended in most cases.

Troubleshooting

If you run VEP with the Variant Transforms tool and the tool fails, a logs directory is created that provides logs from the virtual machines where the process ran. The logs directory is specified in the path used for the --annotation_output_dir flag. You can inspect these log files, in addition to the standard Variant Transforms tool logs, to determine the cause of the issue.

The VEP flag --check_ref is automatically enabled when you run VEP with the Variant Transforms tool. As a result, the files in the logs directory will contain a list of variants that do not match the reference. If you are using the correct VEP cache, then no mismatches should appear because mismatches are automatically dropped and do not appear in the BigQuery output table.

Sample queries

Extracting high impact variants

The following sample shows how to extract high impact variants in the HBB gene, which encodes the beta chain of hemoglobin:

#standardSQL
SELECT reference_name, start_position, reference_bases, ALT.alt, CSQ.*
FROM vcf_imports_external.gnomad_genomes_chr_hg19 AS T, T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE CSQ.SYMBOL = "HBB" AND CSQ.IMPACT = "HIGH"
ORDER BY start_position

Running the query returns:

Row reference_name start_position reference_bases alt allele Consequence IMPACT SYMBOL Gene Feature_type Feature BIOTYPE EXON INTRON
1 11 5246947 G GC C frameshift_variant HIGH HBB ENSG00000244734 Transcript ENST00000335295 protein_coding 3/3
2 11 5246957 T C C splice_acceptor_variant HIGH HBB ENSG00000244734 Transcript ENST00000335295 proten_coding 2/2
3 11 5246957 T C C splice_acceptor_variant&non_coding_transcript_variant HIGH HBB ENSG00000244734 Transcript ENST00000475226 retained_intron 1/1
4 11 5247805 C T T splice_donor_variant HIGH HBB ENSG00000244734 Transcript ENST00000335295 protein_coding 2/2
5 11 5247805 C T T splice_donor_variant&non_coding_transcript_variant HIGH HBB ENSG00000244734 Transcript ENST00000475226 retained_intron 1/1
6 11 5247991 CAAAG C - frameshift_variant HIGH HBB ENSG00000244734 Transcript ENST00000335295 protein_coding 2/3
7 11 5247991 CAAAG C - frameshift_variant HIGH HBB ENSG00000244734 Transcript ENST00000380315 protein_coding 4/4
8 11 5248003 G A A stop_gained HIGH HBB ENSG00000244734 Transcript ENST00000335295 protein_coding 2/3
9 11 5248003 G A A stop_gained HIGH HBB ENSG00000244734 Transcript ENST00000380315 protein_coding 4/4
10 11 5248199 T A A stop_gained HIGH HBB ENSG00000244734 Transcript ENST00000335295 protein_coding 1/3
11 11 5248199 T A A stop_gained HIGH HBB ENSG00000244734 Transcript ENST00000380315 protein_coding 3/4

The gnomAD dataset does not have any calls or samples, but you can extend this query to select samples that have high impact variants.

Joining data with public BigQuery databases

The following sample shows how to join data with other databases, such as pathway databases, that are available in BigQuery. Some of the databases available through the ISB-CGC are:

The following sample shows how to extract high impact variants in the double-strand break repair pathway, as defined by the GO (Gene Ontology) biological process (GO:0006302):

#standardSQL
SELECT
  reference_name, start_position, reference_bases, ALT.alt,
  CSQ.Consequence, CSQ.Impact, CSQ.SYMBOL
FROM
  `vcf_imports_external.gnomad_genomes_chr_hg19` AS T,
  T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE
  # Note: Matching based on symbol is "best effort" as the names may not be
  # standardized across sources.
  CSQ.SYMBOL IN (SELECT DB_Object_Symbol
                 FROM `isb-cgc.genome_reference.GO_Annotations`
                 WHERE GO_ID = 'GO:0006302')
  AND CSQ.IMPACT = "HIGH"
ORDER BY
  start_position

Running the query returns:

Row reference_name start_position reference_bases alt Consequence IMPACT SYMBOL
1 5 917449 G T splice_acceptor_variant&non_coding_transcript_variant HIGH TRIP13
2 12 1022568 A C stop_gained HIGH RAD52
3 12 1022568 A C stop_gained HIGH RAD52
4 12 1022568 A C stop_gained HIGH RAD52
5 12 1023125 G A stop_gained HIGH RAD52
6 12 1023125 G A stop_gained HIGH RAD52
7 12 1023125 G A stop_gained HIGH RAD52
8 12 1023167 G A stop_gained HIGH RAD52
9 12 1023167 G A stop_gained HIGH RAD52
10 12 1023167 G A stop_gained HIGH RAD52
11 12 1023217 G T stop_gained HIGH RAD52
12 12 1023217 G T stop_gained HIGH RAD52
13 12 1023217 G T stop_gained HIGH RAD52
14 12 1025654 TGA T frameshift_variant HIGH RAD52
15 12 1025699 T G splice_acceptor_variant&non_coding_transcript_variant HIGH RAD52
16 12 1036004 GC G frameshift_variant HIGH RAD52
17 12 1038977 C CT frameshift_variant HIGH RAD52
¿Te ha resultado útil esta página? Enviar comentarios:

Enviar comentarios sobre...