This page describes how to:
- Parse annotation fields in input VCF files.
- Annotate VCF files while they're being imported, using open source annotation tools and databases like Variant Effect Predictor (VEP).
Parsing annotation fields
The Variant Transforms tool supports an annotation field format specified in Variant annotations in VCF format. VEP uses this same format. The following example shows a VCF header from a file that was annotated by VEP:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|ALLELE_NUM|...>
The corresponding annotations for a sample variant record where REF=AAC
and
ALT=TAC,A
look like this:
CSQ=-|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000423562|unprocessed_pseudogene|||2,...,T|upstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000518655|transcribed_unprocessed_pseudogene|||1
Multiple comma-separated annotation lists are provided for each variant record.
The first field in each set corresponds to an alternate allele (ALT
). The
example above only uses a subset for each ALT
. The actual file has more fields,
and each alternate allele can be repeated multiple times with different annotations.
You can pass the --annotation_fields
when running the Variant Transforms tool, which splits annotation lists and
moves them under their corresponding alternate_bases
records. For example,
you can see below that there is a repeated record field, CSQ
, under each
alternate_bases
:
Name | Type | Mode | Description |
---|---|---|---|
alternate_bases.CSQ | RECORD | REPEATED | List of CSQ annotations for this alternate. |
alternate_bases.CSQ.allele | STRING | NULLABLE | The ALT part of the annotation field. |
alternate_bases.CSQ.Consequence | STRING | Nullable | None |
alternate_bases.CSQ.IMPACT | STRING | Nullable | None |
alternate_bases.CSQ.SYMBOL | STRING | Nullable | None |
alternate_bases.CSQ.Gene | STRING | Nullable | None |
alternate_bases.CSQ.Feature_type | STRING | Nullable | None |
alternate_bases.CSQ.Feature | STRING | Nullable | None |
alternate_bases.CSQ.BIOTYPE | STRING | Nullable | None |
alternate_bases.CSQ.EXON | STRING | Nullable | None |
alternate_bases.CSQ.INTRON | STRING | Nullable | None |
Alternate allele matching
In the above example, the alternate alleles listed in the CSQ
column are
different from the ALT
strings of the variant record. For example, the
AAC->C
variant is represented by a -
, which signifies a deletion. This
occurs because VEP runs on the dataset with the --minimal
flag. Variant Transforms supports matching ALT
fields in this minimal
mode.
The following modes of ALT
matching are supported:
By default, if no
ALT
matching flag is set, the Variant annotations in VCF format specification is used.As shown in the above example, when using the
--use_allele_num
flag, the Variant Transforms tool looks up the value of theALLELE_NUM
annotation field. This is a feature of VEP.You can simulate the
--minimal
flag with the Variant Transforms tool by using the--minimal_vep_alt_matching
flag. However, this can result in ambiguous matching. These ambiguous matches are counted and reported in the Dataflow Console. They are also added under a new field in the created table.For best results, use
--minimal
rather than--minimal_vep_alt_matching
.
Automatically annotating VCF files using VEP
Typically, if you want to parse annotation fields and break them up into separate fields, you have to first pre-annotate your VCF files. However, you do not need to pre-annotate your VCF files when using the Variant Transforms tool.
The Variant Transforms can automatically annotate VCF files as they are being loaded into BigQuery. To do so, you must run VEP with the Variant Transforms tool, which requires creating and running Compute Engine virtual machine instances containing preconfigured Docker images.
Running VEP with the Variant Transforms tool
To run the Variant Transforms tool with VEP and automatically annotate VCF files as they are being loaded into BigQuery, you must run the Variant Transforms tool with at least the following flags:
--run_annotation_pipeline
- Either
--max_num_workers MAX_NUM_WORKERS
or--num_workers NUM_WORKERS
--annotation_output_dir gs://CLOUD_STORAGE_BUCKET
The bucket you choose for the CLOUD_STORAGE_BUCKET variable must be owned by your project.
A full list of the flags you can pass to the Variant Transforms tool when running VEP is provided below:
Flag | Default | Description |
---|---|---|
--run_annotation_pipeline |
False | Enables automatic annotation. |
--max_num_workers or --num_workers |
Automatically determined | The maximum number of workers in the job to which Cloud Dataflow will autoscale or the number of workers that are initially assigned to the job, respectively. |
--annotation_output_dir |
None | A Cloud Storage path for the VEP output files. The hierarchy of input files is replicated in this location, but the new files are appended with _vep_output.vcf . If the directory already exists, then the Variant Transforms tool will fail. |
--vep_image_uri |
gcr.io/gcp-variant-annotation/vep_91 |
The Cloud Genomics VEP Docker image. Google maintains the default image (VEP version 91). |
--vep_cache_path |
gs://gcp-variant-annotation-vep-cache/vep_cache_homo_sapiens_GRCh38_91.tar.gz (Recommended for a human genome aligned with a GRCh38 reference sequence) |
A Cloud Storage path for the compressed version of the VEP cache. You can create this file using the --build_vep_cache.sh script. |
--vep_info_field |
CSQ_VT |
The name of a new INFO field that holds the new annotations. |
--vep_num_fork |
2 | The number of local processes used when running VEP on a single file. See the --fork [num_forks] flag in the VEP documentation for more information. |
--shard_variants |
True | Determines whether input files are sharded into smaller, temporary VCF files before annotating the VCF files using VEP. If the input files are small (for example, if each VCF file contains less than 50,000 variants), then setting this flag to True can waste compute resources. |
--number_of_variants_per_shard |
2,000 | If shard_variants is True , the maximum number of variants written to each shard. If you have a dataset with many samples, you might want to use a smaller value. The pipeline might take longer to finish when a smaller value is supplied. The default value is recommended in most cases. |
Troubleshooting
If you run VEP with the Variant Transforms tool and the tool fails, a logs
directory
is created that provides logs from the virtual machines where the process ran. The
logs
directory is specified in the path used for the --annotation_output_dir
flag. You can inspect these log files, in addition to the standard Variant
Transforms tool logs, to determine the cause of the issue.
The VEP flag --check_ref
is automatically enabled when you run VEP with the Variant Transforms tool.
As a result, the files in the logs
directory will contain a list of variants
that do not match the reference. If you are using the correct VEP cache, then
no mismatches should appear because mismatches are automatically dropped and
do not appear in the BigQuery output table.
Sample queries
Extracting high impact variants
The following sample shows how to extract high impact variants in the HBB gene, which encodes the beta chain of hemoglobin:
#standardSQL
SELECT reference_name, start_position, reference_bases, ALT.alt, CSQ.*
FROM vcf_imports_external.gnomad_genomes_chr_hg19 AS T, T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE CSQ.SYMBOL = "HBB" AND CSQ.IMPACT = "HIGH"
ORDER BY start_position
Running the query returns:
Row | reference_name | start_position | reference_bases | alt | allele | Consequence | IMPACT | SYMBOL | Gene | Feature_type | Feature | BIOTYPE | EXON | INTRON |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 11 | 5246947 | G | GC | C | frameshift_variant | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000335295 | protein_coding | 3/3 | |
2 | 11 | 5246957 | T | C | C | splice_acceptor_variant | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000335295 | proten_coding | 2/2 | |
3 | 11 | 5246957 | T | C | C | splice_acceptor_variant&non_coding_transcript_variant | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000475226 | retained_intron | 1/1 | |
4 | 11 | 5247805 | C | T | T | splice_donor_variant | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000335295 | protein_coding | 2/2 | |
5 | 11 | 5247805 | C | T | T | splice_donor_variant&non_coding_transcript_variant | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000475226 | retained_intron | 1/1 | |
6 | 11 | 5247991 | CAAAG | C | - | frameshift_variant | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000335295 | protein_coding | 2/3 | |
7 | 11 | 5247991 | CAAAG | C | - | frameshift_variant | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000380315 | protein_coding | 4/4 | |
8 | 11 | 5248003 | G | A | A | stop_gained | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000335295 | protein_coding | 2/3 | |
9 | 11 | 5248003 | G | A | A | stop_gained | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000380315 | protein_coding | 4/4 | |
10 | 11 | 5248199 | T | A | A | stop_gained | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000335295 | protein_coding | 1/3 | |
11 | 11 | 5248199 | T | A | A | stop_gained | HIGH | HBB | ENSG00000244734 | Transcript | ENST00000380315 | protein_coding | 3/4 |
The gnomAD dataset does not have any calls or samples, but you can extend this query to select samples that have high impact variants.
Joining data with public BigQuery databases
The following sample shows how to join data with other databases, such as pathway databases, that are available in BigQuery. Some of the databases available through the ISB-CGC are:
The following sample shows how to extract high impact variants in the double-strand break repair pathway, as defined by the GO (Gene Ontology) biological process (GO:0006302):
#standardSQL
SELECT
reference_name, start_position, reference_bases, ALT.alt,
CSQ.Consequence, CSQ.Impact, CSQ.SYMBOL
FROM
`vcf_imports_external.gnomad_genomes_chr_hg19` AS T,
T.alternate_bases AS ALT, ALT.CSQ AS CSQ
WHERE
# Note: Matching based on symbol is "best effort" as the names may not be
# standardized across sources.
CSQ.SYMBOL IN (SELECT DB_Object_Symbol
FROM `isb-cgc.genome_reference.GO_Annotations`
WHERE GO_ID = 'GO:0006302')
AND CSQ.IMPACT = "HIGH"
ORDER BY
start_position
Running the query returns:
Row | reference_name | start_position | reference_bases | alt | Consequence | IMPACT | SYMBOL |
---|---|---|---|---|---|---|---|
1 | 5 | 917449 | G | T | splice_acceptor_variant&non_coding_transcript_variant | HIGH | TRIP13 |
2 | 12 | 1022568 | A | C | stop_gained | HIGH | RAD52 |
3 | 12 | 1022568 | A | C | stop_gained | HIGH | RAD52 |
4 | 12 | 1022568 | A | C | stop_gained | HIGH | RAD52 |
5 | 12 | 1023125 | G | A | stop_gained | HIGH | RAD52 |
6 | 12 | 1023125 | G | A | stop_gained | HIGH | RAD52 |
7 | 12 | 1023125 | G | A | stop_gained | HIGH | RAD52 |
8 | 12 | 1023167 | G | A | stop_gained | HIGH | RAD52 |
9 | 12 | 1023167 | G | A | stop_gained | HIGH | RAD52 |
10 | 12 | 1023167 | G | A | stop_gained | HIGH | RAD52 |
11 | 12 | 1023217 | G | T | stop_gained | HIGH | RAD52 |
12 | 12 | 1023217 | G | T | stop_gained | HIGH | RAD52 |
13 | 12 | 1023217 | G | T | stop_gained | HIGH | RAD52 |
14 | 12 | 1025654 | TGA | T | frameshift_variant | HIGH | RAD52 |
15 | 12 | 1025699 | T | G | splice_acceptor_variant&non_coding_transcript_variant | HIGH | RAD52 |
16 | 12 | 1036004 | GC | G | frameshift_variant | HIGH | RAD52 |
17 | 12 | 1038977 | C | CT | frameshift_variant | HIGH | RAD52 |