This page describes how to use the Variant Transforms tool VCF files preprocessor to validate VCF files. You can run the preprocessor as a standalone validator, or as a helper when loading and transforming VCF files into BigQuery.
Overview of the VCF files preprocessor
The VCF files preprocessor validates datasets containing VCF files to identify inconsistencies between the files. The preprocessor reports these inconsistencies and works to eliminate them when loading data into BigQuery. If the preprocessor cannot determine how to eliminate inconsistencies, it provides manual corrections you can make, such as when dealing with malformed files.
When you run the preprocessor, it generates a report that identifies three types of inconsistencies:
Inconsistency | Description |
---|---|
(Default) Header conflicts | Indicates that there are different definitions for the same field in different VCF files. This inconsistency is common. The report lists the following:
|
(Optional) Malformed variant records | Indicates that there are malformed records that could not be parsed by the VCF parser. The report lists the following:
|
(Optional) Inferred headers | Indicates one of the following issues with a field:
|
Running the preprocessor
You can run the tool using a Docker image that has all of the necessary binaries and dependencies installed. If you are preprocessing a large number of files, see Handling large inputs with the preprocessor tool.
To run the tool using a Docker image, complete the following steps:
Run the following script to start the preprocessor. Substitute the variables with the relevant resources from your Google Cloud project.
\#!/bin/bash \# Parameters to replace: \# The PROJECT_ID is the name of the GCP project that contains your BigQuery dataset. GOOGLE_CLOUD_PROJECT=PROJECT_ID INPUT_PATTERN=gs://BUCKET/*.vcf REPORT_PATH=gs://BUCKET/report.tsv RESOLVED_HEADERS_PATH=gs://BUCKET/resolved_headers.vcf TEMP_LOCATION=gs://BUCKET/temp COMMAND="vcf_to_bq_preprocess \ --input_pattern ${INPUT_PATTERN} \ --report_path ${REPORT_PATH} \ --resolved_headers_path ${RESOLVED_HEADERS_PATH} \ --report_all_conflicts true \ --temp_location ${TEMP_LOCATION} \ --job_name vcf-to-bigquery-preprocess \ --runner DataflowRunner" docker run -v ~/.config:/root/.config \ gcr.io/cloud-lifesciences/gcp-variant-transforms \ --project "${GOOGLE_CLOUD_PROJECT}" \ --zones us-west1-b \ "${COMMAND}"
When specifying the location of your VCF files in a Cloud Storage bucket, you can specify a single file or use a wildcard (
*
) to load multiple files at once. Acceptable file formats include GZIP, BZIP, and VCF.Note that the
TEMP_LOCATION
directory is used to store temporary files needed to run the tool. It can be any directory in Cloud Storage to which you have write access.Depending on several factors, such as the size of your data, it can take anywhere from several minutes to an hour or more for the job to complete.
Example output
The following example shows a report that was generated when the preprocessor tool was run on the 1000 Genomes dataset:
Header Conflicts
ID | Category | Conflicts | File Paths | Proposed Resolution |
---|---|---|---|---|
GL | FORMAT | num=None type=Float | gs://genomics-public-data/1000-genomes/vcf/ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | num=None type=Float |
gs://genomics-public-data/1000-genomes/vcf/ALL.chr17.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
gs://genomics-public-data/1000-genomes/vcf/ALL.chr21.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
gs://genomics-public-data/1000-genomes/vcf/ALL.chr8.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
gs://genomics-public-data/1000-genomes/vcf/ALL.chrX.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf | ||||
num=3 type=Float | gs://genomics-public-data/1000-genomes/vcf/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf | |||
GQ | FORMAT | num=1 type=Float | gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf | num=1 type=Float |
num=1 type=Integer | gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf |
Inferred Headers
ID | Category | Proposed Resolution |
---|---|---|
FT | FORMAT | num=1 type=String |
No Malformed Records Found.