Validating VCF files with the preprocessor tool

This page describes how to use the Variant Transforms tool VCF files preprocessor to validate VCF files. You can run the preprocessor as a standalone validator, or as a helper when loading and transforming VCF files into BigQuery.

Overview of the VCF files preprocessor

The VCF files preprocessor validates datasets containing VCF files to identify inconsistencies between the files. The preprocessor reports these inconsistencies and works to eliminate them when loading data into BigQuery. If the preprocessor cannot determine how to eliminate inconsistencies, it provides manual corrections you can make, such as when dealing with malformed files.

When you run the preprocessor, it generates a report that identifies three types of inconsistencies:

Inconsistency Description
(Default) Header conflicts Indicates that there are different definitions for the same field in different VCF files. This inconsistency is common. The report lists the following:
  • All conflicting definitions
  • The corresponding files' file paths (maximum of five)
  • Suggested resolutions
(Optional) Malformed variant records Indicates that there are malformed records that could not be parsed by the VCF parser. The report lists the following:
  • The affected files' file paths
  • The malformed records
(Optional) Inferred headers Indicates one of the following issues with a field:
  • A field with no definition in any of the VCF files was used.
  • A field with a definition was used, but the field value does not match the field description. For example:
    • A field's defined type is an integer, but the provided value is a float. The preprocessor infers that the defined type be changed to float.
    • The defined num is A (meaning there is one value for each alternate base), but the provided values do not have the same cardinality as the alternate bases. The preprocessor infers that the num is unknown.
    • In both cases, the preprocessor infers the type and num of the affected fields and provides them in the report.

Running the preprocessor

You can run the tool using a Docker image that has all of the necessary binaries and dependencies installed. If you are preprocessing a large number of files, see Handling large inputs with the preprocessor tool.

To run the tool using a Docker image, complete the following steps:

  1. Run the following script to start the preprocessor. Substitute the variables with the relevant resources from your Google Cloud project.

    \# Parameters to replace:
    \# The PROJECT_ID is the name of the GCP project that contains your BigQuery dataset.
    COMMAND="vcf_to_bq_preprocess \
      --input_pattern ${INPUT_PATTERN} \
      --report_path ${REPORT_PATH} \
      --resolved_headers_path ${RESOLVED_HEADERS_PATH} \
      --report_all_conflicts true \
      --temp_location ${TEMP_LOCATION} \
      --job_name vcf-to-bigquery-preprocess \
      --runner DataflowRunner"
    docker run -v ~/.config:/root/.config \ \
      --project "${GOOGLE_CLOUD_PROJECT}" \
      --zones us-west1-b \

    When specifying the location of your VCF files in a Cloud Storage bucket, you can specify a single file or use a wildcard (*) to load multiple files at once. Acceptable file formats include GZIP, BZIP, and VCF.

    Note that the TEMP_LOCATION directory is used to store temporary files needed to run the tool. It can be any directory in Cloud Storage to which you have write access.

  2. Depending on several factors, such as the size of your data, it can take anywhere from several minutes to an hour or more for the job to complete.

Example output

The following example shows a report that was generated when the preprocessor tool was run on the 1000 Genomes dataset:

Header Conflicts

ID Category Conflicts File Paths Proposed Resolution
GL FORMAT num=None type=Float gs://genomics-public-data/1000-genomes/vcf/ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf num=None type=Float
num=3 type=Float gs://genomics-public-data/1000-genomes/vcf/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf
GQ FORMAT num=1 type=Float gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf num=1 type=Float
num=1 type=Integer gs://genomics-public-data/1000-genomes/vcf/ALL.chrY.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf

Inferred Headers

ID Category Proposed Resolution
FT FORMAT num=1 type=String

No Malformed Records Found.