Troubleshooting the Variant Transforms Tool

This page describes issues you might encounter when loading VCF files using the Variant Transforms tool.

If you are unable to load VCF files after reading this document, you can get additional support in the google-genomics-discuss group. Alternatively, you can open an issue in the Variant Transforms tool GitHub repository.

The Variant Transforms tool is too slow

You can run the Variant Transforms tool, but it's performing slowly.

  • Increase the number of workers in the job using the --max_num_workers flag.
  • Change the --worker_machine_type flag to use a larger machine, such as n1-standard-32. See Predefined vCPUs and memory for more information on Compute Engine predefined machine types.
  • Ensure that you have sufficient Compute Engine resource quotas in the region or zone where you are running the Variant Transforms tool. If necessary, you can change the zone or region where your job is running, or you can request additional Compute Engine quota.

    To view your available Compute Engine quota, see Checking your quota.

  • If you are trying to load GZIP or BZIP files, the tool might slow down because these types of files cannot be sharded. As a solution, decompress the files before loading them. You can use the dsub tool to write a script that decompresses files in a scalable manner.

    This issue is more likely to occur if you are running the tool with a small number of large files. Typically, running the tool with a large number of small files is fine because each file can be read by a separate process.

For more information, see Handling large inputs.

The pipeline crashes due to an out-of-disk error

You can run the Variant Transforms tool, but it crashes because it runs out of disk space.

  • Increase the amount of disk size allocated to reach worker using the --disk_size_gb flag.
  • Increase the number of workers in the job using the --max_num_workers flag.

A JSON parsing or BigQuery field error occurs

You can run the Variant Transforms tool, but it stops with one of the following messages:

  • Error while reading data, error message: JSON parsing error in row starting at position 0: No such field: FIELD_NAME
  • BigQuery schema has no such field

These error messages mean that the field is missing from the BigQuery schema. This is likely because the definition of the field is missing from a VCF header file.

  • Edit the VCF file containing the field and add an entry in the header that contains the correct definition. For example, if the error occurred for the field AF, you would add the following to the VCF file:

    ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

    You must provide a valid Type and Number for the field. If you're unsure of these values, you can use a generic list of strings by passing in Type=String and Number=. as placeholders, which will match any field. Also check if the missing field uses ##INFO or ##FORMAT.

  • If you can't edit the file, run the Variant Transforms tool with the --representative_header_file FILE_PATH flag. In the file you pass in, provide a merged view of all headers in all files. You can add any missing fields to that file.

  • Run the Variant Transforms tool with the --infer_headers flag. This causes the tool to do two passes on the data. In doing so, it infers the definitions for undefined and mismatched headers (in mismatched headers, the header field definition does not match the field value). When adding this flag, you do not need to edit the VCF files or provide a representative header file. However, adding the flag causes the tool to use roughly 30% more Compute Engine resources.

Additional resources

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics