Storing and Loading Genomic Variants

This page describes how to use the Variant Transforms tool to transform and load VCF files directly into BigQuery for large-scale analysis.

If you are loading a large number of files, see Handling Large Inputs for recommendations on improving performance and reducing costs.

Loading and transforming VCF files into BigQuery

Before you begin

To run the tool, you need:

Running the tool

You can run the tool using a Docker image that has all of the necessary binaries and dependencies installed.

To run the tool using a Docker image, complete the following steps:

  1. Run the following script to start the tool. Substitute the variables with the relevant resources from your GCP project.

    \#!/bin/bash
    \# Parameters to replace:
    \# The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.
    GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
    INPUT_PATTERN=gs://BUCKET/*.vcf
    OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
    TEMP_LOCATION=gs://BUCKET/temp
    
    COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq \
        --project ${GOOGLE_CLOUD_PROJECT} \
        --input_pattern ${INPUT_PATTERN} \
        --output_table ${OUTPUT_TABLE} \
        --temp_location ${TEMP_LOCATION} \
        --job_name vcf-to-bigquery \
        --runner DataflowRunner"
    gcloud alpha genomics pipelines run \
        --project "${GOOGLE_CLOUD_PROJECT}" \
        --logging "${TEMP_LOCATION}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
        --zones us-west1-b \
        --service-account-scopes https://www.googleapis.com/auth/cloud-platform \
        --docker-image gcr.io/gcp-variant-transforms/gcp-variant-transforms \
        --command-line "${COMMAND}"
    

    When specifying the location of your VCF files in a Cloud Storage bucket, you can specify a single file or use a wildcard (*) to load multiple files at once. Acceptable file formats include GZIP, BZIP, and VCF. For more information, see Loading multiple files.

    Keep in mind that the tool runs more slowly for compressed files because compressed files cannot be sharded. If you want to merge samples across files, see Variant merging.

    Note that the TEMP_LOCATION directory is used to store temporary files needed to run the tool. It can be any directory in Cloud Storage to which you have write access.

  2. The command returns an operation ID in the format Running [projects/GOOGLE_CLOUD_PROJECT/operations/OPERATION_ID]. You can use the operation ID to track the status of the tool by running the following command, which returns a message when the operation completes:

    gcloud alpha genomics operations wait OPERATION_ID
    

    Depending on several factors, such as the size of your data, it can take anywhere from several minutes to an hour or more for the job to complete.

    Because the tool uses Cloud Dataflow, you can navigate to the Cloud Dataflow Console to see a detailed view of the job. For example, you can view the number of records processed, the number of workers, and detailed error logs.

  3. After the job completes, run the following command to list all of the tables in your dataset. Check that the new table containing your VCF data is in the list:

    bq ls --format=pretty GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET
    

    You can also view details about the table, such as the schema and when it was last modified:

    bq show --format=pretty GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
    

Setting zones and regions

Google Cloud Platform uses regions, subdivided into zones, to define the geographic location of physical computing resources.

You can run the Variant Transforms tool in any region or zone where Cloud Dataflow is supported.

To change the region where the tool runs, update both the Docker COMMAND environment variable and the Pipelines API gcloud command. For example, to restrict job processing to Europe, the Docker image script from the previous section would look like the following:

\#!/bin/bash
\# Parameters to replace:
\# The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset.
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
INPUT_PATTERN=gs://BUCKET/*.vcf
OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
TEMP_LOCATION=gs://BUCKET/temp

COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq \
    --project ${GOOGLE_CLOUD_PROJECT} \
    --input_pattern ${INPUT_PATTERN} \
    --output_table ${OUTPUT_TABLE} \
    --temp_location ${TEMP_LOCATION} \
    --job_name vcf-to-bigquery \
    --runner DataflowRunner
    --region europe-west1
    --zone europe-west1-b"
gcloud alpha genomics pipelines run \
    --project "${GOOGLE_CLOUD_PROJECT}" \
    --logging "${TEMP_LOCATION}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
    --zones europe-west1-b \
    --service-account-scopes https://www.googleapis.com/auth/cloud-platform \
    --docker-image gcr.io/gcp-variant-transforms/gcp-variant-transforms \
    --command-line "${COMMAND}"

For information on how to set the location of a BigQuery dataset, see Creating a dataset.

Loading multiple files

You can specify which VCF files you want to load into BigQuery using the --input_pattern flag in the script above. For example, to load all VCF files in the my-bucket Cloud Storage bucket, set the flag to the following:

--input_pattern=gs://my-bucket/*.vcf

When loading multiple files with the Variant Transforms tool, the following operations occur:

  1. A merged BigQuery schema is created that contains data from all matching VCF files listed in the --input_pattern flag. For example, the INFO and FORMAT fields shared between the VCF files are merged. This step assumes that fields defined in multiple files with the same key are compatible.

  2. Records from all of the VCF files are loaded into a single table. Any missing fields are set to null in their associated column.

You can also merge samples as a third step. For more information, see Variant merging.

When loading the VCF files, their field definitions and values must be consistent, or else the tool will fail. The tool can attempt to fix these inconsistencies if set to do so. For more information, see Handling malformed files.

Appending data to existing BigQuery tables

You can append data to an existing BigQuery table by adding the --append flag when running the Variant Transforms tool.

For best results when appending data, the schema used for the appended data should be the same as the schema in the existing table. If the appended data's schema contains a column with the same name as a column in the existing table, then both columns must have the same name, data type, and mode. Otherwise, the Variant Transforms tool will return an error.

You can append data that has a different schema than the existing table by adding --update_schema_on_append flag in addition to the --append flag. Any new columns from the appended data will be added to the existing schema, and the values of rows for the existing schema in the new columns will be set to NULL. Similarly, if the existing schema has columns that the appended data does not, then the values of the rows in the appended data's columns will also be NULL.

Handling malformed files

There are multiple options for dealing with malformed or incompatible files. Before loading VCF files, you can check for malformed and incompatible files using the VCF files preprocessor tool.

Handling field incompatibility

When loading multiple VCF files, the Variant Transforms tool merges all of the INFO and HEADER fields to generate a "representative header." The representative header is then used to create the BigQuery schema. If the same key is defined in multiple files, its definition must be compatible across all of the files. The compatibility rules are:

  • Fields are compatible if they have the same values in their Number and Type fields. Annotation fields, which are specified using the --annotation_fields flag, must also have the same value in their Description field.
  • Fields that contain different Type values are compatible in the following cases:

    • If the Integer and Float fields are compatible, and both use the Float type.
    • If you run the Variant Transforms tool with the --allow_incompatible_records flag, which automatically resolves conflicts between incompatible fields, such as String and Integer. This ensures that incompatible types are not silently ignored.
  • Fields with different values in their Number field are compatible in the following cases:

    • If the values contain "repeated" numbers that are compatible with one another, such as:

      • Number=. (unknown number)
      • Any Number larger than 1
      • Number=G (one value per genotype) and Number=R (one value for each alternate and reference)
      • Number=A (one value for each alternate), only if the tool is run with --split_alternate_allele_info_fields set to False.
    • If you run the Variant Transforms tool with the --allow_incompatible_records flag, which automatically resolves conflicts between incompatible fields, such as Number=1 and Number=.. This ensures that incompatible types are not silently ignored.

Specifying a headers file

When running the Variant Transforms tool, you can pass the --representative_header_file flag with a headers file that is used to generate the BigQuery schema. The file specifies the merged headers from all of the files being loaded.

The Variant Transforms tool only reads the header information from the file and ignores any VCF records. This means that the file can either contain just the header fields or it can be an actual VCF file.

Providing a headers file has the following benefits:

  • The pipeline will run faster, especially if you are loading large numbers of files. The Variant Transforms tool uses the headers file to generate the BigQuery schema and skips the step of merging headers across files. This is particularly useful if all of the files have the same VCF headers.

  • You can provide definitions for any missing header fields.

  • You can resolve incompatible field definitions across files.

Inferring headers

When running the Variant Transforms tool, you might have fields that don't have a definition or you might want the tool to ignore header definitions that are incompatible with field values. In such a case, you might want the tool to infer the correct header definitions for those fields.

You can pass the --infer_headers flag and the tool will infer TYPE and NUMBER values for undefined fields. It infers the values based on the field values across all of the VCF files.

Passing this flag also outputs a representative header that contains inferred definitions and definitions from headers.

Allowing incompatible records

The Variant Transforms tool fails in both of the following cases:

  • If there is inconsistency between a field definition and the field's values
  • If a field has two inconsistent definitions in two different VCF files

In both cases, you can pass the --allow_incompatible_records flag. This causes the tool to resolve conflicts in header definitions automatically. The tool also casts field values to match the BigQuery schema if there is inconsistency between a field's definition and its value (for example, the Integer field value will be cast to String to match a field schema of type String).

Next steps

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Genomics