This page describes how to use the Variant Transforms tool to transform and load VCF files directly into BigQuery for large-scale analysis.
If you are loading a large number of files, see Handling large inputs for recommendations on improving performance and reducing costs.
Loading and transforming VCF files into BigQuery
Before you begin
To run the tool, you need:
A Google Cloud project with billing and the Cloud Life Sciences, Compute Engine, Cloud Storage, and Dataflow APIs enabled.
An existing BigQuery dataset and a Cloud Storage bucket.
An existing VCF file or a GZIP or BZIP file containing VCF files in a Cloud Storage bucket. If you don't have a VCF file, you can find one in the available public datasets.
Running the tool
You can run the tool using a Docker image that has all of the necessary binaries and dependencies installed.
To run the tool using a Docker image, complete the following steps:
Use the following command to get the latest version of Variant Transforms.
docker pull gcr.io/cloud-lifesciences/gcp-variant-transforms
Copy the following text and save it to a file named
script.sh
, substituting the variables with the relevant resources from your Google Cloud project.#!/bin/bash # Parameters to replace: # The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset. GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT INPUT_PATTERN=gs://BUCKET/*.vcf OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE TEMP_LOCATION=gs://BUCKET/temp COMMAND="vcf_to_bq \ --input_pattern ${INPUT_PATTERN} \ --output_table ${OUTPUT_TABLE} \ --temp_location ${TEMP_LOCATION} \ --job_name vcf-to-bigquery \ --runner DataflowRunner" docker run -v ~/.config:/root/.config \ gcr.io/cloud-lifesciences/gcp-variant-transforms \ --project "${GOOGLE_CLOUD_PROJECT}" \ --zones us-west1-b \ "${COMMAND}"
When specifying the location of your VCF files in a Cloud Storage bucket, you can specify a single file or use a wildcard (
*
) to load multiple files at once. Acceptable file formats include GZIP, BZIP, and VCF. For more information, see Loading multiple files.Keep in mind that the tool runs more slowly for compressed files because compressed files cannot be sharded. If you want to merge samples across files, see Variant merging.
Note that the
TEMP_LOCATION
directory is used to store temporary files needed to run the tool. It can be any directory in Cloud Storage to which you have write access.Run the following command to make
script.sh
executable:chmod +x script.sh
Run
script.sh
:./script.sh
Depending on several factors, such as the size of your data, it can take anywhere from several minutes to an hour or more for the job to complete.
Because the tool uses Dataflow, you can navigate to the Dataflow Console to see a detailed view of the job. For example, you can view the number of records processed, the number of workers, and detailed error logs.
After the job completes, run the following command to list all of the tables in your dataset. Check that the new table containing your VCF data is in the list:
bq ls --format=pretty GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET
You can also view details about the table, such as the schema and when it was last modified:
bq show --format=pretty GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
Setting zones and regions
Google Cloud uses regions, subdivided into zones, to define the geographic location of physical computing resources.
You can run the Variant Transforms tool in any region or zone where Dataflow is supported.
To change the region where the tool runs, update both the Docker COMMAND
environment variable and
the Cloud Life Sciences API gcloud command. For example, to restrict job
processing to Europe, the Docker image script from the previous section would
look like the following:
#!/bin/bash # Parameters to replace: # The GOOGLE_CLOUD_PROJECT is the project that contains your BigQuery dataset. GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT INPUT_PATTERN=gs://BUCKET/*.vcf OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE TEMP_LOCATION=gs://BUCKET/temp COMMAND="vcf_to_bq \ --input_pattern ${INPUT_PATTERN} \ --output_table ${OUTPUT_TABLE} \ --temp_location ${TEMP_LOCATION} \ --job_name vcf-to-bigquery \ --runner DataflowRunner --region europe-west1 --zone europe-west1-b" docker run -v ~/.config:/root/.config \ gcr.io/cloud-lifesciences/gcp-variant-transforms \ --project "${GOOGLE_CLOUD_PROJECT}" \ --zones europe-west1-b \ "${COMMAND}"
For information on how to set the location of a BigQuery dataset, see Creating a dataset.
Loading multiple files
You can specify which VCF files you want to load into BigQuery
using the --input_pattern
flag in the script above. For example, to load all
VCF files in the my-bucket
Cloud Storage bucket, set the flag to
the following:
--input_pattern=gs://my-bucket/*.vcf
When loading multiple files with the Variant Transforms tool, the following operations occur:
A merged BigQuery schema is created that contains data from all matching VCF files listed in the
--input_pattern
flag. For example, theINFO
andFORMAT
fields shared between the VCF files are merged. This step assumes that fields defined in multiple files with the same key are compatible.Records from all of the VCF files are loaded into a single table. Any missing fields are set to
null
in their associated column.
You can also merge samples as a third step. For more information, see Variant merging.
When loading the VCF files, their field definitions and values must be consistent, or else the tool will fail. The tool can attempt to fix these inconsistencies if set to do so. For more information, see Handling malformed files.
Appending data to existing BigQuery tables
You can append data to an existing BigQuery table by adding
the --append
flag when running the Variant Transforms tool.
For best results when appending data, the schema used for the appended data should be the same as the schema in the existing table. If the appended data's schema contains a column with the same name as a column in the existing table, then both columns must have the same name, data type, and mode. Otherwise, the Variant Transforms tool will return an error.
You can append data that has a different schema than the existing table by adding
--update_schema_on_append
flag in addition to the --append
flag. Any
new columns from the appended data will be added to the existing schema, and
the values of rows for the existing schema in the new columns will be set to NULL.
Similarly, if
the existing schema has columns that the appended data does not, then the
values of the rows in the appended data's columns will also be NULL.
Handling malformed files
There are multiple options for dealing with malformed or incompatible files. Before loading VCF files, you can check for malformed and incompatible files using the VCF files preprocessor tool.
Handling field incompatibility
When loading multiple VCF files, the Variant Transforms tool merges all of the
INFO
and HEADER
fields to generate a "representative header." The
representative header is then used to create the BigQuery
schema. If the same key is defined
in multiple files, its definition must be compatible across all of the files.
The compatibility rules are:
- Fields are compatible if they have the same values in their
Number
andType
fields. Annotation fields, which are specified using the--annotation_fields
flag, must also have the same value in theirDescription
field. Fields that contain different
Type
values are compatible in the following cases:- If the
Integer
andFloat
fields are compatible, and both use theFloat
type. - If you run the Variant Transforms tool with the
--allow_incompatible_records
flag, which automatically resolves conflicts between incompatible fields, such asString
andInteger
. This ensures that incompatible types are not silently ignored.
- If the
Fields with different values in their
Number
field are compatible in the following cases:If the values contain "repeated" numbers that are compatible with one another, such as:
Number=.
(unknown number)- Any
Number
larger than 1 Number=G
(one value per genotype) andNumber=R
(one value for each alternate and reference)Number=A
(one value for each alternate), only if the tool is run with--split_alternate_allele_info_fields
set toFalse
.
If you run the Variant Transforms tool with the
--allow_incompatible_records
flag, which automatically resolves conflicts between incompatible fields, such asNumber=1
andNumber=.
. This ensures that incompatible types are not silently ignored.
Specifying a headers file
When running the Variant Transforms tool, you can pass the --representative_header_file
flag with a headers file that is used to generate the BigQuery
schema. The file specifies the merged headers from all of the files being
loaded.
The Variant Transforms tool only reads the header information from the file and ignores any VCF records. This means that the file can either contain just the header fields or it can be an actual VCF file.
Providing a headers file has the following benefits:
The pipeline will run faster, especially if you are loading large numbers of files. The Variant Transforms tool uses the headers file to generate the BigQuery schema and skips the step of merging headers across files. This is particularly useful if all of the files have the same VCF headers.
You can provide definitions for any missing header fields.
You can resolve incompatible field definitions across files.
Inferring headers
When running the Variant Transforms tool, you might have fields that don't have a definition or you might want the tool to ignore header definitions that are incompatible with field values. In such a case, you might want the tool to infer the correct header definitions for those fields.
You can pass the --infer_headers
flag and the tool will infer TYPE
and NUMBER
values for undefined fields.
It infers the values based on the field values across all of the VCF files.
Passing this flag also outputs a representative header that contains inferred definitions and definitions from headers.
Allowing incompatible records
The Variant Transforms tool fails in both of the following cases:
- If there is inconsistency between a field definition and the field's values
- If a field has two inconsistent definitions in two different VCF files
In both cases, you can pass the --allow_incompatible_records
flag. This causes the tool to
resolve conflicts in header definitions automatically. The tool also casts
field values to match the BigQuery schema if there is
inconsistency between a field's definition and its value (for example, the
Integer
field value will be cast to String
to match a field schema of
type String
).
Next steps
- Learn how to run the Variant Transforms preprocessor tool to validate VCF files.
- Read through Analyzing variants using BigQuery to analyze the data you've loaded.
- Learn about the BigQuery variants schema.
- Learn about uploading large local files in parallel by using a parallel composite upload.