Understanding the BigQuery variants schema

This page explains the structure of the schema generated in BigQuery when you use the Variant Transforms pipeline to transform and load VCF files directly into BigQuery.

Nomenclature

To understand the export formats and schemas, familiarize yourself with the following terms.

Genomics nomenclature

Sample
DNA collected and processed under a single identifier. A sample typically involves a single individual organism, but can also be a heterogeneous sample such as a cheek swab.
Reference name
The name of a reference segment of DNA. The reference name is typically a chromosome, but might be other named regions from a reference genome.
Variant
A region of the genome that has been identified as differing from the reference genome. A variant must have a reference name, start position, end position, and one or more reference bases.
Non-variant segment

A region of the genome that matches the reference genome. The non-variant segment is also called a "reference segment". Typically, genomic data does not include non-variant segments with variants.

For more on non-variant segments, see the gVCF documentation.

Call

An identified occurrence of a variant or non-variant segment for an individual sample. A call represents the determination of genotype regarding a particular variant. The call might include associated information such as quality and phasing.

INFO fields

Optional fields added to variant and call information. For example, all calls have a genotype field, but not all datasets have a "Genotype Quality" (GQ) field. The genotype field is a fixed part of the VariantCall schema, but it has no GQ field. You can import the GQ field and value as key/value pairs into the VariantCall info field.

For more genomic nomenclature, see the following documents:

BigQuery terms

Familiarize yourself with the following BigQuery terms:

Simple fields
Simple data elements in a BigQuery table, such as numbers and strings.
Nested fields
Complex data elements in a BigQuery table. A nested field can contain multiple fields, both simple and nested.
Repeated fields
Fields in a BigQuery table that can have multiple values, like a list. Repeated fields can be both simple and nested.

For more about BigQuery's complex data types, see the following:

Variants table structure

When you run the Variant Transforms pipeline, you specify the name of the BigQuery table. By convention, the name of the table is variants.

Variants table record structure

The top-level records of the variants table can be both variants and non-variant segments. Each variants table record contains one or more calls.

The following table illustrates the variants table record structure. The table shows the following variant records:

  • Variant1
  • Variant2

The BigQuery dataset contains the following samples:

  • Sample1
  • Sample2
  • Sample3

Variant1 has been called for Sample1 and Sample2. Variant2 has been called for Sample1 and Sample3.

Variant1 Sample1

Sample2
Variant2 Sample1

Sample3
... ...

Variant table field structure

Every variants table includes both a fixed set of fields and a variable set of fields. The structure of the table at a high level looks like the following:

     
Fixed Variant Fields Call Variant INFO Fields
... Fixed Call Fields          Call INFO Fields ...

The names of the variable fields are the following:

  • Variant Resource INFO field keys
  • VariantCall INFO field keys

Variant table fixed fields

The fixed record-level (variant) fields are the following:

  • reference_name
  • start_position
  • end_position
  • reference_bases
  • alternate_bases
  • names
  • quality
  • filter

The fixed call-level (VariantCall) fields are the following:

  • name
  • genotype
  • phaseset

Example schema

When the 1000 Genomes data was loaded into BigQuery, it included importing ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.

The VCF file includes various variant-level and call-level INFO fields, as described in the following INFO and FORMAT header directives:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype likelihoods">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Per-sample genotype filter">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
 [[trimmed]]
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
 [[trimmed]]

When the file was loaded into BigQuery the following occurred:

  • Fields marked as INFO, such as AC and AF, were added as INFO fields to the variant resources.
  • Fields marked as FORMAT, such as the GP and GQ fields, were added as INFO fields to the VariantCall resources.
  • The FORMAT field GT was not added as an INFO field. The value was converted into the genotype field.

Viewing the schema in BigQuery shows the following fixed fields:

Schema      
reference_name STRING NULLABLE Reference name.
start_position INTEGER NULLABLE Start position (0-based). Corresponds to the first base of the string of reference bases.
end_position INTEGER NULLABLE End position (0-based). Corresponds to the first base after the last base in the reference allele.
reference_bases STRING NULLABLE Reference bases.
alternate_bases RECORD REPEATED One record for each alternate base (if any). See Additional alternate_bases record information.
alternate_bases.alt STRING NULLABLE Alternate base.
names STRING REPEATED Variant names (for example, RefSNP ID).
quality FLOAT NULLABLE Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.
filter STRING REPEATED List of failed filters (if any) or "PASS" indicating the variant has passed all filters.
call RECORD REPEATED One record for each call.

Viewing the schema shows the following variable fields (the INFO fields):

Schema      
call.name STRING NULLABLE Name of the call.
call.genotype INTEGER REPEATED Genotype of the call. "-1" is used in cases where the genotype is not called.
call.phaseset STRING NULLABLE Phaseset of the call (if any). "*" is used in cases where the genotype is phased but no phase set ("PS" in FORMAT) was specified.

Additional alternate_bases record information

The alternate_bases record contains any INFO field with Number=A. The record simplifies querying by removing the need to map each field with its corresponding alternate record.

To use the previous BigQuery schema version, where Number=A fields are independent of alternate bases, complete the following steps:

  1. Run the Variant Transforms tool, and pass the --split_alternate_allele_info_fields flag
  2. Set the value of the --split_alternate_allele_info_fields flag to False

Automatically split records

In the following cases, the Variant Transforms tool automatically splits a record into multiple rows where each row is less than 100 MB:

  • A record has a large number of calls.
  • The large number of calls results in a BigQuery row that is bigger than 100 MB.

Automatic record splitting is necessary because of the BigQuery 100 MB per row limit.

Null values

If a float or integer repeated field contains a null value, then the Variant Transforms tool cannot create the BigQuery schema. BigQuery does not allow null values in repeated fields. The entire record can be null, but values within the record must be non-null.

For example:

  1. Suppose that a VCF file's INFO field has the values 1,.,2. The Variant Transforms tool cannot load 1,null,2 into BigQuery.
  2. A numeric replacement must be used for the null value. By default, the replacement value is -2^31, or -2147483648.

To set a custom numeric value, pass the --null_numeric_value_replacement flag with a value when running the Variant Transforms tool.

Alternatively, you can convert null values to a string and use . as the value. When doing so, the header must be specified as String.