This page explains the structure of the schema generated in BigQuery when you use the Variant Transforms tool. Read this page if you're using the Variant Transforms tool to transform and load VCF files directly into BigQuery and want to understand how the export table is structured.
Nomenclature
To understand the export formats and schemas, familiarize yourself with the following terms.
Genomics nomenclature
- Sample
- DNA collected and processed under a single identifier. A sample typically involves a single individual organism, but can also be a heterogeneous sample such as a cheek swab.
- Reference name
- The name of a reference segment of DNA. The reference name is typically a chromosome, but might be other named regions from a reference genome.
- Variant
A region of the genome that has been identified as differing from the reference genome. A variant must have the following properties:
- Reference name
- Start position
- End position
- One or more reference bases
- Non-variant segment
A region of the genome that matches the reference genome. The non-variant segment is also called a "reference segment." Typically, genomic data doesn't include non-variant segments with variants.
For more information on non-variant segments, see About gVCF.
- Call
An identified occurrence of a variant or non-variant segment for an individual sample. A call represents the determination of genotype regarding a particular variant. The call might include associated information such as quality and phasing.
- INFO fields
Optional fields added to variant and call information. For example, all calls have a
genotype
field, but not all datasets have a "Genotype Quality" (GQ
) field. Thegenotype
field is a fixed part of theVariantCall
schema, but it doesn't have aGQ
field. You can import theGQ
field and value as key-value pairs into theVariantCall
info
field.
For more genomic nomenclature, see the following:
BigQuery terms
Familiarize yourself with the following BigQuery terms:
- Simple fields
- Simple data elements in a BigQuery table, such as numbers and strings.
- Nested fields
- Complex data elements in a BigQuery table. A nested field can contain multiple fields, both simple and nested.
- Repeated fields
- Fields in a BigQuery table that can have multiple values, such as a list. Repeated fields can be both simple and nested.
For more about BigQuery's complex data types, see the following:
- BigQuery nested and repeated fields in standard SQL
- BigQuery explained: Working with joins, nested & repeated data
- Looker documentation on nesting
Variants table structure
When you run the Variant Transforms tool, you specify the name of the
BigQuery table. Typically, the name of the table is variants
.
Variants table record structure
The top-level records of the variants
table can be both variants and
non-variant segments. Each variants
table record contains one or more
calls.
The following table illustrates the variants table record structure. The table shows the following variant records:
Variant1
Variant2
The BigQuery dataset contains the following samples:
Sample1
Sample2
Sample3
In the following table, you can observe these results:
Variant1
has been called forSample1
andSample2
.Variant2
has been called forSample1
andSample3
.
Variant1 | Sample1 Sample2 |
Variant2 | Sample1 Sample3 |
... | ... |
Variant table field structure
Every variants
table includes both a fixed set of fields and a variable set of
fields. The structure of the table looks like the following:
Fixed Variant Fields | Call | Variant INFO Fields |
... | Fixed Call Fields Call INFO Fields | ... |
Variant table fixed fields
The fixed record-level (variant) fields are the following:
reference_name
start_position
end_position
reference_bases
alternate_bases
names
quality
filter
The fixed call-level (VariantCall
) fields are the following:
name
genotype
phaseset
Example schema
When the 1000 Genomes data was loaded into BigQuery, it included importing ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.
The VCF file includes various variant-level and call-level INFO
fields, as
described in the following INFO
and FORMAT
header directives:
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype likelihoods"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=FT,Number=1,Type=String,Description="Per-sample genotype filter"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> [[trimmed]] ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data"> [[trimmed]]
When the file was loaded into BigQuery the following occurred:
- Fields marked as
INFO
, such asAC
andAF
, were added asINFO
fields to the variant resources. - Fields marked as
FORMAT
, such as theGP
andGQ
fields, were added asINFO
fields to theVariantCall
resources. - The FORMAT field
GT
was not added as an INFO field. The value was converted into thegenotype
field.
Viewing the schema in BigQuery shows the following fixed fields:
Field name | Type | Mode | Description |
---|---|---|---|
reference_name | STRING | NULLABLE | Reference name. |
start_position | INTEGER | NULLABLE | Start position (0-based). Corresponds to the first base of the string of reference bases. |
end_position | INTEGER | NULLABLE | End position (0-based). Corresponds to the first base after the last base in the reference allele. |
reference_bases | STRING | NULLABLE | Reference bases. |
alternate_bases | RECORD | REPEATED | One record for each alternate base (if any). See Additional alternate_bases record information. |
alternate_bases.alt | STRING | NULLABLE | Alternate base. |
names | STRING | REPEATED | Variant names (for example, RefSNP ID). |
quality | FLOAT | NULLABLE | Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality. |
filter | STRING | REPEATED | List of failed filters (if any) or "PASS" indicating the variant has passed all filters. |
call | RECORD | REPEATED | One record for each call. |
Viewing the schema shows the following variable fields (the INFO
fields):
Field name | Type | Mode | Description |
---|---|---|---|
call.name | STRING | NULLABLE | Name of the call. |
call.genotype | INTEGER | REPEATED | Genotype of the call. "-1" is used in cases where the genotype is not called. |
call.phaseset | STRING | NULLABLE | Phaseset of the call (if any). "*" is used in cases where the genotype is phased but no phase set ("PS" in FORMAT) was specified. |
Additional alternate_bases
record information
The alternate_bases
record contains any INFO
field with Number=A
. The record
simplifies querying by removing the need to map each field with its
corresponding alternate record.
To use the previous BigQuery
schema version, where Number=A
fields are independent of alternate bases,
complete the following steps:
- Run the Variant Transforms tool, and pass the
--split_alternate_allele_info_fields
flag. - Set the value of the
--split_alternate_allele_info_fields
flag toFalse
.
Automatically split records
In the following cases, the Variant Transforms tool automatically splits a record into multiple rows where each row is less than 100 MB:
- A record has a large number of calls.
- The large number of calls results in a BigQuery row that is bigger than 100 MB.
Automatic record splitting is necessary because of the BigQuery 100 MB per row limit.
Null values
If a float or integer repeated field contains a null value, then the Variant Transforms tool cannot create the BigQuery schema. BigQuery doesn't allow null values in repeated fields. The entire record can be null, but values within the record must be non-null.
For example:
- Suppose that a VCF file's
INFO
field has the values1,.,2
. The Variant Transforms tool cannot load1,null,2
into BigQuery. - A numeric replacement must be used for the null value. By default,
the replacement value is
-2^31
, or-2147483648
.
To set a custom numeric value, pass the --null_numeric_value_replacement
flag
with a value when running the Variant Transforms tool.
Alternatively, you can convert null values to a string and use .
as the
value. When doing so, the header must be specified as String
.