The Variant Transforms pipeline provides the ability to transform and load VCF files directly into BigQuery.
You can also use BigQuery to run ad-hoc interactive queries over genomic variants using hundreds or thousands of computers in parallel.
You can also browse existing published datasets already exported from Cloud Life Sciences to BigQuery.
Nomenclature
To understand the export formats and schemas, you'll want to be familiar with the terminology we use.
Genomics nomenclature
- Sample
- DNA collected and processed under a single identifier. A sample typically involves a single individual organism, but can also be a heterogeneous sample such as a cheek swab.
- Reference Name
- The name of a reference segment of DNA, this is typically a chromosome, but may be other named regions from a reference genome.
- Variant
- A region of the genome that has been identified as differing from the reference genome. A variant must have a reference name, start position, end position, and one or more reference bases.
- Non-variant segment
A region of the genome that matches the reference genome. This is sometimes referred to as a "reference segment". Traditionally, genomic data has not included non-variant segments with variants.
For more on non-variant segments see the gVCF documentation.
- Call
An identified occurrence of a variant or non-variant segment for an individual sample. It represents the determination of genotype with respect to a particular variant and may include associated information such as quality and phasing.
- INFO fields
Optional fields added to Variant and Call information. For example, while all Calls will have a
genotype
field, not all datasets will have a "Genotype Quality" (GQ
) field. Thus thegenotype
field is a fixed part of the VariantCall schema, but there is noGQ
field. TheGQ
field and value can be imported as key/value pairs into the VariantCallinfo
field.
For more genomic nomenclature, see the following documents:
BigQuery nomenclature
- Simple fields
- Simple data elements in a BigQuery table, such as numbers and strings.
- Nested fields
- Complex data elements in a BigQuery table. A nested field can contain multiple fields, both simple and nested.
- Repeated fields
- Fields in a BigQuery table that can have multiple values, like a list. Repeated fields can be both simple and nested.
For more about BigQuery's complex data types, see the documentation on nested and repeated fields in standard SQL.
Variants table structure
The name of the BigQuery table is specified when running the
Variant Transforms pipeline. By convention the name of this table is variants
.
Variants table record structure
The top-level records of the variants
table can be both variants and
non-variant segments. Each variants
table record will contain one or more
Calls.
The table below illustrates this structure, showing two variant records, Variant1 and Variant2. In this dataset, there are three samples, Sample1, Sample2, and Sample3. Variant1 has been called for Sample1 and Sample2, while Variant2 has been called for Sample1 and Sample3.
Variant1 | Sample1 Sample2 |
Variant2 | Sample1 Sample3 |
... | ... |
Variant table field structure
Every variants
table includes both a fixed set of fields and a variable set of
fields. The structure of the table at a high level looks like:
Fixed Variant Fields | Call | Variant INFO Fields |
... | Fixed Call Fields Call INFO Fields | ... |
The names of the variable fields are the Variant Resource INFO field keys and the VariantCall INFO field keys respectively.
Variant table fixed fields
The fixed record-level (Variant) fields are:
- reference_name
- start_position
- end_position
- reference_bases
- alternate_bases
- names
- quality
- filter
The fixed call-level (VariantCall) fields are:
- name
- genotype
- phaseset
Example
When the 1000 Genomes data was loaded into BigQuery, it included importing ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.
This VCF file includes a number of variant-level and call-level INFO fields, as described in the INFO and FORMAT header directives:
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype likelihoods"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=FT,Number=1,Type=String,Description="Per-sample genotype filter"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> [[trimmed]] ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral allele"> ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data"> [[trimmed]]
When this file was loaded into BigQuery, the fields marked as
INFO, such as AC
and AF
, were added as INFO fields to the Variant resources.
When this file was loaded into BigQuery, the fields marked as
FORMAT, such as the GP
and GQ
fields, were added as INFO fields to the
VariantCall resources. The FORMAT field GT
was not added as an INFO field,
but instead the value was converted into the genotype
field.
Viewing the schema in the BigQuery UI shows the following fixed fields:
Schema | |||
---|---|---|---|
reference_name | STRING | NULLABLE | Reference name. |
start_position | INTEGER | NULLABLE | Start position (0-based). Corresponds to the first base of the string of reference bases. |
end_position | INTEGER | NULLABLE | End position (0-based). Corresponds to the first base after the last base in the reference allele. |
reference_bases | STRING | NULLABLE | Reference bases. |
alternate_bases | RECORD | REPEATED | One record for each alternate base (if any). See Additional alternate_bases record information. |
alternate_bases.alt | STRING | NULLABLE | Alternate base. |
names | STRING | REPEATED | Variant names (for example, RefSNP ID). |
quality | FLOAT | NULLABLE | Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality. |
filter | STRING | REPEATED | List of failed filters (if any) or "PASS" indicating the variant has passed all filters. |
call | RECORD | REPEATED | One record for each call. |
and then the variable fields (the INFO fields):
Schema | |||
---|---|---|---|
call.name | STRING | NULLABLE | Name of the call. |
call.genotype | INTEGER | REPEATED | Genotype of the call. "-1" is used in cases where the genotype is not called. |
call.phaseset | STRING | NULLABLE | Phaseset of the call (if any). "*" is used in cases where the genotype is phased, but no phase set ("PS" in FORMAT) was specified. |
Additional alternate_bases
record information
The alternate_bases
record contains any INFO
field with Number=A
. This
simplifies querying by removing the need to map each field with its
corresponding alternate record. To use the previous BigQuery
schema version, where Number=A
fields are independent of alternate bases,
then pass the --split_alternate_allele_info_fields
flag and set it to False
when running the Variant Transforms tool.
Automatically split records
If a record has a large number of calls, and if this results in a BigQuery row that is bigger than 100 MB, then the Variant Transforms tool will automatically split the record into multiple rows so that each row is less than 100 MB. This is required due to the BigQuery 100 MB per row limit.
Null values
If a float or integer repeated field contains a null value, then the BigQuery schema cannot be created. This is because BigQuery does not allow null values in repeated fields. The entire record can be null, but values within the record must be non-null.
For example:
Suppose that a VCF file's INFO field has the values
1,.,2
. The Variant Transforms tool cannot load1,null,2
into BigQuery.Instead, a numeric replacement must be used for the null value. By default, the replacement value is
-2^31
, or-2147483648
.
To set a custom numeric value, pass the --null_numeric_value_replacement
flag
with a value when running the Variant Transforms tool.
Alternatively, you can convert null values to a string and use .
as the
value. When doing so, the header must be specified as String
.