This page shows how to flatten a BigQuery table that uses the BigQuery variants schema.
Flattening a BigQuery table removes repeated rows. Flattening can be useful in the following situations:
- When working with third party tools, such as those used for data visualization, that require a table with a flattened structure.
- If you are querying multiple independently
repeated fields or calculating the cross product of such fields, and you
encounter the error message
Cannot query the cross product of repeated fields ...
.
Flattening a table
Suppose that you have a table in BigQuery that contains the
following row. The row contains two alternate bases, C
and T
(in the alternate_bases.alt
column), and two calls, NA12890
and NA12878
(in the call.name
column).
Row | reference_name | start_position | end_position | reference_bases | alternate_bases.alt | alternate_bases.AC | quality | filter | call.name | call.genotype |
---|---|---|---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | C | 1 | 577.59 | LowMQ | NA12890 | 1 |
T | 1 | 2 | ||||||||
NA12878 | 1 | |||||||||
2 |
Run the following query to flatten the table on the repeated call record
and return one row for each call.
Note the use of the comma (,
) operator, which is a
shorthand notation used for JOIN
.
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call
The above query uses a fully qualified path in t.call
, but you can instead
use the UNNEST
operator in place of t.call
, like so:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE`, UNNEST(call) AS call
In both cases, the queries return one row for each call:
Row | reference_name | start_position | end_position | reference_bases | call_name |
---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 |
2 | chrY | 9909316 | 9909317 | A | NA12878 |
Flattening additional fields
You can include additional information for each call by adding fields to the
SELECT
clause. For example, the following query adds the call genotypes
(as an array of integers):
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, call.genotype FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call
Running the query returns:
Row | reference_name | start_position | end_position | reference_bases | call_name | genotype |
---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 |
2 | ||||||
2 | chrY | 9909316 | 9909317 | A | NA12878 | 1 |
2 |
To flatten the BigQuery table further and return one genotype
per row, add another JOIN
with the call.genotype
field:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, genotype FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype
Running the query returns:
Row | reference_name | start_position | end_position | reference_bases | call_name | genotype |
---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 |
2 | chrY | 9909316 | 9909317 | A | NA12890 | 2 |
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 |
4 | chrY | 9909316 | 9909317 | A | NA12878 | 2 |
Note that, in the output, each call_name
is duplicated because each call
contains two genotype
values.
The following query shows what happens when you add an independently repeated
record (alternate_bases
) to the SELECT
clause:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, genotype, alternate_bases FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype
Running the query returns:
Row | reference_name | start_position | end_position | reference_bases | call_name | genotype | alternate_bases.alt | alternate_bases.AC |
---|---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | C | 1 |
T | 1 | |||||||
2 | chrY | 9909316 | 9909317 | A | NA12890 | 2 | C | 1 |
T | 1 | |||||||
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | C | 1 |
T | 1 | |||||||
4 | chrY | 9909316 | 9909317 | A | NA12878 | 2 | C | 1 |
T | 1 |
Although the genotype
column was flattened, the output contains
both of the alternate_bases
columns. To return the particular alternate base
that matches the index of the genotype
column, use the ORDINAL
function, as in the following query:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, genotype, IF(genotype > 0, alternate_bases[ORDINAL(genotype)], NULL) AS alternate_bases FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype
Running the query returns:
Row | reference_name | start_position | end_position | reference_bases | call_name | genotype | alternate_bases.alt | alternate_bases.AC |
---|---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | C | 1 |
2 | chrY | 9909316 | 9909317 | A | NA12890 | 2 | T | 1 |
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | C | 1 |
4 | chrY | 9909316 | 9909317 | A | NA12878 | 2 | T | 1 |
The semantics of the genotype
column in the SELECT
clause changed
because each row contains only a single alternate allele.
The next query shows how to
reformat the genotype
column using IF(genotype > 0, 1, genotype) AS alt_genotype
which results in the genotype value being set as follows:
1
if it matches the alternate base.0
if it matches the reference base.-1
if it is not called. Note that the Variant Transforms tool uses-1
to denote genotypes that are not called (such as.
in the VCF file).
The following query shows how to include only the alternate_bases.alt
column
by flattening the alternate_bases
record and using the index as a filtering
criteria:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, IF(genotype > 0, 1, genotype) AS alt_genotype, IF(genotype > 0, alts.alt, NULL) AS alt FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype LEFT JOIN t.alternate_bases AS alts WITH OFFSET AS a_index WHERE genotype IN (a_index + 1, 0, -1)
Running the query returns:
Row | reference_name | start_position | end_position | reference_bases | call_name | alt_genotype | alt |
---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | C |
2 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | T |
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | C |
4 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | T |
The query uses a LEFT JOIN
to include any rows that do not have
an alternate base. To include only records that have at least one alternate
base, you can do one of the following:
- Use an
INNER JOIN
. - Include
t.alternate_bases AS alts WITH OFFSET as a_index
in theFROM
clause.
Additionally, the query returns extra rows for the genotype value 0
if
there is a multi-allelic site (meaning multiple alternate bases). This occurs
because the LEFT JOIN
does a CROSS JOIN
between each genotype and the alternate bases.
It then filters by a_index
. As a result, while non-0
genotypes are filtered
by a_index + 1
, the 0
value genotypes will be returned for all combinations
where there is an alternate_base
and a genotype with the value 0
.
Example query
The following query is a complete example of how to flatten a BigQuery variants table so that it does not contain any repeated records. The genotype value is set as follows:
1
if it matches the alternate base.0
if it matches the reference base.-1
if it is not set.
The query is performed on the following table:
Row | reference_name | start_position | end_position | reference_bases | alternate_bases.alt | alternate_bases.AC | quality | filter | call.name | call.genotype |
---|---|---|---|---|---|---|---|---|---|---|
1 | chr9 | 139825375 | 139825376 | T | 120.24 | LowGQX | NA12878 | 0 | ||
PASS | NA12877 | -1 | ||||||||
NA12890 | -1 | |||||||||
NA12889 | -1 | |||||||||
NA12891 | 0 | |||||||||
0 | ||||||||||
2 | chr4 | 4070556 | 4070558 | A | null | LowGQX | NA12878 | 0 | ||
LowMQ | 0 | |||||||||
MaxDepth | ||||||||||
3 | chrX | 61702219 | 61702220 | T | C | 1 | 5.72 | LowGQX | NA12877 | 0 |
LowMQ | 1 | |||||||||
LowQD | ||||||||||
TruthSensitivityTranche99.90to100.00 | ||||||||||
4 | chrY | 9909316 | 9909317 | A | C | 1 | 577.59 | LowMQ | NA12890 | 1 |
T | 1 | 2 | ||||||||
NA1878 | 1 | |||||||||
2 |
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, IF(genotype > 0, alts.alt, NULL) AS alt, t.quality, ARRAY_TO_STRING(t.filter, ' ') AS filter, call.name AS call_name, IF(genotype > 0, 1, genotype) AS alt_genotype, call.phaseset FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype LEFT JOIN t.alternate_bases AS alts WITH OFFSET AS a_index WHERE genotype IN (a_index + 1, 0, -1)
Running the query returns:
Row | reference_name | start_position | end_position | reference_bases | alt | quality | filter | call_name | alt_genotype | phaseset |
---|---|---|---|---|---|---|---|---|---|---|
1 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12878 | 0 | null |
2 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12877 | -1 | null |
3 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12890 | -1 | null |
4 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12889 | -1 | null |
5 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12891 | 0 | null |
6 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12891 | 0 | null |
7 | chr4 | 4070556 | 4070558 | A | null | null | LowGQX LowMQ MaxDepth | NA12878 | 0 | null |
8 | chr4 | 4070556 | 4070558 | A | null | null | LowGQX LowMQ MaxDepth | NA12878 | 0 | null |
9 | chrX | 61702219 | 61702220 | T | null | 5.72 | LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 | NA12877 | 0 | null |
10 | chrX | 61702219 | 61702220 | T | C | 5.72 | LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 | NA12877 | 1 | null |
11 | chrY | 9909316 | 9909317 | A | C | 577.59 | LowMQ | NA12890 | 1 | null |
12 | chrY | 9909316 | 9909317 | A | T | 577.59 | LowMQ | NA12890 | 1 | null |
13 | chrY | 9909316 | 9909317 | A | C | 577.59 | LowMQ | NA12878 | 1 | null |
14 | chrY | 9909316 | 9909317 | A | T | 577.59 | LowMQ | NA12878 | 1 | null |
For other repeated fields, you can do either of the following:
- Concatenate them as a single field (using a function like
ARRAY_TO_STRING
). - Add them to the
FROM
orLEFT JOIN
clause to flatten them.
To materialize the result of the query into a new table, see Creating a table from a query result.