Flattening the BigQuery variants table

This page shows how to flatten a BigQuery table that uses the BigQuery variants schema.

Flattening a BigQuery table removes repeated rows. Flattening can be useful in the following situations:

  • When working with third party tools, such as those used for data visualization, that require a table with a flattened structure.
  • If you are querying multiple independently repeated fields or calculating the cross product of such fields, and you encounter the error message Cannot query the cross product of repeated fields ....

Flattening a table

Suppose that you have a table in BigQuery that contains the following row. The row contains two alternate bases, C and T (in the alternate_bases.alt column), and two calls, NA12890 and NA12878 (in the call.name column).

Row reference_name start_position end_position reference_bases alternate_bases.alt alternate_bases.AC quality filter call.name call.genotype
1 chrY 9909316 9909317 A C 1 577.59 LowMQ NA12890 1
T 1 2
NA12878 1
2

Run the following query to flatten the table on the repeated call record and return one row for each call. Note the use of the comma (,) operator, which is a shorthand notation used for JOIN.

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call

The above query uses a fully qualified path in t.call, but you can instead use the UNNEST operator in place of t.call, like so:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE`,
  UNNEST(call) AS call

In both cases, the queries return one row for each call:

Row reference_name start_position end_position reference_bases call_name
1 chrY 9909316 9909317 A NA12890
2 chrY 9909316 9909317 A NA12878

Flattening additional fields

You can include additional information for each call by adding fields to the SELECT clause. For example, the following query adds the call genotypes (as an array of integers):

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, call.genotype
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call

Running the query returns:

Row reference_name start_position end_position reference_bases call_name genotype
1 chrY 9909316 9909317 A NA12890 1
2
2 chrY 9909316 9909317 A NA12878 1
2

To flatten the BigQuery table further and return one genotype per row, add another JOIN with the call.genotype field:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, genotype
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype

Running the query returns:

Row reference_name start_position end_position reference_bases call_name genotype
1 chrY 9909316 9909317 A NA12890 1
2 chrY 9909316 9909317 A NA12890 2
3 chrY 9909316 9909317 A NA12878 1
4 chrY 9909316 9909317 A NA12878 2

Note that, in the output, each call_name is duplicated because each call contains two genotype values.

The following query shows what happens when you add an independently repeated record (alternate_bases) to the SELECT clause:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, genotype, alternate_bases
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype

Running the query returns:

Row reference_name start_position end_position reference_bases call_name genotype alternate_bases.alt alternate_bases.AC
1 chrY 9909316 9909317 A NA12890 1 C 1
T 1
2 chrY 9909316 9909317 A NA12890 2 C 1
T 1
3 chrY 9909316 9909317 A NA12878 1 C 1
T 1
4 chrY 9909316 9909317 A NA12878 2 C 1
T 1

Although the genotype column was flattened, the output contains both of the alternate_bases columns. To return the particular alternate base that matches the index of the genotype column, use the ORDINAL function, as in the following query:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, genotype,
  IF(genotype > 0, alternate_bases[ORDINAL(genotype)], NULL) AS alternate_bases
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype

Running the query returns:

Row reference_name start_position end_position reference_bases call_name genotype alternate_bases.alt alternate_bases.AC
1 chrY 9909316 9909317 A NA12890 1 C 1
2 chrY 9909316 9909317 A NA12890 2 T 1
3 chrY 9909316 9909317 A NA12878 1 C 1
4 chrY 9909316 9909317 A NA12878 2 T 1

The semantics of the genotype column in the SELECT clause changed because each row contains only a single alternate allele.

The next query shows how to reformat the genotype column using IF(genotype > 0, 1, genotype) AS alt_genotype which results in the genotype value being set as follows:

  • 1 if it matches the alternate base.
  • 0 if it matches the reference base.
  • -1 if it is not called. Note that the Variant Transforms tool uses -1 to denote genotypes that are not called (such as . in the VCF file).

The following query shows how to include only the alternate_bases.alt column by flattening the alternate_bases record and using the index as a filtering criteria:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name,
  IF(genotype > 0, 1, genotype) AS alt_genotype,
  IF(genotype > 0, alts.alt, NULL) AS alt
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype
LEFT JOIN
  t.alternate_bases AS alts WITH OFFSET AS a_index
WHERE
  genotype IN (a_index + 1, 0, -1)

Running the query returns:

Row reference_name start_position end_position reference_bases call_name alt_genotype alt
1 chrY 9909316 9909317 A NA12890 1 C
2 chrY 9909316 9909317 A NA12890 1 T
3 chrY 9909316 9909317 A NA12878 1 C
4 chrY 9909316 9909317 A NA12878 1 T

The query uses a LEFT JOIN to include any rows that do not have an alternate base. To include only records that have at least one alternate base, you can do one of the following:

  • Use an INNER JOIN.
  • Include t.alternate_bases AS alts WITH OFFSET as a_index in the FROM clause.

Additionally, the query returns extra rows for the genotype value 0 if there is a multi-allelic site (meaning multiple alternate bases). This occurs because the LEFT JOIN does a CROSS JOIN between each genotype and the alternate bases. It then filters by a_index. As a result, while non-0 genotypes are filtered by a_index + 1, the 0 value genotypes will be returned for all combinations where there is an alternate_base and a genotype with the value 0.

Example query

The following query is a complete example of how to flatten a BigQuery variants table so that it does not contain any repeated records. The genotype value is set as follows:

  • 1 if it matches the alternate base.
  • 0 if it matches the reference base.
  • -1 if it is not set.

The query is performed on the following table:

Row reference_name start_position end_position reference_bases alternate_bases.alt alternate_bases.AC quality filter call.name call.genotype
1 chr9 139825375 139825376 T 120.24 LowGQX NA12878 0
PASS NA12877 -1
NA12890 -1
NA12889 -1
NA12891 0
0
2 chr4 4070556 4070558 A null LowGQX NA12878 0
LowMQ 0
MaxDepth
3 chrX 61702219 61702220 T C 1 5.72 LowGQX NA12877 0
LowMQ 1
LowQD
TruthSensitivityTranche99.90to100.00
4 chrY 9909316 9909317 A C 1 577.59 LowMQ NA12890 1
T 1 2
NA1878 1
2
#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  IF(genotype > 0, alts.alt, NULL) AS alt,
  t.quality,
  ARRAY_TO_STRING(t.filter, ' ') AS filter,
  call.name AS call_name,
  IF(genotype > 0, 1, genotype) AS alt_genotype,
  call.phaseset
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype
LEFT JOIN
  t.alternate_bases AS alts WITH OFFSET AS a_index
WHERE
  genotype IN (a_index + 1, 0, -1)

Running the query returns:

Row reference_name start_position end_position reference_bases alt quality filter call_name alt_genotype phaseset
1 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12878 0 null
2 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12877 -1 null
3 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12890 -1 null
4 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12889 -1 null
5 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12891 0 null
6 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12891 0 null
7 chr4 4070556 4070558 A null null LowGQX LowMQ MaxDepth NA12878 0 null
8 chr4 4070556 4070558 A null null LowGQX LowMQ MaxDepth NA12878 0 null
9 chrX 61702219 61702220 T null 5.72 LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 NA12877 0 null
10 chrX 61702219 61702220 T C 5.72 LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 NA12877 1 null
11 chrY 9909316 9909317 A C 577.59 LowMQ NA12890 1 null
12 chrY 9909316 9909317 A T 577.59 LowMQ NA12890 1 null
13 chrY 9909316 9909317 A C 577.59 LowMQ NA12878 1 null
14 chrY 9909316 9909317 A T 577.59 LowMQ NA12878 1 null

For other repeated fields, you can do either of the following:

  • Concatenate them as a single field (using a function like ARRAY_TO_STRING).
  • Add them to the FROM or LEFT JOIN clause to flatten them.

To materialize the result of the query into a new table, see Creating a table from a query result.

このページは役立ちましたか?評価をお願いいたします。

フィードバックを送信...