本页面介绍了如何展平使用 BigQuery 变体架构的 BigQuery 表。
展平 BigQuery 表会去除重复的行。 展平非常适合用于以下情况:
- 使用要求所访问的表采用展平结构的第三方工具(例如用于实现数据可视化的工具)时。
- 查询多个独立的重复字段或计算此类字段的叉积时,遇到错误消息
Cannot query the cross product of repeated fields ...
。
展平表
假设您的某个 BigQuery 表包含下面一行。该行包含两个交替碱基 C
和 T
(位于 alternate_bases.alt
列中),以及两个检出 NA12890
和 NA12878
(位于 call.name
列中)。
行 | reference_name | start_position | end_position | reference_bases | alternate_bases.alt | alternate_bases.AC | quality | 过滤器 | call.name | call.genotype |
---|---|---|---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | C | 1 | 577.59 | LowMQ | NA12890 | 1 |
T | 1 | 2 | ||||||||
NA12878 | 1 | |||||||||
2 |
运行下面的查询展平表中重复的检出记录,并为每个检出返回一行。
请注意,此处使用了英文逗号 (,
) 运算符,它是 JOIN
的简写表示法。
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call
上面的查询在 t.call
中使用了一个完全限定的路径,但您可以使用 UNNEST
运算符代替 t.call
,如下所示:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE`, UNNEST(call) AS call
在这两种情况下,查询都会为每个检出返回一行:
行 | reference_name | start_position | end_position | reference_bases | call_name |
---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 |
2 | chrY | 9909316 | 9909317 | A | NA12878 |
展平其他字段
您可以通过向 SELECT
子句添加字段来为每个检出返回更多信息。例如,下面的查询添加了检出基因型(以整数数组的形式):
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, call.genotype FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call
运行查询将返回以下内容:
行 | reference_name | start_position | end_position | reference_bases | call_name | genotype |
---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 |
2 | ||||||
2 | chrY | 9909316 | 9909317 | A | NA12878 | 1 |
2 |
要进一步展平 BigQuery 表并每行返回一个基因型,请使用 call.genotype
字段添加另一个 JOIN
:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, genotype FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype
运行查询将返回以下内容:
行 | reference_name | start_position | end_position | reference_bases | call_name | genotype |
---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 |
2 | chrY | 9909316 | 9909317 | A | NA12890 | 2 |
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 |
4 | chrY | 9909316 | 9909317 | A | NA12878 | 2 |
请注意,在输出结果中,每个 call_name
都是重复的,因为每个检出包含两个 genotype
值。
下面的查询显示了将独立的重复记录 (alternate_bases
) 添加到 SELECT
子句会产生什么结果:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, genotype, alternate_bases FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype
运行查询将返回以下内容:
行 | reference_name | start_position | end_position | reference_bases | call_name | genotype | alternate_bases.alt | alternate_bases.AC |
---|---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | C | 1 |
T | 1 | |||||||
2 | chrY | 9909316 | 9909317 | A | NA12890 | 2 | C | 1 |
T | 1 | |||||||
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | C | 1 |
T | 1 | |||||||
4 | chrY | 9909316 | 9909317 | A | NA12878 | 2 | C | 1 |
T | 1 |
尽管 genotype
列已展平,但这两个 alternate_bases
列都包含在输出中。要返回与 genotype
列的索引匹配的特定交替碱基,请使用 ORDINAL
函数,如下面的查询中所示:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, genotype, IF(genotype > 0, alternate_bases[ORDINAL(genotype)], NULL) AS alternate_bases FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype
运行查询将返回以下内容:
行 | reference_name | start_position | end_position | reference_bases | call_name | genotype | alternate_bases.alt | alternate_bases.AC |
---|---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | C | 1 |
2 | chrY | 9909316 | 9909317 | A | NA12890 | 2 | T | 1 |
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | C | 1 |
4 | chrY | 9909316 | 9909317 | A | NA12878 | 2 | T | 1 |
SELECT
子句中 genotype
列的语义已更改,因为每行仅包含一个交替等位基因。
下一个查询显示如何使用 IF(genotype > 0, 1, genotype) AS alt_genotype
重新设置 genotype
列的格式,基因型的值将设置如下:
- 如果与交替碱基匹配,则为
1
。 - 如果与参考碱基匹配,则为
0
。 - 如果未被检出,则为
-1
。请注意,Variant Transforms 工具使用-1
表示未检出的基因型(例如 VCF 文件中的.
)。
以下查询展示了如何通过展平 alternate_bases
记录并使用索作为过滤条件来使查询结果中仅包含 alternate_bases.alt
列:
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, call.name AS call_name, IF(genotype > 0, 1, genotype) AS alt_genotype, IF(genotype > 0, alts.alt, NULL) AS alt FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype LEFT JOIN t.alternate_bases AS alts WITH OFFSET AS a_index WHERE genotype IN (a_index + 1, 0, -1)
运行查询将返回以下内容:
行 | reference_name | start_position | end_position | reference_bases | call_name | alt_genotype | alt |
---|---|---|---|---|---|---|---|
1 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | C |
2 | chrY | 9909316 | 9909317 | A | NA12890 | 1 | T |
3 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | C |
4 | chrY | 9909316 | 9909317 | A | NA12878 | 1 | T |
该查询使用 LEFT JOIN
包含任何没有交替碱基的行。若要仅包含至少具有一个交替碱基的记录,可以执行以下操作之一:
- 使用
INNER JOIN
。 - 请在
FROM
子句中添加t.alternate_bases AS alts WITH OFFSET as a_index
。
此外,如果存在多等位基因(意味着多个交替碱基),查询会为基因型值 0
返回额外的行。这是因为 LEFT JOIN
在每个基因型与交替碱基之间执行了 CROSS JOIN
,然后按 a_index
进行过滤。因此,a_index + 1
会过滤掉非 0
基因型,而对于所有 alternate_base
与值为 0
的基因型的组合,将返回 0
值基因型。
查询示例
下面的查询是一个完整的示例,展示如何展平 BigQuery 变体表以使其不包含任何重复记录。基因型值设置如下:
- 如果与交替碱基匹配,则为
1
。 - 如果与参考碱基匹配,则为
0
。 - 如果未设置,则为
-1
。
查询针对下表执行:
行 | reference_name | start_position | end_position | reference_bases | alternate_bases.alt | alternate_bases.AC | quality | 过滤器 | call.name | call.genotype |
---|---|---|---|---|---|---|---|---|---|---|
1 | chr9 | 139825375 | 139825376 | T | 120.24 | LowGQX | NA12878 | 0 | ||
PASS | NA12877 | -1 | ||||||||
NA12890 | -1 | |||||||||
NA12889 | -1 | |||||||||
NA12891 | 0 | |||||||||
0 | ||||||||||
2 | chr4 | 4070556 | 4070558 | A | null | LowGQX | NA12878 | 0 | ||
LowMQ | 0 | |||||||||
MAXDEPTH | ||||||||||
3 | chrX | 61702219 | 61702220 | T | C | 1 | 5.72 | LowGQX | NA12877 | 0 |
LowMQ | 1 | |||||||||
LowQD | ||||||||||
TruthSensitivityTranche99.90to100.00 | ||||||||||
4 | chrY | 9909316 | 9909317 | A | C | 1 | 577.59 | LowMQ | NA12890 | 1 |
T | 1 | 2 | ||||||||
NA1878 | 1 | |||||||||
2 |
#standardSQL SELECT reference_name, start_position, end_position, reference_bases, IF(genotype > 0, alts.alt, NULL) AS alt, t.quality, ARRAY_TO_STRING(t.filter, ' ') AS filter, call.name AS call_name, IF(genotype > 0, 1, genotype) AS alt_genotype, call.phaseset FROM `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t, t.call AS call, call.genotype AS genotype LEFT JOIN t.alternate_bases AS alts WITH OFFSET AS a_index WHERE genotype IN (a_index + 1, 0, -1)
运行查询将返回以下内容:
行 | reference_name | start_position | end_position | reference_bases | alt | quality | 过滤器 | call_name | alt_genotype | phaseset |
---|---|---|---|---|---|---|---|---|---|---|
1 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12878 | 0 | null |
2 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12877 | -1 | null |
3 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12890 | -1 | null |
4 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12889 | -1 | null |
5 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12891 | 0 | null |
6 | chr9 | 139825375 | 139825376 | T | null | 120.24 | LowGQX PASS | NA12891 | 0 | null |
7 | chr4 | 4070556 | 4070558 | A | null | null | LowGQX LowMQ MaxDepth | NA12878 | 0 | null |
8 | chr4 | 4070556 | 4070558 | A | null | null | LowGQX LowMQ MaxDepth | NA12878 | 0 | null |
9 | chrX | 61702219 | 61702220 | T | null | 5.72 | LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 | NA12877 | 0 | null |
10 | chrX | 61702219 | 61702220 | T | C | 5.72 | LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 | NA12877 | 1 | null |
11 | chrY | 9909316 | 9909317 | A | C | 577.59 | LowMQ | NA12890 | 1 | null |
12 | chrY | 9909316 | 9909317 | A | T | 577.59 | LowMQ | NA12890 | 1 | null |
13 | chrY | 9909316 | 9909317 | A | C | 577.59 | LowMQ | NA12878 | 1 | null |
14 | chrY | 9909316 | 9909317 | A | T | 577.59 | LowMQ | NA12878 | 1 | null |
对于其他重复字段,您可以执行以下任一操作:
- 将它们连接为单个字段(使用
ARRAY_TO_STRING
等函数)。 - 将它们添加到
FROM
或LEFT JOIN
子句以展平这些字段。
要将查询结果具体化为新表,请参阅基于查询结果创建表。