展平 BigQuery 变异片段表

本页面介绍了如何展平使用 BigQuery 变体架构的 BigQuery 表。

展平 BigQuery 表会去除重复的行。 展平非常适合用于以下情况:

  • 使用要求所访问的表采用展平结构的第三方工具(例如用于实现数据可视化的工具)时。
  • 查询多个独立的重复字段或计算此类字段的叉积时,遇到错误消息 Cannot query the cross product of repeated fields ...

展平表

假设您的某个 BigQuery 表包含下面一行。该行包含两个交替碱基 CT(位于 alternate_bases.alt 列中),以及两个检出 NA12890NA12878(位于 call.name 列中)。

reference_name start_position end_position reference_bases alternate_bases.alt alternate_bases.AC quality 过滤器 call.name call.genotype
1 chrY 9909316 9909317 A C 1 577.59 LowMQ NA12890 1
T 1 2
NA12878 1
2

运行下面的查询展平表中重复的检出记录,并为每个检出返回一行。 请注意,此处使用了英文逗号 (,) 运算符,它是 JOIN 的简写表示法。

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call

上面的查询在 t.call 中使用了一个完全限定的路径,但您可以使用 UNNEST 运算符代替 t.call,如下所示:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE`,
  UNNEST(call) AS call

在这两种情况下,查询都会为每个检出返回一行:

reference_name start_position end_position reference_bases call_name
1 chrY 9909316 9909317 A NA12890
2 chrY 9909316 9909317 A NA12878

展平其他字段

您可以通过向 SELECT 子句添加字段来为每个检出返回更多信息。例如,下面的查询添加了检出基因型(以整数数组的形式):

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, call.genotype
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call

运行查询将返回以下内容:

reference_name start_position end_position reference_bases call_name genotype
1 chrY 9909316 9909317 A NA12890 1
2
2 chrY 9909316 9909317 A NA12878 1
2

要进一步展平 BigQuery 表并每行返回一个基因型,请使用 call.genotype 字段添加另一个 JOIN

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, genotype
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype

运行查询将返回以下内容:

reference_name start_position end_position reference_bases call_name genotype
1 chrY 9909316 9909317 A NA12890 1
2 chrY 9909316 9909317 A NA12890 2
3 chrY 9909316 9909317 A NA12878 1
4 chrY 9909316 9909317 A NA12878 2

请注意,在输出结果中,每个 call_name 都是重复的,因为每个检出包含两个 genotype 值。

下面的查询显示了将独立的重复记录 (alternate_bases) 添加到 SELECT 子句会产生什么结果:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, genotype, alternate_bases
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype

运行查询将返回以下内容:

reference_name start_position end_position reference_bases call_name genotype alternate_bases.alt alternate_bases.AC
1 chrY 9909316 9909317 A NA12890 1 C 1
T 1
2 chrY 9909316 9909317 A NA12890 2 C 1
T 1
3 chrY 9909316 9909317 A NA12878 1 C 1
T 1
4 chrY 9909316 9909317 A NA12878 2 C 1
T 1

尽管 genotype 列已展平,但这两个 alternate_bases 列都包含在输出中。要返回与 genotype 列的索引匹配的特定交替碱基,请使用 ORDINAL 函数,如下面的查询中所示:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name, genotype,
  IF(genotype > 0, alternate_bases[ORDINAL(genotype)], NULL) AS alternate_bases
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype

运行查询将返回以下内容:

reference_name start_position end_position reference_bases call_name genotype alternate_bases.alt alternate_bases.AC
1 chrY 9909316 9909317 A NA12890 1 C 1
2 chrY 9909316 9909317 A NA12890 2 T 1
3 chrY 9909316 9909317 A NA12878 1 C 1
4 chrY 9909316 9909317 A NA12878 2 T 1

SELECT 子句中 genotype 列的语义已更改,因为每行仅包含一个交替等位基因。

下一个查询显示如何使用 IF(genotype > 0, 1, genotype) AS alt_genotype 重新设置 genotype 列的格式,基因型的值将设置如下:

  • 如果与交替碱基匹配,则为 1
  • 如果与参考碱基匹配,则为 0
  • 如果未被检出,则为 -1。请注意,Variant Transforms 工具使用 -1 表示未检出的基因型(例如 VCF 文件中的 .)。

以下查询展示了如何通过展平 alternate_bases 记录并使用索作为过滤条件来使查询结果中仅包含 alternate_bases.alt 列:

#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  call.name AS call_name,
  IF(genotype > 0, 1, genotype) AS alt_genotype,
  IF(genotype > 0, alts.alt, NULL) AS alt
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype
LEFT JOIN
  t.alternate_bases AS alts WITH OFFSET AS a_index
WHERE
  genotype IN (a_index + 1, 0, -1)

运行查询将返回以下内容:

reference_name start_position end_position reference_bases call_name alt_genotype alt
1 chrY 9909316 9909317 A NA12890 1 C
2 chrY 9909316 9909317 A NA12890 1 T
3 chrY 9909316 9909317 A NA12878 1 C
4 chrY 9909316 9909317 A NA12878 1 T

该查询使用 LEFT JOIN 包含任何没有交替碱基的行。若要仅包含至少具有一个交替碱基的记录,可以执行以下操作之一:

  • 使用 INNER JOIN
  • 请在 FROM 子句中添加 t.alternate_bases AS alts WITH OFFSET as a_index

此外,如果存在多等位基因(意味着多个交替碱基),查询会为基因型值 0 返回额外的行。这是因为 LEFT JOIN 在每个基因型与交替碱基之间执行了 CROSS JOIN,然后按 a_index 进行过滤。因此,a_index + 1 会过滤掉非 0 基因型,而对于所有 alternate_base 与值为 0 的基因型的组合,将返回 0 值基因型。

查询示例

下面的查询是一个完整的示例,展示如何展平 BigQuery 变体表以使其不包含任何重复记录。基因型值设置如下:

  • 如果与交替碱基匹配,则为 1
  • 如果与参考碱基匹配,则为 0
  • 如果未设置,则为 -1

查询针对下表执行:

reference_name start_position end_position reference_bases alternate_bases.alt alternate_bases.AC quality 过滤器 call.name call.genotype
1 chr9 139825375 139825376 T 120.24 LowGQX NA12878 0
PASS NA12877 -1
NA12890 -1
NA12889 -1
NA12891 0
0
2 chr4 4070556 4070558 A null LowGQX NA12878 0
LowMQ 0
MAXDEPTH
3 chrX 61702219 61702220 T C 1 5.72 LowGQX NA12877 0
LowMQ 1
LowQD
TruthSensitivityTranche99.90to100.00
4 chrY 9909316 9909317 A C 1 577.59 LowMQ NA12890 1
T 1 2
NA1878 1
2
#standardSQL
SELECT
  reference_name, start_position, end_position, reference_bases,
  IF(genotype > 0, alts.alt, NULL) AS alt,
  t.quality,
  ARRAY_TO_STRING(t.filter, ' ') AS filter,
  call.name AS call_name,
  IF(genotype > 0, 1, genotype) AS alt_genotype,
  call.phaseset
FROM
  `PROJECT_ID.BIGQUERY_DATASET.BIGQUERY_TABLE` AS t,
  t.call AS call,
  call.genotype AS genotype
LEFT JOIN
  t.alternate_bases AS alts WITH OFFSET AS a_index
WHERE
  genotype IN (a_index + 1, 0, -1)

运行查询将返回以下内容:

reference_name start_position end_position reference_bases alt quality 过滤器 call_name alt_genotype phaseset
1 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12878 0 null
2 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12877 -1 null
3 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12890 -1 null
4 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12889 -1 null
5 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12891 0 null
6 chr9 139825375 139825376 T null 120.24 LowGQX PASS NA12891 0 null
7 chr4 4070556 4070558 A null null LowGQX LowMQ MaxDepth NA12878 0 null
8 chr4 4070556 4070558 A null null LowGQX LowMQ MaxDepth NA12878 0 null
9 chrX 61702219 61702220 T null 5.72 LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 NA12877 0 null
10 chrX 61702219 61702220 T C 5.72 LowGQX LowMQ LowQD TruthSensitivityTranche99.90to100.00 NA12877 1 null
11 chrY 9909316 9909317 A C 577.59 LowMQ NA12890 1 null
12 chrY 9909316 9909317 A T 577.59 LowMQ NA12890 1 null
13 chrY 9909316 9909317 A C 577.59 LowMQ NA12878 1 null
14 chrY 9909316 9909317 A T 577.59 LowMQ NA12878 1 null

对于其他重复字段,您可以执行以下任一操作:

  • 将它们连接为单个字段(使用 ARRAY_TO_STRING 等函数)。
  • 将它们添加到 FROMLEFT JOIN 子句以展平这些字段。

要将查询结果具体化为新表,请参阅基于查询结果创建表