Use nested and repeated fields
BigQuery can be used with many different data modelling methods, and generally provides high performance across many data model methodologies. To further tune a data model for performance, one method you might consider is data denormalization, which means adding columns of data to a single table to reduce or remove table joins.
Best practice: Use nested and repeated fields to denormalize data storage and increase query performance.
Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships.
The storage savings from using normalized data has less of an effect in modern systems. Increases in storage costs are worth the performance gains of using denormalized data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots, so that execution can be done in parallel.
To maintain relationships while denormalizing your data, you can use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.
For example, denormalizing an orders schema without using nested and repeated
fields might require you to group the data by a field like order_id
(when there is a one-to-many relationship). Because of the shuffling involved,
grouping the data is less effective than denormalizing the data by using
nested and repeated fields.
In some circumstances, denormalizing your data and using nested and repeated fields doesn't result in increased performance. For example, star schemas are typically optimized schemas for analytics, and as a result, performance might not be significantly different if you attempt to denormalize further.
Using nested and repeated fields
BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.
Nesting data (
STRUCT
)- Nesting data lets you represent foreign entities inline.
- Querying nested data uses "dot" syntax to reference leaf fields, which is similar to the syntax using a join.
- Nested data is represented as a
STRUCT
type in GoogleSQL.
Repeated data (
ARRAY
)- Creating a field of type
RECORD
with the mode set toREPEATED
lets you preserve a one-to-many relationship inline (so long as the relationship isn't high cardinality). - With repeated data, shuffling is not necessary.
- Repeated data is represented as an
ARRAY
. You can use anARRAY
function in GoogleSQL when you query the repeated data.
- Creating a field of type
Nested and repeated data (
ARRAY
ofSTRUCT
s)- Nesting and repetition complement each other.
- For example, in a table of transaction records, you could include an array
of line item
STRUCT
s.
For more information, see Specify nested and repeated columns in table schemas.
For more information about denormalizing data, see Denormalization.
Example
Consider an Orders
table with a row for each line item sold:
Order_Id | Item_Name |
---|---|
001 | A1 |
001 | B1 |
002 | A1 |
002 | C1 |
If you wanted to analyze data from this table, you would need to use a
GROUP BY
clause, similar to the following:
SELECT COUNT (Item_Name) FROM Orders GROUP BY Order_Id;
The GROUP BY
clause involves additional computation overhead, but this can be
avoided by nesting repeated data. You can avoid using a GROUP BY
clause
by creating a table with one order per row, where the order line items are in a
nested field:
Order_Id | Item_Name |
---|---|
001 |
A1 B1 |
002 |
A1 C1 |
In BigQuery, you typically specify a nested schema as an ARRAY
of STRUCT
objects. You use the
UNNEST
operator
to flatten the nested data,
as shown in the following query:
SELECT * FROM UNNEST( [ STRUCT('001' AS Order_Id, ['A1', 'B1'] AS Item_Name), STRUCT('002' AS Order_Id, ['A1', 'C1'] AS Item_Name) ] );
This query yields results similar to the following:
If this data wasn't nested, you could potentially have several rows for
each order, one for each item sold in that order, which would result in a
large table and an expensive GROUP BY
operation.
Exercise
You can see the performance difference in queries that use nested fields as compared to those that don't by following the steps in this section.
Create a table based on the
bigquery-public-data.stackoverflow.comments
public dataset:CREATE OR REPLACE TABLE `PROJECT.DATASET.stackoverflow` AS ( SELECT user_id, post_id, creation_date FROM `bigquery-public-data.stackoverflow.comments` );
Using the
stackoverflow
table, run the following query to see the earliest comment for each user:SELECT user_id, ARRAY_AGG(STRUCT(post_id, creation_date AS earliest_comment) ORDER BY creation_date ASC LIMIT 1)[OFFSET(0)].* FROM `PROJECT.DATASET.stackoverflow` GROUP BY user_id ORDER BY user_id ASC;
This query takes about 25 seconds to run and processes 1.88 GB of data.
Create a second table with identical data that creates a
comments
field using aSTRUCT
type to store thepost_id
andcreation_date
data, instead of two individual fields:CREATE OR REPLACE TABLE `PROJECT.DATASET.stackoverflow_nested` AS ( SELECT user_id, ARRAY_AGG(STRUCT(post_id, creation_date) ORDER BY creation_date ASC) AS comments FROM `bigquery-public-data.stackoverflow.comments` GROUP BY user_id );
Using the
stackoverflow_nested
table, run the following query to see the earliest comment for each user:SELECT user_id, (SELECT AS STRUCT post_id, creation_date as earliest_comment FROM UNNEST(comments) ORDER BY creation_date ASC LIMIT 1).* FROM `PROJECT.DATASET.stackoverflow_nested` ORDER BY user_id ASC;
This query takes about 10 seconds to run and processes 1.28 GB of data.
Delete the
stackoverflow
andstackoverflow_nested
tables when you are finished with them.