Use nested and repeated fields
Best practice: Use nested and repeated fields to denormalize data storage and increase query performance.
Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships.
The storage savings from using normalized data has less of an effect in modern systems. Increases in storage costs are worth the performance gains of using denormalized data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots, so that execution can be done in parallel.
To maintain relationships while denormalizing your data, you can use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.
For example, denormalizing an orders schema without using nested and repeated
fields might require you to group the data by a field like
(when there is a one-to-many relationship). Because of the shuffling involved,
grouping the data is less effective than denormalizing the data by using
nested and repeated fields.
In some circumstances, denormalizing your data and using nested and repeated fields doesn't result in increased performance. Avoid denormalization in these use cases:
- You have a star schema with frequently changing dimensions.
- BigQuery complements an Online Transaction Processing (OLTP) system with row-level mutation but can't replace it.
Using nested and repeated fields
BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.
Nesting data (
- Nesting data lets you represent foreign entities inline.
- Querying nested data uses "dot" syntax to reference leaf fields, which is similar to the syntax using a join.
- Nested data is represented as a
STRUCTtype in standard SQL.
Repeated data (
- Creating a field of type
RECORDwith the mode set to
REPEATEDlets you preserve a one-to-many relationship inline (so long as the relationship isn't high cardinality).
- With repeated data, shuffling is not necessary.
- Repeated data is represented as an
ARRAY. You can use an
ARRAYfunction in standard SQL when you query the repeated data.
- Creating a field of type
Nested and repeated data (
- Nesting and repetition complement each other.
- For example, in a table of transaction records, you could include an array
of line item
For more information, see Specify nested and repeated columns in table schemas.
For a detailed example of how to denormalize data, see Denormalization.
Orders table with a row for each line item sold:
If you wanted to analyze data from this table, you would need to use a
GROUP BY clause, similar to the following:
SELECT COUNT (Item_Name) from Orders GROUP BY Order_Id
GROUP BY clause involves additional computation overhead, but this can be
avoided by nesting repeated data. You can avoid using a
GROUP BY clause
by creating a table with one order per row, where the order line items are in a
SELECT * from UNNEST( [ STRUCT('001' as Order_Id, ['A1', 'B1'] as Item_Name) , STRUCT('002' as Order_Id, ['A1', 'C1'] as Item_Name) ] )
This query yields results similar to the following:
If this data wasn't nested, you could potentially have several rows for
each order, one for each item sold in that order, which would result in a
large table and an expensive
GROUP BY operation.
You can see the performance difference in queries that use nested fields as compared to those that don't by following the steps in this section.
Create a table based on the
CREATE OR REPLACE TABLE `PROJECT.DATASET.stackoverflow` AS ( SELECT user_id, post_id, creation_date FROM `bigquery-public-data.stackoverflow.comments` );
stackoverflowtable, run the following query to see the earliest comment for each user:
SELECT user_id, ARRAY_AGG(STRUCT(post_id, creation_date AS earliest_comment) ORDER BY creation_date ASC LIMIT 1)[OFFSET(0)].* FROM `PROJECT.DATASET.stackoverflow` GROUP BY user_id ORDER BY user_id ASC
This query takes about 25 seconds to run and processes 1.88 GB of data.
Create a second table with identical data that creates a
commentsfield using a
STRUCTtype to store the
creation_datedata, instead of two individual fields:
CREATE OR REPLACE TABLE `PROJECT.DATASET.stackoverflow_nested` AS ( SELECT user_id, ARRAY_AGG(STRUCT(post_id, creation_date) ORDER BY creation_date ASC) AS comments FROM `bigquery-public-data.stackoverflow.comments` GROUP BY user_id )
stackoverflow_nestedtable, run the following query to see the earliest comment for each user:
SELECT user_id, (SELECT AS STRUCT post_id, creation_date as earliest_comment FROM UNNEST(comments) ORDER BY creation_date ASC LIMIT 1).* FROM `PROJECT.DATASET.stackoverflow_nested` ORDER BY user_id ASC
This query takes about 10 seconds to run and processes 1.28 GB of data.
stackoverflow_nestedtables when you are finished with them.