This document provides best practices for using BigQuery Materialized Views. Before you read this document, familiarize yourself with Introduction to materialized views and Creating and using materialized views.
Considerations when creating a materialized view
Ensure your materialized view definition reflects query patterns against the base table. Because there is a maximum of 20 materialized views per table, you should not create a materialized view for every permutation of a query. Instead, create materialized views to serve a broader set of queries.
For example, consider a query on a table where users often filter
by columns user_id
or department
. You can group by these
columns and optionally cluster by them, instead of adding filters like user_id
= 123
into the materialized view.
As another example, users often use date filters, either by specific date
(where order_date = current_date()
) or date range (where order_date between
'2019-10-01' and '2019-10-31'
). As a good practice, add a date range filter in
the materialized view that covers expected date ranges in the query:
CREATE MATERIALIZED VIEW ...
WHERE date > '2019-01-01' GROUP BY date
If the base table is partitioned and its materialized view has significant size, then the materialized view should be partitioned as well. As a general guideline, a materialized view has significant size if it is about the same size as, or larger than, one partition of the base table.
ARRAY_AGG
considerations
When you select ARRAY_AGG
columns from a materialized view, you're subject to the same
limitations as when you use an ARRAY_AGG
clause
to select from the base table. In the
case of RESPECT NULLS
treatment, which is set by default, output arrays cannot
contain NULL
elements. An error is raised if an output array contains NULL
for an element.
For this materialized view:
CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table AS
SELECT
column_1, ARRAY_AGG(column_2 ORDER BY column_3 DESC LIMIT 4) column_arr
FROM `project.dataset.base_table`
GROUP BY column_1
the following query fails:
SELECT * FROM project-id.my_dataset.my_mv_table
The failure details are as follows:
query: Array cannot have a null element; error in writing field
To avoid this failure, add IGNORE NULLS
to the ARRAY_AGG
clause:
ARRAY_AGG(col_2 ORDER BY col_3 DESC LIMIT 4 IGNORE NULLS) col
As another option, format the output array column with the
'%t'
or '%T'
specifier.
SELECT column_1, FORMAT('%T', column_arr) FROM project-id.my_dataset.my_mv_table
JOIN
considerations
Although joins are not currently supported, you might be able to leverage materialized views to reduce cost and latency of a query that performs aggregation on top of a join. For example, consider a case where you join a large fact table with a few small-dimension tables, and then aggregate on top of the join. It might be practical to rewrite the query to first perform the aggregation on top of the fact table with foreign keys as grouping keys, then join the result with the dimension table, and then finally perform a post-aggregation.
To demonstrate this approach, let's use query #52 from the TPC-DS benchmark.
Here's the original query:
SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, SUM(ss_ext_sales_price) ext_price FROM project.tpcds_10T.date_dim dt, project.tpcds_10T.store_sales, project.tpcds_10T.item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = item.i_item_sk and item.i_manager_id = 1 and dt.d_moy=12 and dt.d_year=1998 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, ext_price DESC, brand_id LIMIT 100
Here's the alternative query:
WITH sales_summary AS ( SELECT ss_sold_date_sk, ss_item_sk, SUM(ss_ext_sales_price) AS ext_price FROM project.tpcds_10T.store_sales GROUP BY ss_sold_date_sk, ss_item_sk ) SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, SUM(ext_price) AS ext_price FROM sales_summary, project.tpcds_10T.date_dim dt, project.tpcds_10T.item WHERE sales_summary.ss_sold_date_sk = dt.d_date_sk AND sales_summary.ss_item_sk = item.i_item_sk AND item.i_manager_id = 1 AND dt.d_moy=12 AND dt.d_year=1998 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, ext_price desc, brand_id LIMIT 100
Here's how to create the view:
CREATE OR REPLACE MATERIALIZED VIEW project-id.tpcds_10T.sales_summary_mv AS SELECT ss_sold_date_sk, ss_item_sk, SUM(ss_ext_sales_price) AS ext_price FROM project-id.tpcds_10T.store_sales GROUP BY ss_sold_date_sk, ss_item_sk
Next, you can join the materialized view with the dimension tables as follows:
SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, SUM(ext_price) as ext_price FROM sales_summary_mv, project.tpcds_10T.date_dim dt, project.tpcds_10T.item WHERE sales_summary.ss_sold_date_sk = dt.d_date_sk AND sales_summary.ss_item_sk = item.i_item_sk AND item.i_manager_id = 1 AND dt.d_moy=12 AND dt.d_year=1998 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, ext_price desc, brand_id LIMIT 100
Maintenance considerations for materialized views
Monitor the cost of the refresh job, and adjust the automatic refresh interval
if needed. Specifically, monitor total_bytes_processed
and total_slot_ms
.
For example, if the ingestion rate in the base table is relatively small, it makes sense to refresh the view less often. Otherwise, if the underlying data changes quickly, it makes sense to refresh more often.
If the base table ingests at predefined points in time, such as by a nightly extract, transform, and load (ETL) pipeline, consider taking full ownership of the materialized view maintenance schedule, as follows:
Perform a manual refresh, either as part of the ETL pipeline, or by configuring a scheduled query at specific times of the day.