This document provides best practices for using BigQuery Materialized Views. Before you read this document, familiarize yourself with Introduction to materialized views and Creating and using materialized views.
Considerations when creating a materialized view
Ensure your materialized view definition reflects query patterns against the base table. Because there is a maximum of 20 materialized views per table, you should not create a materialized view for every permutation of a query. Instead, create materialized views to serve a broader set of queries.
For example, consider a query on a table where users often filter
department. You can group by these
columns and optionally cluster by them, instead of adding filters like
= 123 into the materialized view.
As another example, users often use date filters, either by specific date
where order_date = current_date()) or date range (
where order_date between
'2019-10-01' and '2019-10-31'). As a good practice, add a date range filter in
the materialized view that covers expected date ranges in the query:
CREATE MATERIALIZED VIEW ... WHERE date > '2019-01-01' GROUP BY date
If the base table is partitioned and its materialized view has significant size, then the materialized view should be partitioned as well. As a general guideline, a materialized view has significant size if it is about the same size as, or larger than, one partition of the base table.
When you select
columns from a materialized view, you're subject to the same
limitations as when you use an
to select from the base table. In the
RESPECT NULLS treatment, which is set by default, output arrays cannot
elements. An error is raised if an output array contains
NULL for an element.
For this materialized view:
CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table AS SELECT column_1, ARRAY_AGG(column_2 ORDER BY column_3 DESC LIMIT 4) column_arr FROM `project.dataset.base_table` GROUP BY column_1
the following query fails:
SELECT * FROM project-id.my_dataset.my_mv_table
The failure details are as follows:
query: Array cannot have a null element; error in writing field
To avoid this failure, add
IGNORE NULLS to the
ARRAY_AGG(col_2 ORDER BY col_3 DESC LIMIT 4 IGNORE NULLS) col
As another option, format the output array column with the
SELECT column_1, FORMAT('%T', column_arr) FROM project-id.my_dataset.my_mv_table
Although joins are not currently supported, you might be able to leverage materialized views to reduce cost and latency of a query that performs aggregation on top of a join. For example, consider a case where you join a large fact table with a few small-dimension tables, and then aggregate on top of the join. It might be practical to rewrite the query to first perform the aggregation on top of the fact table with foreign keys as grouping keys, then join the result with the dimension table, and then finally perform a post-aggregation.
To demonstrate this approach, let's use query #52 from the TPC-DS benchmark.
Here's the original query:
SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, SUM(ss_ext_sales_price) ext_price FROM project.tpcds_10T.date_dim dt, project.tpcds_10T.store_sales, project.tpcds_10T.item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = item.i_item_sk and item.i_manager_id = 1 and dt.d_moy=12 and dt.d_year=1998 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, ext_price DESC, brand_id LIMIT 100
Here's the alternative query:
WITH sales_summary AS ( SELECT ss_sold_date_sk, ss_item_sk, SUM(ss_ext_sales_price) AS ext_price FROM project.tpcds_10T.store_sales GROUP BY ss_sold_date_sk, ss_item_sk ) SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, SUM(ext_price) AS ext_price FROM sales_summary, project.tpcds_10T.date_dim dt, project.tpcds_10T.item WHERE sales_summary.ss_sold_date_sk = dt.d_date_sk AND sales_summary.ss_item_sk = item.i_item_sk AND item.i_manager_id = 1 AND dt.d_moy=12 AND dt.d_year=1998 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, ext_price desc, brand_id LIMIT 100
Here's how to create the view:
CREATE OR REPLACE MATERIALIZED VIEW project-id.tpcds_10T.sales_summary_mv AS SELECT ss_sold_date_sk, ss_item_sk, SUM(ss_ext_sales_price) AS ext_price FROM project-id.tpcds_10T.store_sales GROUP BY ss_sold_date_sk, ss_item_sk
Next, you can join the materialized view with the dimension tables as follows:
SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, SUM(ext_price) as ext_price FROM sales_summary_mv, project.tpcds_10T.date_dim dt, project.tpcds_10T.item WHERE sales_summary_mv.ss_sold_date_sk = dt.d_date_sk AND sales_summary_mv.ss_item_sk = item.i_item_sk AND item.i_manager_id = 1 AND dt.d_moy=12 AND dt.d_year=1998 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, ext_price desc, brand_id LIMIT 100
Maintenance considerations for materialized views
Monitor the cost of the refresh job, and adjust the automatic refresh interval
if needed. Specifically, monitor
For example, if the ingestion rate in the base table is relatively small, it makes sense to refresh the view less often. Otherwise, if the underlying data changes quickly, it makes sense to refresh more often.
If the base table ingests at predefined points in time, such as by a nightly extract, transform, and load (ETL) pipeline, consider taking full ownership of the materialized view maintenance schedule, as follows:
Perform a manual refresh, either as part of the ETL pipeline, or by configuring a scheduled query at specific times of the day.
To control access to views in BigQuery, see Controlling access to views.