Best practices for materialized views

This document provides best practices for using BigQuery Materialized Views. Before you read this document, familiarize yourself with Introduction to materialized views and Creating and using materialized views.

Considerations when creating a materialized view

Ensure your materialized view definition reflects query patterns against the base table. Because there is a maximum of 20 materialized views per table, you should not create a materialized view for every permutation of a query. Instead, create materialized views to serve a broader set of queries.

For example, consider a query on a table where users often filter by columns user_id or department. You can group by these columns and optionally cluster by them, instead of adding filters like user_id = 123 into the materialized view.

As another example, users often use date filters, either by specific date (where order_date = current_date()) or date range (where order_date between '2019-10-01' and '2019-10-31'). As a good practice, add a date range filter in the materialized view that covers expected date ranges in the query:

CREATE MATERIALIZED VIEW  ...
WHERE date > '2019-01-01' GROUP BY date

If the base table is partitioned and its materialized view has significant size, then the materialized view should be partitioned as well. As a general guideline, a materialized view has significant size if it is about the same size as, or larger than, one partition of the base table.

ARRAY_AGG considerations

When you select ARRAY_AGG columns from a materialized view, you're subject to the same limitations as when you use an ARRAY_AGG clause to select from the base table. In the case of RESPECT NULLS treatment, which is set by default, output arrays cannot contain NULL elements. An error is raised if an output array contains NULL for an element.

For this materialized view:

CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table AS
SELECT
  column_1, ARRAY_AGG(column_2 ORDER BY column_3 DESC LIMIT 4) column_arr
FROM `project.dataset.base_table`
GROUP BY column_1

the following query fails:

SELECT * FROM project-id.my_dataset.my_mv_table

The failure details are as follows:

query: Array cannot have a null element; error in writing field

To avoid this failure, add IGNORE NULLS to the ARRAY_AGG clause:

ARRAY_AGG(col_2 ORDER BY col_3 DESC LIMIT 4 IGNORE NULLS) col

As another option, format the output array column with the '%t' or '%T' specifier.

SELECT column_1, FORMAT('%T', column_arr) FROM project-id.my_dataset.my_mv_table

JOIN considerations

Although joins are not currently supported, you might be able to leverage materialized views to reduce cost and latency of a query that performs aggregation on top of a join. For example, consider a case where you join a large fact table with a few small-dimension tables, and then aggregate on top of the join. It might be practical to rewrite the query to first perform the aggregation on top of the fact table with foreign keys as grouping keys, then join the result with the dimension table, and then finally perform a post-aggregation.

To demonstrate this approach, let's use query #52 from the TPC-DS benchmark.

Here's the original query:

SELECT  dt.d_year,
   item.i_brand_id brand_id,
   item.i_brand brand,
 SUM(ss_ext_sales_price) ext_price
 FROM project.tpcds_10T.date_dim dt,
   project.tpcds_10T.store_sales,
   project.tpcds_10T.item
 WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
    and store_sales.ss_item_sk = item.i_item_sk
    and item.i_manager_id = 1
    and dt.d_moy=12
    and dt.d_year=1998
 GROUP BY dt.d_year,
   item.i_brand,
   item.i_brand_id
 ORDER BY dt.d_year,
   ext_price DESC,
   brand_id
LIMIT 100

Here's the alternative query:

WITH sales_summary AS (
  SELECT ss_sold_date_sk,
    ss_item_sk,
    SUM(ss_ext_sales_price) AS ext_price
  FROM project.tpcds_10T.store_sales
  GROUP BY ss_sold_date_sk, ss_item_sk
)
SELECT dt.d_year,
  item.i_brand_id brand_id,
  item.i_brand brand,
  SUM(ext_price) AS ext_price
FROM sales_summary,
  project.tpcds_10T.date_dim dt,
  project.tpcds_10T.item
WHERE sales_summary.ss_sold_date_sk = dt.d_date_sk
  AND sales_summary.ss_item_sk = item.i_item_sk
    AND item.i_manager_id = 1
    AND dt.d_moy=12
    AND dt.d_year=1998
GROUP BY dt.d_year,
  item.i_brand,
  item.i_brand_id
ORDER BY dt.d_year,
  ext_price desc,
  brand_id
LIMIT 100

Here's how to create the view:

CREATE OR REPLACE MATERIALIZED VIEW project-id.tpcds_10T.sales_summary_mv AS
  SELECT
    ss_sold_date_sk,
    ss_item_sk,
    SUM(ss_ext_sales_price) AS ext_price
  FROM project-id.tpcds_10T.store_sales
  GROUP BY ss_sold_date_sk, ss_item_sk

Next, you can join the materialized view with the dimension tables as follows:

SELECT dt.d_year,
  item.i_brand_id brand_id,
  item.i_brand brand,
  SUM(ext_price) as ext_price
FROM sales_summary_mv,
  project.tpcds_10T.date_dim dt,
  project.tpcds_10T.item
WHERE sales_summary_mv.ss_sold_date_sk = dt.d_date_sk
  AND sales_summary_mv.ss_item_sk = item.i_item_sk
    AND item.i_manager_id = 1
    AND dt.d_moy=12
    AND dt.d_year=1998
GROUP BY dt.d_year,
  item.i_brand,
  item.i_brand_id
ORDER BY dt.d_year,
  ext_price desc,
  brand_id
LIMIT 100

Maintenance considerations for materialized views

Monitor the cost of the refresh job, and adjust the automatic refresh interval if needed. Specifically, monitor total_bytes_processed and total_slot_ms.

For example, if the ingestion rate in the base table is relatively small, it makes sense to refresh the view less often. Otherwise, if the underlying data changes quickly, it makes sense to refresh more often.

If the base table ingests at predefined points in time, such as by a nightly extract, transform, and load (ETL) pipeline, consider taking full ownership of the materialized view maintenance schedule, as follows:

  1. Disable automatic refresh.

  2. Perform a manual refresh, either as part of the ETL pipeline, or by configuring a scheduled query at specific times of the day.

View security

To control access to views in BigQuery, see Controlling access to views.