Introduction to materialized views

This page provides an overview of BigQuery Materialized Views.

Introduction

Materialized views are precomputed views that periodically cache results of a query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only delta changes from the base table to compute up-to-date results. Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.

Queries that use materialized views are generally faster and consume less resources than queries that retrieve the same data only from the base table. Materialized views are helpful to significantly boost performance of workloads that have the characteristic of common and repeated queries.

The following are key characteristics of BigQuery Materialized Views:

  • Zero maintenance: A materialized view is recomputed in background once the base table has changed. All incremental data changes from the base tables are automatically added to the materialized views. No user inputs are required.

  • Always fresh: A materialized view is always consistent with the base table, including BigQuery streaming tables. If a base table is modified via update, merge, partition truncation, or partition expiration, BigQuery will invalidate the impacted portions of the materialized view and fully re-read the corresponding portion of the base table. For an unpartitioned materialized view, BigQuery will invalidate the entire materialized view and re-read the entire base table. For a partitioned materialized view, BigQuery will invalidate the affected partitions of the materialized view and re-read the entire corresponding partitions from the base table. Partitions that are append-only are not invalidated and are read in delta mode. In other words, there will never be a situation when querying a materialized view results in stale data.

  • Smart tuning: If a query or part of a query against the source table can instead be resolved by querying the materialized view, BigQuery will rewrite (reroute) the query to use the materialized view for better performance and/or efficiency.

Benefits

BigQuery Materialized Views provides the following benefits:

  • Reduction in the execution time and cost for queries with aggregate functions. The largest benefit is gained when a query's computation cost is high and the resulting data set is small.

  • Automatic and transparent BigQuery optimization because the optimizer uses a materialized view, if available, to improve the query execution plan. This optimization does not require any changes to the queries.

  • The same resilience and high availability as BigQuery tables.

Use cases

BigQuery Materialized Views targets the following use cases:

  • Faster query performance: If you have a raw table and you perform online analytical processing (OLAP) style aggregation that requires significant processing, and you have predictable and repeated queries, such as from extract, transform, load (ETL) or business intelligence (BI) pipelines, you should use BigQuery Materialized Views.

  • Aggregation of real-time data: If you want access to data for decisions in real-time, BigQuery natively supports streaming capabilities. BigQuery Materialized Views integrates with BigQuery streaming and performs aggregations in real-time to provide up-to-date information.

Creating materialized views

You can create a materialized view through the BigQuery web UI, the bq command-line tool, or the BigQuery API. As an example, in the BigQuery web UI you can use the following DDL statement:

CREATE MATERIALIZED VIEW  project-id.my_dataset.my_mv_table
AS SELECT product_id, SUM(clicks) AS sum_clicks
FROM  project-id.my_dataset.my_base_table
GROUP BY 1

For more details on how to use the BigQuery web UI, the bq command-line tool, or the BigQuery API to create a materialized view, see Create and use materialized views.

Costs

Costs are associated with BigQuery Materialized Views for the following components:

  • Querying materialized views
  • Maintaining materialized views, such as when the view is refreshed
  • Storing materialized view tables
Component On-demand pricing Flat-rate pricing
Querying Bytes processed by the materialized view and any necessary portion of the base table. 1 No extra cost. Slots are consumed during query time.
Maintenance Bytes processed. Slots are consumed for the incremental view refresh. No extra cost. Slots are consumed during query time.
Storage Bytes stored in a materialized view Bytes stored in a materialized view

1 If the base table is append-only, then only the delta since the last view refresh will be processed from the base table. For more information, see Incremental updates.

Storage cost details

For AVG, ARRAY_AGG, APPROX_COUNT_DISTINCT aggregate values in a materialized view, the final value is not directly stored. Instead, BigQuery internally stores a materialized view as an intermediate sketch used to produce the final value.

As an example, consider a materialized view created with the following command:

CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table AS
SELECT date, AVG(net_paid) AS avg_paid
FROM project-id.my_dataset.my_base_table
GROUP BY date

While avg_paid column will be rendered as NUMERIC or FLOAT64 to the user, internally it will be stored as BYTES, with its content being an intermediate sketch in proprietary format. For the purpose of data size calculation, the column will be treated as BYTES.

Access control

Access granted to a materialized view is set at the dataset level. If the user has access to the dataset, the user has access to the materialized views in the dataset.

Monitoring materialized views

You can determine information about materialized views and materialized view refresh jobs by using the BigQuery API.

For discovery of materialized views, list them by calling the tables.list method. To retrieve the schema and other properties of a table, call the tables.get method.

For discovery of materialized view refresh jobs, list them by calling the jobs.list method. To retrieve details about the jobs, call the jobs.get method. Automatic refresh jobs contain the materialized_view_refresh prefix within the job ID, and they are started by a BigQuery admin account.

Comparison to other BigQuery techniques

The following table summarizes the similarities and difference between BigQuery caching, scheduled queries, standard views, and materialized views.

Component Caching Scheduled queries Standard views Materialized views
Performance Yes No No Yes
Query supported All All All Aggregation — filtering and grouping
Partitioning and clustering No Yes NA Yes
Incremental refresh No No No Yes
Additional storage No Yes No Yes
Query rewrite No No No Yes
Maintenance costs No Yes NA Yes
Data staleness Never Yes Never Never
Limited to a single table No No No Yes

Interaction with other BigQuery features

The following BigQuery features work transparently with materialized views:

  • Query plan explanation: The query plan will reflect which materialized views were scanned (if any), and will show how many bytes were read from the materialized view and base table combined.

  • Query caching: Results of a query that gets rewritten using a materialized view can be cached subject to the usual limitations (using of deterministic functions, no streaming into the base table, etc.).

  • Cost calculation: With on-demand billing, if the query was rewritten using materialized view(s) and ended up scanning less bytes than scanning the base table, the query will end up being cheaper.

  • Cost restriction: If you have set a value for maximum bytes billed, and a query will read bytes beyond the limit, the query fails without incurring a charge, whether the query used the materialized view, the base table, or both.

  • Cost estimation via dry run: A dry run repeats query rewrite logic using the available materialized views and should provide an accurate cost estimate. You can use it as a way to test whether a specific query utilizes any materialized views.

Limitations

  • You cannot directly manipulate, or read raw content from, a materialized view. This means the following actions are not supported:

    • Copying a materialized view, either as a source of destination of a copy job.
    • Exporting a materialized view.
    • Loading data into a materialized view
    • Writing a query result into a materialized view
    • Running DML statements over a materialized view.
  • A materialized view is required to be in the same dataset as the base table.

  • A materialized view is limited to referencing only a single table, and is not able to use joins or UNNEST functionality.

  • Materialized views cannot be nested on other materialized views.

  • You can use only the standard SQL dialect for materialized views.

  • You can create a maximum of 20 materialized views per base table.

  • SHOW DDL statements are not supported on materialized views

  • ALTER DDL statements, other than ALTER with SET OPTIONS, are not supported on materialized views

  • If you delete the base table without first deleting the materialized view, queries over the materialized view will fail, as will refreshes. If you decide to re-create the base table, you will also need to recreate the materialized view.

  • Protecting data using Cloud Key Management Service keys is not supported. You can't protect the data in a materialized view using Cloud KMS keys, nor can you create a materialized view over a base table protected by Cloud KMS keys.

ARRAY_AGG considerations

Selecting ARRAY_AGG columns from a materialized view carry the same limitations as using ARRAY_AGG when selecting from the base table. In the case of RESPECT NULLS, which is set by default, output arrays cannot contain NULL elements. An error is raised if an output array contains NULL for an element.

For this materialized view:

CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table AS
SELECT
  column_1, ARRAY_AGG(column_2 ORDER BY column_3 DESC LIMIT 4) column_arr
FROM `project.dataset.base_table`
GROUP BY column_1

This query will fail.

SELECT * FROM project-id.my_dataset.my_mv_table

The failure details:

query: Array cannot have a null element; error in writing field

To avoid this failure, add IGNORE NULLS to ARRAY_AGG:

ARRAY_AGG(col_2 ORDER BY col_3 DESC LIMIT 4 IGNORE NULLS) col

As another option, format the output array column with the '%t' or '%T' specifier.

SELECT column_1, FORMAT('%T', column_arr) FROM project-id.my_dataset.my_mv_table

JOIN considerations

While joins are not currently supported, you may be able to leverage materialized views to reduce cost and latency of a query that does aggregation on top of join. For example, consider a case where you are joining a large fact table with few small-dimension tables and then aggregating on top of the join. It might be practical to rewrite the query to first perform the aggregation on top of the fact table with foreign keys as grouping keys, joining the result with the dimension table, and then finally performing a post-aggregation.

To demonstrate this approach, let's use query #52 from the TPC-DS benchmark.

Here's the original query:

SELECT  dt.d_year,
   item.i_brand_id brand_id,
   item.i_brand brand,
 SUM(ss_ext_sales_price) ext_price
 FROM project.tpcds_10T.date_dim dt,
   project.tpcds_10T.store_sales,
   project.tpcds_10T.item
 WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
    and store_sales.ss_item_sk = item.i_item_sk
    and item.i_manager_id = 1
    and dt.d_moy=12
    and dt.d_year=1998
 GROUP BY dt.d_year,
   item.i_brand,
   item.i_brand_id
 ORDER BY dt.d_year,
   ext_price DESC,
   brand_id
LIMIT 100

Here's the alternative query:

WITH sales_summary AS (
  SELECT ss_sold_date_sk,
    ss_item_sk,
    SUM(ss_ext_sales_price) AS ext_price
  FROM project.tpcds_10T.store_sales
  GROUP BY ss_sold_date_sk, ss_item_sk
)
SELECT dt.d_year,
  item.i_brand_id brand_id,
  item.i_brand brand,
  SUM(ext_price) AS ext_price
FROM sales_summary,
  project.tpcds_10T.date_dim dt,
  project.tpcds_10T.item
WHERE sales_summary.ss_sold_date_sk = dt.d_date_sk
  AND sales_summary.ss_item_sk = item.i_item_sk
    AND item.i_manager_id = 1
    AND dt.d_moy=12
    AND dt.d_year=1998
GROUP BY dt.d_year,
  item.i_brand,
  item.i_brand_id
ORDER BY dt.d_year,
  ext_price desc,
  brand_id
LIMIT 100

Here's how to create the view:

CREATE OR REPLACE MATERIALIZED VIEW project-id.tpcds_10T.sales_summary_mv AS
  SELECT
    ss_sold_date_sk,
    ss_item_sk,
    SUM(ss_ext_sales_price) AS ext_price
  FROM project-id.tpcds_10T.store_sales
  GROUP BY ss_sold_date_sk, ss_item_sk

Then, the following shows how to join the materialized view with the dimension tables:

SELECT dt.d_year,
  item.i_brand_id brand_id,
  item.i_brand brand,
  SUM(ext_price) as ext_price
FROM sales_summary_mv,
  project.tpcds_10T.date_dim dt,
  project.tpcds_10T.item
WHERE sales_summary.ss_sold_date_sk = dt.d_date_sk
  AND sales_summary.ss_item_sk = item.i_item_sk
    AND item.i_manager_id = 1
    AND dt.d_moy=12
    AND dt.d_year=1998
GROUP BY dt.d_year,
  item.i_brand,
  item.i_brand_id
ORDER BY dt.d_year,
  ext_price desc,
  brand_id
LIMIT 100

Next steps