Optimize query computation
This document provides the best practices for optimizing your query performance.
After the query is complete, you can
view the query plan
in the Google Cloud console. You can also request execution details by using
jobs.get REST API method.
The query plan includes details about query stages and steps. These details can help you identify ways to improve query performance. For example, if you notice a stage that writes a lot more output than other stages, it might mean that you need to filter earlier in the query.
To learn more about the query plan and see examples of how the query plan information can help you to improve query performance, see Get query performance insights. After addressing the query performance insights, you can further optimize your query by performing the following tasks:
- Reduce data that is to be processed
- Optimize query operations
- Reduce the output of your query
- Use a BigQuery BI Engine reservation
- Avoid anti-SQL patterns
- Specify constraints in table schema
Reduce data processed
You can reduce data that needs to be processed by using the options described in the following sections.
Best practice: Control projection by querying only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
- Use the data preview options. If you are experimenting with data or exploring data, use one of the
data preview options instead of
- Query specific columns. Applying a
LIMITclause to a
SELECT *query does not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query counts against your free tier quota. Instead, query only the columns you need. For example, use
SELECT * EXCEPTto exclude one or more columns from the results.
- Use partitioned tables. If you do require queries against every column in a table, but only against a subset of data, consider:
SELECT * EXCEPT. Querying a subset of data or using
SELECT * EXCEPTcan greatly reduce the amount of data that is read by a query. In addition to the cost savings, performance is improved by reducing the amount of data I/O and the amount of materialization that is required for the query results.
SELECT * EXCEPT (col1, col2, col5) FROM mydataset.newtable
Avoid excessive wildcard tables
Best practice: When querying wildcard tables, you must use the most granular prefix.
Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains the following resources:
- Multiple, similarly named tables with compatible schemas
- Sharded tables
When you query a wildcard table, specify a wildcard (
*) after the common
table prefix. For example,
queries all tables from the 1940s.
More granular prefixes perform better than shorter prefixes. For example,
FROM performs better than
FROM because fewer tables match
Avoid tables sharded by date
Best practice: Don't use tables sharded by date (also called date-named tables) in place of time-partitioned tables.
Partitioned tables perform better than date-named tables. When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.
Avoid oversharding tables
Best practice: Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.
Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. If you are sharding tables by date, use time-partitioned tables instead.
Because of the low cost of BigQuery storage, you don't need to optimize your tables for cost as you would in a relational database system. Creating a large number of table shards has performance impacts that outweigh any cost benefits.
Sharded tables require BigQuery to maintain schema, metadata, and permissions for each shard. Because of the added overhead required to maintain information on each shard, oversharding tables can impact query performance.
The amount and source of data read by a query can impact query performance and cost.
Prune partitioned queries
Best practice: When querying a partitioned table, to filter with partitions on partitioned tables, use the following columns:
- For ingestion-time partitioned tables, use the pseudo-column
- For partitioned tables such as the time-unit column-based and integer-range, use the partitioning column.
For time-unit partitioned tables, filtering the data with
partitioning column lets you specify a date or range of dates. For example,
WHERE clause uses the
_PARTITIONTIME pseudo column to specify
partitions between January 1, 2016 and January 31, 2016:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP("20160101") AND TIMESTAMP("20160131")
The query processes data only in the partitions that are indicated by the date range. Filtering your partitions improves query performance and reduces costs.
Reduce data before using a
Best practice: Reduce the amount of data that is processed before a
clause by performing aggregations.
GROUP BY clause
with aggregate functions
is computationally intensive, because these types of queries use
As these queries are computationally intensive, you must use a
clause only when necessary.
For queries with
GROUP BY and
JOIN, perform aggregation earlier in the query
to reduce the amount of data processed.
For example, the following query performs a
JOIN on two large tables without
any filtering beforehand:
WITH users_posts AS ( SELECT * FROM `bigquery-public-data`.stackoverflow.comments AS c JOIN `bigquery-public-data`.stackoverflow.users AS u ON c.user_id = u.id ) SELECT user_id, ANY_VALUE(display_name) AS display_name, ANY_VALUE(reputation) AS reputation, COUNT(text) AS comments_count FROM users_posts GROUP BY user_id ORDER BY comments_count DESC LIMIT 20;
This query pre-aggregates the comment counts which reduces the amount of data
read for the
WITH comments AS ( SELECT user_id, COUNT(text) AS comments_count FROM `bigquery-public-data`.stackoverflow.comments WHERE user_id IS NOT NULL GROUP BY user_id ORDER BY comments_count DESC LIMIT 20 ) SELECT user_id, display_name, reputation, comments_count FROM comments JOIN `bigquery-public-data`.stackoverflow.users AS u ON user_id = u.id ORDER BY comments_count DESC;
Best practice: Use a
to limit the amount of data a query returns. When possible, use
columns in the
DATE columns are typically faster
than operations on
BYTE columns. When possible, use a column that
uses one of these data types in the
WHERE clause to reduce the amount of
data returned by the query.
Optimize query operations
You can optimize your query operations by using the options described in the following sections.
Avoid repeatedly transforming data
Best practice: If you are using SQL to perform ETL operations, then avoid situations where you are repeatedly transforming the same data.
For example, if you are using SQL to trim strings or extract data by using regular expressions, it is more performant to materialize the transformed results in a destination table. Functions like regular expressions require additional computation. Querying the destination table without the added transformation overhead is much more efficient.
Avoid multiple evaluations of the same CTEs
When your query contains common table expressions (CTEs) that are used in multiple places in the query, they might end up being evaluated each time they are referenced. The query optimizer attempts to detect parts of the query that could be executed only once, but this might not always be possible. As a result, using a CTE might not help reduce internal query complexity and resource consumption.
You can store the result of a CTE in a scalar variable or a temporary table depending on the data that the CTE returns.
Avoid repeated joins and subqueries
Best practice: Avoid repeatedly joining the same tables and using the same subqueries.
Instead of repeatedly joining the data, it might be more performant for you to use nested repeated data to represent the relationships. Nested repeated data saves you the performance impact of the communication bandwidth that a join requires. It also saves you the I/O costs that you incur by repeatedly reading and writing the same data. For more information, see use nested and repeated fields.
Similarly, repeating the same subqueries affects performance through repetitive query processing. If you are using the same subqueries in multiple queries, consider materializing the subquery results in a table. Then consume the materialized data in your queries.
Materializing your subquery results improves performance and reduces the overall amount of data that BigQuery reads and writes. The small cost of storing the materialized data outweighs the performance impact of repeated I/O and query processing.
Optimize your join patterns
Best practice: For queries that join data from multiple tables, optimize your join patterns by starting with the largest table.
When you create a query by using a
JOIN clause, consider the order in which you are
merging the data. The GoogleSQL query optimizer determines which table
should be on which side of the join. As a best practice, place the
table with the largest number of rows first, followed by the table with the
fewest rows, and then place the remaining tables by decreasing size.
When you have a large table as the left side of the
JOIN and a small one on
the right side of the
JOIN, a broadcast join is created. A broadcast join
sends all the data in the smaller table to each slot that processes the larger
table. It is advisable to perform the broadcast join first.
To view the size of the tables in your
Get information about tables.
ORDER BY clause
Best practice: When you use the
ORDER BY clause, ensure that you follow
the best practices:
ORDER BYin the outermost query or within window clauses. Push complex operations to the end of the query. Placing an
ORDER BYclause in the middle of a query greatly impacts performance unless it is being used in a window function.
Another technique for ordering your query is to push complex operations, such as regular expressions and mathematical functions, to the end of the query. This technique reduces the data to be processed before the complex operations are performed.
LIMITclause. If you are ordering a very large number of values but don't need to have all of them returned, use a
LIMITclause. For example, the following query orders a very large result set and throws a
Resources exceedederror. The query sorts by the
titlecolumn contains millions of values.
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title;
To remove the error, use a query like the following:
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title DESC LIMIT 1000;
Use a window function. If you are ordering a very large number of values, use a window function, and limit data before calling the window function. For example, the following query lists the ten oldest Stack Overflow users and their ranking, with the oldest account being ranked lowest:
SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM bigquery-public-data.stackoverflow.users ORDER BY user_rank ASC LIMIT 10;
This query takes approximately 15 seconds to run. This query uses
LIMITat the end of the query, but not in the
DENSE_RANK() OVERwindow function. Because of this, the query requires all of the data to be sorted on a single worker node.
Instead, you should limit the dataset before computing the window function in order to improve performance:
WITH users AS ( SELECT id, reputation, creation_date, FROM bigquery-public-data.stackoverflow.users ORDER BY creation_date ASC LIMIT 10) SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM users ORDER BY user_rank;
This query takes approximately 2 seconds to run, while returning the same results as the previous query.
One caveat is that the
DENSE_RANK()function ranks the data within years, so for ranking data that spans across multiple years, these queries don't give identical results.
Split complex queries into smaller ones
REGEX functions, and layered subqueries or joins can be slow
and resource intensive to run. Trying to fit all computations in one huge
SELECT statement, for example to make it a view, is sometimes an antipattern,
and it can result in a slow, resource-intensive query. In extreme cases, the
internal query plan becomes so complex that BigQuery is unable to
Splitting up a complex query allows for materializing intermediate results in variables or temporary tables. You can then use these intermediate results in other parts of the query. It is increasingly useful when these results are needed in more than one place of the query.
Often it lets you to better express the true intent of parts of the query with temporary tables being the data materialization points.
Use nested and repeated fields
For information about how to denormalize data storage using nested and repeated fields, see Use nested and repeated fields.
INT64 data types in joins
Best practice: Use
INT64 data types in joins instead of
types to reduce cost and improve comparison performance.
BigQuery doesn't index primary keys like traditional databases,
so the wider the join column is, the longer the comparison takes. Therefore,
INT64 data types in joins is cheaper and more efficient than using
STRING data types.
Reduce query outputs
You can reduce the query outputs by using the options described in the following the sections.
Materialize large result sets
Best practice: Consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts.
BigQuery limits cached results to approximately 10 GB
compressed. Queries that return larger results overtake this limit and
frequently result in the following error:
Response too large.
This error often occurs when you select a large number of fields from a table with a considerable amount of data. Issues writing cached results can also occur in ETL-style queries that normalize data without reduction or aggregation.
You can overcome the limitation on cached result size by using the following options:
- Use filters to limit the result set
- Use a
LIMITclause to reduce the result set, especially if you are using an
- Write the output data to a destination table
You can page through the results using the BigQuery REST API. For more information, see Paging through table data.
Use BI Engine
To further accelerates your SQL queries by caching the data that you use most
frequently, consider adding a BI Engine
reservation to the project where the queries are being computed.
BigQuery BI Engine uses a vectorized query engine to accelerate the
Avoid anti-SQL patterns
The following best practices provide guidance on avoiding query anti-patterns that impact performance in BigQuery.
Avoid self joins
Best practice: Instead of using self-joins, use a window (analytic) function.
Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially squares the number of output rows. This increase in output data can cause poor performance.
To reduce the number of additional bytes that are generated by the query, use a window (analytic) function.
Avoid cross joins
Best practice: Avoid joins that generate more outputs than inputs. When a
CROSS JOIN is required, pre-aggregate your data.
Cross joins are queries where each row from the first table is joined to every row in the second table, with non-unique keys on both sides. The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish.
If the query job completes, the query plan explanation shows output rows
versus input rows. You can confirm a Cartesian product
by modifying the query to print the number of rows on each side of the
clause, grouped by the join key.
To avoid performance issues associated with joins that generate more outputs than inputs:
- Use a
GROUP BYclause to pre-aggregate the data.
- Use a window function. Window functions are often more efficient than using a cross join. For more information, see window functions.
Avoid DML statements that update or insert single rows
Best practice: Avoid DML statements that update or insert single rows. Batch your updates and inserts.
Using point-specific DML statements is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system. BigQuery focuses on Online Analytical Processing (OLAP) by using table scans and not point lookups. If you need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Cloud SQL.
BigQuery DML statements are intended for bulk updates.
DELETE DML statements in BigQuery are oriented towards
periodic rewrites of your data, not single row mutations. The
statement is intended to be used sparingly. Inserts consume the same
as load jobs. If your use case involves frequent single row inserts, consider
streaming your data instead.
If batching your
UPDATE statements yields many tuples in very long queries,
you might approach the query length limit of 256 KB. To work around the query
length limit, consider whether your updates can be handled based on a logical
criteria instead of a series of direct tuple replacements.
For example, you could load your set of replacement records into another table,
then write the DML statement to update all values in the original table if the
non-updated columns match. For example, if the original data is in table
the updates are staged in table
u, the query would look like the following:
UPDATE dataset.t t SET my_column = u.my_column FROM dataset.u u WHERE t.my_key = u.my_key
Filter data for skewed data
Best practice: If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can't share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.
Partitions become large when your partition key has a value
that occurs more often than any other value. For example, grouping by a
user_id field where there are many entries for
When a slot's resources are overwhelmed, a
error results. Reaching the shuffle limit for a slot (2TB in memory
compressed) also causes the shuffle to write to disk and further impacts
performance. Customers with capacity-based pricing
can increase the number of allocated slots.
If you examine the query execution graph and see a significant difference between avg and max compute times, your data is probably skewed.
To avoid performance issues that result from data skew:
- Use an approximate aggregate function such as
APPROX_TOP_COUNTto determine if the data is skewed.
- Filter your data as early as possible.
Data skew can also appear when you use
JOIN clauses. Because
BigQuery shuffles data on each side of the join, all data with the
same join key goes to the same shard. This shuffling can overload the slot.
To avoid performance issues that are associated with unbalanced joins, you can perform the following tasks:
- Pre-filter rows from the table with the unbalanced key.
- If possible, split the query into two queries.
For example, instead of using the following clause that contains a
table1.my_id NOT IN ( SELECT my_id FROM table2 )
Use a clause that contains a
SELECT DISTINCTstatement instead:
table1.my_id NOT IN ( SELECT DISTINCT my_id FROM table2 )
Use alias names for similarly named columns
Best Practice: Use column and table aliases when you work with similarly named columns across queries, including subqueries.
Aliases help to identify which columns and tables are referenced in addition to your initial reference of the column. Using aliases can help you understand and address problems in your SQL query, including finding the columns that are used in subqueries.
Specify constraints in table schema
When table data contains constraints, specify the constraints in the table schema. The query engine can optimize query plans using table constraints.
Specify primary key and foreign key constraints
You should specify key constraints in the table schema when table data satisfies the data integrity requirements of primary key or foreign key constraints. The query engine can use the key constraints to optimize query plans. You can find detailed information in the blog post Join optimizations with BigQuery primary keys and foreign keys.
BigQuery doesn't automatically check for data integrity, so you must ensure that your data meets the constraints specified in the table schema. If you don't maintain data integrity in tables with specified constraints, your query results might be inaccurate.