Optimize query computation
The following best practices provide guidance on controlling query computation.
Avoid repeatedly transforming data through SQL queries
Best practice: If you are using SQL to perform ETL operations, avoid situations where you are repeatedly transforming the same data.
For example, if you are using SQL to trim strings or extract data by using regular expressions, it is more performant to materialize the transformed results in a destination table. Functions like regular expressions require additional computation. Querying the destination table without the added transformation overhead is much more efficient.
Use approximate aggregation functions
Best practice: If your use case supports it, use an approximate aggregation function.
If the SQL aggregation function you're using has an equivalent approximation
function, the approximation function yields faster query performance. For
example, instead of using
more information, see approximate aggregation functions
in the standard SQL reference.
You can also use
HyperLogLog++ functions to do approximations (including custom
approximate aggregations). For more information, see
in the standard SQL reference.
Use aggregate analytic function to obtain the latest record
Best practice: To obtain the latest record, use the
ARRAY_AGG() aggregate analytic function instead of using numbering
functions, such as
ROW_NUMBER(), allows a query to run more
efficiently because the
ORDER BY clause is allowed to drop everything except
the top record on each
GROUP BY clause. For example,
SELECT event.* FROM ( SELECT id, ARRAY_AGG( t ORDER BY t.created_at DESC LIMIT 1 )[OFFSET(0)] event FROM `dataset.table` t GROUP BY id )
For more information, see Aggregate analytic functions in the standard SQL reference.
Order query operations to maximize performance
Best practice: Use
ORDER BY only in the outermost query or within window
clauses (analytic functions). Push complex operations to the end of the query.
If you need to sort data, filter first to reduce the number of values that you
need to sort. If you sort your data first, you sort much more data than is
necessary. It is preferable to sort on a subset of data than to sort all the
data and apply a
When you use an
ORDER BY clause, it should appear only in the outermost query.
ORDER BY clause in the middle of a query greatly impacts
performance unless it is being used in a
window (analytic) function.
Another technique for ordering your query is to push complex operations, such as regular expressions and mathematical functions to the end of the query. Again, this technique allows the data to be pruned as much as possible before the complex operations are performed.
Optimize your join patterns
Best practice: For queries that join data from multiple tables, optimize your join patterns. Start with the largest table.
When you create a query by using a
JOIN, consider the order in which you are
merging the data. The standard SQL query optimizer can determine which table
should be on which side of the join, but it is still recommended to order your
joined tables appropriately. As a best practice, place the table with the
largest number of rows first, followed by the table with the fewest rows, and
then place the remaining tables by decreasing size.
When you have a large table as the left side of the
JOIN and a small one on
the right side of the
JOIN, a broadcast join is created. A broadcast join
sends all the data in the smaller table to each slot that processes the larger
table. It is advisable to perform the broadcast join first.
To view the size of the tables in your
getting information about tables.
INT64 data types in joins to reduce cost and improve comparison performance
Best practice: If your use case supports it, use
INT64 data types in joins
STRING data types.
BigQuery does not index primary keys like traditional databases, so the
wider the join column is, the longer the comparison takes. Therefore, using
INT64 data types in joins is cheaper and more efficient than using
Prune partitioned queries
Best practice: When querying a partitioned table, to filter with partitions on partitioned tables, use the following columns:
- For ingestion-time partitioned tables, use the pseudo-column
- For partitioned tables such as the time-unit column-based and integer-range, use the partitioning column.
For time-unit partitioned tables, filtering the data with
partitioning column lets you specify a date or range of dates. For example,
WHERE clause uses the
_PARTITIONTIME pseudo column to specify
partitions between January 1, 2016 and January 31, 2016:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP("20160101") AND TIMESTAMP("20160131")
The query processes data only in the partitions that are indicated by the date range. Filtering your partitions improves query performance and reduces costs.
Avoid multiple evaluations of the same Common Table Expressions (CTEs)
When your query contains Common Table Expressions that are used in multiple places in the query, they are evaluated each time they are referenced. This may increase internal query complexity and resource consumption.
You can store the result of a CTE in a scalar variable or a temporary table depending on the data that the CTE returns. You are not charged for storage of temporary tables.
Split complex queries into multiple smaller ones
Complex queries, REGEX functions, and layered subqueries or joins can be slow and resource intensive to run. Trying to fit all computations in one huge SELECT statement, for example to make it a view, is sometimes an antipattern, and it can result in a slow, resource-intensive query. In extreme cases the internal query plan will become so complex that BigQuery will be unable to execute it.
Splitting up a complex query allows for materializing intermediate results in variables or temporary tables. You can then use these intermediate results in other parts of the query. It is increasingly useful when these results are needed in more than one place of the query. You are not charged for storage of temporary tables.
Often it allows you to better express the true intent of parts of the query with temporary tables being the data materialization points.