Best practices for filtering and ordering
This document describes how to optimize queries when filtering
and ordering using the
WHERE
and
ORDER BY
clauses.
Choosing columns to use in the WHERE
clause
Best practice: When possible, use BOOL
, INT
, FLOAT
, or DATE
columns in the WHERE
clause.
Operations on BOOL
, INT
, FLOAT
, and DATE
columns are typically faster
than operations on STRING
or BYTE
columns. When possible, use a column that
uses one of these data types in the WHERE
clause to reduce the amount of
data returned by the query.
Order query operations to maximize performance
Best practice: Use ORDER BY
only in the outermost query or within
window clauses.
Push complex operations to the end of the query.
If you need to sort data, filter first to reduce the number of values that you
need to sort. If you sort your data first, you sort much more data than is
necessary. It is preferable to sort on a subset of data than to sort all the
data and apply a LIMIT
clause.
When you use an ORDER BY
clause, it should appear only in the outermost query.
Placing an ORDER BY
clause in the middle of a query greatly impacts
performance unless it is being used in a
window function.
Another technique for ordering your query is to push complex operations, such as regular expressions and mathematical functions, to the end of the query. Again, this technique allows the data to be pruned as much as possible before the complex operations are performed.
Use LIMIT
with an ORDER BY
clause
Best practice: If you are ordering a very large number of values but don't
need to have all of them returned, use a LIMIT
clause.
Use a LIMIT
clause with an ORDER BY
clause when returning a limited number
of records works for your scenario. When it doesn't, consider using a
window function instead.
For example, the following query orders a very large result set and throws a
Resources exceeded
error. The query sorts by the title
column in mytable
.
The title
column contains millions of values.
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title;
To remove the error, use a query like the following:
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title DESC LIMIT 1000;
Use LIMIT
and a window function with an ORDER BY
clause
Best practice: If you are ordering a very large number of values, use a window function, and limit data before calling the window function.
For example, the following query lists the ten oldest Stack Overflow users and their ranking, with the oldest account being ranked lowest:
SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM bigquery-public-data.stackoverflow.users ORDER BY user_rank ASC LIMIT 10;
This query takes approximately 15 seconds to run. This query uses LIMIT
at
the end of the query, but not in the DENSE_RANK() OVER
window function.
Because of this, the query requires all of the data to be sorted on a
single worker node.
Instead, you should limit the dataset before computing the window function in order to improve performance:
WITH users AS ( SELECT id, reputation, creation_date, FROM bigquery-public-data.stackoverflow.users ORDER BY creation_date ASC LIMIT 10) SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM users ORDER BY user_rank;
This query takes approximately 2 seconds to run, while returning the same results as the previous query.
One caveat is that the DENSE_RANK()
function ranks the data within years, so
if you want to rank data spanning multiple years, these queries would not give
identical results.