Best practices for filtering and ordering
This document describes how to optimize queries when filtering
and ordering using the
Choosing columns to use in the
Best practice: When possible, use
columns in the
DATE columns are typically faster
than operations on
BYTE columns. When possible, use a column that
uses one of these data types in the
WHERE clause to reduce the amount of
data returned by the query.
Order query operations to maximize performance
Best practice: Use
ORDER BY only in the outermost query or within
Push complex operations to the end of the query.
If you need to sort data, filter first to reduce the number of values that you
need to sort. If you sort your data first, you sort much more data than is
necessary. It is preferable to sort on a subset of data than to sort all the
data and apply a
When you use an
ORDER BY clause, it should appear only in the outermost query.
ORDER BY clause in the middle of a query greatly impacts
performance unless it is being used in a
Another technique for ordering your query is to push complex operations, such as regular expressions and mathematical functions, to the end of the query. Again, this technique allows the data to be pruned as much as possible before the complex operations are performed.
LIMIT with an
ORDER BY clause
Best practice: If you are ordering a very large number of values but don't
need to have all of them returned, use a
LIMIT clause with an
ORDER BY clause when returning a limited number
of records works for your scenario. When it doesn't, consider using a
window function instead.
For example, the following query orders a very large result set and throws a
Resources exceeded error. The query sorts by the
title column in
title column contains millions of values.
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title;
To remove the error, use a query like the following:
SELECT title FROM `my-project.mydataset.mytable` ORDER BY title DESC LIMIT 1000;
LIMIT and a window function with an
ORDER BY clause
Best practice: If you are ordering a very large number of values, use a window function, and limit data before calling the window function.
For example, the following query lists the ten oldest Stack Overflow users and their ranking, with the oldest account being ranked lowest:
SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM bigquery-public-data.stackoverflow.users ORDER BY user_rank ASC LIMIT 10;
This query takes approximately 15 seconds to run. This query uses
the end of the query, but not in the
DENSE_RANK() OVER window function.
Because of this, the query requires all of the data to be sorted on a
single worker node.
Instead, you should limit the dataset before computing the window function in order to improve performance:
WITH users AS ( SELECT id, reputation, creation_date, FROM bigquery-public-data.stackoverflow.users ORDER BY creation_date ASC LIMIT 10) SELECT id, reputation, creation_date, DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank FROM users ORDER BY user_rank;
This query takes approximately 2 seconds to run, while returning the same results as the previous query.
One caveat is that the
DENSE_RANK() function ranks the data within years, so
if you want to rank data spanning multiple years, these queries would not give