Best practices for filtering and ordering

This document describes how to optimize queries when filtering and ordering using the WHERE and ORDER BY clauses.

Choosing columns to use in the WHERE clause

Best practice: When possible, use BOOL, INT, FLOAT, or DATE columns in the WHERE clause.

Operations on BOOL, INT, FLOAT, and DATE columns are typically faster than operations on STRING or BYTE columns. When possible, use a column that uses one of these data types in the WHERE clause to reduce the amount of data returned by the query.

Order query operations to maximize performance

Best practice: Use ORDER BY only in the outermost query or within window clauses. Push complex operations to the end of the query.

If you need to sort data, filter first to reduce the number of values that you need to sort. If you sort your data first, you sort much more data than is necessary. It is preferable to sort on a subset of data than to sort all the data and apply a LIMIT clause.

When you use an ORDER BY clause, it should appear only in the outermost query. Placing an ORDER BY clause in the middle of a query greatly impacts performance unless it is being used in a window function.

Another technique for ordering your query is to push complex operations, such as regular expressions and mathematical functions, to the end of the query. Again, this technique allows the data to be pruned as much as possible before the complex operations are performed.

Use LIMIT with an ORDER BY clause

Best practice: If you are ordering a very large number of values but don't need to have all of them returned, use a LIMIT clause.

Use a LIMIT clause with an ORDER BY clause when returning a limited number of records works for your scenario. When it doesn't, consider using a window function instead.

For example, the following query orders a very large result set and throws a Resources exceeded error. The query sorts by the title column in mytable. The title column contains millions of values.

SELECT
  title
FROM
  `my-project.mydataset.mytable`
ORDER BY
  title;

To remove the error, use a query like the following:

SELECT
  title
FROM
  `my-project.mydataset.mytable`
ORDER BY
  title DESC
LIMIT
  1000;

Use LIMIT and a window function with an ORDER BY clause

Best practice: If you are ordering a very large number of values, use a window function, and limit data before calling the window function.

For example, the following query lists the ten oldest Stack Overflow users and their ranking, with the oldest account being ranked lowest:

SELECT
  id,
  reputation,
  creation_date,
  DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank
FROM bigquery-public-data.stackoverflow.users
ORDER BY user_rank ASC
LIMIT 10;

This query takes approximately 15 seconds to run. This query uses LIMIT at the end of the query, but not in the DENSE_RANK() OVER window function. Because of this, the query requires all of the data to be sorted on a single worker node.

Instead, you should limit the dataset before computing the window function in order to improve performance:

WITH users AS (
SELECT
  id,
  reputation,
  creation_date,
FROM bigquery-public-data.stackoverflow.users
ORDER BY creation_date ASC
LIMIT 10)
SELECT
  id,
  reputation,
  creation_date,
  DENSE_RANK() OVER (ORDER BY creation_date) AS user_rank
FROM users
ORDER BY user_rank;

This query takes approximately 2 seconds to run, while returning the same results as the previous query.

One caveat is that the DENSE_RANK() function ranks the data within years, so if you want to rank data spanning multiple years, these queries would not give identical results.