Stay organized with collections Save and categorize content based on your preferences.

Reduce data processed in queries

The following best practices provide guidance on reducing the amount of data processed by your query, which will help to increase query performance and reduce cost.

Avoid SELECT *

Best practice: Control projection — Query only the columns that you need.

Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).

Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.

If you are experimenting with data or exploring data, use one of the data preview options instead of SELECT *.

Applying a LIMIT clause to a SELECT * query does not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query counts against your free tier quota.

Instead, query only the columns you need. For example, use SELECT * EXCEPT to exclude one or more columns from the results.

If you do require queries against every column in a table, but only against a subset of data, consider:

Querying a subset of data or using SELECT * EXCEPT can greatly reduce the amount of data that is read by a query. In addition to the cost savings, performance is improved by reducing the amount of data I/O and the amount of materialization that is required for the query results.

The following examples illustrate this best practice.

Instead of selecting all columns:

SELECT *
FROM mydataset.newtable

Use the SELECT * EXCEPT statement to omit unnecessary columns:

SELECT * EXCEPT (col1, col2, col5)
FROM mydataset.newtable

Prune partitioned queries

Best practice: When querying a partitioned table, use the _PARTITIONDATE or _PARTITIONTIME pseudo columns to filter the partitions. Filtering your partitions improves query performance and reduces costs.

Filtering the data using _PARTITIONTIME lets you specify a date or range of dates. For example, the following WHERE clause uses the _PARTITIONTIME pseudo column to specify partitions between January 1, 2016 and January 31, 2016:

WHERE _PARTITIONTIME
BETWEEN TIMESTAMP("2016-01-01")
    AND TIMESTAMP("2016-01-31")

The query processes data only in the partitions that are indicated by the date range, reducing the amount of input data.

If the partition granularity is daily, the table also contains a pseudo-column named _PARTITIONDATE that you can use similarly to _PARTITIONTIME. The _PARTITIONDATE value is equal to _PARTITIONTIME truncated to a DATE value.

Avoid excessive wildcard tables

Best practice: When querying wildcard tables, use the most granular prefix possible.

Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains:

  • Multiple, similarly named tables with compatible schemas
  • Sharded tables

When you query a wildcard table, specify a wildcard (*) after the common table prefix. For example, FROM bigquery-public-data.noaa_gsod.gsod194* queries all tables from the 1940s.

More granular prefixes perform better than shorter prefixes. For example, FROM bigquery-public-data.noaa_gsod.gsod194* performs better than FROM bigquery-public-data.noaa_gsod.* because fewer tables match the wildcard.

Avoid tables sharded by date

Best practice: Do not use tables sharded by date (also called date-named tables) in place of time-partitioned tables.

Partitioned tables perform better than date-named tables. When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.

Avoid oversharding tables

Best practice: Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.

Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. If you are sharding tables by date, use time-partitioned tables instead.

Because of the low cost of BigQuery storage, you do not need to optimize your tables for cost as you would in a relational database system. Creating a large number of table shards has performance impacts that outweigh any cost benefits.

Sharded tables require BigQuery to maintain schema, metadata, and permissions for each shard. Because of the added overhead required to maintain information on each shard, oversharding tables can impact query performance.

The amount and source of data read by a query can impact query performance and cost. Use the best practices in this topic to reduce the amount of data read by SELECT to improve query performance.

To examine how a query reads the input data, you can use the query plan explanation.

Reduce data before using a JOIN

Best practice: Reduce the amount of data that is processed before a JOIN clause by performing aggregations early.

Using a GROUP BY clause with aggregate functions is computationally intensive, because these types of queries require the use of shuffle. Because of this, you should only use a GROUP BY clause when necessary.

For queries where you do need to use GROUP BY and where there is also a JOIN involved, perform aggregation earlier in the query when possible. Early aggregation is applied to the table prior to the JOIN and can substantially reduce the amount of data read by the query, thereby improving query performance.

For example, the following query performs a JOIN on two large tables without any filtering beforehand:

WITH
  users_posts AS (
    SELECT *
    FROM
      `bigquery-public-data`.stackoverflow.comments AS c
    JOIN
      `bigquery-public-data`.stackoverflow.users AS u
      ON
        c.user_id = u.id
  )
SELECT
  user_id,
  ANY_VALUE(display_name) AS display_name,
  ANY_VALUE(reputation) AS reputation,
  COUNT(text) AS comments_count
FROM users_posts
GROUP BY user_id
ORDER BY comments_count DESC
LIMIT 20;

This query pre-aggregates the comment counts which reduces the amount of data read for the JOIN:

SELECT user_id, display_name, reputation, comments_count
FROM
  (
    (
      SELECT user_id, COUNT(text) AS comments_count
      FROM `bigquery-public-data`.stackoverflow.comments
      WHERE user_id IS NOT NULL
      GROUP BY user_id
      ORDER BY comments_count DESC
      LIMIT 20
    )
    JOIN `bigquery-public-data`.stackoverflow.users AS u
      ON user_id = u.id
  )
ORDER BY comments_count DESC;

Keep in mind that WITH clauses with Common Table Expressions (CTEs) are used primarily for query readability, not performance. Adding a WITH clause does not cause BigQuery to materialize temporary intermediate tables; the SELECT statement may be evaluated multiple times. For example, placing all of your queries in WITH clauses and then running UNION ALL is equivalent to placing all of the queries inline.