Managing input data and data sources

When evaluating your input data, consider the required I/O. How many bytes does your query read? Are you properly limiting the amount of input data? Is your data in native BigQuery storage or an external data source? The amount of data read by a query and the source of the data impact query performance and cost.

To examine how a query reads the input data, you can use the query plan explanation.

The following best practices provide guidance to control your input data and choose a data source.

Control projection - Avoid SELECT *

Best practice: Control projection — Query only the columns that you need.

Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).

Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.

If you are experimenting with data or exploring data, use one of the data preview options instead of SELECT *.

Applying a LIMIT clause to a SELECT * query does not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query counts against your free tier quota.

Instead, query only the columns you need. For example, use SELECT * EXCEPT to exclude one or more columns from the results.

If you do require queries against every column in a table, but only against a subset of data, consider:

  • Materializing results in a destination table and querying that table instead
  • Partitioning your tables by date and querying the relevant partition; for example, WHERE _PARTITIONDATE="2017-01-01" only scans the January 1, 2017 partition

Querying a subset of data or using SELECT * EXCEPT can greatly reduce the amount of data that is read by a query. In addition to the cost savings, performance is improved by reducing the amount of data I/O and the amount of materialization that is required for the query results.


The following examples illustrate this best practice.

Instead of selecting all columns:

FROM mydataset.newtable

Use the SELECT * EXCEPT statement to omit unnecessary columns:

SELECT * EXCEPT (col1, col2, col5)
FROM mydataset.newtable

Prune partitioned queries

Best practice: When querying a partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions.

When you query partitioned tables, use the _PARTITIONTIME pseudo column. Filtering the data using _PARTITIONTIME lets you specify a date or range of dates. For example, the following WHERE clause uses the _PARTITIONTIME pseudo column to specify partitions between January 1, 2016 and January 31, 2016:

    AND TIMESTAMP("2016-01-31")

The query processes data only in the partitions that are indicated by the date range, reducing the amount of input data. Filtering your partitions improves query performance and reduces costs.

Denormalizing data

Best practice: Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships.

The storage savings from using normalized data has less of an effect in modern systems. Increases in storage costs are worth the performance gains of using denormalized data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots, so that execution can be done in parallel.

To maintain relationships while denormalizing your data, you can use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.

For example, denormalizing an orders schema without using nested and repeated fields might require you to group the data by a field like order_id (when there is a one-to-many relationship). Because of the shuffling involved, grouping the data is less effective than denormalizing the data by using nested and repeated fields.

In some circumstances, denormalizing your data and using nested and repeated fields doesn't result in increased performance. Avoid denormalization in these use cases:

  • You have a star schema with frequently changing dimensions.
  • BigQuery complements an Online Transaction Processing (OLTP) system with row-level mutation but can't replace it.

Using nested and repeated fields

BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.

  • Nesting data (STRUCT)

    • Nesting data lets you represent foreign entities inline.
    • Querying nested data uses "dot" syntax to reference leaf fields, which is similar to the syntax using a join.
    • Nested data is represented as a STRUCT type in standard SQL.
  • Repeated data (ARRAY)

    • Creating a field of type RECORD with the mode set to REPEATED lets you preserve a one-to-many relationship inline (so long as the relationship isn't high cardinality).
    • With repeated data, shuffling is not necessary.
    • Repeated data is represented as an ARRAY. You can use an ARRAY function in standard SQL when you query the repeated data.
  • Nested and repeated data (ARRAY of STRUCTs)

    • Nesting and repetition complement each other.
    • For example, in a table of transaction records, you could include an array of line item STRUCTs.

For a detailed example of how to denormalize data, see Denormalization.

Use external data sources appropriately

Best practice: If query performance is a top priority, do not use an external data source.

Querying tables in BigQuery managed storage is typically much faster than querying external tables in Cloud Storage, Drive, or Cloud Bigtable.

Use an external data source for these use cases:

  • Performing extract, transform, and load (ETL) operations when loading data
  • Frequently changing data
  • Periodic loads such as recurring ingestion of data from Cloud Bigtable

Avoid excessive wildcard tables

Best practice: When querying wildcard tables, use the most granular prefix possible.

Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains:

  • Multiple, similarly named tables with compatible schemas
  • Sharded tables

When you query a wildcard table, specify a wildcard (*) after the common table prefix. For example, FROM bigquery-public-data.noaa_gsod.gsod194* queries all tables from the 1940s.

More granular prefixes perform better than shorter prefixes. For example, FROM bigquery-public-data.noaa_gsod.gsod194* performs better than FROM bigquery-public-data.noaa_gsod.* because fewer tables match the wildcard.