Managing input data and data sources
When evaluating your input data, consider the required I/O. How many bytes does your query read? Are you properly limiting the amount of input data? Is your data in native BigQuery storage or an external data source? The amount of data read by a query and the source of the data impact query performance and cost.
To examine how a query reads the input data, you can use the query plan explanation.
The following best practices provide guidance to control your input data and choose a data source.
Control projection - Avoid
Best practice: Control projection — Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
SELECT * is the most expensive way to query data. When you use
BigQuery does a full scan of every column in the table.
If you are experimenting with data or exploring data, use one of the
data preview options instead of
LIMIT clause to a
SELECT * query does not affect the amount
of data read. You are billed for reading all bytes in the entire table, and the query counts against
your free tier quota.
Instead, query only the columns you need. For example, use
SELECT * EXCEPT to
exclude one or more columns from the results.
If you do require queries against every column in a table, but only against a subset of data, consider:
- Materializing results in a destination table and querying that table instead
- Partitioning your tables by date
and querying the relevant partition; for example,
WHERE _PARTITIONDATE="2017-01-01"only scans the January 1, 2017 partition
Querying a subset of data or using
SELECT * EXCEPT can greatly reduce the
amount of data that is read by a query. In addition to the cost savings,
performance is improved by reducing the amount of data I/O and the amount of
materialization that is required for the query results.
The following examples illustrate this best practice.
Instead of selecting all columns:
SELECT * FROM mydataset.newtable
SELECT * EXCEPT statement to omit unnecessary columns:
SELECT * EXCEPT (col1, col2, col5) FROM mydataset.newtable
Prune partitioned queries
Best practice: When querying a partitioned table,
_PARTITIONTIME pseudo column to filter the partitions.
When you query partitioned tables, use the
_PARTITIONTIME pseudo column.
Filtering the data using
_PARTITIONTIME lets you specify a date or range
of dates. For example, the following
WHERE clause uses the
pseudo column to specify partitions between January 1, 2016 and January 31,
WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2016-01-01") AND TIMESTAMP("2016-01-31")
The query processes data only in the partitions that are indicated by the date range, reducing the amount of input data. Filtering your partitions improves query performance and reduces costs.
Best practice: Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It's best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships.
The storage savings from using normalized data has less of an effect in modern systems. Increases in storage costs are worth the performance gains of using denormalized data. Joins require data coordination (communication bandwidth). Denormalization localizes the data to individual slots, so that execution can be done in parallel.
To maintain relationships while denormalizing your data, you can use nested and repeated fields instead of completely flattening your data. When relational data is completely flattened, network communication (shuffling) can negatively impact query performance.
For example, denormalizing an orders schema without using nested and repeated
fields might require you to group the data by a field like
(when there is a one-to-many relationship). Because of the shuffling involved,
grouping the data is less effective than denormalizing the data by using
nested and repeated fields.
In some circumstances, denormalizing your data and using nested and repeated fields doesn't result in increased performance. Avoid denormalization in these use cases:
- You have a star schema with frequently changing dimensions.
- BigQuery complements an Online Transaction Processing (OLTP) system with row-level mutation but can't replace it.
Using nested and repeated fields
BigQuery doesn't require a completely flat denormalization. You can use nested and repeated fields to maintain relationships.
Nesting data (
- Nesting data lets you represent foreign entities inline.
- Querying nested data uses "dot" syntax to reference leaf fields, which is similar to the syntax using a join.
- Nested data is represented as a
STRUCTtype in standard SQL.
Repeated data (
- Creating a field of type
RECORDwith the mode set to
REPEATEDlets you preserve a one-to-many relationship inline (so long as the relationship isn't high cardinality).
- With repeated data, shuffling is not necessary.
- Repeated data is represented as an
ARRAY. You can use an
ARRAYfunction in standard SQL when you query the repeated data.
- Creating a field of type
Nested and repeated data (
- Nesting and repetition complement each other.
- For example, in a table of transaction records, you could include an array
of line item
For a detailed example of how to denormalize data, see Denormalization.
Use external data sources appropriately
Best practice: If query performance is a top priority, do not use an external data source.
Querying tables in BigQuery managed storage is typically much faster than querying external tables in Cloud Storage, Drive, or Cloud Bigtable.
Use an external data source for these use cases:
- Performing extract, transform, and load (ETL) operations when loading data
- Frequently changing data
- Periodic loads such as recurring ingestion of data from Cloud Bigtable
Avoid excessive wildcard tables
Best practice: When querying wildcard tables, use the most granular prefix possible.
Use wildcards to query multiple tables by using concise SQL statements. Wildcard tables are a union of tables that match the wildcard expression. Wildcard tables are useful if your dataset contains:
- Multiple, similarly named tables with compatible schemas
- Sharded tables
When you query a wildcard table, specify a wildcard (
*) after the common
table prefix. For example,
queries all tables from the 1940s.
More granular prefixes perform better than shorter prefixes. For example,
FROM performs better than
FROM because fewer tables match