Optimizing communication between slots
When evaluating your communication throughput, consider the amount of shuffling
that is required by your query. How many bytes are passed between stages? How
many bytes are passed to each slot? For example, a GROUP BY
clause passes like
values to the same slot for processing. The amount of data that is shuffled
directly impacts communication throughput and as a result, query performance.
The following best practices provide guidance on controlling communication between slots.
Reduce data before using a JOIN
Best practice: Reduce the amount of data that is processed before a JOIN
clause.
Trim the data as early in the query as possible, before the query performs a
JOIN
. If you reduce data early in the processing cycle, shuffling and other
complex operations only execute on the data that you need.
Do not treat WITH
clauses as prepared statements
Best practice: Use WITH
clauses primarily for readability.
WITH
clauses are used primarily for readability because they are
not materialized. For example, placing all your queries in WITH
clauses and
then running UNION ALL
is a misuse of the WITH
clause. If a query appears in
more than one WITH
clause, it executes in each clause.
Avoid tables sharded by date
Best practice: Do not use tables sharded by date (also called date-named tables) in place of time-partitioned tables.
Partitioned tables perform better than date-named tables. When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.
Avoid oversharding tables
Best practice: Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.
Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. If you are sharding tables by date, use time-partitioned tables instead.
Because of the low cost of BigQuery storage, you do not need to optimize your tables for cost as you would in a relational database system. Creating a large number of table shards has performance impacts that outweigh any cost benefits.
Sharded tables require BigQuery to maintain schema, metadata, and permissions for each shard. Because of the added overhead required to maintain information on each shard, oversharding tables can impact query performance.