Optimizing Communication Between Slots

When evaluating your communication throughput, consider the amount of shuffling that is required by your query. How many bytes are passed between stages? How many bytes are passed to each slot? For example, a GROUP BY clause passes like values to the same slot for processing. The amount of data that is shuffled directly impacts communication throughput and as a result, query performance.

The following best practices provide guidance on controlling communication between slots.

Reduce data before using a JOIN

Best practice: Reduce the amount of data that is processed before a JOIN clause.

Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need.

Do not treat WITH clauses as prepared statements

Best practice: Use WITH clauses primarily for readability.

WITH clauses are used primarily for readability because they are not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.

Avoid tables sharded by date

Best practice: Do not use tables sharded by date (also called date-named tables) in place of time-partitioned tables.

Partitioned Tables perform better than date-named tables. When you create tables sharded by date, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.

Avoid oversharding tables

Best practice: Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.

Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. If you are sharding tables by date, use time-partitioned tables instead.

Because of the low cost of BigQuery storage, you do not need to optimize your tables for cost as you would in a relational database system. Creating a large number of table shards has performance impacts that outweigh any cost benefits.

Sharded tables require BigQuery to maintain schema, metadata, and permissions for each shard. Because of the added overhead required to maintain information on each shard, oversharding tables can impact query performance.

Send feedback about...