Optimizing Communication Between Slots

When evaluating your communication throughput, consider the amount of shuffling that is required by your query. How many bytes are passed between stages? How many bytes are passed to each slot? For example, a GROUP BY clause passes like values to the same slot for processing. The amount of data that is shuffled directly impacts communication throughput and as a result, query performance.

The following best practices provide guidance on controlling communication between slots.

Reduce data before using a JOIN

Best practice: Reduce the amount of data that is processed before a JOIN clause.

Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need.

For an example of using filter pushdown to trim data before a JOIN, see pushing down filters to improve performance.

Do not treat WITH clauses as prepared statements

Best practice: Use WITH clauses primarily for readability.

WITH clauses are used primarily for readability because they are not materialized. For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.

Avoid date-named tables

Best practice: Do not use date-named tables in place of date-partitioned tables.

Partitioned Tables perform better than date-named tables. When you create date-named tables, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.

Avoid oversharding tables

Best practice: Avoid creating too many table shards. If you are sharding tables by date, use date-partitioned tables instead.

Table sharding refers to dividing large datasets into separate tables and adding a suffix to each table name. If you are sharding tables by date, use date-partitioned tables instead.

Because of the low cost of BigQuery storage, you do not need to optimize your tables for cost as you would in a relational database system. Creating a large number of table shards has performance impacts that outweigh any cost benefits.

Like date-named tables, sharded tables require BigQuery to maintain schema, metadata, and permissions for each shard. Because of the added overhead required to maintain information on each shard, oversharding tables can impact query performance.

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...