Avoiding SQL Anti-Patterns

The following best practices provide guidance on avoiding query anti-patterns that impact performance in BigQuery.

Self-joins

Best practice: Avoid self-joins. Use a window function instead.

Typcially, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially doubles the number of output rows. This increase in output data can cause poor performance.

Instead of using a self-join, use a window (analytic) function to reduce the number of additional bytes that are generated by the query.

Data skew

Best practice: If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.

Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can't share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.

Partitions become large when your partition key has a value that occurs more often than any other value. For example, grouping by a user_id field where there are many entries for guest or null.

When a slot's resources are overwhelmed, a resources exceeded error results. Reaching the shuffle limit for a slot (2TB in memory compressed) also causes the shuffle to write to disk and further impacts performance. Customers with flat-rate pricing can increase the number of allocated slots.

If you examine the query explain plan and see a significant difference between avg and max compute times, your data is probably skewed.

To avoid peformance issues that result from data skew:

  • Use an approximate aggregate function such as APPROX_TOP_COUNT to determine if the data is skewed.
  • Filter your data as early as possible.

Unbalanced joins

Data skew can also appear when you use JOIN clauses. Because BigQuery shuffles data on each side of the join, all data with the same join key goes to the same shard. This shuffling can overload the slot.

To avoid performance issues that are associated with unbalanced joins:

  • Pre-filter rows from the table with the unbalanced key.
  • If possible, split the query into two queries.

Cross joins (Cartesian product)

Best practice: Avoid joins that generate more outputs than inputs. When a CROSS JOIN is required, pre-aggregate your data.

Cross joins are queries where each row from the first table is joined to every row in the second table (there are non-unique keys on both sides). The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish.

If the query job completes, the query plan explanation will show output rows versus input rows. You can confirm a Cartesian product by modifying the query to print the number of rows on each side of the JOIN clause, grouped by the join key.

To avoid performance issues associated with joins that generate more outputs than inputs:

  • Use a GROUP BY clause to pre-aggregate the data.
  • Use a window function. Window functions are often more efficient that using a cross join. For more information, see analytic functions.

DML statements that update or insert single rows

Best practice: Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.

Using point-specific DML statements is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system. BigQuery focuses on Online Analytical Processing (OLAP) by using table scans and not point lookups. If you need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Google Cloud SQL.

BigQuery DML statements are intended for bulk updates. UPDATE and DELETE DML statements in BigQuery are oriented towards periodic rewrites of your data, not single row mutations. The INSERT DML statement is intended to be used sparingly. Inserts consume the same modification quotas as load jobs. If your use case involves frequent single row inserts, consider streaming your data instead.

If batching your UPDATE statements yields many tuples in very long queries, you might approach the query length limit of 256K. To work around the query length limit, consider whether your updates can be handled based on a logical criteria instead of a series of direct tuple replacements.

For example, you could load your set of replacement records into another table, then write the DML statement to update all values in the original table if the non-updated columns match. For example, if the original data is in table t and the updates are staged in table u, the query would look like the following:

UPDATE dataset.t t
SET my_column = u.my_column
FROM dataset.u u
WHERE t.my_key = u.my_key

Monitor your resources on the go

Get the Google Cloud Console app to help you manage your projects.

Send feedback about...