Avoiding SQL anti-patterns
The following best practices provide guidance on avoiding query anti-patterns that impact performance in BigQuery.
Self-joins
Best practice: Avoid self-joins. Use a window (analytic) function instead.
Typically, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially squares the number of output rows. This increase in output data can cause poor performance.
Instead of using a self-join, use a window (analytic) function to reduce the number of additional bytes that are generated by the query.
Data skew
Best practice: If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can't share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.
Partitions become large when your partition key has a value
that occurs more often than any other value. For example, grouping by a
user_id
field where there are many entries for guest
or NULL
.
When a slot's resources are overwhelmed, a resources exceeded
error results. Reaching the shuffle limit for a slot (2TB in memory
compressed) also causes the shuffle to write to disk and further impacts
performance. Customers with flat-rate pricing
can increase the number of allocated slots.
If you examine the query explain plan and see a significant difference between avg and max compute times, your data is probably skewed.
To avoid performance issues that result from data skew:
- Use an approximate aggregate function such as
APPROX_TOP_COUNT
to determine if the data is skewed. - Filter your data as early as possible.
Unbalanced joins
Data skew can also appear when you use JOIN
clauses. Because
BigQuery shuffles data on each side of the join, all data with the
same join key goes to the same shard. This shuffling can overload the slot.
To avoid performance issues that are associated with unbalanced joins:
- Pre-filter rows from the table with the unbalanced key.
- If possible, split the query into two queries.
Use the
SELECT DISTINCT
statement when specifying a subquery in theWHERE
clause, in order to evaluate unique field values only once.For example, instead of using the following clause that contains a
SELECT
statement:table1.my_id NOT IN ( SELECT my_id FROM table2 )
Use a clause that contains a
SELECT DISTINCT
statement instead:table1.my_id NOT IN ( SELECT DISTINCT my_id FROM table2 )
Cross joins (Cartesian product)
Best practice: Avoid joins that generate more outputs than inputs. When a
CROSS JOIN
is required, pre-aggregate your data.
Cross joins are queries where each row from the first table is joined to every row in the second table (there are non-unique keys on both sides). The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish.
If the query job completes, the query plan explanation will show output rows
versus input rows. You can confirm a Cartesian product
by modifying the query to print the number of rows on each side of the JOIN
clause, grouped by the join key.
To avoid performance issues associated with joins that generate more outputs than inputs:
- Use a
GROUP BY
clause to pre-aggregate the data. - Use a window function. Window functions are often more efficient than using a cross join. For more information, see window functions.
DML statements that update or insert single rows
Best practice: Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.
Using point-specific DML statements is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system. BigQuery focuses on Online Analytical Processing (OLAP) by using table scans and not point lookups. If you need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Cloud SQL.
BigQuery DML statements are intended for bulk updates. UPDATE
and DELETE
DML statements in BigQuery are oriented towards
periodic rewrites of your data, not single row mutations. The INSERT
DML
statement is intended to be used sparingly. Inserts consume the same
modification quotas
as load jobs. If your use case involves frequent single row inserts, consider
streaming your data instead.
If batching your UPDATE
statements yields many tuples in very long queries,
you might approach the query length limit of 256 KB. To work around the query
length limit, consider whether your updates can be handled based on a logical
criteria instead of a series of direct tuple replacements.
For example, you could load your set of replacement records into another table,
then write the DML statement to update all values in the original table if the
non-updated columns match. For example, if the original data is in table t
and
the updates are staged in table u
, the query would look like the following:
UPDATE dataset.t t SET my_column = u.my_column FROM dataset.u u WHERE t.my_key = u.my_key