The following best practices provide guidance on avoiding query anti-patterns that impact performance in BigQuery.
Best practice: Avoid self-joins. Use a window function instead.
Typcially, self-joins are used to compute row-dependent relationships. The result of using a self-join is that it potentially doubles the number of output rows. This increase in output data can cause poor performance.
Instead of using a self-join, use a window (analytic) function to reduce the number of additional bytes that are generated by the query.
Best practice: If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can't share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.
Partitions become large when your partition key has a value
that occurs more often than any other value. For example, grouping by a
user_id field where there are many entries for
When a slot's resources are overwhelmed, a
error results. Reaching the shuffle limit for a slot (2TB in memory
compressed) also causes the shuffle to write to disk and further impacts
performance. Customers with flat-rate pricing
can increase the number of allocated slots.
If you examine the query explain plan and see a significant difference between avg and max compute times, your data is probably skewed.
To avoid performance issues that result from data skew:
- Use an approximate aggregate function such as
APPROX_TOP_COUNTto determine if the data is skewed.
- Filter your data as early as possible.
Data skew can also appear when you use
JOIN clauses. Because BigQuery
shuffles data on each side of the join, all data with the same join key goes to
the same shard. This shuffling can overload the slot.
To avoid performance issues that are associated with unbalanced joins:
- Pre-filter rows from the table with the unbalanced key.
- If possible, split the query into two queries.
Cross joins (Cartesian product)
Best practice: Avoid joins that generate more outputs than inputs. When a
CROSS JOIN is required, pre-aggregate your data.
Cross joins are queries where each row from the first table is joined to every row in the second table (there are non-unique keys on both sides). The worst case output is the number of rows in the left table multiplied by the number of rows in the right table. In extreme cases, the query might not finish.
If the query job completes, the query plan explanation will show output rows
versus input rows. You can confirm a Cartesian product
by modifying the query to print the number of rows on each side of the
clause, grouped by the join key.
To avoid performance issues associated with joins that generate more outputs than inputs:
- Use a
GROUP BYclause to pre-aggregate the data.
- Use a window function. Window functions are often more efficient that using a cross join. For more information, see analytic functions.
DML statements that update or insert single rows
Best practice: Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.
Using point-specific DML statements is an attempt to treat BigQuery like an Online Transaction Processing (OLTP) system. BigQuery focuses on Online Analytical Processing (OLAP) by using table scans and not point lookups. If you need OLTP-like behavior (single-row updates or inserts), consider a database designed to support OLTP use cases such as Google Cloud SQL.
BigQuery DML statements are intended for bulk updates.
DELETE DML statements in BigQuery are oriented towards
periodic rewrites of your data, not single row mutations. The
statement is intended to be used sparingly. Inserts consume the same
as load jobs. If your use case involves frequent single row inserts, consider
streaming your data instead.
If batching your
UPDATE statements yields many tuples in very long queries,
you might approach the query length limit of 256K. To work around the query
length limit, consider whether your updates can be handled based on a logical
criteria instead of a series of direct tuple replacements.
For example, you could load your set of replacement records into another table,
then write the DML statement to update all values in the original table if the
non-updated columns match. For example, if the original data is in table
the updates are staged in table
u, the query would look like the following:
UPDATE dataset.t t SET my_column = u.my_column FROM dataset.u u WHERE t.my_key = u.my_key