Best practices for functions
This document describes how to optimize queries that use SQL functions.
Optimize string comparison
Best practice: When possible, use LIKE
instead of REGEXP_CONTAINS
.
In BigQuery, you can use the
REGEXP_CONTAINS
function or the LIKE
operator to compare strings. REGEXP_CONTAINS
provides more functionality,
but also has a slower execution time. Using LIKE
instead of REGEXP_CONTAINS
is faster, particularly if you don't need the full power of regular expressions
that REGEXP_CONTAINS
provides, for example wildcard matching.
Consider the following use of the REGEXP_CONTAINS
function:
SELECT dim1 FROM `dataset.table1` WHERE REGEXP_CONTAINS(dim1, '.*test.*');
You can optimize this query as follows:
SELECT dim1 FROM `dataset.table` WHERE dim1 LIKE '%test%';
Optimize aggregation functions
Best practice: If your use case supports it, use an approximate aggregation function.
If the SQL aggregation function you're using has an equivalent approximation
function, the approximation function yields faster query performance. For
example, instead of using
COUNT(DISTINCT)
,
use
APPROX_COUNT_DISTINCT
.
For more information, see
approximate aggregation functions.
You can also use HyperLogLog++
functions to do approximations (including custom
approximate aggregations). For more information, see
HyperLogLog++ functions
in the GoogleSQL reference.
Consider the following use of the COUNT
function:
SELECT dim1, COUNT(DISTINCT dim2) FROM `dataset.table` GROUP BY 1;
You can optimize this query as follows:
SELECT dim1, APPROX_COUNT_DISTINCT(dim2) FROM `dataset.table` GROUP BY 1;
Optimize quantile functions
Best practice: When possible, use APPROX_QUANTILE
instead of NTILE
.
Running a query that contains the
NTILE
function can fail with a
Resources exceeded
error if there are too many
elements to ORDER BY
in a single partition, which causes data volume to grow.
The analytic window isn't partitioned, so the NTILE
computation requires
a global ORDER BY
for all rows in the table to be processed
by a single worker/slot.
Try using
APPROX_QUANTILES
instead. This function allows the query to run more efficiently because it
doesn't require a global ORDER BY
for all rows in the table.
Consider the following use of the NTILE
function:
SELECT individual_id, NTILE(nbuckets) OVER (ORDER BY sales desc) AS sales_third FROM `dataset.table`;
You can optimize this query as follows:
WITH QuantInfo AS ( SELECT o, qval FROM UNNEST(( SELECT APPROX_QUANTILES(sales, nbuckets) FROM `dataset.table` )) AS qval WITH offset o WHERE o > 0 ) SELECT individual_id, (SELECT (nbuckets + 1) - MIN(o) FROM QuantInfo WHERE sales <= QuantInfo.qval ) AS sales_third FROM `dataset.table`;
The optimized version gives similar but not identical results to the original
query, because APPROX_QUANTILES
:
- Provides an approximate aggregation.
- Places the remainder values (the remainder of the number of rows divided by buckets) in a different way.
Optimize UDFs
Best practice: Use SQL UDFs for simple calculations because the query optimizer can apply optimizations to SQL UDF definitions. Use Javascript UDFs for complex calculations that are not supported by SQL UDF.
Calling a JavaScript UDF requires the instantiation of a subprocess. Spinning up this process and running the UDF directly impacts query performance. If possible, use a native (SQL) UDF instead.
Persistent UDFs
It is better to create persistent user-defined SQL and JavaScript functions in a centralized BigQuery dataset that can be invoked across queries and in logical views, as opposed to creating and calling a UDF in code each time. Creating org-wide libraries of business logic within shared datasets helps optimize performance and use fewer resources.
The following example shows how a temporary UDF is invoked in a query:
CREATE TEMP FUNCTION addFourAndDivide(x INT64, y INT64) AS ((x + 4) / y); WITH numbers AS (SELECT 1 as val UNION ALL SELECT 3 as val UNION ALL SELECT 4 as val UNION ALL SELECT 5 as val) SELECT val, addFourAndDivide(val, 2) AS result FROM numbers;
You can optimize this query by replacing the temporary UDF with a persistent one:
WITH numbers AS (SELECT 1 as val UNION ALL SELECT 3 as val UNION ALL SELECT 4 as val UNION ALL SELECT 5 as val) SELECT val, `your_project.your_dataset.addFourAndDivide`(val, 2) AS result FROM numbers;