Best practices for functions

This document describes how to optimize queries that use SQL functions.

Optimize string comparison

Best practice: When possible, use LIKE instead of REGEXP_CONTAINS.

In BigQuery, you can use the REGEXP_CONTAINS function or the LIKE operator to compare strings. REGEXP_CONTAINS provides more functionality, but also has a slower execution time. Using LIKE instead of REGEXP_CONTAINS is faster, particularly if you don't need the full power of regular expressions that REGEXP_CONTAINS provides, for example wildcard matching.

Consider the following use of the REGEXP_CONTAINS function:

SELECT
  dim1
FROM
  `dataset.table1`
WHERE
  REGEXP_CONTAINS(dim1, '.*test.*');

You can optimize this query as follows:

SELECT
  dim1
FROM
  `dataset.table`
WHERE
  dim1 LIKE '%test%';

Optimize aggregation functions

Best practice: If your use case supports it, use an approximate aggregation function.

If the SQL aggregation function you're using has an equivalent approximation function, the approximation function yields faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT. For more information, see approximate aggregation functions.

You can also use HyperLogLog++ functions to do approximations (including custom approximate aggregations). For more information, see HyperLogLog++ functions in the GoogleSQL reference.

Consider the following use of the COUNT function:

SELECT
  dim1,
  COUNT(DISTINCT dim2)
FROM
  `dataset.table`
GROUP BY 1;

You can optimize this query as follows:

SELECT
  dim1,
  APPROX_COUNT_DISTINCT(dim2)
FROM
  `dataset.table`
GROUP BY 1;

Optimize quantile functions

Best practice: When possible, use APPROX_QUANTILE instead of NTILE.

Running a query that contains the NTILE function can fail with a Resources exceeded error if there are too many elements to ORDER BY in a single partition, which causes data volume to grow. The analytic window isn't partitioned, so the NTILE computation requires a global ORDER BY for all rows in the table to be processed by a single worker/slot.

Try using APPROX_QUANTILES instead. This function allows the query to run more efficiently because it doesn't require a global ORDER BY for all rows in the table.

Consider the following use of the NTILE function:

SELECT
  individual_id,
  NTILE(nbuckets) OVER (ORDER BY sales desc) AS sales_third
FROM
  `dataset.table`;

You can optimize this query as follows:

WITH QuantInfo AS (
  SELECT
    o, qval
  FROM UNNEST((
     SELECT APPROX_QUANTILES(sales, nbuckets)
     FROM `dataset.table`
    )) AS qval
  WITH offset o
  WHERE o > 0
)
SELECT
  individual_id,
  (SELECT
     (nbuckets + 1) - MIN(o)
   FROM QuantInfo
   WHERE sales <= QuantInfo.qval
  ) AS sales_third
FROM `dataset.table`;

The optimized version gives similar but not identical results to the original query, because APPROX_QUANTILES:

Provides an approximate aggregation.
Places the remainder values (the remainder of the number of rows divided by buckets) in a different way.

Optimize UDFs

Best practice: Use SQL UDFs for simple calculations because the query optimizer can apply optimizations to SQL UDF definitions. Use Javascript UDFs for complex calculations that are not supported by SQL UDF.

Calling a JavaScript UDF requires the instantiation of a subprocess. Spinning up this process and running the UDF directly impacts query performance. If possible, use a native (SQL) UDF instead.

Persistent UDFs

It is better to create persistent user-defined SQL and JavaScript functions in a centralized BigQuery dataset that can be invoked across queries and in logical views, as opposed to creating and calling a UDF in code each time. Creating org-wide libraries of business logic within shared datasets helps optimize performance and use fewer resources.

The following example shows how a temporary UDF is invoked in a query:

CREATE TEMP FUNCTION addFourAndDivide(x INT64, y INT64) AS ((x + 4) / y);

WITH numbers AS
  (SELECT 1 as val
  UNION ALL
  SELECT 3 as val
  UNION ALL
  SELECT 4 as val
  UNION ALL
  SELECT 5 as val)
SELECT val, addFourAndDivide(val, 2) AS result
FROM numbers;

You can optimize this query by replacing the temporary UDF with a persistent one:

WITH numbers AS
  (SELECT 1 as val
  UNION ALL
  SELECT 3 as val
  UNION ALL
  SELECT 4 as val
  UNION ALL
  SELECT 5 as val)
SELECT val, `your_project.your_dataset.addFourAndDivide`(val, 2) AS result
FROM numbers;