Optimizing Query Computation

When evaluating the computation that is required by a query, consider the amount of work that is required. How much CPU time is required? Are you using functions like JavaScript user-defined functions that require additional CPU resources?

The following best practices provide guidance on controlling query computation.

Avoid repeatedly transforming data via SQL queries

Best practice: If you are using SQL to perform ETL operations, avoid situations where you are repeatedly transforming the same data.

For example, if you are using SQL to trim strings or extract data by using regular expressions, it is more performant to materialize the transformed results in a destination table. Functions like regular expressions require additional computation. Querying the destination table without the added transformation overhead is much more efficient.

Avoid JavaScript user-defined functions

Best practice: Avoid using JavaScript user-defined functions. Use native UDFs instead.

Calling a JavaScript UDF requires the instantiation of a Java subprocess. Spinning up this process and running the UDF directly impacts query performance. If possible, use a native (SQL) UDF instead.

Use approximate aggregation functions

Best practice: If your use case supports it, use an approximate aggregation function.

If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance. For example, instead of using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT(). For more information, see approximate aggregation functions in the standard SQL reference.

You can also use HyperLogLog++ functions to do approximations (including custom approximate aggregations). For more information, see HyperLogLog functions in the standard SQL reference.

Order query operations to maximize performance

Best practice: Use ORDER BY only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query.

If you need to sort data, filter first to reduce the number of values that you need to sort. If you sort your data first, you sort much more data than is necessary. It is preferable to sort on a subset of data than to sort all the data and apply a LIMIT clause.

When you use an ORDER BY clause, it should appear only in the outermost query. Placing an ORDER BY clause in the middle of a query greatly impacts performance unless it is being used in a window (analytic) function.

Another technique for ordering your query is to push complex operations, such as regular expressions and mathematical functions to the end of the query. Again, this technique allows the data to be pruned as much as possible before the complex operations are performed.

Optimize your join patterns

Best practice: For queries that join data from multiple tables, optimize your join patterns. Start with the largest table.

When you create a query by using a JOIN, consider the order in which you are merging the data. The standard SQL query optimizer can determine which table should be on which side of the join, but it is still recommended to order your joined tables appropriately. The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.

When you have a large table as the left side of the JOIN and a small one on the right side of the JOIN, a broadcast join is created. A broadcast join sends all the data in the smaller table to each slot that processes the larger table. It is advisable to perform the broadcast join first.

To view the size of the tables in your JOIN, see getting information about tables.

Prune partitioned queries

Best practice: When querying a time-partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions.

When you query partitioned tables, use the _PARTITIONTIME pseudo column. Filtering the data using _PARTITIONTIME allows you to specify a date or range of dates. For example, the following WHERE clause uses the _PARTITIONTIME pseudo column to specify partitions between January 1, 2016 and January 31, 2016:

    AND TIMESTAMP(“20160131”)

The query processes data only in the partitions that are indicated by the date range. Filtering your partitions improves query performance and reduces costs.

Cette page vous a-t-elle été utile ? Évaluez-la :

Envoyer des commentaires concernant…