Query optimizer

What is a query optimizer?

The Cloud Spanner SQL query optimizer converts a declarative SQL statement, that describes what data the query wants, into an imperative execution plan, that describes one way to precisely obtain that data. The process of transforming a declarative statement into a query execution plan involves performing transformations to tree structures used to represent the query. The optimizer, in the process of producing an execution plan, preserves the logical meaning of the original SQL query so that the correct rows are returned.

Another important role of the optimizer is to produce an execution plan that is efficient.

How does the Cloud Spanner optimizer produce efficient execution plans?

Cloud Spanner's optimizer uses a combination of well-established heuristics and cost-based optimization to produce efficient plans. Some heuristics are very straightforward, such as "perform filters on remote machines rather than pulling data to the local machine". Other heuristics are more complex, but still embody the principle of moving logic that reduces data volume closer to the data. That principle is particularly important in a system that shards data across multiple machines.

Not all execution decisions can be made effectively using such fixed rules, so Cloud Spanner's optimizer also makes decisions based on an estimated cost of alternatives. Such cost estimates are calculated using the structure of the query, the schema of the database, and estimates of the data volume that will be produced by fragments of the query. For example, Cloud Spanner will estimate how many rows of the Songs table qualify the filter SongGenre = "Country" if that filter appears in a query. To help the computation of such estimates, Cloud Spanner periodically gathers statistics on the user data.

To learn more about query execution plans and how they are used by Cloud Spanner to perform queries in a distributed environment, see Query execution plans.

Query optimizer versioning

Over time, the query optimizer will evolve, broadening the set of choices in the query execution plan and improving the accuracy of the estimates that inform those choices, leading to more efficient query execution plans.

Cloud Spanner rolls out optimizer updates as new query optimizer versions. By default, each database starts using the latest version of the optimizer no sooner than 30 days after that version has been released.

To put more control in your hands, we've introduced the ability to manage the query optimizer version that your queries use. Before committing to the latest version, you can compare query performance profiles between older versions and the latest version. To learn more, see Managing the query optimizer.

Query optimizer version history

The following is a summary of the updates made to the query optimizer in each release.

Version 3: August 1st, 2021 (latest)

  • Adds a new join algorithm, merge join, enabled using a new JOIN METHOD query hint value.

    Statement hint: SQL @{join_method=merge_join} SELECT ...

    Join hint: SQL SELECT ... FROM (...) JOIN@{join_method=merge_join} (...)

  • Adds a new join algorithm, push broadcast hash join, enabled using a new JOIN METHOD query hint value.

    Join hint: SQL SELECT ... FROM (...) JOIN@{join_method=push_broadcast_hash_join} (...)

  • Introduces the distributed merge union, which is enabled by default when applicable.

  • A small improvement to the performance of a scan under a GROUP BY when there is no MAX or MIN aggregate (or HAVING MAX/MAX) in the SELECT list. Prior to this change, Spanner loaded the extra non-grouped column even if it was not required by the query.

    Example:

    Consider the following table.

    CREATE TABLE myTable(
      a INT64,
      b INT64,
      c INT64,
      d INT64)
    PRIMARY KEY (a, b, c);
    

    Prior to this change, the following query would have loaded column c even though it is not required by the query.

    SELECT a, b
    FROM myTable
    GROUP BY a, b
    
  • Improves the performance of some queries with LIMIT when there is a CrossApply operator introduced by joins and the query asks for sorted results with LIMIT. After this change, the optimizer applies the sorting with the limit on the input side of CrossApply first.

    Example:

    SELECT a2.*
    FROM Albums@{FORCE_INDEX=_BASE_TABLE} a1
    JOIN Albums@{FORCE_INDEX=_BASE_TABLE} a2 USING(SingerId)
    ORDER BY a1.AlbumId
    LIMIT 2;
    
  • Improves query performance by pushing more computations through JOIN.

    Pushes more computations which may include a subquery or struct construction through join. That improves the query performance in a few ways such as: More computations can be done in a distributed fashion and more operations that depend on the pushed computations can be pushed down as well. For example, the query has a limit and the sort order depends on those computations, then the limit can be pushed through join as well.

    Example:

    SELECT
      t.ConcertDate,
      (
        SELECT COUNT(*) FROM UNNEST(t.TicketPrices) p WHERE p > 10
      ) AS expensive_tickets,
      u.VenueName
    FROM Concerts t
    JOIN Venues u ON t.VenueId = u.VenueId
    ORDER BY expensive_tickets
    LIMIT 2;
    

Version 2: March 1st, 2020 (default)

  • Adds optimizations in index selection.
  • Improves the performance of REGEXP_CONTAINS and LIKE predicates under certain circumstances.
  • Improves the performance of a scan under a GROUP BY in certain situations.

Version 1: June 18th, 2019

  • Includes many rule based optimizations such as predicate pushdown, limit pushdown, redundant join and redundant expression removal, to name a few.

  • Uses statistics on user data to select which index to use to access each table.

Query optimizer statistics packages

Cloud Spanner maintains statistics on the data distribution of table columns to help estimate how many rows will be produced by a query. The query optimizer uses these estimates to help choose the best query execution plan. These statistics are periodically updated by Cloud Spanner. Because the statistics are used to choose query execution plans, when the statistics are updated, it is possible for Cloud Spanner to change the query plan that it uses for a query.

By default, databases automatically use the latest generated statistics package. You can pin your database to an earlier statistics package version. You also have the option of running individual queries with a statistics package other than the latest.

Garbage collection of statistics packages

Cloud Spanner updates statistics packages every 3 days. Older packages are kept for a period of 30 days from their creation, after which they are subject to garbage collection.

The Cloud Spanner built-in INFORMATION_SCHEMA.SPANNER_STATISTICS table contains a list of available statistics packages. Each row in this table lists a statistics package by name, and the name contains the creation timestamp of the given package. Each entry also contains a field called ALLOW_GC which defines whether a package can be garbage collected or not.

You can pin your entire database to any one of the packages listed in that table. The pinned statistics package won't be garbage collected and the value of ALLOW_GC is ignored as long as the database is pinned to this package. To use a particular statistics package for an individual query, the package must be listed with ALLOW_GC=FALSE or pinned. This prevents queries from failing after the statistics package has been garbage collected. The value of ALLOW_GC can be changed using the ALTER STATISTICS DDL statement.

Package retention and PII

A statistics package contains histograms of the column data, as per standard industry practice. This helps the query optimizer select the optimal query plans. The histogram is constructed using a small sample of values. This small dataset can potentially contain PII.

Cloud Spanner creates a new statistics package on a regular basis and retains it for 30 days by default. Thus a small sample of values deleted from the database may be retained for additional 30 days in statistics histograms. Statistics packages pinned with optimizer_statistics_package database option or packages with ALLOW_GC=FALSE option will not be garbage collected. Histograms in these packages may contain values deleted from the database for a longer period. In addition, the content of statistics packages are included in database backups.

The optimizer statistics is stored encrypted in the same way as user data.

The total amount of storage required for these packages is usually less than 100 MB, and does count towards your total storage costs.

What's next