What is a query optimizer?
The Cloud Spanner SQL query optimizer converts a declarative SQL statement, that describes what data the query wants, into an imperative execution plan, that describes one way to precisely obtain that data. The process of transforming a declarative statement into a query execution plan involves performing transformations to tree structures used to represent the query. The optimizer, in the process of producing an execution plan, preserves the logical meaning of the original SQL query so that the correct rows are returned.
Another important role of the optimizer is to produce an execution plan that is efficient.
How does the Spanner optimizer produce efficient execution plans?
Spanner's optimizer uses a combination of well-established heuristics and cost-based optimization to produce efficient plans. Some heuristics are very straightforward, such as "perform filters on remote machines rather than pulling data to the local machine". Other heuristics are more complex, but still embody the principle of moving logic that reduces data volume closer to the data. That principle is particularly important in a system that shards data across multiple machines.
Not all execution decisions can be made effectively using such fixed rules, so
Spanner's optimizer also makes decisions based on an estimated
cost of alternatives. Such cost estimates are calculated using the structure of
the query, the schema of the database, and estimates of the data volume that
will be produced by fragments of the query. For example, Spanner
will estimate how many rows of the Songs table qualify the filter
SongGenre = "Country" if that filter appears in a query. To help the
computation of such estimates, Spanner periodically gathers
statistics to characterize the distribution of the data in a database.
To learn more about query execution plans and how they are used by Cloud Spanner to perform queries in a distributed environment, see Query execution plans.
Query optimizer versioning
Over time, the Cloud Spanner query optimizer will evolve, broadening the set of choices in the query execution plan and improving the accuracy of the estimates that inform those choices, leading to more efficient query execution plans.
Spanner releases optimizer updates as new query optimizer versions to improve the efficiency of its query execution plans. To learn more about the different versions, see Cloud Spanner query optimizer versions.
Query optimizer statistics packages
Spanner maintains statistics on the data distribution of table columns to help estimate how many rows will be produced by a query. The query optimizer uses these estimates to help choose the best query execution plan. These statistics are periodically updated by Spanner. Because the statistics are used to choose query execution plans, when the statistics are updated, it is possible for Spanner to change the query plan that it uses for a query.
By default, databases automatically use the latest generated statistics package. You can pin your database to an earlier statistics package version. You also have the option of running individual queries with a statistics package other than the latest.
Construct a new statistics package
Spanner automatically generates a new statistics packages every three
days. To construct a new statistics package manually, use the
Constructing a new statistics package can benefit query performance after significant changes to your database's data or schema, including the following:
- The database processes a large amount of inserts, updates, or deletes.
- You add a new index to the database.
- You add a new column to a table.
After Spanner finishes executing the statement, it takes up to ten minutes for the query optimizer to account for a new statistics package in its query planning.
Garbage collection of statistics packages
Statistics packages in Spanner are kept for a period of 30 days since their creation, after which they are subject to garbage collection.
The Spanner built-in
table contains a list of available statistics packages. Each row in this table
lists a statistics package by name, and the name contains the creation timestamp
of the given package. Each entry also contains a field called
defines whether a package can be garbage collected or not.
You can pin your entire database to any one of the packages listed in that
table. The pinned statistics package won't be garbage collected and the value of
ALLOW_GC is ignored as long as the database is pinned to this package. To use
a particular statistics package for an individual query, the package must be
ALLOW_GC=FALSE or pinned. This prevents queries from failing after
the statistics package has been garbage collected. The value of
be changed using the
ALTER STATISTICS DDL statement.
Package retention and Personally Identifiable Information (PII)
A statistics package contains histograms of the column data, as per standard industry practice. This helps the query optimizer select the optimal query plans. The histogram is constructed using a small sample of values. This small dataset can potentially contain PII.
Spanner creates a new statistics package on a regular basis and
retains it for 30 days by default. Thus a small sample of values deleted from
the database may be retained for additional 30 days in statistics histograms.
Statistics packages pinned with
optimizer_statistics_package database option
or packages with
ALLOW_GC=FALSE option will not be garbage collected.
Histograms in these packages may contain values deleted from the database for a
longer period. In addition, the content of statistics packages are included in
The optimizer statistics is stored encrypted in the same way as user data.
The total amount of storage required for these packages is usually less than 100 MB, and does count towards your total storage costs.
- To learn more about the history of the query optimizer, see Query optimizer version history.
- To manage both the optimizer version and statistics package for your scenario, see Manage the query optimizer.