SQL best practices

Stay organized with collections Save and categorize content based on your preferences.

As described in Query execution plans, SQL compiler transforms a SQL statement into a query execution plan, which is used to obtain the results of the query. This page describes best practices for constructing SQL statements to help Spanner find efficient execution plans.

The example SQL statements shown in this page use the following sample schema:

Google Standard SQL

CREATE TABLE Singers (
 SingerId   INT64 NOT NULL,
 FirstName  STRING(1024),
 LastName   STRING(1024),
 SingerInfo BYTES(MAX),
 BirthDate  DATE
) PRIMARY KEY (SingerId);

CREATE TABLE Albums (
 SingerId     INT64 NOT NULL,
 AlbumId      INT64 NOT NULL,
 AlbumTitle   STRING(MAX),
 ReleaseDate  DATE
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

For the complete SQL reference, refer to Statement syntax, Functions and operators, and Lexical structure and syntax.

PostgreSQL

CREATE TABLE Singers (
 SingerId   BIGINT PRIMARY KEY,
 FirstName  VARCHAR(1024),
 LastName   VARCHAR(1024),
 SingerInfo BYTEA,
 BirthDate  TIMESTAMPTZ
);

CREATE TABLE Albums (
 SingerId        BIGINT NOT NULL,
 AlbumId         BIGINT NOT NULL,
 AlbumTitle      VARCHAR(1024),
 ReleaseDate     DATE,
 PRIMARY KEY(SingerId, AlbumId),
 FOREIGN KEY (SingerId) REFERENCES Singers(SingerId)
) INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

For more information, refer to The PostgreSQL language in Cloud Spanner.

Use query parameters to speed up frequently executed queries

Parameterized queries are a technique of query execution that separates a query string from query parameter values. For example, suppose your application needs to retrieve singers who have released albums with certain titles in a given year. You might write a SQL statement like the following example to retrieve all albums titled "Love" released since 2017:

SELECT a.SingerId
FROM Albums AS a
WHERE a.AlbumTitle = 'Love' AND a.ReleaseDate >= '2017-01-01'

In another query, you might change the value of the album title to "Peace":

SELECT a.SingerId
FROM Albums AS a
WHERE a.AlbumTitle = 'Peace' AND a.ReleaseDate >= '2017-01-01'

If your application needs to execute many queries that are similar to this one, in which only a literal value changes in subsequent queries, use a parameter placeholder for that value. The resulting parametric query can be cached and reused, which reduces compilation costs.

For example, the following rewritten query replaces Love with a parameter named title:

Google Standard SQL

SELECT a.SingerId
FROM Albums AS a
WHERE a.AlbumTitle = @title AND a.ReleaseDate >= '2017-01-01'

A parameter reference in the query uses the @ character followed by the parameter name, which can contain any combination of letters, numbers, and underscores. Specify the query parameter and the value to bind it to in the params field of the ExecuteSQL or ExecuteStreamingSQL request API. Learn more about query parameter syntax in SQL lexical structure and syntax.

PostgreSQL

SELECT a.SingerId
FROM Albums AS a
WHERE a.AlbumTitle = $1 AND a.ReleaseDate >= '2017-01-01'

A parameter reference in the query uses the $ character followed by a number that represents positional parameter in statement.

Parameters can appear anywhere that a literal value is expected. The same parameter name can be used more than once in a single SQL statement.

In summary, query parameters support query execution in the following ways:

  • Pre-optimized plans: Queries that use parameters can be executed faster on each invocation because the parameterization makes it easier for Spanner to cache the execution plan.
  • Simplified query composition: You do not need to escape string values when providing them in query parameters. Query parameters also reduce the risk of syntax errors.
  • Security: Query parameters make your queries more secure by protecting you from various SQL injection attacks. This protection is especially important for queries that you construct from user input.

Understand how Spanner executes queries

Spanner enables you to query databases using declarative SQL statements that specify what data you want to retrieve. If you want to also understand how Spanner obtains the results, use query execution plans. A query execution plan displays the computational cost associated with each step of the query. Using those costs, you can debug query performance issues and optimize your query.

You can retrieve query execution plans through the Google Cloud console or the client libraries.

To get a query plan using the Google Cloud console, do the following:

  1. Open the Spanner instances page.

    Go to Spanner instances

  2. Select the names of the Spanner instance and the database that you want to query.

  3. Click Query in the left navigation panel.

  4. Type the query in the text field, and then click Run query.

  5. Click Explanation
    . The Google Cloud console displays a visual execution plan for your query.

    Screenshot of visual execution plan in Cloud console

For more information on visual plans, see Tuning a query using the query plan visualizer.

For the complete query plan reference, refer to Query execution plans.

Use secondary indexes to speed up common queries

Like other relational databases, Spanner offers secondary indexes, which you can use to retrieve data using either a SQL statement or Spanner's read interface. The more common way to fetch data from an index is to use the SQL query interface. Using a secondary index in a SQL query enables you to specify how you want Spanner to obtain the results. Specifying a secondary index can speed up query execution.

For example, suppose you wanted to fetch the IDs of all the singers with a specific last name. One way to write such a SQL query is:

SELECT s.SingerId
FROM Singers AS s
WHERE s.LastName = 'Smith';

This query would return the results that you expect, but it might take a long time to return the results. The timing would depend on the number of rows in the Singers table and how many satisfy the predicate WHERE s.LastName = 'Smith'. If there is no secondary index that contains the LastName column to read from, the query plan would read the entire Singers table to find rows that match the predicate. Reading the entire table is called a full table scan. A full table scan is an expensive way to obtain the results when the table contains only a small percentage of Singers with that last name.

You can improve the performance of this query by defining a secondary index on the last name column:

CREATE INDEX SingersByLastName on Singers (LastName);

Because the secondary index SingersByLastName contains the indexed table column LastName and the primary key column SingerId, Spanner can fetch all the data from the much smaller index table instead of scanning the full Singers table.

In this scenario, Spanner automatically uses the secondary index SingersByLastName when executing the query (as long as three days have passed since database creation; see A note about new databases). However, it's best to explicitly tell Spanner to use that index by specifying an index directive in the FROM clause:

Google Standard SQL

SELECT s.SingerId
FROM Singers@{FORCE_INDEX=SingersByLastName} AS s
WHERE s.LastName = 'Smith';

PostgreSQL

 SELECT s.SingerId
FROM Singers /*@ FORCE_INDEX=SingersByLastName */ AS s
WHERE s.LastName = 'Smith';

Now suppose you also wanted to fetch the singer's first name in addition to the ID. Even though the FirstName column is not contained in the index, you should still specify the index directive as before:

Google Standard SQL

SELECT s.SingerId, s.FirstName
FROM Singers@{FORCE_INDEX=SingersByLastName} AS s
WHERE s.LastName = 'Smith';

PostgreSQL

SELECT s.SingerId, s.FirstName
FROM Singers /*@ FORCE_INDEX=SingersByLastName */ AS s
WHERE s.LastName = 'Smith';

You still get a performance benefit from using the index because Spanner doesn't need to do a full table scan when executing the query plan. Instead, it selects the subset of rows that satisfy the predicate from the SingersByLastName index, and then does a lookup from the base table Singers to fetch the first name for only that subset of rows.

If you want Spanner to not have to fetch any rows from the base table at all, you can store a copy of the FirstName column in the index itself:

Google Standard SQL

CREATE INDEX SingersByLastName on Singers (LastName) STORING (FirstName);

PostgreSQL

CREATE INDEX SingersByLastName ON Singers (LastName) INCLUDE (FirstName);

Using a STORING clause (for the Google Standard SQL dialect) or an INCLUDE clause (for the PostgreSQL dialect) like this costs extra storage but it provides the following advantages for queries and read calls using the index:

  • SQL queries that use the index and select columns stored in the STORING or INCLUDE clause do not require an extra join to the base table.
  • Read calls that use the index can read columns stored in the STORING or INCLUDE clause.

The preceding examples illustrate how secondary indexes can speed up queries when the rows chosen by the WHERE clause of a query can be quickly identified using the secondary index.

Another scenario in which secondary indexes can offer performance benefits is for certain queries that return ordered results. For example, suppose you want to fetch all album titles and their release dates in ascending order of release date and descending order of album title. You could write a SQL query as follows:

SELECT a.AlbumTitle, a.ReleaseDate
FROM Albums AS a
ORDER BY a.ReleaseDate, a.AlbumTitle DESC;

Without a secondary index, this query requires a potentially expensive sorting step in the execution plan. You could speed up query execution by defining this secondary index:

CREATE INDEX AlbumsByReleaseDateTitleDesc on Albums (ReleaseDate, AlbumTitle DESC);

Then, rewrite the query to use the secondary index:

Google Standard SQL

SELECT a.AlbumTitle, a.ReleaseDate
FROM Albums@{FORCE_INDEX=AlbumsByReleaseDateTitleDesc} AS a
ORDER BY a.ReleaseDate, a.AlbumTitle DESC;

PostgreSQL

SELECT a.AlbumTitle, a.ReleaseDate
FROM Albums /*@ FORCE_INDEX=AlbumsByReleaseDateTitleDesc */ AS s
ORDER BY a.ReleaseDate, a.AlbumTitle DESC;

This query and index definition meet both of the following criteria:

  • The column list in the ORDER BY clause is a prefix of the index key list.
  • The index covers all columns in the table that the query uses.

Because both these conditions are satisfied, the resulting query plan removes the sorting step and executes faster.

Although secondary indexes can speed up common queries, adding secondary indexes can add latency to your commit operations, because each secondary index typically requires involving an extra node in each commit. For most workloads, having a few secondary indexes is fine. However, you should consider whether you care more about read or write latency, and consider which operations are most critical for your workload. Benchmark your workload to ensure that it's performing as you expect.

For the complete reference on secondary indexes, refer to Secondary indexes.

Write efficient queries for range key lookup

A common use of a SQL query is to read multiple rows from Spanner based on a list of known keys.

The following best practices help you write efficient queries when fetching data by a range of keys:

  • If the list of keys is sparse and not adjacent, use query parameters and UNNEST to construct your query.

    For example, if your key list is {1, 5, 1000}, write the query like this:

    Google Standard SQL

    SELECT *
    FROM Table AS t
    WHERE t.Key IN UNNEST (@KeyList)
    

    PostgreSQL

    SELECT *
    FROM Table AS t
    WHERE t.Key IN UNNEST ($1)
    

    Notes:

    • The array UNNEST operator flattens an input array into rows of elements.

    • The query parameter, which is @KeyList for Google Standard SQL and $1 for PostgreSQL, can speed up your query as discussed in the preceding best practice.

  • If the list of keys is adjacent and within a range, specify the lower and higher limits of the key range in the WHERE clause.

    For example, if your key list is {1,2,3,4,5}, construct the query as follows:

    Google Standard SQL

    SELECT *
    FROM Table AS t
    WHERE t.Key BETWEEN @min AND @max
    

    PostgreSQL

    SELECT *
    FROM Table AS t
    WHERE t.Key BETWEEN $1 AND $2
    

    This query is only more efficient if the keys in the key range are adjacent. In other words, if your key list is {1, 5, 1000}, don't specify the lower and higher limits like in the preceding query because the resulting query would scan through every value between 1 and 1000.

Write efficient queries for joins

Join operations can be expensive because they can significantly increase the number of rows that your query needs to scan, which results in slower queries. In addition to the techniques that you're accustomed to using in other relational databases to optimize join queries, here are some best practices for a more efficient JOIN when using Spanner SQL:

  • If possible, join data in interleaved tables by primary key. For example:

    SELECT s.FirstName, a.ReleaseDate
    FROM Singers AS s JOIN Albums AS a ON s.SingerId = a.SingerId;
    

    The rows in the interleaved table Albums are guaranteed to be physically stored in the same splits as the parent row in Singers, as discussed in Schema and Data Model. Therefore, joins can be completed locally without sending lots of data across the network.

  • Use the join directive if you want to force the order of the join. For example:

    Google Standard SQL

    SELECT *
    FROM Singers AS s JOIN@{FORCE_JOIN_ORDER=TRUE} Albums AS a
    ON s.SingerId = a.Singerid
    WHERE s.LastName LIKE '%x%' AND a.AlbumTitle LIKE '%love%';
    

    PostgreSQL

    SELECT *
    FROM Singers AS s JOIN/*@ FORCE_JOIN_ORDER=TRUE */ Albums AS a
    ON s.SingerId = a.Singerid
    WHERE s.LastName LIKE '%x%' AND a.AlbumTitle LIKE '%love%';
    

    The join directive FORCE_JOIN_ORDER tells Spanner to use the join order specified in the query (that is, Singers JOIN Albums, not Albums JOIN Singers). The returned results are the same, regardless of the order that Spanner chooses. However, you might want to use this join directive if you notice in the query plan that Spanner has changed the join order and caused undesirable results, such as larger intermediate results, or has missed opportunities for seeking rows.

  • Use a join directive to choose a join implementation. When you use SQL to query multiple tables, Spanner automatically uses a join method that is likely to make the query more efficient. However, Google advises you to test with different join algorithms. Choosing the right join algorithm can improve latency, memory consumption, or both. This query demonstrates the syntax for using a JOIN directive with the JOIN_METHOD hint to choose a HASH JOIN:

    Google Standard SQL

    SELECT *
    FROM Singers s JOIN@{JOIN_METHOD=HASH_JOIN} Albums AS a
    ON a.SingerId = a.SingerId
    

    PostgreSQL

    SELECT *
    FROM Singers s JOIN/*@ JOIN_METHOD=HASH_JOIN */ Albums AS a
    ON a.SingerId = a.SingerId
    
  • If you're using a HASH JOIN or APPLY JOIN and if you have a WHERE clause that is highly selective on one side of your JOIN, put the table that produces the smallest number of rows as the first table in the FROM clause of the join. This structure helps because currently in HASH JOIN, Spanner always picks the left-hand side table as build and the right-hand side table as probe. Similarly, for APPLY JOIN, Spanner picks the left-hand side table as outer and the right-hand side table as inner. See more about these join types: Hash join and Apply join.

  • For queries that are critical for your workload, specify the most performant join method and join order in your SQL statements for more consistent performance.

Avoid large reads inside read-write transactions

Read-write transactions allow a sequence of zero or more reads or SQL queries, and can include a set of mutations, before a call to commit. To maintain the consistency of your data, Spanner acquires locks when reading and writing rows in your tables and indexes. For more information about locking, see Life of Reads and Writes).

Because of the way locking works in Spanner, performing a read or SQL query that reads a large number of rows (for example SELECT * FROM Singers) means that no other transactions can write to the rows that you've read until your transaction is either committed or aborted.

Furthermore, because your transaction is processing a large number of rows, it's likely to take longer than a transaction that reads a much smaller range of rows (for example SELECT LastName FROM Singers WHERE SingerId = 7), which further exacerbates the problem and reduces system throughput.

So, try to avoid large reads (for example, full table scans or massive join operations) in your transactions, unless you're willing to accept lower write throughput.

In some cases, the following pattern can yield better results:

  1. Do your large reads inside a read-only transaction. Read-only transactions allow for higher aggregate throughput because they do not use locks.
  2. [Optional] Do any processing required on the data you just read.
  3. Start a read-write transaction.
  4. Verify that the critical rows have not changed values since you performed the read-only transaction in step 1.
    • If the rows have changed, roll back your transaction and start again at step 1.
    • If everything looks okay, commit your mutations.

One way to ensure that you're avoiding large reads in read-write transactions is to look at the execution plans that your queries generate.

Use ORDER BY to ensure the ordering of your SQL results

If you're expecting a certain ordering for the results of a SELECT query, explicitly include the ORDER BY clause. For example, if you want to list all singers in primary key order, use this query:

SELECT * FROM Singers
ORDER BY SingerId;

Spanner guarantees result ordering only if the ORDER BY clause is present in the query. In other words, consider this query without the ORDER BY:

SELECT * FROM Singers;

Spanner does not guarantee that the results of this query will be in primary key order. Furthermore, the ordering of results can change at any time and is not guaranteed to be consistent from invocation to invocation.

Use STARTS_WITH instead of LIKE to speed up parameterized SQL queries

Because Spanner does not evaluate parameterized LIKE patterns until execution time, Spanner must read all rows and evaluate them against the LIKE expression to filter out rows that do not match.

When a LIKE pattern looks for matches that are at the beginning of a value and the column is indexed, use STARTS_WITH instead of LIKE. This option allows Spanner to more effectively optimize the query execution plan.

Not recommended:

Google Standard SQL

SELECT a.AlbumTitle FROM Albums a
WHERE a.AlbumTitle LIKE @like_clause;

PostgreSQL

SELECT a.AlbumTitle FROM Albums a
WHERE a.AlbumTitle LIKE $1;

Recommended:

Google Standard SQL

SELECT a.AlbumTitle FROM Albums a
WHERE STARTS_WITH(a.AlbumTitle, @prefix);

PostgreSQL

SELECT a.AlbumTitle FROM Albums a
WHERE STARTS_WITH(a.AlbumTitle, $2);

Use commit timestamps to query recent data

If your application needs to query data written after a particular time, add commit timestamp columns to the relevant tables. Commit timestamps enable a Spanner optimization that can reduce the I/O of queries whose WHERE clauses restrict results to rows written more recently than a specific time.

Learn more about this optimization with Google Standard SQL-dialect databases or with PostgreSQL-dialect databases.