Query overview

This page describes the SEARCH function and the enhanced query mode, which are used to perform full-text search queries on Spanner tables.

Query a search index

Spanner provides the SEARCH function for search index queries. The SEARCH function requires two arguments:

  • A search index name
  • An rquery

The SEARCH function only works when a search index is defined. The SEARCH function can be combined with any arbitrary SQL constructs, such as filters, aggregations, or joins.

The following query uses the SEARCH function to return all albums that have either friday or monday in the title:

SELECT AlbumId
FROM Albums
WHERE SEARCH(AlbumTitle_Tokens, 'friday OR monday')

The rquery language

The second argument of the SEARCH function is a domain-specific language (DSL) called rquery. The rquery language is similar to what internet users are accustomed to using on google.com:

  • Multiple terms imply AND. For example, "Big Time" is equivalent to big AND time.
  • The OR operation implies disjunction between two terms, such as big OR time. The statement SEARCH(tl, 'big time OR fast car') is equivalent to:

    SEARCH(tl, 'big')
    AND (SEARCH(tl, 'time')
         OR SEARCH(tl, 'fast'))
    AND SEARCH(tl, 'cat');
    

    OR only applies to the two adjacent terms. For example, the search expression happy friday OR monday searches for all the documents that have the term happy and either the term friday or monday.

  • Double quotes mean a phrase search. For example, the rquery "big cat" matches "I have a big cat", but doesn't match "My cat is big".

  • The AROUND operator matches terms that are within a certain distance of each other. For example, rquery big AROUND cat matches "big, white, and fluffy cat", but doesn't match "big dog was sitting next to a small cat". The default is to match terms separated by, at most, five positions. To adjust the distance, pass an argument to the AROUND operator. Spanner supports two syntaxes for AROUND:

    • big AROUND(10) cat
    • big AROUND 10 cat
  • The AROUND operator, when used as part of a phrase, can match terms that are within a certain distance of each other, and in the same order (the default is five tokens). For example, the string big AROUND cat matches "big white and fluffy cat", but not "cat was big".

  • Negation of a single token is expressed with a dash (-). For example -dog matches all documents that don't contain the term dog.

  • Punctuation is generally ignored. For example, "Big Time!" is equivalent to "Big Time".

  • Search is case insensitive. For example, "Big Time" matches "big time".

  • The OR and AROUND operators are case sensitive. The pipe character (|) is a shortcut for OR.

How the rquery language works

The rquery language follows the same rules as the plain-text tokenizer when splitting the input string into distinct terms. This includes segmentation of Asian languages.

Many applications provide a way for their users to enter a search query in a search box. The easiest way to integrate such end user queries is to send the user input directly into the SEARCH function.

The following table explains the meaning of various rquery strings:

rquery Explanation
Big Cat Matches documents that contain both terms "big" and "cat".
cat OR dog Matches documents that contain at least one of the terms "cat" and "dog".
-cat Matches all documents that don't contain the term "cat".
"big dog" -"big cat" Matches documents that contain two adjacent terms "big" and "dog", but don't contain adjacent "big" and "cat". For example, this query matches "I have a big dog and a small cat", but doesn't match "I have a big dog and a big cat".
cat|dog This is the same as cat OR dog.
and OR or Matches documents that have either the term "and" or the term "or" (the OR operator must be upper case)

Enhanced query mode

In addition to a purely token-based full-text search, Spanner supports a richer mode called enhance_query. When enabled, this mode extends the search query to include more token variants. These rewrites increase search recall.

To enable this option, set the optional argument enhance_query=>true in the SEARCH function. For example, the search query hotl cal matches the album Hotel California.

SELECT AlbumId
FROM Albums
WHERE SEARCH(AlbumTitle_Tokens, 'hotl cal', enhance_query=>true)

The enhance_query mode is a query-time option. It doesn't affect tokenization. You can use the same search index with or without enhance_query.

Google is continuously improving the query enhancement algorithms. As the result, a query with enhance_query == true might yield slightly different results over time..

When the enhance_query mode is enabled, it may increase the number of terms that the SEARCH function is looking for and may result in slightly elevated latency.

For example, the following query uses a three-second timeout and fails if enhance_query is unavailable:

@{require_enhance_query=true, enhance_query_timeout_ms=3000}
SELECT AlbumId
FROM Albums
WHERE SEARCH(AlbumTitle_Tokens, 'cat', enhance_query=>true)

SQL query requirements

There are several conditions that a SQL query must meet to use a search index. If these conditions aren't met, the query uses either an alternative query plan or fails if no alternative plan exists.

Queries must meet the following conditions:

  • SEARCH and SEARCH_SUBSTRING functions require a search index. Spanner doesn't support these functions in queries against the base table or secondary indexes. Queries against
  • Partitioned indexes must have all partition columns bound by an equality condition in the WHERE clause of the query.

    For example, if a search index is defined as PARTITION BY x, y, the query must have a conjunct in the WHERE clause of x = <parameter or constant> AND y = <parameter or constant>. That search index isn't considered by the query optimizer if such a condition is missing.

  • All TOKENLIST columns referenced by SEARCH and SEARCH_SUBSTRING operators must be indexed in the same search index.

    For example, consider the following table and index definition:

    CREATE TABLE Albums (
        AlbumId STRING(MAX) NOT NULL,
        AlbumTitle STRING(MAX),
        AlbumStudio STRING(MAX),
        AlbumTitle_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN,
        AlbumStudio_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumStudio)) HIDDEN
      ) PRIMARY KEY(AlbumId);
    
      CREATE SEARCH INDEX AlbumsTitleIndex ON Albums(AlbumTitle_Tokens);
      CREATE SEARCH INDEX AlbumsStudioIndex ON Albums(AlbumStudio_Tokens);
    

    The following query fails because there's no single search index that indexes both AlbumTitle_Tokens and AlbumStudio_Tokens:

    SELECT AlbumId
    FROM Albums
    WHERE SEARCH(AlbumTitle_Tokens, @p1) AND SEARCH(AlbumStudio_Tokens, @p2)
    
  • If the sort order column is nullable, both the schema and the query must exclude rows where the sort order column is NULL. See Search index sort order for details.

  • If the search index is NULL filtered, the query must include the same NULL-filtering expression that's used in an index. See NULL-filtered search indexes for details.

  • Search indexes aren't supported in DML, partitioned DML, and partitioned queries .

  • Search indexes are typically used in read-only transactions. If application requirements allow stale results, we recommend running search queries with a staleness duration of 10 seconds or longer. For more information, see Read stale data. This is particularly useful for large search queries that fan out to multiple Paxos groups.

    Search indexes aren't recommended in read-write transactions. During execution, search queries lock an entire index partition; as a result, a high rate of search queries in read-write transactions might cause lock conflicts leading to latency spikes. By default, search indexes aren't automatically selected in read-write transactions. If a query is forced to use a search index in a read-write transaction it fails by default. This behavior can be overridden with the @{ALLOW_SEARCH_INDEXES_IN_TRANSACTION=TRUE} statement-level hint (but queries are still prone to lock conflicts).

Once index eligibility conditions are met, the query optimizer tries to accelerate non-text query conditions (like Rating > 4). If the search index doesn't include the appropriate TOKENLIST column, the condition isn't accelerated and remains a residual condition.

Query parameters

rquery and other parameters like OFFSET are specified as either a literal or a query parameter. We recommend using query parameters rather than string literals. Parameterized queries have better query cache hit rates resulting in lower query latency and lower overall CPU usage.

For example, instead of using a query like the following:

SELECT AlbumId FROM Albums WHERE SEARCH(AlbumTitle_Tokens, 'cat')

use the following syntax:

SELECT AlbumId FROM Albums WHERE SEARCH(AlbumTitle_Tokens, @p)

Spanner runs the query optimizer on distinct SQL text. The fewer distinct SQL texts the application uses, the fewer times that the query optimization is invoked.

Search index sort order

Search indexes sort order behavior is different from secondary indexes.

For example, consider the following table:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  ReleaseTimestamp INT64 NOT NULL,
  AlbumName STRING(MAX),
  AlbumName_Token TOKENLIST AS (TOKEN(AlbumName)) HIDDEN
) PRIMARY KEY(AlbumId);

The application might define a secondary index to look up information using the AlbumName sorted by ReleaseTimestamp:

CREATE INDEX AlbumsSecondaryIndex ON Albums(AlbumName, ReleaseTimestamp DESC);

The equivalent search index looks as following (this uses exact-match tokenization, since secondary indexes don't support full-text searches):

CREATE SEARCH INDEX AlbumsSearchIndex
ON Albums(AlbumName_Token)
ORDER BY ReleaseTimestamp DESC;

Search index sort order must conform to the following rules:

  1. Only use INT64 columns for the sort order of a search index. Columns that have arbitrary sizes use too many resources on the search index because Spanner needs to store the docid next to every token. Specifically, the sort order column can't use the TIMESTAMP type because TIMESTAMP uses nanosecond precision which doesn't fit in a 64-bit integer.
  2. Sort order columns must not be NULL. There are two ways to meet this requirement:

    1. Declare the sort order column as NOT NULL.
    2. Configure the index to exclude NULL values.

In practice, a timestamp is often used to determine the sort order. A common practice is to use microseconds since the Unix epoch for such timestamps.

Applications usually retrieve the newest data first using a search index that's sorted in descending order.

Index selection

Spanner typically selects the most efficient index for a query using cost-based modeling. However, the FORCE_INDEX hint explicitly instructs Spanner to use a specific search index. For example, the following shows how to force Spanner to use the AlbumsIndex:

SELECT AlbumId
FROM Albums @{FORCE_INDEX=AlbumsIndex}
WHERE SEARCH(AlbumTitle_Tokens, @p1)

If the specified search index isn't eligible, the query fails, even if there are other eligible search indexes.

Snippets in search results

A snippet is a piece of text extracted from a given string that gives users a sense of what a search result contains, and the reason why the result is relevant to their query.

For example, Gmail uses snippets to indicate the portion of an email that matches the search query:

List of snippets

Having the database generate a snippet has several benefits:

  1. Convenience: You don't need to implement logic to generate snippets from a search query.
  2. Efficiency: Snippets reduce the output size from the server.

The SNIPPET function creates the snippet. It returns the relevant portion of the original string value along with positions of characters to highlight. The client can then choose how to display the snippet to the end user (for example, using highlighted or bold text). The SNIPPET function removes all HTML tags from the original string.

For example, the following uses SNIPPET to retrieve text from AlbumTitle:

SELECT AlbumId, SNIPPET(AlbumTitle, "Fast Car")
FROM Albums
WHERE SEARCH(AlbumTitle_Tokens, "Fast Car")

What's next