Combine TOKENLISTs

This page describes how to concatenate TOKENLISTs in either a search index when you set up your schema or in a search query when performing a full-text search in Spanner.

Combine TOKENLISTs in a search index

Sometimes, you need your application to search across individual fields. At other times, the application needs to search across all fields. For example, in a table with two string columns, you might want your application to search across both columns without differentiating which column the matches come from.

In Spanner, there are two ways to achieve this:

  1. Tokenize words separately and concatenate the resulting TOKENLISTs (recommended).
  2. Concatenate strings and tokenize the result.

With the second approach, there are two problems:

  1. If you want to index Title or Studio individually, in addition to indexing them in a combined TOKENLIST, the same text is tokenized twice. This causes transactions to use more resources.
  2. A phrase search spans both fields. For example, if @p is set to "Blue Note", it matches a row that contains both Title="Big Blue Note" and Studio="Blue Note Studios".

The first approach solves these problems because a phrase only matches one field and each string field is only tokenized once if both the individual and combined TOKENLISTs are indexed. Even though each string field is only tokenized once, the resulting TOKENLISTs are stored separately in the index.

Tokenize words separately and concatenate TOKENLISTs

The following example tokenizes each word and uses TOKENLIST_CONCAT to concatenate the TOKENLISTs:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  Title STRING(MAX),
  Studio STRING(MAX),
  Title_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title)) HIDDEN,
  Studio_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Studio)) HIDDEN,
  Combined_Tokens TOKENLIST AS (TOKENLIST_CONCAT([Title_Tokens, Studio_Tokens])) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(Combined_Tokens);

SELECT AlbumId FROM Albums WHERE SEARCH(Combined_Tokens, @p);

PostgreSQL uses spanner.tokenlist_concat for concatenation. The query parameter $1 is bound to 'Hatel Kaliphorn'.

CREATE TABLE albums (
  albumid character varying NOT NULL,
  title character varying,
  studio character varying,
  title_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.tokenize_fulltext(title)) VIRTUAL HIDDEN,
  studio_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.tokenize_fulltext(studio)) VIRTUAL HIDDEN,
  combined_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.tokenlist_concat(ARRAY[spanner.tokenize_fulltext(title), spanner.tokenize_fulltext(studio)])) VIRTUAL HIDDEN,
PRIMARY KEY(albumid));

CREATE SEARCH INDEX albumsindex ON albums(combined_tokens);

SELECT albumid FROM albums WHERE spanner.search(combined_tokens, $1);

Note that tokenlist_concat doesn't call title_tokens or studio_tokens, but instead calls spanner.tokenize_fulltext(title) and spanner.tokenize_fulltext(studio). This is because PostgreSQL doesn't support referencing generated columns that are within other generated columns. spanner.tokenlist_concat needs to call tokenize functions and not reference tokenlist columns directly.

TOKENLIST concatenation can also be implemented entirely on the query side. For more information, see Query-side TOKENLIST concatenation.

TOKENLIST_CONCAT is supported for both full-text and substring searches. Spanner doesn't let you mix tokenization types, such as TOKENIZE_FULLTEXT and TOKENIZE_SUBSTRING in the same TOKENLIST_CONCAT call.

In GoogleSQL, the definition of text TOKENLIST columns can be changed in non-stored columns to add additional columns. This is useful when you want to add an additional column to TOKENLIST_CONCAT. Changing the generated column expression doesn't backfill existing rows in the index.

Concatenate strings and tokenize the result

The following example concatenates strings and tokenizes the result:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  Title STRING(MAX),
  Studio STRING(MAX),
  Combined_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title || " " || Studio)) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(Combined_Tokens);

SELECT AlbumId FROM Albums WHERE SEARCH(Combined_Tokens, @p);
CREATE TABLE albums (
  albumid character varying NOT NULL,
  title character varying,
  studio character varying,
  combined_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.tokenize_fulltext(title || ' ' || studio)) VIRTUAL HIDDEN,
PRIMARY KEY(albumid));

CREATE SEARCH INDEX albumsindex ON albums(combined_tokens);

SELECT albumid FROM albums WHERE spanner.search(combined_tokens, $1);

Query-side TOKENLIST concatenation

The tradeoff with indexing the concatenated TOKENLIST is that it increases storage and write cost. Each token is now stored on the disk twice: once in a posting list of its original TOKENLIST, and once in a posting list of the combined TOKENLIST. Query-side concatenation of TOKENLIST columns avoids this cost but the query uses more compute resources.

To concatenate multiple TOKENLISTs, use the TOKENLIST_CONCAT function in the SEARCH query. For this section, we're using the following sample schema:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  Title STRING(MAX),
  Studio STRING(MAX),
  Title_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Title)) HIDDEN,
  Studio_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(Studio)) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsIndex ON Albums(Title_Tokens, Studio_Tokens);
CREATE TABLE albums (
  albumid character varying NOT NULL,
  title character varying,
  studio character varying,
  title_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.tokenize_fulltext(title)) VIRTUAL HIDDEN,
  studio_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.tokenize_fulltext(studio)) VIRTUAL HIDDEN,
 PRIMARY KEY(albumid));

CREATE SEARCH INDEX albumsindex ON albums(title_tokens, studio_tokens);

The following query searches for rows that have the tokens "blue" and "note" anywhere in the Title and Studio columns. This includes rows with both "blue" and "note" in the Title column, "blue" and "note" in the Studio column, and "blue" in the Title column and "note" in the Studio column, or the opposite.

SELECT AlbumId
FROM Albums
WHERE SEARCH(TOKENLIST_CONCAT([AlbumTitle_Tokens, Studio_Tokens]), 'blue note')

This example uses spanner.search with spanner.tokenlist_concat.

SELECT albumid
FROM albums
WHERE spanner.search(spanner.tokenlist_concat(ARRAY[albumtitle_tokens, studio_tokens]), 'blue note')

Write-side and query-side TOKENLIST concatenation produce identical results. The choice between the two is a trade-off between disk cost and query cost.

Alternatively, an application could search multiple TOKENLIST columns and use OR along with the SEARCH function:

SEARCH(AlbumTitle_Tokens, 'Blue Note') OR SEARCH(Studio_Tokens, 'Blue Note')
spanner.search(albumtitle_tokens, 'Blue Note') OR spanner.search(studio_tokens, 'Blue Note')

This, however, has different semantics. It doesn't match albums where AlbumTitle_Tokens has "blue", but not "note" and Studio_Tokens has "note", but not "blue".

What's next