GoogleSQL for Spanner supports the following search functions.
Categories
The search functions are grouped into the following categories, based on their behavior:
Category | Functions | Description |
---|---|---|
Indexing |
TOKEN TOKENIZE_BOOL TOKENIZE_FULLTEXT TOKENIZE_NGRAMS TOKENIZE_NUMBER TOKENIZE_SUBSTRING TOKENLIST_CONCAT |
Functions that you can use to create search indexes. |
Retrieval and presentation |
SCORE SCORE_NGRAMS SEARCH SEARCH_NGRAMS SEARCH_SUBSTRING SNIPPET |
Functions that you can use to search for data, score the search result, or format the search result. |
Debugging |
DEBUG_TOKENLIST |
Functions that you can use for debugging. |
Function list
Name | Summary |
---|---|
DEBUG_TOKENLIST
|
Displays a human-readable representation of tokens present in the TOKENLIST value for debugging purposes. |
SCORE
|
Calculates a relevance score of a TOKENLIST for a full-text
search query. The higher the score, the stronger the match.
|
SCORE_NGRAMS
|
Calculates a relevance score of a TOKENLIST for a fuzzy search.
The higher the score, the stronger the match.
|
SEARCH
|
Returns TRUE if a full-text search query matches tokens. |
SEARCH_NGRAMS
|
Checks whether enough n-grams match the tokens in a fuzzy search. |
SEARCH_SUBSTRING
|
Returns TRUE if a substring query matches tokens. |
SNIPPET
|
Gets a list of snippets that match a full-text search query. |
TOKEN
|
Constructs an exact match TOKENLIST value by tokenizing a
BYTE or STRING value verbatim to accelerate
exact match expressions in SQL. |
TOKENIZE_BOOL
|
Constructs a boolean TOKENLIST value by tokenizing a
BOOL value to accelerate boolean match expressions in SQL.
|
TOKENIZE_FULLTEXT
|
Constructs a full-text TOKENLIST value by tokenizing text
for full-text matching. |
TOKENIZE_NGRAMS
|
Constructs an n-gram TOKENLIST value by tokenizing
a STRING value for matching n-grams.
|
TOKENIZE_NUMBER
|
Constructs a numeric TOKENLIST value by tokenizing numeric
values to accelerate numeric comparison expressions in SQL.
|
TOKENIZE_SUBSTRING
|
Constructs a substring TOKENLIST value by tokenizing text for
substring matching. |
DEBUG_TOKENLIST
DEBUG_TOKENLIST(tokenlist)
Description
Displays a human-readable representation of tokens present in a TOKENLIST
value for debugging purposes.
Definitions
tokenlist
: TheTOKENLIST
value to display.
Details
The output of this function is dependent on the source of the TOKENLIST
value
provided as input.
Return type
STRING
Examples
The following query illustrates how attributes and positions are represented:
- In
hello(boundary)
,hello
is the text of the token andboundary
is an attribute of the token. - Token
db
has no attributes. - In
[#world, world](boundary)
,#world
andworld
are both tokens added to the tokenlist, at the same position.boundary
is the attribute for both of them. This can match either#world
orworld
query terms.
SELECT DEBUG_TOKENLIST(TOKENIZE_FULLTEXT('Hello DB #World')) AS Result;
/*------------------------------------------------+
| Result |
+------------------------------------------------+
| hello(boundary), db, [#world, world](boundary) |
+------------------------------------------------*/
The following query illustrates how equality and range are represented:
==1
and==10
represent equality tokens for1
and10
.[1, 1]
represents a range token with1
as the lower bound and1
as the upper bound.
SELECT DEBUG_TOKENLIST(TOKENIZE_NUMBER([1, 10], min=> 1, max=>10)) AS Result;
/*--------------------------------------------------------------------------------+
| Result |
+--------------------------------------------------------------------------------+
| ==1, ==10, [1, 1], [1, 2], [1, 4], [1, 8], [9, 10], [9, 12], [9, 16], [10, 10] |
+--------------------------------------------------------------------------------*/
SCORE
SCORE(
tokens,
raw_search_query
[, language_tag => value ]
[, enhance_query => { TRUE | FALSE } ]
[, options => value ]
)
Description
Calculates a relevance score of a TOKENLIST
for a full-text search query. The
higher the score, the stronger the match.
Definitions
tokens
: ATOKENLIST
value that represents a list of full-text tokens.raw_search_query
: ASTRING
value that represents a raw search query.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forraw_search_query
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.enhance_query
: A named argument with aBOOL
value. The value determines whether to enhance the search query. For example, ifenhance_query
is enabled, a search query containing the termclassic
can expand to include similar terms such asclassical
. The search query isn't enhanced if theenhance_query
call takes longer than the timeout.If
TRUE
, the search query is enhanced to improve search quality.If
FALSE
(default), the search query isn't enhanced.
options
: A named argument with aJSON
value. The value represents the fine-tuning for the search scoring.bigram_weight
: A multiplier for bigrams, which have matching terms adjacent to each other. The default is 2.0.idf_weight
: A multiplier for term commonality. Hits on rare terms will score relatively higher than hits on common terms. The default is 1.0.token_category_weights
: A multiplier for each HTML category. The available categories are:small
,medium
,large
,title
.
Details
- This function must reference a full-text
TOKENLIST
column in a table that is also indexed in a search index. To add a full-textTOKENLIST
column to a table and to a search index, see the examples for this function. - This function requires the
SEARCH
function in the same SQL query. - This function returns
0
whentokens
orraw_search_query
isNULL
.
Return type
FLOAT64
Examples
The following examples reference a table called Albums
and a search index
called AlbumsIndex
.
The Albums
table contains a column called DescriptionTokens
, which tokenizes
the input added to the Description
column, and then saves those tokens in the
DescriptionTokens
column. Finally, AlbumsIndex
indexes DescriptionTokens
.
Once DescriptionTokens
is indexed, it can be used with the SCORE
function.
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
Description STRING(MAX),
DescriptionTokens TOKENLIST AS (TOKENIZE_FULLTEXT(Description)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(DescriptionTokens);
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 1, 'classical album');
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 2, 'classical and rock album');
The following query searches the column called Description
for a token called
classical album
. If this token is found for singer ID 1
, the matching
Description
are returned with the corresponding score. Both classical album
and classical and rock album
have the terms classical
and album
, but the
first one has a higher score because the terms are adjacent.
SELECT
a.Description, SCORE(a.DescriptionTokens, 'classical album') AS Score
FROM
Albums a
WHERE
SEARCH(a.DescriptionTokens, 'classical album');
/*--------------------------+---------------------*
| Description | Score |
+--------------------------+---------------------+
| classical album | 1.2818930149078369 |
| classical and rock album | 0.50003194808959961 |
*--------------------------+---------------------*/
The following query is like the previous one. However, scores are boosted more
with bigram_weight
on adjacent positions.
SELECT
a.Description,
SCORE(
a.DescriptionTokens,
'classical album',
options=>JSON '{"bigram_weight": 3.0}'
) AS Score
FROM Albums a
WHERE SEARCH(a.DescriptionTokens, 'classical album');
/*--------------------------+---------------------*
| Description | Score |
+--------------------------+---------------------+
| classical album | 1.7417128086090088 |
| classical and rock album | 0.50003194808959961 |
*--------------------------+---------------------*/
The following query uses SCORE
in the ORDER BY
clause to get the row with
the highest score.
SELECT a.Description
FROM Albums a
WHERE SEARCH(a.DescriptionTokens, 'classical album')
ORDER BY SCORE(a.DescriptionTokens, 'classical album') DESC
LIMIT 1;
/*--------------------------*
| Description |
+--------------------------+
| classical album |
*--------------------------*/
SCORE_NGRAMS
SCORE_NGRAMS(
tokens,
ngrams_query
[, language_tag => value ]
[, algorithm => value ]
)
Description
Calculates a relevance score of a TOKENLIST
for a fuzzy search. The higher
the score, the stronger the match.
Definitions
tokens
: ATOKENLIST
value that contains a list of ngrams tokens. You can generate aTOKENLIST
using eitherTOKENIZE_SUBSTRING
orTOKENIZE_NGRAMS
, which tokenizes the source column directly.TOKENLIST
generated from an expression isn't supported for scoring, for example,TOKENIZE_SUBSTRING(REGEXP_REPLACE(col, 'foo', 'bar))
.ngrams_query
: ASTRING
value that represents a fuzzy search query.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forngrams_query
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.algorithm
: A named argument with aSTRING
value. The value specifies the scoring algorithm for the fuzzy search. The default value for this argument istrigrams
, and currently it's the only supported algorithm.trigrams
: Generates trigrams (n-grams with size 3) without duplication from the query, then also generates trigrams without duplication from the source column of thetokens
. Matches are an intersection between query trigrams and source trigrams. The score is roughly calculated as(match_count / (query_trigrams + source_trigrams - match_count))
.
Details
- This function returns
0
whentokens
orngrams_query
isNULL
. - Unlike
SEARCH_NGRAMS
, this function requires access to the source column oftokens
. Therefore, it is often advantageous to include the source column inSEARCH INDEX
'sSTORING
clause, to avoid a join with the base table. Please see index-only scans.
Return type
FLOAT64
Examples
The following examples reference a table called Albums
and a search index
called AlbumsIndex
.
The Albums
table contains a column DescriptionSubstrTokens
which tokenizes
Description
column using TOKENIZE_SUBSTRING
. Finally, AlbumsIndex
stores
Description
, so that the query below does not have to join with the base
table.
CREATE TABLE Albums (
AlbumId INT64 NOT NULL,
Description STRING(MAX),
DescriptionSubstrTokens TOKENLIST AS
(TOKENIZE_SUBSTRING(Description, ngram_size_max=>3)) HIDDEN
) PRIMARY KEY (AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(DescriptionSubstrTokens)
STORING(Description);
INSERT INTO Albums (AlbumId, Description) VALUES (1, 'rock album');
INSERT INTO Albums (AlbumId, Description) VALUES (2, 'classical album');
The following query scores Description
with clasic albun
, which is
misspelled.
SELECT
a.Description, SCORE_NGRAMS(a.DescriptionSubstrTokens, 'clasic albun') AS Score
FROM
Albums a
/*-----------------+---------------------*
| Description | Score |
+-----------------+---------------------+
| rock album | 0.14285714285714285 |
| classical album | 0.38095238095238093 |
*-----------------+---------------------*/
The following query uses SCORE_NGRAMS
in the ORDER BY
clause to produce the
row with the highest score.
SELECT a.Description
FROM Albums a
WHERE SEARCH_NGRAMS(a.DescriptionSubstrTokens, 'clasic albun')
ORDER BY SCORE_NGRAMS(a.DescriptionSubstrTokens, 'clasic albun') DESC
LIMIT 1
/*-----------------*
| Description |
+-----------------+
| classical album |
*-----------------*/
SEARCH
SEARCH(
tokens,
raw_search_query
[, language_tag => value]
[, enhance_query => { TRUE | FALSE }]
)
Description
Returns TRUE
if a full-text search query matches tokens.
Definitions
tokens
: ATOKENLIST
value that is a list of full-text tokens.raw_search_query
: ASTRING
value that is a raw search query. The raw query is written using a domain-specific language (DSL) called rquery. For an overview of rquery, see rquery syntax overview. For rquery syntax rules, see rquery syntax.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forraw_search_query
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.enhance_query
: A named argument with aBOOL
value. The value determines whether to enhance the search query. For example, ifenhance_query
is enabled, a search query containing the termclassic
can expand to include similar terms such asclassical
. The search query isn't enhanced if theenhance_query
call takes longer than the timeout.If
TRUE
, the search query is enhanced to improve search quality.If
FALSE
(default), the search query isn't enhanced.
Details
- Returns
TRUE
if a TOKENLISTtokens
is a match forraw_search_query
. - This function must reference a full-text
TOKENLIST
column in a table that is also indexed in a search index. To add a full-textTOKENLIST
column to a table and to a search index, see the examples for this function. - This function returns
NULL
whentokens
orraw_search_query
isNULL
. - This function can only be used in the
WHERE
clause of a SQL query.
rquery syntax
The rquery syntax rules are as follows:
- Multiple terms imply
AND
. For example, "big time" is equivalent tobig AND time
. The
OR
operator implies disjunction between two terms, such asbig OR time
. The predicateSEARCH(tl, 'big time OR fast car')
is equivalent to:SEARCH(tl, 'big') AND (SEARCH(tl, 'time') OR SEARCH(tl, 'fast')) AND SEARCH(tl, 'car');
OR
only applies to the two adjacent terms so the search expressionbig time OR fast car
searches for all the documents that have the termsbig
andcar
and eithertime
orfast
.The OR
operator is case sensitive.The pipe character (
|
) is a shortcut forOR
.Double quotes mean a phrase search. For example, the rquery
"fast car"
matches "You got a fast car", but doesn't match "driving fast in my car".The
AROUND
operator matches terms that are within a certain distance of each other, and in the same order (the default is five tokens). For example, the rqueryfast AROUND car
matches "driving fast in my car", but doesn't match "driving fast in his small shiny metal Italian car". The default is to match terms separated by, at most, five positions. To adjust the distance, pass an argument to theAROUND
operator. supports two syntaxes forAROUND
:fast AROUND(10) car
fast AROUND 10 car
The
AROUND
operator is case sensitive.Negation of a single term is expressed with a dash (
-
). For example-dog
matches all documents that don't contain the termdog
.Punctuation is generally ignored. For example, "Fast Car!" is equivalent to "Fast Car". For more information, see the
TOKENIZE_FULLTEXT
punctuation rules.Search is case insensitive. For example, "Fast Car" matches "fast car".
The following table explains the meaning of various rquery strings:
rquery | Explanation |
---|---|
Miles Davis |
Matches documents that contain both terms "Miles" and "Davis". |
Miles OR Davis |
Matches documents that contain at least one of the terms "Miles" and "Davis". |
-Davis |
Matches all documents that don't contain the term "Davis". |
"Miles Davis" -"Miles Jaye" |
Matches documents that contain two adjacent terms "Miles" and "Davis", but don't contain adjacent "Miles" and "Jaye". For example, this query matches "I saw Miles Davis last night and Jaye earlier today", but doesn't match "I saw Miles Davis and Miles Jaye perform together". |
Davis|Jaye |
This is the same as Davis OR Jaye. |
and OR or |
Matches documents that have either the term "and" or the term "or" (the OR operator must be upper case) |
Return type
BOOL
Examples
The following examples reference a table called Albums
and a search index
called AlbumsIndex
.
The Albums
table contains a column called DescriptionTokens
, which tokenizes
the Description
column using TOKENIZE_FULLTEXT
, and then saves those tokens
in the DescriptionTokens
column. Finally, AlbumsIndex
indexes
DescriptionTokens
. Once DescriptionTokens
is indexed, it can be used with
the SEARCH
function.
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
Description STRING(MAX),
DescriptionTokens TOKENLIST AS (TOKENIZE_FULLTEXT(Description)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(DescriptionTokens)
PARTITION BY SingerId;
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 1, 'rock album');
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 2, 'classical album');
The following query searches the column called Description
for a token called
classical
. If this token is found for singer ID 1
, the matching rows are
returned.
SELECT a.AlbumId, a.Description
FROM Albums a
WHERE a.SingerId = 1 AND SEARCH(a.DescriptionTokens, 'classical');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
*---------------------------*/
The following query is like the previous one. However, if Description
contains
the classical
or rock
token, the matching rows are returned.
SELECT a.AlbumId, a.Description
FROM Albums a
WHERE a.SingerId = 1 AND SEARCH(a.DescriptionTokens, 'classical OR rock');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
| 1 | rock album |
*---------------------------*/
The following query is like the previous ones. However, if Description
contains the classic
and albums
token, the matching rows are returned. When
enhance_query
is enabled, it includes similar matches of classical
and
album
.
SELECT a.AlbumId, a.Description
FROM Albums a
WHERE a.SingerId = 1 AND SEARCH(a.DescriptionTokens, 'classic albums', enhance_query => TRUE);
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
*---------------------------*/
SEARCH_NGRAMS
SEARCH_NGRAMS(
tokens,
ngrams_query
[, language_tag => value ]
[, min_ngrams => value ]
[, min_ngrams_percent => value ]
)
Description
Checks whether enough n-grams match the tokens in a fuzzy search.
Definitions
tokens
: ATOKENLIST
value that contains a list of n-grams. It must be aTOKENLIST
generated by eitherTOKENIZE_SUBSTRING
orTOKENIZE_NGRAMS
.ngrams_query
: ASTRING
value that represents a fuzzy search query. This function generates ngram query terms from this value, usingtokens
'sngram_size_max
as ngram size, or query as is if the query is shorter thanngram_size_max
. Then the function looks for those ngrams intokens
. Whentokens
is generated byTOKENIZE_SUBSTRING
,ngrams_query
value is broken into words first, then n-grams are generated from the each word.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forngrams_query
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.min_ngrams
: A named argument with anINT64
value. The value specifies the minimum number of n-grams inngrams_query
that have to match in order forSEARCH_NGRAMS
to returntrue
. This only counts distinct n-grams and ignores repeating n-grams. The default value for this argument is2
.min_ngrams_percent
: A named argument with aFLOAT64
value. The value specifies the minimum percentage of n-grams inngrams_query
that have to match in order forSEARCH_NGRAMS
to returntrue
. This only counts distinct n-grams and ignores repeating n-grams.
Details
- This function must reference a substring or n-grams
TOKENLIST
column in a table that is also indexed in a search index. - This function returns
NULL
whentokens
orngrams_query
isNULL
. - This function returns
false
if the length ofngrams_query
is smaller thanngram_size_min
oftokens
. - This function can only be used in the
WHERE
clause of a SQL query.
Return type
BOOL
Examples
The following examples reference a table called Albums
and a search index
called AlbumsIndex
.
The Albums
table contains columns DescriptionSubstrTokens
and
DescriptionNgramsTokens
which tokenize a Description
column using
TOKENIZE_SUBSTRING
and TOKENIZE_NGRAMS
, respectively. Finally, AlbumsIndex
indexes DescriptionSubstrTokens
and DescriptionNgramsTokens
.
CREATE TABLE Albums (
AlbumId INT64 NOT NULL,
Description STRING(MAX),
DescriptionSubstrTokens TOKENLIST AS
(TOKENIZE_SUBSTRING(Description, ngram_size_min=>3, ngram_size_max=>3)) HIDDEN,
DescriptionNgramsTokens TOKENLIST AS
(TOKENIZE_NGRAMS(Description, ngram_size_min=>3, ngram_size_max=>3)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(DescriptionSubstrTokens, DescriptionNgramsTokens);
INSERT INTO Albums (AlbumId, Description) VALUES (1, 'rock album');
INSERT INTO Albums (AlbumId, Description) VALUES (2, 'classical album');
INSERT INTO Albums (AlbumId, Description) VALUES (3, 'last note');
The following query searches the column Description
for clasic
. The query
is misspelled, so querying with
SEARCH_SUBSTRING(a.DescriptionSubstrTokens, 'clasic')
doesn't return a row,
but the n-grams search is able to find similar matches.
SEARCH_NGRAMS
first transforms the query clasic
into n-grams of size 3 (the
value of DescriptionSubstrTokens
's ngram_size_max
), producing
['asi', 'cla', 'las', 'sic']
. Then it finds rows that have at least two of
these n-grams (the default value for min_ngrams
) in the
DescriptionSubstrTokens
column.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
SEARCH_NGRAMS(a.DescriptionSubstrTokens, 'clasic');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
*---------------------------*/
If we change the min_ngrams
to 1, then the query will also return the row with
last
which has one n-gram match with las
. This example illustrates the
decreased relevancy of the returned results when this parameter is set low.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
SEARCH_NGRAMS(a.DescriptionSubstrTokens, 'clasic', min_ngrams=>1);
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
| 3 | last notes |
*---------------------------*/
The following query searches the column Description
for clasic albun
. As the
DescriptionSubstrTokens
is tokenized by TOKENIZE_SUBSTRING
, the query is
segmented into ['clasic', 'albun']
first, then n-gram tokens are generated
from those words, producing the following:
['alb', 'asi', 'bun', 'cla', 'las', 'lbu', 'sic']
.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
SEARCH_NGRAMS(a.DescriptionSubstrTokens, 'clasic albun');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
| 1 | rock album |
*---------------------------*/
The following query searches the column Description
for l al
, but using the
DescriptionNgramsTokens
this time. As the DescriptionNgramsTokens
is
generated by TOKENIZE_NGRAMS
, there is no splitting into words before making
n-gram tokens, so the query n-gram tokens are generated as the following:
['%20al', 'l%20a']
.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
SEARCH_NGRAMS(a.DescriptionNgramsTokens, 'l al');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
*---------------------------*/
SEARCH_SUBSTRING
SEARCH_SUBSTRING(
tokens,
substring_query
[, language_tag => value ]
[, relative_search_type => value ]
)
Description
Returns TRUE
if a substring query matches tokens.
Definitions
tokens
: ATOKENLIST
value that is a list of substring tokens.substring_query
: ASTRING
value that represents a substring query.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forsubstring_query
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.relative_search_type
: A named argument with aSTRING
value. The value refines the substring search result. To use a givenrelative_search_type
, the substringTOKENLIST
must have been generated with the corresponding type in itsTOKENIZE_SUBSTRING
relative_search_types
argument. We support these relative search types:phrase
: The substring query terms must appear adjacent to one another and in order in the tokenized value (the value that was tokenized to produce thetokens
argument).value_prefix
: The substring query terms must be found at the start of tokenized value.value_suffix
: The substring query terms must be found at the end of tokenized value.word_prefix
: The substring query terms must be found at the start of a word in the tokenized value.word_suffix
: The substring query terms must be found at the end of a word in the tokenized value.
Details
- Returns
TRUE
iftokens
is a match forsubstring_query
. - This function must reference a substring
TOKENLIST
column in a table that is also indexed in a search index. To add a substringTOKENLIST
column to a table and to a search index, see the examples for this function. - This function returns
NULL
whentokens
orsubstring_query
isNULL
. - This function can only be used in the
WHERE
clause of a SQL query.
Return type
BOOL
Examples
The following examples reference a table called Albums
and a search index
called AlbumsIndex
.
The Albums
table contains a column called DescriptionSubstrTokens
, which
tokenizes the input added to the Description
column using
TOKENIZE_SUBSTRING
, and then saves those substring tokens in the
DescriptionSubstrTokens
column. Finally, AlbumsIndex
indexes
DescriptionSubstrTokens
. Once DescriptionSubstrTokens
is indexed, it can be
used with the SEARCH_SUBSTRING
function.
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
Description STRING(MAX),
DescriptionSubstrTokens TOKENLIST AS (TOKENIZE_SUBSTRING(Description, support_relative_search=>TRUE)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(DescriptionSubstrTokens)
PARTITION BY SingerId;
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 1, 'rock album');
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 2, 'classical album');
The following query searches the column called Description
for a token called
ssic
. If this token is found for singer ID 1
, the matching rows are
returned.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
a.SingerId = 1 AND SEARCH_SUBSTRING(a.DescriptionSubstrTokens, 'ssic');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
*---------------------------*/
The following query searches the column called Description
for a token called
both lbu
and oc
. If these tokens are found for singer ID 1
, the matching
rows are returned.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
a.SingerId = 1 AND SEARCH_SUBSTRING(a.DescriptionSubstrTokens, 'lbu oc');
/*-----------------------*
| AlbumId | Description |
+-----------------------+
| 1 | rock album |
*-----------------------*/
The following query searches the column called Description
for a token called
al
at the start of a word. If this token is found for singer ID 1
, the
matching rows are returned.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
a.SingerId = 1 AND SEARCH_SUBSTRING(a.DescriptionSubstrTokens, 'al', relative_search_type=>'word_prefix');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| 2 | classical album |
| 1 | rock album |
*---------------------------*/
The following query searches the column called Description
for a token called
al
at the start of tokens. If this token is found for singer ID 1
, the
matching rows are returned. Because there are no matches, no rows are returned.
SELECT
a.AlbumId, a.Description
FROM
Albums a
WHERE
a.SingerId = 1 AND SEARCH_SUBSTRING(a.DescriptionSubstrTokens, 'al', relative_search_type=>'value_prefix');
/*---------------------------*
| AlbumId | Description |
+---------------------------+
| | |
*---------------------------*/
SNIPPET
SNIPPET(
data_to_search,
raw_search_query
[, language_tag => value ]
[, enhance_query => { TRUE | FALSE } ]
[, max_snippet_width => value ]
[, max_snippets => value ]
[, content_type => value ]
)
Description
Gets a list of snippets that match a full-text search query.
Definitions
data_to_search
: ASTRING
value that represents the data to search over.raw_search_query
: ASTRING
value that represents the terms of a raw search query.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forraw_search_query
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.max_snippets
: A named argument with anINT64
value. The value represents the maximum number of output snippets to produce.max_snippet_width
: A named argument with anINT64
value. The value represents the width of the output snippet. The width is measured by the estimated number of average proportional-width characters. For example, a wide character like 'M' will use up more space than a narrow character like 'i'.enhance_query
: A named argument with aBOOL
value. The value determines whether to enhance the search query. For example, ifenhance_query
is enabled, a search query containing the termclassic
can expand to include similar terms such asclassical
. The search query isn't enhanced if theenhance_query
call takes longer than the timeout.If
TRUE
, the search query is enhanced to improve search quality.If
FALSE
(default), the search query isn't enhanced.
content_type
: A named argument with aSTRING
value. The value represents the mime type of the content. Currently,"text/html"
is supported for this function.
Details
Each snippet contains a matching substring of the data_to_search
, and a list
of highlights for the location of matching terms.
This function returns NULL
when data_to_search
or raw_search_query
is
NULL
.
Return type
JSON
The JSON
value has this format and definitions:
{
"snippets":[
{
"highlights":[
{
"begin": json_number,
"end": json_number
},
]
"snippet": json_string
}
]
}
snippets
: A JSON object that contains snippets fromdata_to_search
. These are snippets of text forraw_search_query
from the provideddata_to_search
argument.highlights
: A JSON array that contains the position of each search term found insnippet
.begin
: A JSON number that represents the position of a search term's first character insnippet
.end
: A JSON number that represents the position of a search term's final character insnippet
.snippet
: A JSON string that represents an individual snippet fromsnippets
.
Examples
The following query produces a single snippet, Rock albums rock.
with two
highlighted positions for the matching raw search query term, rock
:
SELECT SNIPPET('Rock albums rock.', 'rock') AS Snippet
/*---------------------------------------------------*
| Snippet |
+---------------------------------------------------+
| {"snippets":[ |
| {"highlights":[ |
| {"begin":1, "end":5}, |
| {"begin":13,"end":17} |
| ], "snippet":"Rock albums rock."} |
| ]} |
*---------------------------------------------------*/
TOKEN
TOKEN(value_to_tokenize)
Description
Constructs an exact match TOKENLIST
value by tokenizing a BYTE
or STRING
value verbatim to accelerate exact match expressions in SQL.
Definitions
value_to_tokenize
: ABYTE
,ARRAY<BYTE>
,STRING
orARRAY<STRING>
value to tokenize for searching with exact match expressions.
Details
- This function returns
NULL
whenvalue_to_tokenize
isNULL
.
Return type
TOKENLIST
Examples
The Albums
table contains a column called SingerNameToken
and
SongTitlesToken
, which tokenizes the SingerName
and SongTitles
columns
respectively using the TOKEN
function. Finally, AlbumsIndex
indexes
SingerNameToken
and SongTitlesToken
, which makes it possible for
Spanner to use the index to accelerate exact-match expressions in SQL.
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
SingerName STRING(MAX),
SingerNameToken TOKENLIST AS (TOKEN(SingerName)) HIDDEN,
SongTitles ARRAY<STRING(MAX)>,
SongTitlesToken TOKENLIST AS (TOKEN(SongTitles)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(SingerNameToken, SongTitlesToken);
-- For example, the INSERT statement below generates SingerNameToken of
-- 'Catalina Smith', and SongTitlesToken of
-- ['Starting Again', 'The Second Title'].
INSERT INTO Albums (SingerId, AlbumId, SingerName, SongTitles)
VALUES (1, 1, 'Catalina Smith', ['Starting Again', 'The Second Time']);
The following query finds the column SingerName
is equal to Catalina Smith
.
The query optimizer could choose to accelerate the condition using AlbumsIndex
with SingerNameToken
. Optionally, the query can provide
@{force_index = AlbumsIndex}
to force the optimizer to use AlbumsIndex
.
SELECT a.AlbumId
FROM Albums @{force_index = AlbumsIndex} a
WHERE a.SingerName = 'Catalina Smith';
/*---------*
| AlbumId |
+---------+
| 1 |
*---------*/
The following query is like the previous ones. However, this time the query
searches for SongTitles
that contain the string Starting Again
. Array
conditions should use ARRAY_INCLUDES
, ARRAY_INCLUDES_ANY
or
ARRAY_INCLUDES_ALL
functions to be eligible for using a search index for
acceleration.
SELECT a.AlbumId
FROM Albums a
WHERE ARRAY_INCLUDES(a.SongTitles, 'Starting Again');
/*---------*
| AlbumId |
+---------+
| 1 |
*---------*/
TOKENIZE_BOOL
TOKENIZE_BOOL(value_to_tokenize)
Description
Constructs a boolean TOKENLIST
value by tokenizing a BOOL
value to
accelerate boolean match expressions in SQL.
Definitions
value_to_tokenize
: ABOOL
orARRAY<BOOL>
value to tokenize for boolean match.
Details
- This function returns
NULL
whenvalue_to_tokenize
isNULL
.
Return type
TOKENLIST
Examples
The Albums
table contains a column called IsAwardedToken
, which tokenizes
the IsAwarded
column using TOKENIZE_BOOL
function. Finally, AlbumsIndex
indexes IsAwardedToken
, which makes it possible for Spanner
to use the index to accelerate boolean-match expressions in SQL.
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
IsAwarded BOOL,
IsAwardedToken TOKENLIST AS (TOKENIZE_BOOL(IsAwarded)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(IsAwardedToken);
-- IsAwarded with TRUE generates IsAwardedToken with value 'y'.
INSERT INTO Albums (SingerId, AlbumId, IsAwarded) VALUES (1, 1, TRUE);
-- IsAwarded with FALSE generates IsAwardedToken with value 'n'.
INSERT INTO Albums (SingerId, AlbumId, IsAwarded) VALUES (1, 2, FALSE);
-- NULL IsAwarded generates IsAwardedToken with value NULL.
INSERT INTO Albums (SingerId, AlbumId) VALUES (1, 3);
The following query finds the column IsAwarded
is equal to TRUE
. The query
optimizer could choose to accelerate the condition using AlbumsIndex
with
IsAwardedToken
. Optionally, the query can provide
@{force_index = AlbumsIndex}
to force the optimizer to use AlbumsIndex
.
SELECT a.AlbumId
FROM Albums @{force_index = AlbumsIndex} a
WHERE IsAwarded = TRUE;
TOKENIZE_FULLTEXT
TOKENIZE_FULLTEXT(
value_to_tokenize
[, language_tag => value ]
[, content_type => { "text/plain" | "text/html" } ]
[, token_category => { "small" | "medium" | "large" | "title" } ]
)
Description
Constructs a full-text TOKENLIST
value by tokenizing text for full-text
matching.
Definitions
value_to_tokenize
: ASTRING
orARRAY<STRING>
value to tokenize for full-text search.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forvalue_to_tokenize
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.content_type
: A named argument with aSTRING
value. Indicates the MIME type ofvalue
. This can be:"text/plain"
(default):value
contains plain text. All tokens are assigned to the small token category."text/html"
:value
contains HTML. The HTML tags are removed. HTML-escaped entities are replaced with their unescaped equivalents (for example,<
becomes<
). A token category is assigned to each token depending on its prominence in the HTML. For example, bolded text or text in a<h1>
tag might have higher prominence than normal text and thus might be placed into a different token category.We use token categories during scoring to boost the weight of high-prominence tokens.
token_category
: A named argument with aSTRING
value. Sets or overrides the token importance signals detected by the tokenizer and used by the scorer. Useful for cases where two or more tokenlists will be combined withTOKENLIST_CONCAT
and one of the input columns is known to have higher or lower than usual importance.Allowed values:
"small"
: The category with the lowest importance."medium"
: The category with the second lowest importance."large"
: The category with the second highest importance."title"
: The category with the highest importance.
Details
- This function returns
NULL
whenvalue_to_tokenize
isNULL
.
Return type
TOKENLIST
Examples
In the following example, a TOKENLIST
column is created using the
TOKENIZE_FULLTEXT
function:
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
Title STRING(MAX),
Description STRING(MAX),
DescriptionTokens TOKENLIST AS (TOKENIZE_FULLTEXT(Description)) HIDDEN,
TitleTokens TOKENLIST AS (
TOKENIZE_FULLTEXT(Title, token_category=>"title")) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
-- DescriptionTokens is generated from the Description value, using the
-- TOKENIZE_FULLTEXT function. For example, the following INSERT statement
-- generates DescriptionTokens with the tokens ['rock', 'album']. TitleTokens
-- will contain ['abbey', 'road'] and these tokens will be assigned to the
-- "title" token category.
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 1, 'rock album');
-- Capitalization and delimiters are removed during tokenization. For example,
-- the following INSERT statement generates DescriptionTokens with the tokens
-- ['classical', 'albums'].
INSERT INTO Albums (SingerId, AlbumId, Description) VALUES (1, 1, 'Classical, Albums.');
To query a full-text TOKENLIST
column, see the
SEARCH function.
TOKENIZE_NGRAMS
TOKENIZE_NGRAMS(
value_to_tokenize
[, ngram_size_min => value ]
[, ngram_size_max => value ]
[, remove_diacritics => { TRUE | FALSE } ]
)
Description
Constructs an n-gram TOKENLIST
value by tokenizing a STRING
value for
matching n-grams.
Definitions
value_to_tokenize
: ASTRING
value to tokenize for the n-gram match.remove_diacritics
: A named argument with aBOOL
value. IfTRUE
, the diacritics is removed fromvalue_to_tokenize
before indexing. This is useful when you want to search a substring or ngram, regardless of diacritics. When a search query is called on aTOKENLIST
value withremove_diacritics
set asTRUE
, the diacritics will also be removed at query time from the search queries.ngram_size_min
: A named argument with anINT64
value. The value is the minimum length of the n-gram tokens to generate. The default value for this argument is1
. This argument must be less than or equal tongram_size_max
.Queries or
TOKENLIST
s generated from input values shorter thanngram_size_min
never match anything, since no n-grams are generated for such queries or values.We recommend tuning
ngram_size_min
only when the developer controls the queries and can ensure that the minimum query length is at leastngram_size_min
.ngram_size_max
: A named argument with anINT64
value. The value is the maximum size of each n-gram token to generate. A larger value can improve retrieval performance by reducing the number of non-matching records that have any given n-gram. However, larger values can substantially increase index sizes and write costs.When using the
SEARCH_NGRAMS
function,ngram_size_max
is the length of n-grams shared between the query and the value that are counted when deciding whether a value reaches themin_ngrams
ormin_ngrams_percent
threshold. For more information, seeSEARCH_NGRAMS
.The default value for this argument is
4
.When using the
TOKENLIST
result with theSEARCH_NGRAMS
function, we recommend setting this value to3
.
Details
- This function returns
NULL
whenvalue_to_tokenize
isNULL
.
Return type
TOKENLIST
Examples
In the following example, a TOKENLIST
column is created using the
TOKENIZE_NGRAMS
function. The INSERT
generates a TOKENLIST
which contains
two sets of tokens. First, the whole string is broken up into n-grams with a
length in the range [ngram_size_min, ngram_size_max-1]
. Capitalization and
whitespace are preserved in the n-grams. These n-grams are placed in the first
position in the tokenlist.
[" ", " M", " Me", "vy ", "y ", "y M", H, He, Hea, Heav, ...], ...
Second, any n-grams with length equal to ngram_size_max
are stored in
sequence, with the first of these in the same position as the smaller n-grams.
(In this example, the Heav
token is in the first position.)
..., eavy, "avy ", "vy M", "y Me", " Met", Meta, etal
CREATE TABLE Albums (
AlbumId INT64 NOT NULL,
Description STRING(MAX),
DescriptionNgramTokens TOKENLIST AS (TOKENIZE_NGRAMS(Description)) HIDDEN
) PRIMARY KEY (AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(DescriptionNgramTokens);
INSERT INTO Albums (AlbumId, Description) VALUES (1, 'Heavy Metal');
To query an n-gram TOKENLIST
column, see the
SEARCH_NGRAMS function.
TOKENIZE_NUMBER
TOKENIZE_NUMBER(
value_to_tokenize,
[, comparison_type => { "all" | "equality" } ]
[, algorithm => { "logtree" | "prefixtree" | "floatingpoint" } ]
[, min => value ]
[, max => value ]
[, granularity => value ]
[, tree_base => value ]
[, precision => value ]
)
Description
Constructs a numeric TOKENLIST
value by tokenizing numeric values to
accelerate numeric comparison expressions in SQL.
Definitions
value_to_tokenize
: AnINT64
,FLOAT32
,FLOAT64
orARRAY
of these types to tokenize for numeric comparison expressions.comparison_type
: A named argument with aSTRING
value. The value represents the type of comparison to use for numeric expressions. Set toequality
to save space if equality is only required comparison. Default isall
.algorithm
: A named argument with aSTRING
value. The value indicates the indexing algorithm to use. Supported algorithms are limited, depending on the type of value being indexed. The default islogtree
.FLOAT32
orFLOAT64
must not use default. They should specify the algorithm and must also usemin
andmax
when using thelogtree
orprefixtree
algorithms.logtree
: Use for indexing uniformly distributed data.min
andmax
must be specified ifvalue_to_tokenize
isFLOAT32
orFLOAT64
.prefixtree
: Use when indexing exponentially distributed data and when query predicate is of the form "@param > number
" or "@param >= number
" (ranges without an upper bound). Compared tologtree
, this algorithm generates fewer index tokens for small numbers. For queries where theWHERE
clause contains the predicate previously described,prefixtree
generates fewer query tokens, which can improve performance.min
andmax
must be specified ifvalue_to_tokenize
isFLOAT32
orFLOAT64
.floatingpoint
: Use for indexingFLOAT32
orFLOAT64
values where the indexed data and queries often contain fractions. When tokenizingFLOAT32
orFLOAT64
usinglogtree
orprefixtree
,TOKENIZE_NUMBER
might lose precision as the count ofgranularity
buckets in themin
tomax
range approaches the maximum resolution of floating point numbers. This can make queries less efficient, but it doesn't cause incorrect behavior. This loss of precision doesn't happen with thefloatingpoint
algorithm if theprecision
argument is set high enough. However, thefloatingpoint
algorithm generates more index tokens whenprecision
is set to a larger value.
min
: A named argument with the same type asvalue_to_tokenize
. Values less thanmin
are indexed in the same index bucket. This will not cause incorrect results, but may cause significant over-retrieval for queries with a range that includes values lesser thanmin
. Don't usemin
whencomparison_type
isequality
.max
: A named argument with the same type asvalue_to_tokenize
. Values greater thanmax
are indexed in the same index bucket. This doesn't cause incorrect results, but might cause significant over-retrieval for queries with a range that includes values greater than themax
. Don't usemax
whencomparison_type
isequality
.granularity
: A named argument with the same type asvalue_to_tokenize
. The value represents the width of each indexing bucket. Values in the same bucket are indexed together, so larger buckets are more storage efficient, but may cause over-retrieval, causing high latency during query execution.granularity
is only allowed whenalgorithm
islogtree
orprefixtree
.tree_base
: A named argument with anINT64
value. The value is the numerical base of a tree for tree-based algorithms.For example, the value of
2
means that each tree token represents some power-of-two number of buckets. In the case of a value indexed in the 1024th bucket, there is a token for [1024,1024], then a token for [1024,1025], then a token for [1024, 1027], and so on.Increasing
tree_base
reduces the required number of index tokens and increases the required number of query tokens.The default value is 2.
tree_base
is only allowed whenalgorithm
islogtree
orprefixtree
.precision
: A named argument with anINT64
value. Reducing the precision reduces the number of index tokens, but increases over-retrieval when queries specify ranges with a high number of significant digits. The default value is 15.precision
is only allowed whenalgorithm
isfloatingpoint
.
Details
- This function returns
NULL
whenvalue_to_tokenize
isNULL
. - The
tree_base
parameter controls the width of each tree bucket in thelogtree
andprefixtree
algorithms. Both algorithms generate tokens representing nodes in abase
-ary tree where the width of a node isbasedistance_from_leaf
. The algorithms differ in thatprefixtree
omits some of the tree nodes in favor of greater-than tokens that accelerate greater-than queries. When a larger base is selected, fewer index tokens are generated. However, largerbase
values increase the maximum number of query tokens required. - Numbers that fall outside of the
[min, max]
range are all indexed into two buckets: one for all numbers less thanmin
, and the other for all numbers greater thanmax
. This might cause significant over-retrieval (retrieval of too many candidate results) when the range requested by the query also includes numbers outside of the range. For this reason, setmin
andmax
to the narrowest possible values that encompass all input numbers. Like all tokenization configurations, changing themin
andmax
values requires a rebuild of the numeric index, so leave room to grow if the final domain of a column isn't known. The problem of over-retrieval is not a correctness problem as all potential matches are checked against non-bucketized numbers at the end of the search process; it's only a potential efficiency issue. - The
granularity
argument controls the rate of downsampling that's applied to numbers before they are indexed in the tree-based algorithms. Before each number is tokenized, it's sorted into buckets with a width equal togranularity
. All the numbers in the samegranularity
bucket get the same tokens. This means that over-retrieval might occur if the granularity value is set to anything other than 1 for integral numbers. Over retrieval is always possible forFLOAT64
numbers. It also means that if numeric values change by a small amount, most of their tokens don't need to be reindexed. Using agranularity
higher than 1 also reduces the number of tokens that the algorithm needs to generate, but the effect is less significant than the effect of increasing thebase
. Therefore, we recommend that 'granularity' is set to 1.
Return type
TOKENLIST
Examples
The Albums
table contains a column called the RatingTokens
, which tokenizes
the Rating
column using the TOKENIZE_NUMBER
function. Finally, AlbumsIndex
indexes RatingTokens
, which makes it possible for Spanner
to use the index to accelerate numeric comparison expressions in SQL.
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
Rating INT64,
RatingTokens TOKENLIST AS (TOKENIZE_NUMBER(Rating)) HIDDEN,
TrackRating ARRAY<INT64>,
TrackRatingTokens TOKENLIST AS (TOKENIZE_NUMBER(TrackRating)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
CREATE SEARCH INDEX AlbumsIndex ON Albums(RatingTokens, TrackRatingTokens);
-- RatingTokens and TrackRatingTokens are generated from Rating and TrackRating
-- values, respectively, using the TOKENIZE_NUMBER function.
INSERT INTO Albums (SingerId, AlbumId, Rating, TrackRating) VALUES (1, 1, 2, [2, 3]);
INSERT INTO Albums (SingerId, AlbumId, Rating, TrackRating) VALUES (1, 2, 5, [3, 5]);
The following query finds rows in which the column Rating
is equal to 5
. The
query optimizer might choose to accelerate the condition using AlbumsIndex
with RatingTokens
. Optionally, the query can provide
@{force_index = AlbumsIndex}
to force the optimizer to use AlbumsIndex
.
SELECT a.AlbumId
FROM Albums @{force_index = AlbumsIndex} a
WHERE a.Rating = 5;
/*---------*
| AlbumId |
+---------+
| 2 |
*---------*/
The following query is like the previous one. However, the condition is on the
array column of TrackRating
this time. Array conditions should use
ARRAY_INCLUDES
, ARRAY_INCLUDES_ANY
or ARRAY_INCLUDES_ALL
functions to be
eligible for using a search index for acceleration.
SELECT a.AlbumId
FROM Albums a
WHERE ARRAY_INCLUDES_ALL(a.TrackRating, [2, 3]);
/*---------*
| AlbumId |
+---------+
| 1 |
*---------*/
SELECT a.AlbumId
FROM Albums a
WHERE ARRAY_INCLUDES_ANY(a.TrackRating, [3, 4, 5]);
/*---------*
| AlbumId |
+---------+
| 1 |
| 2 |
*---------*/
The following query is like the previous ones. However, the condition is range
this time. This query can also be accelerated, as default comparison_type
is
all
which covers both equality
and range
comparisons.
SELECT a.AlbumId
FROM Albums a
WHERE a.Rating >= 2;
/*---------*
| AlbumId |
+---------+
| 1 |
| 2 |
*---------*/
TOKENIZE_SUBSTRING
TOKENIZE_SUBSTRING(
value_to_tokenize
[, language_tag => value ]
[, ngram_size_min => value ]
[, ngram_size_max => value ]
[, relative_search_types => value ]
[, content_type => { "text/plain" | "text/html" } ]
[, remove_diacritics => { TRUE | FALSE } ]
[, short_tokens_only_for_anchors => {TRUE | FALSE } ]
)
Description
Constructs a substring TOKENLIST
value, which tokenizes text for substring
matching.
Definitions
value_to_tokenize
: ASTRING
orARRAY<STRING>
value to tokenize for a substring search.language_tag
: A named argument with aSTRING
value. The value contains an IETF BCP 47 language tag. You can use this tag to specify the language forvalue_to_tokenize
. If the value for this argument isNULL
, this function doesn't use a specific language. If this argument is not specified,NULL
is used by default.relative_search_types
: A named argument with anARRAY<STRING>
value. The value determines whichTOKENIZE_SUBSTRING
relative search types are supported. See SEARCH_SUBSTRING for a list of the different relative search types.In addition to the relative search types from
SEARCH_SUBSTRING
, TOKENIZE_SUBSTRING accepts a special flag,all
, which means that all relative search types are supported.If this argument is not used, then no relative search tokens are generated for the resulting
TOKENLIST
value.Setting this value causes extra anchor tokens to be generated to enable relative searches. A given relative search type can only be used in a query if that type, or
all
, is present in therelative_search_types
argument. By default,relative_search_types
is empty.content_type
: A named argument with aSTRING
value. Indicates the MIME type ofvalue
. This can be:"text/plain"
(default):value
contains plain text. All tokens are assigned to the small token category."text/html"
:value
contains HTML. The HTML tags are removed. HTML-escaped entities are replaced with their unescaped equivalents (for example,<
becomes<
). A token category is assigned to each token depending on its prominence in the HTML. For example, bolded text or text in a<h1>
tag might have higher prominence than normal text and thus might be placed into a different token category.We use token categories during scoring to boost the weight of high-prominence tokens.
remove_diacritics
: A named argument with aBOOL
value. IfTRUE
, the diacritics is removed fromvalue_to_tokenize
before indexing. This is useful when you want to search a substring or ngram, regardless of diacritics. When a search query is called on aTOKENLIST
value withremove_diacritics
set asTRUE
, the diacritics will also be removed at query time from the search queries.ngram_size_min
: A named argument with anINT64
value. The value is the minimum length of the n-gram tokens to generate. The default value for this argument is1
. This argument must be less than or equal tongram_size_max
.Whole words that are shorter than
ngram_size_min
ignore this argument and are emitted regardless of how short they are.SEARCH_SUBSTRING
can retrieve values containing such words, but only if the query text exactly matches the whole word.ngram_size_max
: A named argument with anINT64
value. The value is the maximum size of each n-gram token to generate. A larger value can improve retrieval performance by reducing the number of non-matching records that have any given n-gram. However, larger values can substantially increase index sizes and write costs.When using the
SEARCH_NGRAMS
function,ngram_size_max
is the length of n-grams shared between the query and the value that are counted when deciding whether a value reaches themin_ngrams
ormin_ngrams_percent
threshold. For more information, seeSEARCH_NGRAMS
.The default value for this argument is
4
.When using the
TOKENLIST
result with theSEARCH_NGRAMS
function, we recommend setting this value to3
.When using the
TOKENLIST
result withSEARCH_SUBSTRING
, we recommend not tuning this value unless retrieval performance problems related to excessively common n-grams are detected.short_tokens_only_for_anchors
: A named argument with aBOOL
value. If true, theTOKENLIST
emitted by this function does not contain short n-grams — those with sizes less thanngram_size_max
— except when those n-grams are part of one of the anchors used to support the prefix and suffixrelative_search_types
settings. The default value isFALSE
.Setting this to
TRUE
can reduce the number of n-grams generated. However, it causesSEARCH_SUBSTRING
to returnFALSE
for short query terms whenrelative_search_types
is not one of the prefix or suffix modes. Therefore, we recommend setting this only whenrelative_search_types
will always be set to a prefix or suffix mode.
Details
- This function returns
NULL
whenvalue_to_tokenize
isNULL
.
Return type
TOKENLIST
Example
In the following example, a TOKENLIST
column is created using the
TOKENIZE_SUBSTRING
function. The INSERT
generates a TOKENLIST
which
contains two sets of tokens. First, each word is broken up into n-grams with a
length in the range [ngram_size_min, ngram_size_max-1]
, and any whole words
with a length shorter than that ngram_size_max
. All of these tokens are placed
in the first position in the tokenlist.
[a, al, av, avy, e, ea, eav, et, eta, h, he, hea, ...], ...
Second, any n-grams with length equal to ngram_size_max
are stored in
subsequent positions. These tokens are used when searching for words larger than
the maximum ngram size.
..., heav, eavy, <gap(1)>, meta, etal
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
Description STRING(MAX),
DescriptionSubstrTokens TOKENLIST
AS (TOKENIZE_SUBSTRING(Description, ngram_size_min=>1, ngram_size_max=>4)) HIDDEN
) PRIMARY KEY (SingerId, AlbumId);
INSERT INTO Albums (SingerId, AlbumId, Description)
VALUES (1, 1, 'Heavy Metal');
To query a substring TOKENLIST
column, see the
SEARCH_SUBSTRING or
SEARCH_NGRAMS function.