GoogleSQL for BigQuery supports the following search functions.
Function list
Name | Summary |
---|---|
SEARCH
|
Checks to see whether a table or other search data contains a set of search terms. |
VECTOR_SEARCH
|
Performs a vector search on embeddings to find semantically similar entities. |
SEARCH
SEARCH(
data_to_search, search_query
[, json_scope=>{ 'JSON_VALUES' | 'JSON_KEYS' | 'JSON_KEYS_AND_VALUES' }]
[, analyzer=>{ 'LOG_ANALYZER' | 'NO_OP_ANALYZER' | 'PATTERN_ANALYZER'}]
[, analyzer_options=>analyzer_options_values]
)
Description
The SEARCH
function checks to see whether a BigQuery table or other
search data contains a set of search terms (tokens). It returns TRUE
if all
search terms appear in the data, based on the text analysis
described in the text analyzer, and FALSE
otherwise.
Definitions
data_to_search
: The data to search over. The value can be:- Any GoogleSQL data type literal
- A list of columns
- A table reference
- A column of any type
A table reference is evaluated as a
STRUCT
whose fields are the columns of the table.data_to_search
can be any type, butSEARCH
will returnFALSE
for all types except those listed here:ARRAY<STRING>
ARRAY<STRUCT>
JSON
STRING
STRUCT
You can search for string literals in columns of the preceding types. For additional rules, see Search data rules.
search_query
: ASTRING
literal, or aSTRING
constant expression that represents the terms of the search query. Ifsearch_query
isNULL
, an error is returned. Ifsearch_query
contains no tokens and the text analyzer isLOG_ANALYZER
, an error is returned.json_scope
: Optional mandatory-named argument that takes one of the following values to indicate the scope of JSON data to be searched. It has no effect ifdata_to_search
isn't a JSON value or doesn't contain a JSON field.'JSON_VALUES'
(default): Only the JSON values are searched. Ifjson_scope
isn't provided, this is used by default.'JSON_KEYS'
: Only the JSON keys are searched.'JSON_KEYS_AND_VALUES'
: The JSON keys and values are searched.
analyzer
: Optional mandatory-named argument that takes one of the following values to indicate the text analyzer to use:'LOG_ANALYZER'
(default): Breaks the input into terms when delimiters are encountered and then normalizes the terms. For more information, seeLOG_ANALYZER
.'NO_OP_ANALYZER'
: Extracts the text as a single term (token), but doesn't apply normalization. For more information about this analyzer, seeNO_OP_ANALYZER
.'PATTERN_ANALYZER'
: Breaks the input into terms that match a regular expression. For more information, seePATTERN_ANALYZER
text analyzer.
analyzer_options
: Optional mandatory-named argument that takes a list of text analysis rules as a JSON-formattedSTRING
. For more information, see Text analyzer options.
Details
The SEARCH
function is designed to work with search indexes to
optimize point lookups. Although the SEARCH
function works for
tables that aren't indexed, its performance will be greatly improved with a
search index. If both the analyzer and analyzer options match the one used
to create the index, the search index will be used.
Rules for search_query
Backtick rules for search_query
:
If the
LOG_ANALYZER
text analyzer is used, text enclosed in backticks forces an exact match.For example,
`Hello World` happy days
becomesHello World
,happy
, anddays
.Search terms enclosed in backticks must match exactly in
data_to_search
, subject to the following conditions:It appears at the start of
data_to_search
or is immediately preceded by a delimiter.It appears at the end of
data_to_search
or is immediately followed by a delimiter.
For example,
SEARCH('foo.bar', '`foo.`')
returnsFALSE
because the text enclosed in the backticksfoo.
is immediately followed by the characterb
in the search datafoo.bar
, rather than by a delimiter or the end of the string. However,SEARCH('foo..bar', '`foo.`')
returnsTRUE
becausefoo.
is immediately followed by the delimiter.
in the search data.The backtick itself can be escaped using a backslash, as in
\`foobar\`
.The following are reserved words and must be enclosed in backticks:
AND
,NOT
,OR
,IN
, andNEAR
Reserved character rules for search_query
:
Text not enclosed in backticks requires the following reserved characters to be escaped by a double backslash
\\
:[ ] < > ( ) { } | ! ' " * & ? + / : = - \ ~ ^
If the quoted string is preceded by the character
r
orR
, such asr"my\+string"
, then it is treated as a raw string and only a single backslash is required to escape the reserved characters. For more information about raw strings and escape sequences, see String and byte literals.
How search_query
is broken into searchable terms
The following table shows how search_query
is broken into
searchable terms by the LOG_ANALYZER
text analyzer. All entries are strings.
search_query | searchable terms |
---|---|
127.0.0.1 | 127 0 1 127.0.0.1 . 127.0.0 127.0 0.0 0.0.1 0.1 |
foobar@example.com | foobar example com foobar@example example.com foobar@example.com |
The fox. | the fox The The fox The fox. fox fox. |
The following table shows how search_query
is broken into query
terms by the LOG_ANALYZER
text analyzer. All entries are strings.
search_query | query terms |
---|---|
127.0.0.1 | 127 0 1 |
`127.0.0.1` | 127.0.0.1 |
foobar@example.com | foobar example com |
`foobar@example.com` | foobar@example.com |
Rules for data_to_search
General rules for data_to_search
:
data_to_search
must contain all terms, in any order, from thesearch_query
for the function to returnTRUE
.- To perform a cross-field search,
data_to_search
must be aSTRUCT
,ARRAY
, orJSON
data type. - Each
STRING
field in a compound data type is individually searched for terms. If at least one field in
data_to_search
includes all search terms in any order,SEARCH
returnsTRUE
. Otherwise it has the following behavior:If at least one
STRING
field isNULL
,SEARCH
returnsNULL
.Otherwise,
SEARCH
returnsFALSE
.
Return type
BOOL
Examples
The following queries show how tokens in search_query
are analyzed
by a SEARCH
function call using the default analyzer, LOG_ANALYZER
:
SELECT
-- ERROR: `search_query` is NULL.
SEARCH('foobarexample', NULL) AS a,
-- ERROR: `search_query` contains no tokens.
SEARCH('foobarexample', '') AS b,
SELECT
-- TRUE: '-' and ' ' are delimiters.
SEARCH('foobar-example', 'foobar example') AS a,
-- TRUE: The search query is a constant expression evaluated to 'foobar'.
SEARCH('foobar-example', CONCAT('foo', 'bar')) AS b,
-- FALSE: The search_query is not split.
SEARCH('foobar-example', 'foobarexample') AS c,
-- TRUE: The double backslash escapes the ampersand which is a delimiter.
SEARCH('foobar-example', 'foobar\\&example') AS d,
-- TRUE: The single backslash escapes the ampersand in a raw string.
SEARCH('foobar-example', R'foobar\&example')AS e,
-- FALSE: The backticks indicate that there must be an exact match for
-- foobar&example.
SEARCH('foobar-example', '`foobar&example`') AS f,
-- TRUE: An exact match is found.
SEARCH('foobar&example', '`foobar&example`') AS g
/*-------+-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f | g |
+-------+-------+-------+-------+-------+-------+-------+
| true | true | false | true | true | false | true |
*-------+-------+-------+-------+-------+-------+-------*/
SELECT
-- TRUE: The order of terms doesn't matter.
SEARCH('foobar-example', 'example foobar') AS a,
-- TRUE: Tokens are made lower-case.
SEARCH('foobar-example', 'Foobar Example') AS b,
-- TRUE: An exact match is found.
SEARCH('foobar-example', '`foobar-example`') AS c,
-- FALSE: Backticks preserve capitalization.
SEARCH('foobar-example', '`Foobar`') AS d,
-- FALSE: Backticks don't have special meaning for search_data and are
-- not delimiters in the default LOG_ANALYZER.
SEARCH('`foobar-example`', '`foobar-example`') AS e,
-- TRUE: An exact match is found after the delimiter in search_data.
SEARCH('foobar@example.com', '`example.com`') AS f,
-- TRUE: An exact match is found between the space delimiters.
SEARCH('a foobar-example b', '`foobar-example`') AS g;
/*-------+-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f | g |
+-------+-------+-------+-------+-------+-------+-------+
| true | true | true | false | false | true | true |
*-------+-------+-------+-------+-------+-------+-------*/
SELECT
-- FALSE: No single array entry matches all search terms.
SEARCH(['foobar', 'example'], 'foobar example') AS a,
-- FALSE: The search_query is equivalent to foobar\\=.
SEARCH('foobar=', '`foobar\\=`') AS b,
-- FALSE: This is equivalent to the previous example.
SEARCH('foobar=', R'`\foobar=`') AS c,
-- TRUE: The equals sign is a delimiter in the data and query.
SEARCH('foobar=', 'foobar\\=') AS d,
-- TRUE: This is equivalent to the previous example.
SEARCH('foobar=', R'foobar\=') AS e,
-- TRUE: An exact match is found.
SEARCH('foobar.example', '`foobar`') AS f,
-- FALSE: `foobar.\` is not analyzed because of backticks; it is not
-- followed by a delimiter in search_data 'foobar.example'.
SEARCH('foobar.example', '`foobar.\`') AS g,
-- TRUE: `foobar.` is not analyzed because of backticks; it is
-- followed by the delimiter '.' in search_data 'foobar..example'.
SEARCH('foobar..example', '`foobar.`') AS h;
/*-------+-------+-------+-------+-------+-------+-------+-------*
| a | b | c | d | e | f | g | h |
+-------+-------+-------+-------+-------+-------+-------+-------+
| false | false | false | true | true | true | false | true |
*-------+-------+-------+-------+-------+-------+-------+-------*/
The following query shows examples of calls to the SEARCH
function using the
NO_OP_ANALYZER
text analyzer and reasons for various return values:
SELECT
-- TRUE: exact match
SEARCH('foobar', 'foobar', analyzer=>'NO_OP_ANALYZER') AS a,
-- FALSE: Backticks are not special characters for `NO_OP_ANALYZER`.
SEARCH('foobar', '\`foobar\`', analyzer=>'NO_OP_ANALYZER') AS b,
-- FALSE: The capitalization does not match.
SEARCH('foobar', 'Foobar', analyzer=>'NO_OP_ANALYZER') AS c,
-- FALSE: There are no delimiters for `NO_OP_ANALYZER`.
SEARCH('foobar example', 'foobar', analyzer=>'NO_OP_ANALYZER') AS d,
-- TRUE: An exact match is found.
SEARCH('', '', analyzer=>'NO_OP_ANALYZER') AS e;
/*-------+-------+-------+-------+-------*
| a | b | c | d | e |
+-------+-------+-------+-------+-------+
| true | false | false | false | true |
*-------+-------+-------+-------+-------*/
Consider the following table called meals
with columns breakfast
, lunch
,
and dinner
:
/*-------------------+-------------------------+------------------*
| breakfast | lunch | dinner |
+-------------------+-------------------------+------------------+
| Potato pancakes | Toasted cheese sandwich | Beef soup |
| Avocado toast | Tomato soup | Chicken soup |
*-------------------+-------------------------+------------------*/
The following query shows how to search single columns, multiple columns, and
whole tables, using the default LOG_ANALYZER
text analyzer
with the default analyzer options:
WITH
meals AS (
SELECT
'Potato pancakes' AS breakfast,
'Toasted cheese sandwich' AS lunch,
'Beef soup' AS dinner
UNION ALL
SELECT
'Avocado toast' AS breakfast,
'Tomato soup' AS lunch,
'Chicken soup' AS dinner
)
SELECT
SEARCH(lunch, 'soup') AS lunch_soup,
SEARCH((breakfast, dinner), 'soup') AS breakfast_or_dinner_soup,
SEARCH(meals, 'soup') AS anytime_soup
FROM meals;
/*------------+--------------------------+--------------*
| lunch_soup | breakfast_or_dinner_soup | anytime_soup |
+------------+--------------------------+--------------+
| false | true | true |
| true | true | true |
*------------+--------------------------+--------------*/
The following query shows additional ways to search, using the
default LOG_ANALYZER
text analyzer with
default analyzer options:
WITH data AS ( SELECT 'Please use foobar@example.com as your email.' AS email )
SELECT
SEARCH(email, 'exam') AS a,
SEARCH(email, 'foobar') AS b,
SEARCH(email, 'example.com') AS c
FROM data;
/*-------+-------+-------*
| a | b | c |
+-------+-------+-------+
| false | true | true |
*-------+-------+-------*/
The following query shows additional ways to search, using the
default LOG_ANALYZER
text analyzer with custom
analyzer options. Terms are only split when a space or @
symbol is
encountered.
WITH data AS ( SELECT 'Please use foobar@example.com as your email.' AS email )
SELECT
SEARCH(email, 'foobar', analyzer_options=>'{"delimiters": [" ", "@"]}') AS a,
SEARCH(email, 'example', analyzer_options=>'{"delimiters": [" ", "@"]}') AS b,
SEARCH(email, 'example.com', analyzer_options=>'{"delimiters": [" ", "@"]}') AS c,
SEARCH(email, 'foobar@example.com', analyzer_options=>'{"delimiters": [" ", "@"]}') AS d
FROM data;
/*-------+-------+-------+-------*
| a | b | c | d |
+-------+-------+-------+-------+
| true | false | true | true |
*-------+-------+-------+-------*/
The following query shows how to search, using the
NO_OP_ANALYZER
text analyzer:
WITH meals AS ( SELECT 'Tomato soup' AS lunch )
SELECT
SEARCH(lunch, 'Tomato soup', analyzer=>'NO_OP_ANALYZER') AS a,
SEARCH(lunch, 'soup', analyzer=>'NO_OP_ANALYZER') AS b,
SEARCH(lunch, 'tomato soup', analyzer=>'NO_OP_ANALYZER') AS c
FROM meals;
/*-------+-------+-------*
| a | b | c |
+-------+-------+-------+
| true | false | false |
*-------+-------+-------*/
The following query shows how to use the PATTERN_ANALYZER
text analyzer with default analyzer options:
WITH data AS ( SELECT 'Please use foobar@example.com as your email.' AS email )
SELECT
SEARCH(email, 'exam', analyzer=>'PATTERN_ANALYZER') AS a,
SEARCH(email, 'foobar', analyzer=>'PATTERN_ANALYZER') AS b,
SEARCH(email, 'example.com', analyzer=>'PATTERN_ANALYZER') AS c
FROM data;
/*-------+-------+-------*
| a | b | c |
+-------+-------+-------+
| false | true | true |
*-------+-------+-------*/
The following query shows additional ways to search, using the
PATTERN_ANALYZER
text analyzer with
custom analyzer options:
WITH data AS ( SELECT 'Please use foobar@EXAMPLE.com as your email.' AS email )
SELECT
SEARCH(email, 'EXAMPLE', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[A-Z]*"]}') AS a,
SEARCH(email, 'example', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[a-z]*"]}') AS b,
SEARCH(email, 'example.com', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[a-z]*"]}') AS c,
SEARCH(email, 'example.com', analyzer=>'PATTERN_ANALYZER', analyzer_options=>'{"patterns": ["[a-zA-Z.]*"]}') AS d
FROM data;
/*-------+-------+-------+-------*
| a | b | c | d |
+-------+-------+-------+-------+
| true | false | false | true |
*-------+-------+-------+-------*/
For additional examples that include analyzer options, see the Text analysis reference guide.
For helpful analyzer recipes that you can use to enhance analyzer-supported queries, see the Search with text analyzers user guide.
VECTOR_SEARCH
To provide feedback or request support for this feature, send email to bq-vector-search@google.com.
VECTOR_SEARCH(
TABLE base_table,
column_to_search,
TABLE query_table
[, query_column_to_search => query_column_to_search_value]
[, top_k => top_k_value ]
[, distance_type => distance_type_value ]
[, options => options_value ]
)
VECTOR_SEARCH(
TABLE base_table,
column_to_search,
(query_statement)
[, query_column_to_search => query_column_to_search_value]
[, top_k => top_k_value ]
[, distance_type => distance_type_value ]
[, options => options_value ]
)
Description
The VECTOR_SEARCH
function lets you search embeddings to find semantically
similar entities.
Embeddings are high-dimensional numerical vectors that represent a given entity, like a piece of text or an audio file. Machine learning (ML) models use embeddings to encode semantics about such entities to make it easier to reason about and compare them. For example, a common operation in clustering, classification, and recommendation models is to measure the distance between vectors in an embedding space to find items that are most semantically similar.
Definitions
base_table
: The table to search for nearest neighbor embeddings.column_to_search
: The name of the base table column to search for nearest neighbor embeddings. The column must have a type ofARRAY<FLOAT64>
. All elements in the array must be non-NULL
, and all values in the column must have the same array dimensions. If the column has a vector index, BigQuery attempts to use it. To determine if an index was used in the vector search, see Vector index usage.query_table
: The table that provides the embeddings for which to find nearest neighbors. All columns are passed through as output columns.query_statement
: A query that provides the embeddings for which to find nearest neighbors. All columns are passed through as output columns.query_column_to_search
: An optionalSTRING
positional-named argument.query_column_to_search_value
specifies the name of the column in the query table or statement that contains the embeddings for which to find nearest neighbors. The column must have a type ofARRAY<FLOAT64>
. All elements in the array must be non-NULL
and all values in the column must have the same array dimensions as the values in thecolumn_to_search
column. If you don't specifyquery_column_to_search_value
, the function uses thecolumn_to_search
value.top_k
: An optionalINT64
mandatory-named argument.top_k_value
specifies the number of nearest neighbors to return. The default is10
. A negative value is treated as infinity, meaning that all values are counted as neighbors and returned.distance_type
: An optionalSTRING
mandatory-named argument.distance_type_value
specifies the type of metric to use to compute the distance between two vectors. Supported distance types areEUCLIDEAN
andCOSINE
. The default isEUCLIDEAN
.If you don't specify
distance_type_value
and thecolumn_to_search
column has a vector index that is used,VECTOR_SEARCH
uses the distance type specified in thedistance_type
option of theCREATE VECTOR INDEX
statement.options
: An optional JSON-formattedSTRING
mandatory-named argument.options_value
is a literal that specifies the following vector search options:fraction_lists_to_search
: A JSON number that specifies the percentage of lists to search. For example,options => '{"fraction_lists_to_search":0.15}'
. Thefraction_lists_to_search
value must be in the range0.0
to1.0
, exclusive.Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage.
fraction_lists_to_search
is only used when a vector index is also used. If you don't specify afraction_lists_to_search
value but an index is matched, the default number of lists to scan is calculated asmin(0.002 * number_of_lists, 10)
.The number of available lists to search is determined by the
num_lists
option in theivf_options
option of theCREATE VECTOR INDEX
statement if that is specified. Otherwise, BigQuery calculates an appropriate number.You can't specify
fraction_lists_to_search
whenuse_brute_force
is set totrue
.use_brute_force
: A JSON boolean that determines whether to use brute force search by skipping the vector index if one is available. For example,options => '{"use_brute_force":true}'
. The default isfalse
. If you specifyuse_brute_force=false
and there is no useable vector index available, brute force is used anyway.
options
defaults to'{}'
to denote that all underlying options use their corresponding default values.
Details
You can optionally use VECTOR_SEARCH
with a vector index. When
a vector index is used, VECTOR_SEARCH
uses the Approximate Nearest
Neighbor search technique to help improve vector search performance, with
the trade-off of reducing recall and so returning more approximate
results. Brute force is used to return exact results when a vector index isn't
available, and you can choose to use brute force to get exact results even when
a vector index is available.
Output
For each row in the query data, the output contains multiple rows from the
base table that satisfy the search criteria. The number of results rows per
query table row is either 10 or the top_k
value if it is specified. The
order of the output isn't guaranteed.
The output includes the following columns:
query
: ASTRUCT
value that contains all selected columns from the query data.base
: ASTRUCT
value that contains all columns from the base table.distance
: AFLOAT64
value that represents the distance between the base data and the query data.
Limitations
BigQuery data security and governance rules apply to the use of
VECTOR_SEARCH
, which results in the following behavior:
- If the base table has
row-level security policies,
VECTOR_SEARCH
applies the row-level access policies to the query results. - If the indexed column from the base table has
data masking policies,
VECTOR_SEARCH
succeeds only if the user running the query has theFine-Grained Reader
role on the policy tags that are used. Otherwise,VECTOR_SEARCH
fails with an invalid query error. - If any base table column or any column in the query table or statement has
column-level security policies and you don't have appropriate
permissions to access the column,
VECTOR_SEARCH
fails with a permission denied error.
Examples
The following queries create test tables table1
and table2
to use in
subsequent query examples :
CREATE OR REPLACE TABLE mydataset.table1
(
id INT64,
my_embedding ARRAY<FLOAT64>
);
INSERT mydataset.table1 (id, my_embedding)
VALUES(1, [1.0, 2.0]),
(2, [2.0, 4.0]),
(3, [1.5, 7.0]),
(4, [1.0, 3.2]),
(5, [5.0, 5.4]),
(6, [3.7, 1.8]),
(7, [4.4, 2.9]);
CREATE OR REPLACE TABLE mydataset.table2
(
query_id STRING,
embedding ARRAY<FLOAT64>
);
INSERT mydataset.table2 (query_id, embedding)
VALUES('dog', [1.0, 2.0]),
('cat', [3.0, 5.2]);
The following example searches the my_embedding
column of table1
for the top
two embeddings that match each row of data in the embedding
column of
table2
:
SELECT *
FROM
VECTOR_SEARCH(
TABLE mydataset.table1,
'my_embedding',
(SELECT query_id, embedding FROM mydataset.table2),
'embedding',
top_k => 2);
/*------ --------+-----------------+---------+----------------------------------------*
| query.query_id | query.embedding | base.id | base.my_embedding | distance |
+----------------+-----------------+---------+-------------------+--------------------+
| dog | 1.0 | 1 | 1.0 | 0 |
| | 2.0 | | 2.0 | |
+----------------+-----------------+---------+-------------------+--------------------+
| dog | 1.0 | 4 | 1.0 | 1.2000000000000002 |
| | 2.0 | | 3.2 | |
+----------------+-----------------+---------+-------------------+--------------------+
| cat | 3.0 | 2 | 2.0 | 1.5620499351813311 |
| | 5.2 | | 4.0 | |
+----------------+-----------------+---------+-------------------+--------------------+
| cat | 3.0 | 5 | 5.0 | 2.0099751242241779 |
| | 5.2 | | 5.4 | |
*----------------+-----------------+---------+-------------------+--------------------*/
The following example searches the my_embedding
column of table1
for the top
two embeddings that match each row of data in the embedding
column of
table2
, and uses the COSINE
distance type to measure the distance between
the embeddings:
SELECT *
FROM
VECTOR_SEARCH(
TABLE mydataset.table1,
'my_embedding',
TABLE mydataset.table2,
'embedding',
top_k => 2,
distance_type => 'COSINE');
/*------ --------+-----------------+---------+-------------------------------------------+
| query.query_id | query.embedding | base.id | base.my_embedding | distance |
+----------------+-----------------+---------+-------------------+-----------------------+
| dog | 1.0 | 2 | 2.0 | 0 |
| | 2.0 | | 4.0 | |
+----------------+-----------------+---------+-------------------+-----------------------+
| dog | 1.0 | 1 | 1.0 | 0 |
| | 2.0 | | 2.0 | |
+----------------+-----------------+---------+-------------------+-----------------------+
| cat | 3.0 | 2 | 2.0 | 0.0017773842088002478 |
| | 5.2 | | 4.0 | |
+----------------+-----------------+---------+-------------------+-----------------------+
| cat | 3.0 | 1 | 1.0 | 0.0017773842088002478 |
| | 5.2 | | 2.0 | |
*----------------+-----------------+---------+-------------------+-----------------------*/