Search indexed data
This page provides examples of searching in BigQuery.
When you index your data, BigQuery can optimize some queries
that use the SEARCH
function
or other functions and operators,
such as =
, IN
, LIKE
, and STARTS_WITH
.
SQL queries return correct results from all ingested data, even if some of the data isn't indexed yet. However, query performance can be greatly improved with an index. Savings in bytes processed and slot milliseconds are maximized when the number of search results make up a relatively small fraction of the total rows in your table because less data is scanned. To determine whether an index was used for a query, see search index usage.
Create a search index
The following table called Logs
is used to show different
ways of using the SEARCH
function. This example table is quite small, but in
practice the performance gains you get with SEARCH
improve with the size of
the table.
CREATE TABLE my_dataset.Logs (Level STRING, Source STRING, Message STRING) AS ( SELECT 'INFO' as Level, '65.177.8.234' as Source, 'Entry Foo-Bar created' as Message UNION ALL SELECT 'WARNING', '132.249.240.10', 'Entry Foo-Bar already exists, created by 65.177.8.234' UNION ALL SELECT 'INFO', '94.60.64.181', 'Entry Foo-Bar deleted' UNION ALL SELECT 'SEVERE', '4.113.82.10', 'Entry Foo-Bar does not exist, deleted by 94.60.64.181' UNION ALL SELECT 'INFO', '181.94.60.64', 'Entry Foo-Baz created' );
The table looks like the following:
+---------+----------------+-------------------------------------------------------+ | Level | Source | Message | +---------+----------------+-------------------------------------------------------+ | INFO | 65.177.8.234 | Entry Foo-Bar created | | WARNING | 132.249.240.10 | Entry Foo-Bar already exists, created by 65.177.8.234 | | INFO | 94.60.64.181 | Entry Foo-Bar deleted | | SEVERE | 4.113.82.10 | Entry Foo-Bar does not exist, deleted by 94.60.64.181 | | INFO | 181.94.60.64 | Entry Foo-Baz created | +---------+----------------+-------------------------------------------------------+
Create a search index on the Logs
table using the default text analyzer:
CREATE SEARCH INDEX my_index ON my_dataset.Logs(ALL COLUMNS);
For more information about search indexes, see Manage search indexes.
Use the SEARCH
function
The SEARCH
function provides tokenized search on data.
SEARCH
is designed to be used with an index to
optimize lookups.
You can use the SEARCH
function to search an entire table or restrict your
search to specific columns.
Search an entire table
The following query searches across all columns of the Logs
table for the
value bar
and returns the rows that contain this value, regardless of
capitalization. Since the search index uses the default text analyzer, you don't
need to specify it in the SEARCH
function.
SELECT * FROM my_dataset.Logs WHERE SEARCH(Logs, 'bar');
+---------+----------------+-------------------------------------------------------+ | Level | Source | Message | +---------+----------------+-------------------------------------------------------+ | INFO | 65.177.8.234 | Entry Foo-Bar created | | WARNING | 132.249.240.10 | Entry Foo-Bar already exists, created by 65.177.8.234 | | INFO | 94.60.64.181 | Entry Foo-Bar deleted | | SEVERE | 4.113.82.10 | Entry Foo-Bar does not exist, deleted by 94.60.64.181 | +---------+----------------+-------------------------------------------------------+
The following query searches across all columns of the Logs
table for the
value `94.60.64.181`
and returns the rows that contain this value. The
backticks allow for an exact search, which is why the last row of the Logs
table which contains 181.94.60.64
is omitted.
SELECT * FROM my_dataset.Logs WHERE SEARCH(Logs, '`94.60.64.181`');
+---------+----------------+-------------------------------------------------------+ | Level | Source | Message | +---------+----------------+-------------------------------------------------------+ | INFO | 94.60.64.181 | Entry Foo-Bar deleted | | SEVERE | 4.113.82.10 | Entry Foo-Bar does not exist, deleted by 94.60.64.181 | +---------+----------------+-------------------------------------------------------+
Search a subset of columns
SEARCH
makes it easy to specify a subset of columns within which to search for
data. The following query searches the Message
column of the Logs
table for
the value 94.60.64.181
and returns the rows that contain this value.
SELECT * FROM my_dataset.Logs WHERE SEARCH(Message, '`94.60.64.181`');
+---------+----------------+-------------------------------------------------------+ | Level | Source | Message | +---------+----------------+-------------------------------------------------------+ | SEVERE | 4.113.82.10 | Entry Foo-Bar does not exist, deleted by 94.60.64.181 | +---------+----------------+-------------------------------------------------------+
The following query searches both the Source
and Message
columns of the
Logs
table. It returns the rows that contain the value 94.60.64.181
from
either column.
SELECT * FROM my_dataset.Logs WHERE SEARCH((Source, Message), '`94.60.64.181`');
+---------+----------------+-------------------------------------------------------+ | Level | Source | Message | +---------+----------------+-------------------------------------------------------+ | INFO | 94.60.64.181 | Entry Foo-Bar deleted | | SEVERE | 4.113.82.10 | Entry Foo-Bar does not exist, deleted by 94.60.64.181 | +---------+----------------+-------------------------------------------------------+
Exclude columns from a search
If a table table has many columns and you want to search most of them, it may be
easier to specify only the columns to exclude from the search. The following
query searches across all columns of the Logs
table except for
the Message
column. It returns the rows of any columns other than Message
that contains the value 94.60.64.181
.
SELECT *
FROM my_dataset.Logs
WHERE SEARCH(
(SELECT AS STRUCT Logs.* EXCEPT (Message)), '`94.60.64.181`');
+---------+----------------+---------------------------------------------------+ | Level | Source | Message | +---------+----------------+---------------------------------------------------+ | INFO | 94.60.64.181 | Entry Foo-Bar deleted | +---------+----------------+---------------------------------------------------+
Use a different text analyzer
The following example creates a table called contact_info
with an index that
uses the NO_OP_ANALYZER
text analyzer:
CREATE TABLE my_dataset.contact_info (name STRING, email STRING) AS ( SELECT 'Kim Lee' AS name, 'kim.lee@example.com' AS email UNION ALL SELECT 'Kim' AS name, 'kim@example.com' AS email UNION ALL SELECT 'Sasha' AS name, 'sasha@example.com' AS email ); CREATE SEARCH INDEX noop_index ON my_dataset.contact_info(ALL COLUMNS) OPTIONS (analyzer = 'NO_OP_ANALYZER');
+---------+---------------------+ | name | email | +---------+---------------------+ | Kim Lee | kim.lee@example.com | | Kim | kim@example.com | | Sasha | sasha@example.com | +---------+---------------------+
The following query searches for Kim
in the name
column and kim
in the email
column.
Since the search index doesn't use the default text analyzer, you must pass the
name of the analyzer to the SEARCH
function.
SELECT name, SEARCH(name, 'Kim', analyzer=>'NO_OP_ANALYZER') AS name_Kim, email, SEARCH(email, 'kim', analyzer=>'NO_OP_ANALYZER') AS email_kim FROM my_dataset.contact_info;
The NO_OP_ANALYZER
doesn't modify the text, so the SEARCH
function only
returns TRUE
for exact matches:
+---------+----------+---------------------+-----------+ | name | name_Kim | email | email_kim | +---------+----------+---------------------+-----------+ | Kim Lee | FALSE | kim.lee@example.com | FALSE | | Kim | TRUE | kim@example.com | FALSE | | Sasha | FALSE | sasha@example.com | FALSE | +---------+----------+---------------------+-----------+
Configure text analyzer options
The LOG_ANALYZER
and PATTERN_ANALYZER
text
analyzers can be
customized by adding a JSON-formatted string to the configuration options. You
can configure text analyzers in the SEARCH
function, the CREATE
SEARCH INDEX
DDL
statement,
and the TEXT_ANALYZE
function.
The following example creates a table called complex_table
with an index that
uses the LOG_ANALYZER
text analyzer. It uses a JSON-formatted string to
configure the analyzer options:
CREATE TABLE dataset.complex_table( a STRING, my_struct STRUCT<string_field STRING, int_field INT64>, b ARRAY<STRING> ); CREATE SEARCH INDEX my_index ON dataset.complex_table(a, my_struct, b) OPTIONS (analyzer = 'LOG_ANALYZER', analyzer_options = '''{ "token_filters": [ { "normalization": {"mode": "NONE"} } ] }''');
The following tables shows examples of calls to the SEARCH
function with
different text analyzers and their results. The first table calls the SEARCH
function using the default text analyzer, the LOG_ANALYZER
:
Function call | Returns | Reason |
---|---|---|
SEARCH('foobarexample', NULL) | ERROR | The search_terms is `NULL`. |
SEARCH('foobarexample', '') | ERROR | The search_terms contains no tokens. |
SEARCH('foobar-example', 'foobar example') | TRUE | '-' and ' ' are delimiters. |
SEARCH('foobar-example', 'foobarexample') | FALSE | The search_terms isn't split. |
SEARCH('foobar-example', 'foobar\\&example') | TRUE | The double backslash escapes the ampersand which is a delimiter. |
SEARCH('foobar-example', R'foobar\&example') | TRUE | The single backslash escapes the ampersand in a raw string. |
SEARCH('foobar-example', '`foobar&example`') | FALSE | The backticks require an exact match for foobar&example. |
SEARCH('foobar&example', '`foobar&example`') | TRUE | An exact match is found. |
SEARCH('foobar-example', 'example foobar') | TRUE | The order of terms doesn't matter. |
SEARCH('foobar-example', 'foobar example') | TRUE | Tokens are made lower-case. |
SEARCH('foobar-example', '`foobar-example`') | TRUE | An exact match is found. |
SEARCH('foobar-example', '`foobar`') | FALSE | Backticks preserve capitalization. |
SEARCH('`foobar-example`', '`foobar-example`') | FALSE | Backticks don't have special meaning for data_to_search and |
SEARCH('foobar@example.com', '`example.com`') | TRUE | An exact match is found after the delimiter in data_to_search. |
SEARCH('a foobar-example b', '`foobar-example`') | TRUE | An exact match is found between the space delimiters. |
SEARCH(['foobar', 'example'], 'foobar example') | FALSE | No single array entry matches all search terms. |
SEARCH('foobar=', '`foobar\\=`') | FALSE | The search_terms is equivalent to foobar\=. |
SEARCH('foobar=', R'`foobar\=`') | FALSE | This is equivalent to the previous example. |
SEARCH('foobar=', 'foobar\\=') | TRUE | The equals sign is a delimiter in the data and query. |
SEARCH('foobar=', R'foobar\=') | TRUE | This is equivalent to the previous example. |
SEARCH('foobar.example', '`foobar`') | TRUE | An exact match is found. |
SEARCH('foobar.example', '`foobar.`') | FALSE | `foobar.` isn't analyzed because of backticks; it isn't |
SEARCH('foobar..example', '`foobar.`') | TRUE | `foobar.` isn't analyzed because of backticks; it is followed |
The following table shows examples of calls to the SEARCH
function using the
NO_OP_ANALYZER
text analyzer and reasons for various return values:
Function call | Returns | Reason |
---|---|---|
SEARCH('foobar', 'foobar', analyzer=>'NO_OP_ANALYZER') | TRUE | An exact match is found. |
SEARCH('foobar', '`foobar`', analyzer=>'NO_OP_ANALYZER') | FALSE | Backticks aren't special characters for NO_OP_ANALYZER. |
SEARCH('foobar', 'foobar', analyzer=>'NO_OP_ANALYZER') | FALSE | The capitalization doesn't match. |
SEARCH('foobar example', 'foobar', analyzer=>'NO_OP_ANALYZER') | FALSE | There are no delimiters for NO_OP_ANALYZER. |
SEARCH('', '', analyzer=>'NO_OP_ANALYZER') | TRUE | There are no delimiters for NO_OP_ANALYZER. |
Other operators and functions
You can perform search index optimizations with several operators, functions, and predicates.
Optimize with operators and comparison functions
BigQuery can optimize some queries that use the
equal operator
(=
),
IN
operator,
LIKE
operator,
or
STARTS_WITH
function
to compare string literals with indexed data.
Optimize with string predicates
The following predicates are eligible for search index optimization:
column_name = 'string_literal'
'string_literal' = column_name
struct_column.nested_field = 'string_literal'
string_array_column[OFFSET(0)] = 'string_literal'
string_array_column[ORDINAL(1)] = 'string_literal'
column_name IN ('string_literal1', 'string_literal2', ...)
STARTS_WITH(column_name, 'prefix')
column_name LIKE 'prefix%'
Optimize with numeric predicates
For support during the preview, email bq-search-team@google.com.
If the search index was created with numeric data types, BigQuery
can optimize some queries that use the equal operator (=
) or IN
operator
with indexed data. The following predicates are eligible for search index
optimization:
INT64(json_column.int64_field) = 1
int64_column = 1
int64_array_column[OFFSET(0)] = 1
int64_column IN (1, 2)
struct_column.nested_int64_field = 1
struct_column.nested_timestamp_field = TIMESTAMP "2024-02-15 21:31:40"
timestamp_column = "2024-02-15 21:31:40"
timestamp_column IN ("2024-02-15 21:31:40", "2024-02-16 21:31:40")
Optimize functions that produce indexed data
BigQuery supports search index optimization when certain
functions are applied to indexed data.
If the search index uses the default LOG_ANALYZER
text analyzer then you can
apply the
UPPER
or LOWER
functions to the column, such as UPPER(column_name) = 'STRING_LITERAL'
.
For JSON
scalar string data extracted from an indexed JSON
column, you can
apply the
STRING
function or its safe version,
SAFE.STRING
.
If the extracted JSON
value is not a string, then the STRING
function
produces an error and the SAFE.STRING
function returns NULL
.
For indexed
JSON-formatted STRING
(not JSON
) data, you can apply the following
functions:
For example, suppose you have the following indexed table called
dataset.person_data
with a JSON
and a STRING
column:
+----------------------------------------------------------------+-----------------------------------------+ | json_column | string_column | +----------------------------------------------------------------+-----------------------------------------+ | { "name" : "Ariel", "email" : "cloudysanfrancisco@gmail.com" } | { "name" : "Ariel", "job" : "doctor" } | +----------------------------------------------------------------+-----------------------------------------+
The following queries are eligible for optimization:
SELECT * FROM dataset.person_data WHERE SAFE.STRING(json_column.email) = 'cloudysanfrancisco@gmail.com';
SELECT * FROM dataset.person_data WHERE JSON_VALUE(string_column, '$.job') IN ('doctor', 'lawyer', 'teacher');
Combinations of these functions are also optimized, such as
UPPER(JSON_VALUE(json_string_expression)) = 'FOO'
.
Search index usage
To determine whether a search index was used for a query, look at the query's Job Information in Query results. The Index Usage Mode and Index Unused Reasons fields provide detailed information on search index usage.
Information on search index usage is also available through the
searchStatistics
field
in the Jobs.Get API method. The
indexUsageMode
field in searchStatistics
indicates whether a search index
was used with the following values:
UNUSED
: no search index was used.PARTIALLY_USED
: part of the query used search indexes and part did not.FULLY_USED
: everySEARCH
function in the query used a search index.
When indexUsageMode
is UNUSED
or PARTIALLY_USED
, the indexUnusuedReasons
field contains information about why search indexes were not used in the query.
To view searchStatistics
for a query, run the bq show
command.
bq show --format=prettyjson -j JOB_ID
Example
Suppose you run a query that calls the SEARCH
function on data in a table. You
can view the
job details of the query to
find the job ID, then run the bq show
command to see more information:
bq show --format=prettyjson --j my_project:US.bquijob_123x456_789y123z456c
The output contains many fields, including searchStatistics
, which looks
similar to the following. In this example, indexUsageMode
indicates that the
index was not used. The reason is that the table doesn't have a search index. To
solve this problem, create a search index on the
table. See the
indexUnusedReason
code
field
for a list of all reasons a search index might not be used in a query.
"searchStatistics": {
"indexUnusedReasons": [
{
"baseTable": {
"datasetId": "my_dataset",
"projectId": "my_project",
"tableId": "my_table"
},
"code": "INDEX_CONFIG_NOT_AVAILABLE",
"message": "There is no search index configuration for the base table `my_project:my_dataset.my_table`."
}
],
"indexUsageMode": "UNUSED"
},
Best practices
The following sections describe best practices for searching.
Search selectively
Searching works best when your search has few results. Make your searches as specific as possible.
ORDER BY LIMIT optimization
Queries that use SEARCH
, =
, IN
, LIKE
or STARTS_WITH
on a very large
partitioned table can be optimized
when you use an ORDER BY
clause on the partitioned field and a LIMIT
clause.
For queries that don't contain the SEARCH
function, you can use the
other operators and functions to take
advantage of the optimization. The optimization is applied whether or not the
table is indexed. This works well if you're searching for a common term.
For example, suppose the Logs
table created earlier
is partitioned on an additional DATE
type column
called day
. The following query is optimized:
SELECT
Level, Source, Message
FROM
my_dataset.Logs
WHERE
SEARCH(Message, "foo")
ORDER BY
day
LIMIT 10;
Scope your search
When you use the SEARCH
function, only search the columns of the table that
you expect to contain your search terms. This improves performance and
reduces the number of bytes that need to be scanned.
Use backticks
When you use the SEARCH
function with the LOG_ANALYZER
text analyzer,
enclosing your search query in backticks
forces an exact match. This is helpful
if your search is case-sensitive or contains characters that shouldn't be
interpreted as delimiters. For example, to search for the IP address
192.0.2.1
, use `192.0.2.1`
. Without the backticks, the search returns
any row that contains the individual tokens 192
, 0
, 2
, and 1
, in any
order.