Search indexed text

This page provides examples of searching in BigQuery. When you index your data, BigQuery can optimize some queries that use the SEARCH function or other functions and operators, such as =, IN, LIKE, and STARTS_WITH.

SQL queries return correct results from all ingested data, even if some of the data isn't indexed yet. However, query performance can be greatly improved with an index. Savings in bytes processed and slot milliseconds are maximized when the number of search results make up a relatively small fraction of the total rows in your table because less data is scanned. To determine whether an index was used for a query, see search index usage.

Create a search index

The following table called Logs is used to show different ways of using the SEARCH function. This example table is quite small, but in practice the performance gains you get with SEARCH improve with the size of the table.

CREATE TABLE my_dataset.Logs (Level STRING, Source STRING, Message STRING)
AS (
  SELECT 'INFO' as Level, '65.177.8.234' as Source, 'Entry Foo-Bar created' as Message
  UNION ALL
  SELECT 'WARNING', '132.249.240.10', 'Entry Foo-Bar already exists, created by 65.177.8.234'
  UNION ALL
  SELECT 'INFO', '94.60.64.181', 'Entry Foo-Bar deleted'
  UNION ALL
  SELECT 'SEVERE', '4.113.82.10', 'Entry Foo-Bar does not exist, deleted by 94.60.64.181'
  UNION ALL
  SELECT 'INFO', '181.94.60.64', 'Entry Foo-Baz created'
);

The table looks like the following:

+---------+----------------+-------------------------------------------------------+
| Level   | Source         | Message                                               |
+---------+----------------+-------------------------------------------------------+
| INFO    | 65.177.8.234   | Entry Foo-Bar created                                 |
| WARNING | 132.249.240.10 | Entry Foo-Bar already exists, created by 65.177.8.234 |
| INFO    | 94.60.64.181   | Entry Foo-Bar deleted                                 |
| SEVERE  | 4.113.82.10    | Entry Foo-Bar does not exist, deleted by 94.60.64.181 |
| INFO    | 181.94.60.64   | Entry Foo-Baz created                                 |
+---------+----------------+-------------------------------------------------------+

Create a search index on the Logs table using the default text analyzer:

CREATE SEARCH INDEX my_index ON my_dataset.Logs(ALL COLUMNS);

Use the SEARCH function

The SEARCH function provides tokenized search on data. SEARCH is designed to be used with an index to optimize lookups. You can use the SEARCH function to search an entire table or restrict your search to specific columns.

Search an entire table

The following query searches across all columns of the Logs table for the value bar and returns the rows that contain this value, regardless of capitalization. Since the search index uses the default text analyzer, you don't need to specify it in the SEARCH function.

SELECT * FROM my_dataset.Logs WHERE SEARCH(Logs, 'bar');
+---------+----------------+-------------------------------------------------------+
| Level   | Source         | Message                                               |
+---------+----------------+-------------------------------------------------------+
| INFO    | 65.177.8.234   | Entry Foo-Bar created                                 |
| WARNING | 132.249.240.10 | Entry Foo-Bar already exists, created by 65.177.8.234 |
| INFO    | 94.60.64.181   | Entry Foo-Bar deleted                                 |
| SEVERE  | 4.113.82.10    | Entry Foo-Bar does not exist, deleted by 94.60.64.181 |
+---------+----------------+-------------------------------------------------------+

The following query searches across all columns of the Logs table for the value `94.60.64.181` and returns the rows that contain this value. The backticks allow for an exact search, which is why the last row of the Logs table which contains 181.94.60.64 is omitted.

SELECT * FROM my_dataset.Logs WHERE SEARCH(Logs, '`94.60.64.181`');
+---------+----------------+-------------------------------------------------------+
| Level   | Source         | Message                                               |
+---------+----------------+-------------------------------------------------------+
| INFO    | 94.60.64.181   | Entry Foo-Bar deleted                                 |
| SEVERE  | 4.113.82.10    | Entry Foo-Bar does not exist, deleted by 94.60.64.181 |
+---------+----------------+-------------------------------------------------------+

Search a subset of columns

SEARCH makes it easy to specify a subset of columns within which to search for data. The following query searches the Message column of the Logs table for the value 94.60.64.181 and returns the rows that contain this value.

SELECT * FROM my_dataset.Logs WHERE SEARCH(Message, '`94.60.64.181`');
+---------+----------------+-------------------------------------------------------+
| Level   | Source         | Message                                               |
+---------+----------------+-------------------------------------------------------+
| SEVERE  | 4.113.82.10    | Entry Foo-Bar does not exist, deleted by 94.60.64.181 |
+---------+----------------+-------------------------------------------------------+

The following query searches both the Source and Message columns of the Logs table. It returns the rows that contain the value 94.60.64.181 from either column.

SELECT * FROM my_dataset.Logs WHERE SEARCH((Source, Message), '`94.60.64.181`');
+---------+----------------+-------------------------------------------------------+
| Level   | Source         | Message                                               |
+---------+----------------+-------------------------------------------------------+
| INFO    | 94.60.64.181   | Entry Foo-Bar deleted                                 |
| SEVERE  | 4.113.82.10    | Entry Foo-Bar does not exist, deleted by 94.60.64.181 |
+---------+----------------+-------------------------------------------------------+

If a table table has many columns and you want to search most of them, it may be easier to specify only the columns to exclude from the search. The following query searches across all columns of the Logs table except for the Message column. It returns the rows of any columns other than Message that contains the value 94.60.64.181.

SELECT *
FROM my_dataset.Logs
WHERE SEARCH(
  (SELECT AS STRUCT Logs.* EXCEPT (Message)), '`94.60.64.181`');
+---------+----------------+---------------------------------------------------+
| Level   | Source         | Message                                           |
+---------+----------------+---------------------------------------------------+
| INFO    | 94.60.64.181   | Entry Foo-Bar deleted                             |
+---------+----------------+---------------------------------------------------+

Use a different text analyzer

The following example creates a table called contact_info with an index that uses the NO_OP_ANALYZER text analyzer:

CREATE TABLE my_dataset.contact_info (name STRING, email STRING)
AS (
  SELECT 'Kim Lee' AS name, 'kim.lee@example.com' AS email
  UNION ALL
  SELECT 'Kim' AS name, 'kim@example.com' AS email
  UNION ALL
  SELECT 'Sasha' AS name, 'sasha@example.com' AS email
);
CREATE SEARCH INDEX noop_index ON my_dataset.contact_info(ALL COLUMNS)
OPTIONS (analyzer = 'NO_OP_ANALYZER');
+---------+---------------------+
| name    | email               |
+---------+---------------------+
| Kim Lee | kim.lee@example.com |
| Kim     | kim@example.com     |
| Sasha   | sasha@example.com   |
+---------+---------------------+

The following query searches for Kim in the name column and kim in the email column. Since the search index doesn't use the default text analyzer, you must pass the name of the analyzer to the SEARCH function.

SELECT
  name,
  SEARCH(name, 'Kim', analyzer=>'NO_OP_ANALYZER') AS name_Kim,
  email,
  SEARCH(email, 'kim', analyzer=>'NO_OP_ANALYZER') AS email_kim
FROM
  my_dataset.contact_info;

The NO_OP_ANALYZER doesn't modify the text, so the SEARCH function only returns TRUE for exact matches:

+---------+----------+---------------------+-----------+
| name    | name_Kim | email               | email_kim |
+---------+----------+---------------------+-----------+
| Kim Lee | FALSE    | kim.lee@example.com | FALSE     |
| Kim     | TRUE     | kim@example.com     | FALSE     |
| Sasha   | FALSE    | sasha@example.com   | FALSE     |
+---------+----------+---------------------+-----------+

Configure text analyzer options

The LOG_ANALYZER and PATTERN_ANALYZER text analyzers can be customized by adding a JSON-formatted string to the configuration options. You can configure text analyzers in the SEARCH function, the CREATE SEARCH INDEX DDL statement, and the TEXT_ANALYZE function.

The following example creates a table called complex_table with an index that uses the LOG_ANALYZER text analyzer. It uses a JSON-formatted string to configure the analyzer options:

CREATE TABLE dataset.complex_table(
  a STRING,
  my_struct STRUCT<string_field STRING, int_field INT64>,
  b ARRAY<STRING>
);

CREATE SEARCH INDEX my_index
ON dataset.complex_table(a, my_struct, b)
OPTIONS (analyzer = 'LOG_ANALYZER', analyzer_options = '''{
  "token_filters": [
    {
      "normalization": {"mode": "NONE"}
    }
  ]
}''');

The following tables shows examples of calls to the SEARCH function with different text analyzers and their results. The first table calls the SEARCH function using the default text analyzer, the LOG_ANALYZER:

Function call Returns Reason
SEARCH('foobarexample', NULL) ERROR The search_terms is `NULL`.
SEARCH('foobarexample', '') ERROR The search_terms contains no tokens.
SEARCH('foobar-example', 'foobar example') TRUE '-' and ' ' are delimiters.
SEARCH('foobar-example', 'foobarexample') FALSE The search_terms isn't split.
SEARCH('foobar-example', 'foobar\\&example') TRUE The double backslash escapes the ampersand which is a delimiter.
SEARCH('foobar-example', R'foobar\&example') TRUE The single backslash escapes the ampersand in a raw string.
SEARCH('foobar-example', '`foobar&example`') FALSE The backticks require an exact match for foobar&example.
SEARCH('foobar&example', '`foobar&example`') TRUE An exact match is found.
SEARCH('foobar-example', 'example foobar') TRUE The order of terms doesn't matter.
SEARCH('foobar-example', 'foobar example') TRUE Tokens are made lower-case.
SEARCH('foobar-example', '`foobar-example`') TRUE An exact match is found.
SEARCH('foobar-example', '`foobar`') FALSE Backticks preserve capitalization.
SEARCH('`foobar-example`', '`foobar-example`') FALSE Backticks don't have special meaning for data_to_search and
SEARCH('foobar@example.com', '`example.com`') TRUE An exact match is found after the delimiter in data_to_search.
SEARCH('a foobar-example b', '`foobar-example`') TRUE An exact match is found between the space delimiters.
SEARCH(['foobar', 'example'], 'foobar example') FALSE No single array entry matches all search terms.
SEARCH('foobar=', '`foobar\\=`') FALSE The search_terms is equivalent to foobar\=.
SEARCH('foobar=', R'`foobar\=`') FALSE This is equivalent to the previous example.
SEARCH('foobar=', 'foobar\\=') TRUE The equals sign is a delimiter in the data and query.
SEARCH('foobar=', R'foobar\=') TRUE This is equivalent to the previous example.
SEARCH('foobar.example', '`foobar`') TRUE An exact match is found.
SEARCH('foobar.example', '`foobar.`') FALSE `foobar.` isn't analyzed because of backticks; it isn't
SEARCH('foobar..example', '`foobar.`') TRUE `foobar.` isn't analyzed because of backticks; it is followed

The following table shows examples of calls to the SEARCH function using the NO_OP_ANALYZER text analyzer and reasons for various return values:

Function call Returns Reason
SEARCH('foobar', 'foobar', analyzer=>'NO_OP_ANALYZER') TRUE An exact match is found.
SEARCH('foobar', '`foobar`', analyzer=>'NO_OP_ANALYZER') FALSE Backticks aren't special characters for NO_OP_ANALYZER.
SEARCH('foobar', 'foobar', analyzer=>'NO_OP_ANALYZER') FALSE The capitalization doesn't match.
SEARCH('foobar example', 'foobar', analyzer=>'NO_OP_ANALYZER') FALSE There are no delimiters for NO_OP_ANALYZER.
SEARCH('', '', analyzer=>'NO_OP_ANALYZER') TRUE There are no delimiters for NO_OP_ANALYZER.

Other operators and functions

BigQuery can optimize some queries that use the equal operator (=), IN operator, LIKE operator, or STARTS_WITH function to compare string literals with indexed data.

The following predicates are eligible for search index optimization:

  • column_name = 'string_literal'
  • 'string_literal' = column_name
  • struct_column.nested_field = 'string_literal'
  • string_array_column[OFFSET(0)] = 'string_literal'
  • string_array_column[ORDINAL(1)] = 'string_literal'
  • column_name IN ('string_literal1', 'string_literal2', ...)
  • STARTS_WITH(column_name, 'prefix')
  • column_name LIKE 'prefix%'

BigQuery also supports optimization when certain functions are applied to indexed data. If the search index uses the default LOG_ANALYZER text analyzer then you can apply the UPPER or LOWER functions to the column, such as UPPER(column_name) = 'STRING_LITERAL'.

For JSON scalar string data extracted from an indexed JSON column, you can apply the STRING function or its safe version, SAFE.STRING. If the extracted JSON value is not a string, then the STRING function produces an error and the SAFE.STRING function returns NULL.

For indexed JSON-formatted STRING (not JSON) data, you can apply the following functions:

For example, suppose you have the following indexed table called dataset.person_data with a JSON and a STRING column:

+----------------------------------------------------------------+-----------------------------------------+
| json_column                                                    | string_column                           |
+----------------------------------------------------------------+-----------------------------------------+
| { "name" : "Ariel", "email" : "cloudysanfrancisco@gmail.com" } | { "name" : "Ariel", "job" : "doctor" }  |
+----------------------------------------------------------------+-----------------------------------------+

The following queries are eligible for optimization:

SELECT * FROM dataset.person_data
WHERE SAFE.STRING(json_column.email) = 'cloudysanfrancisco@gmail.com';
SELECT * FROM dataset.person_data
WHERE JSON_VALUE(string_column, '$.job') IN ('doctor', 'lawyer', 'teacher');

Combinations of these functions are also optimized, such as UPPER(JSON_VALUE(json_string_expression)) = 'FOO'.

Search index usage

To determine whether a search index was used for a query, look at the query's Job Information in Query results. The Index Usage Mode and Index Unused Reasons fields provide detailed information on search index usage.

Job information showing why a search index was unused.

Information on search index usage is also available through the searchStatistics field in the Jobs.Get API method. The indexUsageMode field in searchStatistics indicates whether a search index was used with the following values:

  • UNUSED: no search index was used.
  • PARTIALLY_USED: part of the query used search indexes and part did not.
  • FULLY_USED: every SEARCH function in the query used a search index.

When indexUsageMode is UNUSED or PARTIALLY_USED, the indexUnusuedReasons field contains information about why search indexes were not used in the query.

To view searchStatistics for a query, run the bq show command.

bq show --format=prettyjson -j JOB_ID

Example

Suppose you run a query that calls the SEARCH function on data in a table. You can view the job details of the query to find the job ID, then run the bq show command to see more information:

bq show --format=prettyjson --j my_project:US.bquijob_123x456_789y123z456c

The output contains many fields, including searchStatistics, which looks similar to the following. In this example, indexUsageMode indicates that the index was not used. The reason is that the table doesn't have a search index. To solve this problem, create a search index on the table. See the indexUnusedReason code field for a list of all reasons a search index might not be used in a query.

"searchStatistics": {
  "indexUnusedReasons": [
    {
      "baseTable": {
        "datasetId": "my_dataset",
        "projectId": "my_project",
        "tableId": "my_table"
      },
      "code": "INDEX_CONFIG_NOT_AVAILABLE",
      "message": "There is no search index configuration for the base table `my_project:my_dataset.my_table`."
    }
  ],
  "indexUsageMode": "UNUSED"
},

Best practices

The following sections describe best practices when using the SEARCH function.

Search selectively

Searching works best when your search has few results. Make your searches as specific as possible.

ORDER BY LIMIT optimization

Queries that use SEARCH, =, IN, LIKE or STARTS_WITH on a very large partitioned table can be optimized when you use an ORDER BY clause on the partitioned field and a LIMIT clause. For queries that don't contain the SEARCH function, you can use the other operators and functions to take advantage of the optimization. The optimization is applied whether or not the table is indexed. This works well if you're searching for a common term. For example, suppose the Logs table created earlier is partitioned on an additional DATE type column called day. The following query is optimized:

SELECT
  Level, Source, Message
FROM
  my_dataset.Logs
WHERE
  SEARCH(Message, "foo")
ORDER BY
  day
LIMIT 10;

When you use the SEARCH function, only search the columns of the table that you expect to contain your search terms. This improves performance and reduces the number of bytes that need to be scanned.

Use backticks

When you use the SEARCH function with the LOG_ANALYZER text analyzer, enclosing your search query in backticks forces an exact match. This is helpful if your search is case-sensitive or contains characters that shouldn't be interpreted as delimiters. For example, to search for the IP address 192.0.2.1, use `192.0.2.1`. Without the backticks, the search returns any row that contains the individual tokens 192, 0, 2, and 1, in any order.