Manage search indexes

A search index is a data structure designed to enable very efficient search with the SEARCH function. Much like the index you'd find in the back of a book, a search index for a column of string data acts like an auxiliary table that has one column for unique words and another for where in the data those words occur.

Create a search index

To create a search index, use the CREATE SEARCH INDEX DDL statement. You can create a search index on these column types:

  • STRING
  • ARRAY<STRING>
  • STRUCT containing at least one nested field of type STRING or ARRAY<STRING>
  • JSON

When you create a search index, you can specify the type of text analyzer to use. The text analyzer controls how data is tokenized for indexing and searching. The default is LOG_ANALYZER. This analyzer works well for machine generated logs and has special rules around tokens commonly found in observability data, such as IP addresses or emails. Use the NO_OP_ANALYZER when you have pre-processed data that you want to match exactly. PATTERN_ANALYZER extracts tokens from text using a regular expression.

The following example creates a search index on columns a and c of simple_table.

CREATE TABLE dataset.simple_table(a STRING, b INT64, c JSON);

CREATE SEARCH INDEX my_index
ON dataset.simple_table(a, c);

When you create a search index on ALL COLUMNS, all STRING or JSON data in the table is indexed. If the table contains no such data, for example if all columns contain integers, the index creation fails. When you specify a STRUCT column to be indexed, all nested subfields are indexed.

In the following example, a search index is created on a, c.e, and c.f.g, and uses the NO_OP_ANALYZER text analyzer.

CREATE TABLE dataset.my_table(
  a STRING,
  b INT64,
  c STRUCT <d INT64,
            e ARRAY<STRING>,
            f STRUCT<g STRING, h INT64>>) AS
SELECT 'hello' AS a, 10 AS b, (20, ['x', 'y'], ('z', 30)) AS c;

CREATE SEARCH INDEX my_index
ON dataset.my_table(ALL COLUMNS)
OPTIONS (analyzer = 'NO_OP_ANALYZER');

Since the search index was created on ALL COLUMNS, any columns added to the table are automatically indexed if they contain STRING data.

Understand index refresh

Search indexes are fully managed by BigQuery and automatically refreshed when the table changes. The following schema changes to the table can trigger a full refresh:

  • A new indexable column is added to a table with a search index on ALL COLUMNS.
  • An indexed column is updated due to a table schema change.

If you delete the only indexed column in a table or rename the table itself, then the search index is deleted automatically.

Search indexes are designed for large tables. If you create a search index on a table that is smaller than 10GB, then the index is not populated. Similarly, if you delete data from an indexed table and the table size falls below 10GB, then the index is temporarily disabled. In this case, search queries do not use the index and the IndexUnusedReason code is BASE_TABLE_TOO_SMALL. This happens whether or not you use your own reservation for your index-management jobs. When an indexed table's size exceeds 10GB, then its index is populated automatically. You are not charged for storage until the search index is populated and active. Queries that use the SEARCH function always return correct results even if some data is not yet indexed.

Get information about search indexes

You can verify the existence and the readiness of a search index by querying INFORMATION_SCHEMA. There are two views that contain metadata on search indexes. The INFORMATION_SCHEMA.SEARCH_INDEXES view has information on each search index created on a dataset. The INFORMATION_SCHEMA.SEARCH_INDEX_COLUMNS view has information on which columns of each table in the dataset are indexed.

The following example shows all active search indexes on tables in the dataset my_dataset, located in the project my_project. It includes their names, the DDL statements used to create them, their coverage percentage, and their text analyzer. If an indexed base table is less than 10GB, then its index is not populated, in which case coverage_percentage is 0.

SELECT table_name, index_name, ddl, coverage_percentage, analyzer
FROM my_project.my_dataset.INFORMATION_SCHEMA.SEARCH_INDEXES
WHERE index_status = 'ACTIVE';

The results should look like the following:

+-------------+-------------+--------------------------------------------------------------------------------------+---------------------+----------------+
| table_name  | index_name  | ddl                                                                                  | coverage_percentage | analyzer       |
+-------------+-------------+--------------------------------------------------------------------------------------+---------------------+----------------+
| small_table | names_index | CREATE SEARCH INDEX `names_index` ON `my_project.my_dataset.small_table`(names)      | 0                   | NO_OP_ANALYZER |
| large_table | logs_index  | CREATE SEARCH INDEX `logs_index` ON `my_project.my_dataset.large_table`(ALL COLUMNS) | 100                 | LOG_ANALYZER   |
+-------------+-------------+--------------------------------------------------------------------------------------+---------------------+----------------+

The following example creates a search index on all columns of my_table.

CREATE TABLE dataset.my_table(
  a STRING,
  b INT64,
  c STRUCT <d INT64,
            e ARRAY<STRING>,
            f STRUCT<g STRING, h INT64>>) AS
SELECT 'hello' AS a, 10 AS b, (20, ['x', 'y'], ('z', 30)) AS c;

CREATE SEARCH INDEX my_index
ON dataset.my_table(ALL COLUMNS);

The following query extracts information on which fields are indexed. The index_field_path indicates which field of a column is indexed. This differs from the index_column_name only in the case of a STRUCT, where the full path to the indexed field is given. In this example, column c contains an ARRAY<STRING> field e and another STRUCT called f which contains a STRING field g, each of which is indexed.

SELECT table_name, index_name, index_column_name, index_field_path
FROM my_project.dataset.INFORMATION_SCHEMA.SEARCH_INDEX_COLUMNS

The result is similar to the following:

+------------+------------+-------------------+------------------+
| table_name | index_name | index_column_name | index_field_path |
+------------+------------+-------------------+------------------+
| my_table   | my_index   | a                 | a                |
| my_table   | my_index   | c                 | c.e              |
| my_table   | my_index   | c                 | c.f.g            |
+------------+------------+-------------------+------------------+

The following query joins the INFORMATION_SCHEMA.SEARCH_INDEX_COUMNS view with the INFORMATION_SCHEMA.SEARCH_INDEXES and INFORMATION_SCHEMA.COLUMNS views to include the search index status and the data type of each column:

SELECT
  index_columns_view.index_catalog AS project_name,
  index_columns_view.index_SCHEMA AS dataset_name,
  indexes_view.TABLE_NAME AS table_name,
  indexes_view.INDEX_NAME AS index_name,
  indexes_view.INDEX_STATUS AS status,
  index_columns_view.INDEX_COLUMN_NAME AS column_name,
  index_columns_view.INDEX_FIELD_PATH AS field_path,
  columns_view.DATA_TYPE AS data_type
FROM
  mydataset.INFORMATION_SCHEMA.SEARCH_INDEXES indexes_view
INNER JOIN
  mydataset.INFORMATION_SCHEMA.SEARCH_INDEX_COLUMNS index_columns_view
  ON
    indexes_view.TABLE_NAME = index_columns_view.TABLE_NAME
    AND indexes_view.INDEX_NAME = index_columns_view.INDEX_NAME
LEFT OUTER JOIN
  mydataset.INFORMATION_SCHEMA.COLUMNS columns_view
  ON
    indexes_view.INDEX_CATALOG = columns_view.TABLE_CATALOG
    AND indexes_view.INDEX_SCHEMA = columns_view.TABLE_SCHEMA
    AND index_columns_view.TABLE_NAME = columns_view.TABLE_NAME
    AND index_columns_view.INDEX_COLUMN_NAME = columns_view.COLUMN_NAME
ORDER BY
  project_name,
  dataset_name,
  table_name,
  column_name;

The result is similar to the following:

+------------+------------+----------+------------+--------+-------------+------------+---------------------------------------------------------------+
| project    | dataset    | table    | index_name | status | column_name | field_path | data_type                                                     |
+------------+------------+----------+------------+--------+-------------+------------+---------------------------------------------------------------+
| my_project | my_dataset | my_table | my_index   | ACTIVE | a           | a          | STRING                                                        |
| my_project | my_dataset | my_table | my_index   | ACTIVE | c           | c.e        | STRUCT<d INT64, e ARRAY<STRING>, f STRUCT<g STRING, h INT64>> |
| my_project | my_dataset | my_table | my_index   | ACTIVE | c           | c.f.g      | STRUCT<d INT64, e ARRAY<STRING>, f STRUCT<g STRING, h INT64>> |
+------------+------------+----------+------------+--------+-------------+------------+---------------------------------------------------------------+

Index management options

To create indexes and have BigQuery maintain them, you have two options:

  • Use the default shared slot pool: When the data you plan to index is below your per-organization limit, you can use the free shared slot pool for index management.
  • Use your own reservation: To achieve more predictable and consistent indexing progress on your larger production workloads, you can use your own reservations for index management.

Use shared slots

If you have not configured your project to use a dedicated reservation for indexing, index management is handled in the free, shared slot pool, subject to the following constraints.

If you add data to a table which causes the total size of indexed tables to exceed your organization's limit, BigQuery pauses index management for all indexed tables. When this happens, the index_status field in the INFORMATION_SCHEMA.SEARCH_INDEXES view displays PENDING DISABLEMENT and the index is queued for deletion. While the index is pending disablement, it is still used in queries and you are charged for the index storage. After an index is deleted, the index_status field shows the index as TEMPORARILY DISABLED. In this state, queries don't use the index, and you are not charged for index storage. In this case, the IndexUnusedReason code is BASE_TABLE_TOO_LARGE.

If you delete data from the table and the total size of indexed tables falls below the per-organization limit, then index management is resumed for all indexed tables. The index_status field in the INFORMATION_SCHEMA.SEARCH_INDEXES view is ACTIVE, queries can use the index, and you are charged for the index storage.

BigQuery does not make guarantees about the available capacity of the shared pool or the throughput of indexing you see. For production applications, you might want to use dedicated slots for your index processing.

Use your own reservation

Instead of using the default shared slot pool, you can optionally designate your own reservation to index your tables. Using your own reservation ensures predictable and consistent performance of index-management jobs, such as creation, refresh, and background optimizations.

  • There are no table size limits when an indexing job runs in your reservation.
  • Using your own reservation gives you flexibility in your index management. If you need to create a very large index or make a major update to an indexed table, you can temporarily add more slots to the assignment.

To index the tables in a project with a designated reservation, create a reservation in the region where your tables are located. Then, assign the project to the reservation with the job_type set to BACKGROUND:

SQL

Use the CREATE ASSIGNMENT DDL statement.

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE ASSIGNMENT
      `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME.ASSIGNMENT_ID`
    OPTIONS (
      assignee = 'projects/PROJECT_ID',
      job_type = 'BACKGROUND');
    

    Replace the following:

    • ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource
    • LOCATION: the location of the reservation
    • RESERVATION_NAME: the name of the reservation
    • ASSIGNMENT_ID: the ID of the assignment

      The ID must be unique to the project and location, start and end with a lowercase letter or a number, and contain only lowercase letters, numbers, and dashes.

    • PROJECT_ID: the ID of the project containing the tables to index. This project is assigned to the reservation.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

bq

Use the bq mk command:

bq mk \
    --project_id=ADMIN_PROJECT_ID \
    --location=LOCATION \
    --reservation_assignment \
    --reservation_id=RESERVATION_NAME \
    --assignee_id=PROJECT_ID \
    --job_type=BACKGROUND \
    --assignee_type=PROJECT

Replace the following:

  • ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource
  • LOCATION: the location of the reservation
  • RESERVATION_NAME: the name of the reservation
  • PROJECT_ID: the ID of the project to assign to this reservation

View your indexing jobs

A new indexing job is created every time an index is created or updated on a single table. To view information about the job, query the INFORMATION_SCHEMA.JOBS* views. You can filter for indexing jobs by setting job_type IS NULL AND SEARCH(job_id, '`search_index`') in the WHERE clause of your query. The following example lists the five most recent indexing jobs in the project my_project:

SELECT *
FROM
 region-us.INFORMATION_SCHEMA.JOBS
WHERE
  project_id  = 'my_project'
  AND job_type IS NULL
  AND SEARCH(job_id, '`search_index`')
ORDER BY
 creation_time DESC
LIMIT 5;

Choose your reservation size

To choose the right number of slots for your reservation, you should consider when index-management jobs are run, how many slots they use, and what your usage looks like over time. BigQuery triggers an index-management job in the following situations:

  • You create an index on a table.
  • Data is modified in an indexed table.
  • The schema of a table changes and this affects which columns are indexed.
  • Index data and metadata are periodically optimized or updated.

The number of slots you need for an index-management job on a table depends on the following factors:

  • The size of the table
  • The rate of data ingestion to the table
  • The rate of DML statements applied to the table
  • The acceptable delay for building and maintaining the index
  • The complexity of the index, typically determined by attributes of the data, such as the number of duplicate terms
Initial Estimation

The following estimates can help you to approximate how many slots your reservation requires. Due to the highly variable nature of indexing workloads, you should re-evaluate your requirements after you start indexing data.

  • Existing data: With a 1000-slot reservation, an existing table in BigQuery can be indexed at an average rate of up to 4 GiB per second, which is approximately 336 TiB per day.
  • Newly ingested data: Indexing is typically more resource-intensive on newly ingested data, as the table and its index go through several rounds of transformative optimizations. On average, indexing newly ingested data consumes three times the resources compared to initial backfill-indexing of the same data.
  • Infrequently modified data: Indexed tables with little to no data modification need substantially fewer resources for continued index maintenance. A recommended starting point is to maintain 1/5 of the slots required for the initial backfill-indexing of the same data, and no fewer than 250 slots.
  • Indexing progress scales roughly linearly with the reservation size. However, we don't recommend using reservations smaller than 250 slots for indexing because it might lead to inefficiencies that can slow indexing progress.
  • These estimates may change as features, optimizations, and your actual usage vary.
  • If your organization's total table size exceeds your region's indexing limit, then you should maintain a non-zero reservation assigned for indexing. Otherwise, indexing might fall back to the default tier, resulting in unintended deletion of all indexes.
Monitor Usage and Progress

The best way to assess the number of slots you need to efficiently run your index-management jobs is to monitor your slot utilization and adjust the reservation size accordingly. The following query produces the daily slot usage for index-management jobs. Only the past 30 days are included in the region us-west1:

SELECT
  TIMESTAMP_TRUNC(job.creation_time, DAY) AS usage_date,
  -- Aggregate total_slots_ms used for index-management jobs in a day and divide
  -- by the number of milliseconds in a day. This value is most accurate for
  -- days with consistent slot usage.
  SAFE_DIVIDE(SUM(job.total_slot_ms), (1000 * 60 * 60 * 24)) AS average_daily_slot_usage
FROM
  `region-us-west1`.INFORMATION_SCHEMA.JOBS job
WHERE
  project_id = 'my_project'
  AND job_type IS NULL
  AND SEARCH(job_id, '`search_index`')
GROUP BY
  usage_date
ORDER BY
  usage_date DESC
limit 30;

When there are insufficient slots to run index-management jobs, an index can become out of sync with its table and indexing jobs might fail. In this case, BigQuery rebuilds the index from scratch. To avoid having an out-of-sync index, ensure you have enough slots to support index updates from data ingestion and optimization. For more information on monitoring slot usage, see admin resource charts.

Best practices

  • Search indexes are designed for large tables. The performance gains from a search index increase with the size of the table.
  • Don't index columns that contain only a very small number of unique values.
  • Don't index columns that you never intend to call the SEARCH function on.
  • Be careful when creating a search index on ALL COLUMNS. Every time you add a column containing STRING or JSON data, it is indexed.
  • You should use your own reservation for index management in production applications. If you choose to use the default shared slot pool for your index-management jobs, then the per-organization sizing limits apply.

Delete a search index

When you no longer need a search index or want to change which columns are indexed on a table, you can delete the index currently on that table. To do this, use the DROP SEARCH INDEX DDL statement.

If an indexed table is deleted, its index is deleted automatically.

Example:

DROP SEARCH INDEX my_index ON dataset.simple_table;

What's next