Manage vector indexes

This document describes how to create and manage vector indexes.

A vector index is a data structure designed to let the VECTOR_SEARCH function perform a more efficient vector search of embeddings. When VECTOR_SEARCH is able to use a vector index, the function uses the Approximate Nearest Neighbor search technique to help improve search performance, with the trade-off of reducing recall and thus returning more approximate results.

Roles and permissions

To create a vector index, you need the bigquery.tables.createIndex IAM permission on the table where you're creating the index. To drop a vector index, you need the bigquery.tables.deleteIndex permission. Each of the following predefined IAM roles includes the permissions that you need to work with vector indexes:

BigQuery Data Owner (roles/bigquery.dataOwner)
BigQuery Data Editor (roles/bigquery.dataEditor)

Create a vector index

To create a vector index, use the CREATE VECTOR INDEX data definition language (DDL) statement:

Go to the BigQuery page.

Go to BigQuery
In the query editor, select an index type and run on the of the following SQL statements:

To create a IVF vector index:
```
CREATE [ OR REPLACE ] VECTOR INDEX [ IF NOT EXISTS ] INDEX_NAME
ON DATASET_NAME.TABLE_NAME(COLUMN_NAME)
STORING(STORED_COLUMN_NAME [, ...])
OPTIONS(index_type = "IVF",
  distance_type = "DISTANCE_TYPE",
  ivf_options = '{"num_lists":NUM_LISTS}')
```
To create a TreeAH vector index:
```
CREATE [ OR REPLACE ] VECTOR INDEX [ IF NOT EXISTS ] INDEX_NAME
ON DATASET_NAME.TABLE_NAME(COLUMN_NAME)
OPTIONS(index_type = "TREE_AH",
  distance_type = "DISTANCE_TYPE",
  tree_ah_options = '{"leaf_node_embedding_count":LEAF_NODE_EMBEDDING_COUNT,
    "normalization_type":"NORMALIZATION_TYPE"}')
```
See Choose a vector index type for more information.

Replace the following:
- INDEX_NAME: the name of the vector index you're creating. Since the index is always created in the same project and dataset as the base table, there is no need to specify these in the name.
- DATASET_NAME: the name of the dataset that contains the table.
- TABLE_NAME: the name of the table that contains the column with embeddings data.
- COLUMN_NAME: the name of a column that contains the embeddings data. The column must have a type of ARRAY<FLOAT64>. The column can't have any child fields. All elements in the array must be non-NULL, and all values in the column must have the same array dimensions.
- STORED_COLUMN_NAME: the name of a top-level column in the table to store in the vector index. The column type can't be RANGE. Stored columns are not used if the table has a row-level access policy or the column has a policy tag. For information about how to enable stored columns, see Store columns and pre-filter.
- DISTANCE_TYPE: specifies the default distance type to use when performing a vector search using this index. The supported values are EUCLIDEAN, COSINE, and DOT_PRODUCT. EUCLIDEAN is the default.
  
  The index creation itself always uses EUCLIDEAN distance for training but the distance used in the VECTOR_SEARCH function can be different.
  
  If you specify a value for the distance_type argument of the VECTOR_SEARCH function, that value is used instead of the DISTANCE_TYPE value.
- NUM_LISTS: an INT64 value less than or equal to 5,000 that determines how many lists the IVF algorithm creates. The IVF algorithm divides the whole data space into a number of lists equal to NUM_LISTS, with data points that are closer to each other being more likely to be put on the same list. If NUM_LISTS is small, you have fewer lists with more data points, while a larger value creates more lists with fewer data points.
  
  You can use NUM_LISTS in combination with the fraction_lists_to_search argument in the VECTOR_SEARCH function to create an efficient vector search. If you have data that is distributed in many small groups in the embedding space, then specify a high NUM_LISTS to create an index with more lists and specify a lower fraction_lists_to_search value to scan fewer lists in vector search. Use a lower NUM_LISTS and a higher fraction_lists_to_search value when your data is distributed in fewer, larger groups. Using a high num_lists value might make the vector index take longer to build.
  
  If you don't specify NUM_LISTS, BigQuery calculates an appropriate value.
- LEAF_NODE_EMBEDDING_COUNT: an INT64 value greater than or equal to 500 that specifies the approximate number of vectors in each leaf node of the tree that the TreeAH algorithm creates. The TreeAH algorithm divides the whole data space into a number of lists, with each list containing approximately LEAF_NODE_EMBEDDING_COUNT data points. A lower value creates more lists with fewer data points, while a larger value creates fewer lists with more data points. The default is 1,000.
- NORMALIZATION_TYPE: a STRING value. The supported values are NONE or L2. The default is NONE. Normalization happens before any processing, for both the base table data and the query data, but doesn't modify the embedding column COLUMN_NAME in TABLE_NAME. Depending on the dataset, the embedding model, and the distance type used during VECTOR_SEARCH, normalizing the embeddings might improve recall.

The following example creates a vector index on the embedding column of my_table:

CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>);

CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
OPTIONS(index_type = 'IVF');

The following example creates a vector index on the embedding column of my_table, and specifies the distance type to use and the IVF options:

CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>);

CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
OPTIONS(index_type = 'IVF', distance_type = 'COSINE',
ivf_options = '{"num_lists": 2500}')

The following example creates a vector index on the embedding column of my_table, and specifies the distance type to use and the TreeAH options:

CREATE TABLE my_dataset.my_table(id INT64, embedding ARRAY<FLOAT64>);

CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
OPTIONS (index_type = 'TREE_AH', distance_type = 'EUCLIDEAN',
tree_ah_options = '{"normalization_type": "L2"}');

Choose a vector index type

BigQuery offers two vector index types.

IVF Index

IVF is an inverted file index, which uses a k-means algorithm to cluster the vector data, and then partitions the vector data based on those clusters. When you use the VECTOR_SEARCH function to search the vector data, it can use these partitions to reduce the amount of data it needs to read in order to determine a result.

TreeAH Index

TreeAH is a type of vector index that uses Google's ScaNN algorithm. It works as follows:

The base table is divided into smaller, more manageable shards.
A clustering model is trained, with the number of clusters derived from the leaf_node_embedding_count option in tree_ah_options.
The vectors are product quantized and stored in the index tables.
During VECTOR_SEARCH, a candidate list for each query vector is efficiently computed using asymmetric hashing, which is hardware-optimized for approximate distance calculations. These candidates are then re-scored and re-ranked using exact embeddings.

The TreeAH algorithm is optimized for batch queries that process hundreds or more query vectors. The use of product quantization can significantly reduce latency and cost, potentially by orders of magnitude compared to IVF. However, due to increased overhead, the IVF algorithm might be better when you have a smaller number of query vectors.

We suggest you try the TreeAH index type if your use case meets the following criteria:

Your table contains 200 million rows or fewer.
You frequently execute large batch queries involving hundreds or more query vectors.

For small batch queries, VECTOR_SEARCH using TreeAH index type might fall back to brute force. In that case, a Vector Index Unused reason is populated to explain the reason.
Your workflow does not require the use of stored columns or pre-filtering. BigQuery treats pre-filters used with a TreeAH index as post-filters.

For all questions and concerns, notably on scalability, limits, and performance, please reach out to us at bq-vector-search@google.com.

Store columns and pre-filter

To further improve the efficiency of your vector index, you can specify columns from your base table to store in your vector index. Using stored columns can optimize queries that call the VECTOR_SEARCH function in the following ways:

The VECTOR_SEARCH function outputs a struct called base that contains all columns from the base table. Without stored columns, a potentially expensive join is needed to retrieve the columns stored in base. If your query only selects stored columns from base, then BigQuery optimizes your query to eliminate that join.
Instead of searching an entire table, you can call the VECTOR_SEARCH function on a query statement that pre-filters the base table with a WHERE clause. If your table has an index and you filter on only stored columns, then BigQuery optimizes the query by filtering the data before searching and then using the index to search the smaller result set. If you filter on columns that aren't stored, then BigQuery applies the filter after the table is searched, or post-filters.

Post-filtering is less efficient and can cause fewer than top_k matches in the result set. In some cases, pre-filtering can also reduce the size of the result set. If this happens, try increasing the value of fraction_lists_to_search in your call to VECTOR_SEARCH.

To store columns, list them in the STORING clause of the CREATE VECTOR INDEX DDL statement. Storing columns increases the size of the vector index, so it's best to store only the most frequently used or filtered columns.

The following example creates a vector index with stored columns and then explains the behavior of different types of vector searches:

-- Create a table that contains an embedding.
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>, type STRING, creation_time DATETIME, id INT64);

-- Create a query table that contains an embedding.
CREATE TABLE my_dataset.my_testdata(embedding ARRAY<FLOAT64>, test_id INT64);

-- Create a vector index with stored columns.
CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding)
STORING (type, creation_time)
OPTIONS (index_type = 'IVF');

-- Select only stored columns from a vector search to avoid an expensive join.
SELECT query, base.type, distance
FROM
  VECTOR_SEARCH(
    TABLE my_dataset.my_table,
    'embedding'
    TABLE my_dataset.my_testdata);

-- Pre-filter on a stored column. The index speeds up the query.
SELECT *
FROM
  VECTOR_SEARCH(
    (SELECT * FROM my_dataset.my_table WHERE type = 'animal'),
    'embedding',
    TABLE my_dataset.my_testdata);

-- Filter on a column that isn't stored. The index is used to search the
-- entire table, and then the results are post-filtered. You might see fewer
-- than 5 matches returned for some embeddings.
SELECT query.test_id, base.type, distance
FROM
  VECTOR_SEARCH(
    (SELECT * FROM my_dataset.my_table WHERE id = 123),
    'embedding',
    TABLE my_dataset.my_testdata,
    top_k => 5);

-- Use post-filters. The index is used, but the entire table is searched and
-- the post-filtering might reduce the number of results.
SELECT query.test_id, base.type, distance
FROM
  VECTOR_SEARCH(
    TABLE my_dataset.my_table,
    'embedding',
    TABLE my_dataset.my_testdata,
    top_k => 5)
WHERE base.type = 'animal';

-- Use pre-filters with brute force. The data is filtered and then searched
-- with brute force for exact results.
SELECT query.test_id, base.type, distance
FROM
  VECTOR_SEARCH(
    (SELECT * FROM my_dataset.my_table WHERE id = 123),
    'embedding',
    TABLE my_dataset.my_testdata,
    options => '{"use_brute_force":true}');

Limitations

You can't use logical views in your pre-filter.
If your pre-filter contains a subquery, it might interfere with index usage.
If the mode, type, or schema of a column is changed in the base table, and if it is a stored column in the vector index, then there can be a delay before that change is reflected in the vector index. Until the updates have been applied to the index, the vector search queries use the modified stored columns from the base table.
If you select a column of type STRUCT from the query output of a VECTOR_SEARCH query on a table that has an index with stored columns, then the whole query might fail.
Stored columns are not supported for TreeAH indexes.

Understand index refresh

Vector indexes are fully managed by BigQuery and automatically refreshed when the indexed table changes. If you delete the indexed column in a table or rename the table itself, the vector index is deleted automatically.

If you create a vector index on a table that is smaller than 10 MB, then the vector index isn't populated. Similarly, if you delete data from an indexed table and the table size falls below 10 MB, then the vector index is temporarily disabled. In this case, vector search queries don't use the index and the indexUnusedReasons code in the vectorSearchStatistics section of the Job resource is BASE_TABLE_TOO_SMALL. Without the index, VECTOR_SEARCH automatically falls back to using brute force to find the nearest neighbors of embeddings.

Queries that use the VECTOR_SEARCH function always return correct results, even if some portion of data is not yet indexed.

Get information about vector indexes

You can verify the existence and the readiness of a vector index by querying INFORMATION_SCHEMA. The following views contain metadata on vector indexes:

The INFORMATION_SCHEMA.VECTOR_INDEXES view has information about the vector indexes in a dataset.

After the CREATE VECTOR INDEX statement completes, the index must still be populated before you can use it. You can use the last_refresh_time and coverage_percentage columns to verify the readiness of a vector index. If the vector index isn't ready, you can still use the VECTOR_SEARCH function on a table, it just might run more slowly without the index.
The INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS view has information about the vector-indexed columns for all tables in a dataset.
The INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS view has information about the options used by the vector indexes in a dataset.

Vector index examples

The following example shows all active vector indexes on tables in the dataset my_dataset, located in the project my_project. It includes their names, the DDL statements used to create them, and their coverage percentage. If an indexed base table is less than 10 MB, then its index is not populated, in which case the coverage_percentage value is 0.

SELECT table_name, index_name, ddl, coverage_percentage
FROM my_project.my_dataset.INFORMATION_SCHEMA.VECTOR_INDEXES
WHERE index_status = 'ACTIVE';

The result is similar to the following:

+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table_name | index_name | ddl                                                                                             | coverage_percentage |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table1     | indexa     | CREATE VECTOR INDEX `indexa` ON `my_project.my_dataset.table1`(embeddings)                      | 100                 |
|            |            | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"num_lists": 100}')   |                     |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table2     | indexb     | CREATE VECTOR INDEX `indexb` ON `my_project.my_dataset.table2`(vectors)                         | 42                  |
|            |            | OPTIONS (distance_type = 'COSINE', index_type = 'IVF', ivf_options = '{"num_lists": 500}')      |                     |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
| table3     | indexc     | CREATE VECTOR INDEX `indexc` ON `my_project.my_dataset.table3`(vectors)                         | 98                  |
|            |            | OPTIONS (distance_type = 'DOT_PRODUCT', index_type = 'TREE_AH',                                 |                     |
|            |            |          tree_ah_options = '{"leaf_node_embedding_count": 1000, "normalization_type": "NONE"}') |                     |
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+

Vector index columns examples

The following query extracts information on columns that have vector indexes:

SELECT table_name, index_name, index_column_name, index_field_path
FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS;

The result is similar to the following:

+------------+------------+-------------------+------------------+
| table_name | index_name | index_column_name | index_field_path |
+------------+------------+-------------------+------------------+
| table1     | indexa     | embeddings        | embeddings       |
| table2     | indexb     | vectors           | vectors          |
| table3     | indexc     | vectors           | vectors          |
+------------+------------+-------------------+------------------+

Vector index options examples

The following query extracts information on vector index options:

SELECT table_name, index_name, option_name, option_type, option_value
FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS;

The result is similar to the following:

+------------+------------+------------------+------------------+-------------------------------------------------------------------+
| table_name | index_name | option_name      | option_type      | option_value                                                      |
+------------+------------+------------------+------------------+-------------------------------------------------------------------+
| table1     | indexa     | index_type       | STRING           | IVF                                                               |
| table1     | indexa     | distance_type    | STRING           | EUCLIDEAN                                                         |
| table1     | indexa     | ivf_options      | STRING           | {"num_lists": 100}                                                |
| table2     | indexb     | index_type       | STRING           | IVF                                                               |
| table2     | indexb     | distance_type    | STRING           | COSINE                                                            |
| table2     | indexb     | ivf_options      | STRING           | {"num_lists": 500}                                                |
| table3     | indexc     | index_type       | STRING           | TREE_AH                                                           |
| table3     | indexc     | distance_type    | STRING           | DOT_PRODUCT                                                       |
| table3     | indexc     | tree_ah_options  | STRING           | {"leaf_node_embedding_count": 1000, "normalization_type": "NONE"} |
+------------+------------+------------------+------------------+-------------------------------------------------------------------+

Vector index usage

Information on vector index usage is available in the job metadata of the job that ran the vector search query. You can view job metadata by using the Google Cloud console, the bq command-line tool, the BigQuery API, or the client libraries.

When you use the Google Cloud console, you can find vector index usage information in the Vector Index Usage Mode and Vector Index Unused Reasons fields.

When you use the bq tool or the BigQuery API, you can find vector index usage information in the VectorSearchStatistics section of the Job resource.

The index usage mode indicates whether a vector index was used by providing one of the following values:

UNUSED: No vector index was used.
PARTIALLY_USED: Some VECTOR_SEARCH functions in the query used vector indexes and some didn't.
FULLY_USED: Every VECTOR_SEARCH function in the query used a vector index.

When the index usage mode value is UNUSED or PARTIALLY_USED, the index unused reasons indicate why vector indexes weren't used in the query.

For example, the following results returned by bq show --format=prettyjson -j my_job_id shows that the index was not used because the use_brute_force option was specified in the VECTOR_SEARCH function:

"vectorSearchStatistics": {
  "indexUnusedReasons": [
    {
      "baseTable": {
        "datasetId": "my_dataset",
        "projectId": "my_project",
        "tableId": "my_table"
      },
      "code": "INDEX_SUPPRESSED_BY_FUNCTION_OPTION",
      "message": "No vector index was used for the base table `my_project:my_dataset.my_table` because use_brute_force option has been specified."
    }
  ],
  "indexUsageMode": "UNUSED"
}

Index management options

To create indexes and have BigQuery maintain them, you have two options:

Use the default shared slot pool: When the data you plan to index is below your per-organization limit, you can use the free shared slot pool for index management.
Use your own reservation: To achieve more predictable and consistent indexing progress on your larger production workloads, you can use your own reservations for index management.

Use shared slots

If you have not configured your project to use a dedicated reservation for indexing, index management is handled in the free, shared slot pool, subject to the following constraints.

If you add data to a table which causes the total size of indexed tables to exceed your organization's limit, BigQuery pauses index management for all indexed tables. When this happens, the index_status field in the INFORMATION_SCHEMA.VECTOR_INDEXES view displays PENDING DISABLEMENT and the index is queued for deletion. While the index is pending disablement, it is still used in queries and you are charged for the index storage. After an index is deleted, the index_status field shows the index as TEMPORARILY DISABLED. In this state, queries don't use the index, and you are not charged for index storage. In this case, the IndexUnusedReason code is BASE_TABLE_TOO_LARGE.

If you delete data from the table and the total size of indexed tables falls below the per-organization limit, then index management is resumed for all indexed tables. The index_status field in the INFORMATION_SCHEMA.VECTOR_INDEXES view is ACTIVE, queries can use the index, and you are charged for the index storage.

BigQuery does not make guarantees about the available capacity of the shared pool or the throughput of indexing you see. For production applications, you might want to use dedicated slots for your index processing.

Use your own reservation

Instead of using the default shared slot pool, you can optionally designate your own reservation to index your tables. Using your own reservation ensures predictable and consistent performance of index-management jobs, such as creation, refresh, and background optimizations.

There are no table size limits when an indexing job runs in your reservation.
Using your own reservation gives you flexibility in your index management. If you need to create a very large index or make a major update to an indexed table, you can temporarily add more slots to the assignment.

To index the tables in a project with a designated reservation, create a reservation in the region where your tables are located. Then, assign the project to the reservation with the job_type set to BACKGROUND:

SQL

Use the CREATE ASSIGNMENT DDL statement.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the query editor, enter the following statement:
```
CREATE ASSIGNMENT
  `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME.ASSIGNMENT_ID`
OPTIONS (
  assignee = 'projects/PROJECT_ID',
  job_type = 'BACKGROUND');
```
Replace the following:
- ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource
- LOCATION: the location of the reservation
- RESERVATION_NAME: the name of the reservation
- ASSIGNMENT_ID: the ID of the assignment
  
  The ID must be unique to the project and location, start and end with a lowercase letter or a number, and contain only lowercase letters, numbers, and dashes.
- PROJECT_ID: the ID of the project containing the tables to index. This project is assigned to the reservation.
Click Run.

For more information about how to run queries, see Run an interactive query.

bq

Use the bq mk command:

bq mk \
    --project_id=ADMIN_PROJECT_ID \
    --location=LOCATION \
    --reservation_assignment \
    --reservation_id=RESERVATION_NAME \
    --assignee_id=PROJECT_ID \
    --job_type=BACKGROUND \
    --assignee_type=PROJECT

Replace the following:

ADMIN_PROJECT_ID: the project ID of the administration project that owns the reservation resource
LOCATION: the location of the reservation
RESERVATION_NAME: the name of the reservation
PROJECT_ID: the ID of the project to assign to this reservation

View your indexing jobs

A new indexing job is created every time an index is created or updated on a single table. To view information about the job, query the INFORMATION_SCHEMA.JOBS* views. You can filter for indexing jobs by setting job_type IS NULL AND SEARCH(job_id, '`search_index`') in the WHERE clause of your query. The following example lists the five most recent indexing jobs in the project my_project:

SELECT *
FROM
 region-us.INFORMATION_SCHEMA.JOBS
WHERE
  project_id  = 'my_project'
  AND job_type IS NULL
  AND SEARCH(job_id, '`search_index`')
ORDER BY
 creation_time DESC
LIMIT 5;

Choose your reservation size

To choose the right number of slots for your reservation, you should consider when index-management jobs are run, how many slots they use, and what your usage looks like over time. BigQuery triggers an index-management job in the following situations:

You create an index on a table.
Data is modified in an indexed table.
The schema of a table changes and this affects which columns are indexed.
Index data and metadata are periodically optimized or updated.

The number of slots you need for an index-management job on a table depends on the following factors:

The size of the table
The rate of data ingestion to the table
The rate of DML statements applied to the table
The acceptable delay for building and maintaining the index
The complexity of the index, typically determined by attributes of the data, such as the number of duplicate terms

Monitor Usage and Progress

The best way to assess the number of slots you need to efficiently run your index-management jobs is to monitor your slot utilization and adjust the reservation size accordingly. The following query produces the daily slot usage for index-management jobs. Only the past 30 days are included in the region us-west1:

SELECT
  TIMESTAMP_TRUNC(job.creation_time, DAY) AS usage_date,
  -- Aggregate total_slots_ms used for index-management jobs in a day and divide
  -- by the number of milliseconds in a day. This value is most accurate for
  -- days with consistent slot usage.
  SAFE_DIVIDE(SUM(job.total_slot_ms), (1000 * 60 * 60 * 24)) AS average_daily_slot_usage
FROM
  `region-us-west1`.INFORMATION_SCHEMA.JOBS job
WHERE
  project_id = 'my_project'
  AND job_type IS NULL
  AND SEARCH(job_id, '`search_index`')
GROUP BY
  usage_date
ORDER BY
  usage_date DESC
limit 30;

When there are insufficient slots to run index-management jobs, an index can become out of sync with its table and indexing jobs might fail. In this case, BigQuery rebuilds the index from scratch. To avoid having an out-of-sync index, ensure you have enough slots to support index updates from data ingestion and optimization. For more information on monitoring slot usage, see admin resource charts.

Delete a vector index

When you no longer need a vector index or want to change which column is indexed on a table, you can delete the index on that table by using the DROP VECTOR INDEX DDL statement.

For example:

DROP VECTOR INDEX my_index ON my_dataset.indexed_table;

If an indexed table is deleted, its index is deleted automatically.

What's next

For an overview of vector index use cases, pricing, and limitations, see the Introduction to vector search.
Try the Search embeddings with vector search tutorial.