Manage vector indexes
This document describes how to create and manage vector indexes.
A vector index is a data structure designed to let the
VECTOR_SEARCH
function
perform a more efficient vector search
of embeddings. When VECTOR_SEARCH
is able to use a vector index, the function
uses the
Approximate Nearest Neighbor
search technique to help improve search performance, with the trade-off of
reducing
recall
and thus returning more approximate results.
Roles and permissions
To create a vector index, you need the
bigquery.tables.createIndex
IAM permission
on the table where you're creating the index. To drop a vector index, you need
the bigquery.tables.deleteIndex
permission. Each of the following predefined
IAM roles includes the permissions that you need to work with
vector indexes:
- BigQuery Data Owner (
roles/bigquery.dataOwner
) - BigQuery Data Editor (
roles/bigquery.dataEditor
)
Create a vector index
To create a vector index, use the
CREATE VECTOR INDEX
data definition language (DDL) statement:
Go to the BigQuery page.
In the query editor, select an index type and run on the of the following SQL statements:
To create a IVF vector index:
CREATE [ OR REPLACE ] VECTOR INDEX [ IF NOT EXISTS ] INDEX_NAME ON DATASET_NAME.TABLE_NAME(COLUMN_NAME) STORING(STORED_COLUMN_NAME [, ...]) OPTIONS(index_type = "IVF", distance_type = "DISTANCE_TYPE", ivf_options = '{"num_lists":NUM_LISTS}')
To create a TreeAH vector index:
CREATE [ OR REPLACE ] VECTOR INDEX [ IF NOT EXISTS ] INDEX_NAME ON DATASET_NAME.TABLE_NAME(COLUMN_NAME) OPTIONS(index_type = "TREE_AH", distance_type = "DISTANCE_TYPE", tree_ah_options = '{"leaf_node_embedding_count":LEAF_NODE_EMBEDDING_COUNT, "normalization_type":"NORMALIZATION_TYPE"}')
See Choose a vector index type for more information.
Replace the following:
INDEX_NAME
: the name of the vector index you're creating. Since the index is always created in the same project and dataset as the base table, there is no need to specify these in the name.DATASET_NAME
: the name of the dataset that contains the table.TABLE_NAME
: the name of the table that contains the column with embeddings data.COLUMN_NAME
: the name of a column that contains the embeddings data. The column must have a type ofARRAY<FLOAT64>
. The column can't have any child fields. All elements in the array must be non-NULL
, and all values in the column must have the same array dimensions.STORED_COLUMN_NAME
: the name of a top-level column in the table to store in the vector index. The column type can't beRANGE
. Stored columns are not used if the table has a row-level access policy or the column has a policy tag. For information about how to enable stored columns, see Store columns and pre-filter.DISTANCE_TYPE
: specifies the default distance type to use when performing a vector search using this index. The supported values areEUCLIDEAN
,COSINE
, andDOT_PRODUCT
.EUCLIDEAN
is the default.The index creation itself always uses
EUCLIDEAN
distance for training but the distance used in theVECTOR_SEARCH
function can be different.If you specify a value for the
distance_type
argument of theVECTOR_SEARCH
function, that value is used instead of theDISTANCE_TYPE
value.NUM_LISTS
: anINT64
value less than or equal to 5,000 that determines how many lists the IVF algorithm creates. The IVF algorithm divides the whole data space into a number of lists equal toNUM_LISTS
, with data points that are closer to each other being more likely to be put on the same list. IfNUM_LISTS
is small, you have fewer lists with more data points, while a larger value creates more lists with fewer data points.You can use
NUM_LISTS
in combination with thefraction_lists_to_search
argument in theVECTOR_SEARCH
function to create an efficient vector search. If you have data that is distributed in many small groups in the embedding space, then specify a highNUM_LISTS
to create an index with more lists and specify a lowerfraction_lists_to_search
value to scan fewer lists in vector search. Use a lowerNUM_LISTS
and a higherfraction_lists_to_search
value when your data is distributed in fewer, larger groups. Using a highnum_lists
value might make the vector index take longer to build.If you don't specify
NUM_LISTS
, BigQuery calculates an appropriate value.LEAF_NODE_EMBEDDING_COUNT
: anINT64
value greater than or equal to 500 that specifies the approximate number of vectors in each leaf node of the tree that the TreeAH algorithm creates. The TreeAH algorithm divides the whole data space into a number of lists, with each list containing approximatelyLEAF_NODE_EMBEDDING_COUNT
data points. A lower value creates more lists with fewer data points, while a larger value creates fewer lists with more data points. The default is 1,000.NORMALIZATION_TYPE
: aSTRING
value. The supported values areNONE or
L2
. The default isNONE
. Normalization happens before any processing, for both the base table data and the query data, but doesn't modify the embedding columnCOLUMN_NAME
inTABLE_NAME
. Depending on the dataset, the embedding model, and the distance type used duringVECTOR_SEARCH
, normalizing the embeddings might improve recall.
The following example creates a vector index on the embedding
column
of my_table
:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS(index_type = 'IVF');
The following example creates a vector index on the embedding
column
of my_table
, and specifies the distance type to use and the IVF options:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS(index_type = 'IVF', distance_type = 'COSINE', ivf_options = '{"num_lists": 2500}')
The following example creates a vector index on the embedding
column
of my_table
, and specifies the distance type to use and the TreeAH options:
CREATE TABLE my_dataset.my_table(id INT64, embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS (index_type = 'TREE_AH', distance_type = 'EUCLIDEAN', tree_ah_options = '{"normalization_type": "L2"}');
Choose a vector index type
BigQuery offers two vector index types.
IVF Index
IVF is an inverted file index, which uses a k-means algorithm to cluster the
vector data, and then partitions the vector data based on those clusters. When
you use the
VECTOR_SEARCH
function
to search the vector data, it can use these partitions to reduce the
amount of data it needs to read in order to determine a result.
TreeAH Index
TreeAH is a type of vector index that uses Google's ScaNN algorithm. It works as follows:
The base table is divided into smaller, more manageable shards.
A clustering model is trained, with the number of clusters derived from the
leaf_node_embedding_count
option intree_ah_options
.The vectors are product quantized and stored in the index tables.
During
VECTOR_SEARCH
, a candidate list for each query vector is efficiently computed using asymmetric hashing, which is hardware-optimized for approximate distance calculations. These candidates are then re-scored and re-ranked using exact embeddings.
The TreeAH algorithm is optimized for batch queries that process hundreds or more query vectors. The use of product quantization can significantly reduce latency and cost, potentially by orders of magnitude compared to IVF. However, due to increased overhead, the IVF algorithm might be better when you have a smaller number of query vectors.
We suggest you try the TreeAH index type if your use case meets the following criteria:
Your table contains 200 million rows or fewer.
You frequently execute large batch queries involving hundreds or more query vectors.
For small batch queries,
VECTOR_SEARCH
using TreeAH index type might fall back to brute force. In that case, a Vector Index Unused reason is populated to explain the reason.Your workflow does not require the use of stored columns or pre-filtering. BigQuery treats pre-filters used with a TreeAH index as post-filters.
For all questions and concerns, notably on scalability, limits, and performance, please reach out to us at bq-vector-search@google.com.
Store columns and pre-filter
To further improve the efficiency of your vector index, you can specify columns
from your base table to store in your vector index. Using stored columns can
optimize queries that call the VECTOR_SEARCH
function in the following ways:
The
VECTOR_SEARCH
function outputs a struct calledbase
that contains all columns from the base table. Without stored columns, a potentially expensive join is needed to retrieve the columns stored inbase
. If your query only selects stored columns frombase
, then BigQuery optimizes your query to eliminate that join.Instead of searching an entire table, you can call the
VECTOR_SEARCH
function on a query statement that pre-filters the base table with aWHERE
clause. If your table has an index and you filter on only stored columns, then BigQuery optimizes the query by filtering the data before searching and then using the index to search the smaller result set. If you filter on columns that aren't stored, then BigQuery applies the filter after the table is searched, or post-filters.Post-filtering is less efficient and can cause fewer than
top_k
matches in the result set. In some cases, pre-filtering can also reduce the size of the result set. If this happens, try increasing the value offraction_lists_to_search
in your call toVECTOR_SEARCH
.
To store columns, list them in the STORING
clause of the
CREATE VECTOR INDEX
DDL statement.
Storing columns increases the size of the vector index, so it's best
to store only the most frequently used or filtered columns.
The following example creates a vector index with stored columns and then explains the behavior of different types of vector searches:
-- Create a table that contains an embedding. CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>, type STRING, creation_time DATETIME, id INT64); -- Create a query table that contains an embedding. CREATE TABLE my_dataset.my_testdata(embedding ARRAY<FLOAT64>, test_id INT64); -- Create a vector index with stored columns. CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) STORING (type, creation_time) OPTIONS (index_type = 'IVF'); -- Select only stored columns from a vector search to avoid an expensive join. SELECT query, base.type, distance FROM VECTOR_SEARCH( TABLE my_dataset.my_table, 'embedding' TABLE my_dataset.my_testdata); -- Pre-filter on a stored column. The index speeds up the query. SELECT * FROM VECTOR_SEARCH( (SELECT * FROM my_dataset.my_table WHERE type = 'animal'), 'embedding', TABLE my_dataset.my_testdata); -- Filter on a column that isn't stored. The index is used to search the -- entire table, and then the results are post-filtered. You might see fewer -- than 5 matches returned for some embeddings. SELECT query.test_id, base.type, distance FROM VECTOR_SEARCH( (SELECT * FROM my_dataset.my_table WHERE id = 123), 'embedding', TABLE my_dataset.my_testdata, top_k => 5); -- Use post-filters. The index is used, but the entire table is searched and -- the post-filtering might reduce the number of results. SELECT query.test_id, base.type, distance FROM VECTOR_SEARCH( TABLE my_dataset.my_table, 'embedding', TABLE my_dataset.my_testdata, top_k => 5) WHERE base.type = 'animal'; -- Use pre-filters with brute force. The data is filtered and then searched -- with brute force for exact results. SELECT query.test_id, base.type, distance FROM VECTOR_SEARCH( (SELECT * FROM my_dataset.my_table WHERE id = 123), 'embedding', TABLE my_dataset.my_testdata, options => '{"use_brute_force":true}');
Limitations
- You can't use logical views in your pre-filter.
- If your pre-filter contains a subquery, it might interfere with index usage.
- If the mode, type, or schema of a column is changed in the base table, and if it is a stored column in the vector index, then there can be a delay before that change is reflected in the vector index. Until the updates have been applied to the index, the vector search queries use the modified stored columns from the base table.
- If you select a column of type
STRUCT
from thequery
output of aVECTOR_SEARCH
query on a table that has an index with stored columns, then the whole query might fail. - Stored columns are not supported for TreeAH indexes.
Understand index refresh
Vector indexes are fully managed by BigQuery and automatically refreshed when the indexed table changes. If you delete the indexed column in a table or rename the table itself, the vector index is deleted automatically.
If you create a vector index on a table that is smaller than 10 MB, then the
vector index isn't populated. Similarly, if you delete data from an indexed
table and the table size falls below 10 MB, then the vector index is
temporarily disabled. In this case, vector search queries don't use the index
and the indexUnusedReasons
code in the
vectorSearchStatistics
section of the Job
resource is BASE_TABLE_TOO_SMALL
. Without the index,
VECTOR_SEARCH
automatically falls back to using brute force to find the
nearest neighbors of embeddings.
Queries that use the VECTOR_SEARCH
function
always return correct results, even if some portion of data is not yet indexed.
Get information about vector indexes
You can verify the existence and the readiness of a vector index by querying
INFORMATION_SCHEMA
. The following views contain metadata on vector indexes:
The
INFORMATION_SCHEMA.VECTOR_INDEXES
view has information about the vector indexes in a dataset.After the
CREATE VECTOR INDEX
statement completes, the index must still be populated before you can use it. You can use thelast_refresh_time
andcoverage_percentage
columns to verify the readiness of a vector index. If the vector index isn't ready, you can still use theVECTOR_SEARCH
function on a table, it just might run more slowly without the index.The
INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS
view has information about the vector-indexed columns for all tables in a dataset.The
INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS
view has information about the options used by the vector indexes in a dataset.
Vector index examples
The following example shows all active vector indexes on tables in the dataset
my_dataset
, located in the project my_project
. It includes their names, the
DDL statements used to create them, and their coverage percentage. If an
indexed base table is less than 10 MB, then its index is not populated, in
which case the coverage_percentage
value is 0.
SELECT table_name, index_name, ddl, coverage_percentage FROM my_project.my_dataset.INFORMATION_SCHEMA.VECTOR_INDEXES WHERE index_status = 'ACTIVE';
The result is similar to the following:
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+ | table_name | index_name | ddl | coverage_percentage | +------------+------------+-------------------------------------------------------------------------------------------------+---------------------+ | table1 | indexa | CREATE VECTOR INDEX `indexa` ON `my_project.my_dataset.table1`(embeddings) | 100 | | | | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"num_lists": 100}') | | +------------+------------+-------------------------------------------------------------------------------------------------+---------------------+ | table2 | indexb | CREATE VECTOR INDEX `indexb` ON `my_project.my_dataset.table2`(vectors) | 42 | | | | OPTIONS (distance_type = 'COSINE', index_type = 'IVF', ivf_options = '{"num_lists": 500}') | | +------------+------------+-------------------------------------------------------------------------------------------------+---------------------+ | table3 | indexc | CREATE VECTOR INDEX `indexc` ON `my_project.my_dataset.table3`(vectors) | 98 | | | | OPTIONS (distance_type = 'DOT_PRODUCT', index_type = 'TREE_AH', | | | | | tree_ah_options = '{"leaf_node_embedding_count": 1000, "normalization_type": "NONE"}') | | +------------+------------+-------------------------------------------------------------------------------------------------+---------------------+
Vector index columns examples
The following query extracts information on columns that have vector indexes:
SELECT table_name, index_name, index_column_name, index_field_path FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS;
The result is similar to the following:
+------------+------------+-------------------+------------------+ | table_name | index_name | index_column_name | index_field_path | +------------+------------+-------------------+------------------+ | table1 | indexa | embeddings | embeddings | | table2 | indexb | vectors | vectors | | table3 | indexc | vectors | vectors | +------------+------------+-------------------+------------------+
Vector index options examples
The following query extracts information on vector index options:
SELECT table_name, index_name, option_name, option_type, option_value FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS;
The result is similar to the following:
+------------+------------+------------------+------------------+-------------------------------------------------------------------+ | table_name | index_name | option_name | option_type | option_value | +------------+------------+------------------+------------------+-------------------------------------------------------------------+ | table1 | indexa | index_type | STRING | IVF | | table1 | indexa | distance_type | STRING | EUCLIDEAN | | table1 | indexa | ivf_options | STRING | {"num_lists": 100} | | table2 | indexb | index_type | STRING | IVF | | table2 | indexb | distance_type | STRING | COSINE | | table2 | indexb | ivf_options | STRING | {"num_lists": 500} | | table3 | indexc | index_type | STRING | TREE_AH | | table3 | indexc | distance_type | STRING | DOT_PRODUCT | | table3 | indexc | tree_ah_options | STRING | {"leaf_node_embedding_count": 1000, "normalization_type": "NONE"} | +------------+------------+------------------+------------------+-------------------------------------------------------------------+
Vector index usage
Information on vector index usage is available in the job metadata of the job that ran the vector search query. You can view job metadata by using the Google Cloud console, the bq command-line tool, the BigQuery API, or the client libraries.
When you use the Google Cloud console, you can find vector index usage information in the Vector Index Usage Mode and Vector Index Unused Reasons fields.
When you use the bq tool or the BigQuery API, you can
find vector index usage information in the
VectorSearchStatistics
section of the Job
resource.
The index usage mode indicates whether a vector index was used by providing one of the following values:
UNUSED
: No vector index was used.PARTIALLY_USED
: SomeVECTOR_SEARCH
functions in the query used vector indexes and some didn't.FULLY_USED
: EveryVECTOR_SEARCH
function in the query used a vector index.
When the index usage mode value is UNUSED
or PARTIALLY_USED
,
the index unused reasons indicate why vector indexes weren't used in the query.
For example, the following results returned by
bq show --format=prettyjson -j my_job_id
shows that the index was not used
because the use_brute_force
option was specified in the VECTOR_SEARCH
function:
"vectorSearchStatistics": { "indexUnusedReasons": [ { "baseTable": { "datasetId": "my_dataset", "projectId": "my_project", "tableId": "my_table" }, "code": "INDEX_SUPPRESSED_BY_FUNCTION_OPTION", "message": "No vector index was used for the base table `my_project:my_dataset.my_table` because use_brute_force option has been specified." } ], "indexUsageMode": "UNUSED" }
Index management options
To create indexes and have BigQuery maintain them, you have two options:
- Use the default shared slot pool: When the data you plan to index is below your per-organization limit, you can use the free shared slot pool for index management.
- Use your own reservation: To achieve more predictable and consistent indexing progress on your larger production workloads, you can use your own reservations for index management.
Use shared slots
If you have not configured your project to use a dedicated reservation for indexing, index management is handled in the free, shared slot pool, subject to the following constraints.
If you add data to a table which causes the total size of indexed
tables to exceed your organization's limit,
BigQuery pauses index management
for all indexed tables. When this happens, the index_status
field in the
INFORMATION_SCHEMA.VECTOR_INDEXES
view
displays PENDING DISABLEMENT
and the index is queued for deletion. While
the index is pending disablement, it is
still used in queries and you are charged for the index storage.
After an index is deleted, the index_status
field shows
the index as TEMPORARILY DISABLED
. In this state, queries don't use the index,
and you are not charged for index storage. In this case, the
IndexUnusedReason
code
is BASE_TABLE_TOO_LARGE
.
If you delete data from the table and the total size of indexed tables
falls below the per-organization limit, then index management is resumed for
all indexed tables. The index_status
field in the
INFORMATION_SCHEMA.VECTOR_INDEXES
view is ACTIVE
, queries can use the index, and you are charged for the
index storage.
BigQuery does not make guarantees about the available capacity of the shared pool or the throughput of indexing you see. For production applications, you might want to use dedicated slots for your index processing.
Use your own reservation
Instead of using the default shared slot pool, you can optionally designate your own reservation to index your tables. Using your own reservation ensures predictable and consistent performance of index-management jobs, such as creation, refresh, and background optimizations.
- There are no table size limits when an indexing job runs in your reservation.
- Using your own reservation gives you flexibility in your index management. If you need to create a very large index or make a major update to an indexed table, you can temporarily add more slots to the assignment.
To index the tables in a project with a designated reservation,
create a reservation
in the region where your tables are located. Then, assign the project to the
reservation with the job_type
set to BACKGROUND
:
SQL
Use the
CREATE ASSIGNMENT
DDL statement.
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
CREATE ASSIGNMENT `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME.ASSIGNMENT_ID` OPTIONS ( assignee = 'projects/PROJECT_ID', job_type = 'BACKGROUND');
Replace the following:
ADMIN_PROJECT_ID
: the project ID of the administration project that owns the reservation resourceLOCATION
: the location of the reservationRESERVATION_NAME
: the name of the reservationASSIGNMENT_ID
: the ID of the assignmentThe ID must be unique to the project and location, start and end with a lowercase letter or a number, and contain only lowercase letters, numbers, and dashes.
PROJECT_ID
: the ID of the project containing the tables to index. This project is assigned to the reservation.
Click
Run.
For more information about how to run queries, see Run an interactive query.
bq
Use the bq mk
command:
bq mk \ --project_id=ADMIN_PROJECT_ID \ --location=LOCATION \ --reservation_assignment \ --reservation_id=RESERVATION_NAME \ --assignee_id=PROJECT_ID \ --job_type=BACKGROUND \ --assignee_type=PROJECT
Replace the following:
ADMIN_PROJECT_ID
: the project ID of the administration project that owns the reservation resourceLOCATION
: the location of the reservationRESERVATION_NAME
: the name of the reservationPROJECT_ID
: the ID of the project to assign to this reservation
View your indexing jobs
A new indexing job is created every time an index is created or updated on
a single table. To view information about the job, query the
INFORMATION_SCHEMA.JOBS*
views. You
can filter for indexing jobs by
setting job_type IS NULL AND SEARCH(job_id, '`search_index`')
in the WHERE
clause of your query. The following example lists the five most recent indexing
jobs in the project my_project
:
SELECT * FROM region-us.INFORMATION_SCHEMA.JOBS WHERE project_id = 'my_project' AND job_type IS NULL AND SEARCH(job_id, '`search_index`') ORDER BY creation_time DESC LIMIT 5;
Choose your reservation size
To choose the right number of slots for your reservation, you should consider when index-management jobs are run, how many slots they use, and what your usage looks like over time. BigQuery triggers an index-management job in the following situations:
- You create an index on a table.
- Data is modified in an indexed table.
- The schema of a table changes and this affects which columns are indexed.
- Index data and metadata are periodically optimized or updated.
The number of slots you need for an index-management job on a table depends on the following factors:
- The size of the table
- The rate of data ingestion to the table
- The rate of DML statements applied to the table
- The acceptable delay for building and maintaining the index
- The complexity of the index, typically determined by attributes of the data, such as the number of duplicate terms
Monitor Usage and Progress
The best way to assess the number of slots you need to efficiently run your
index-management jobs is to monitor your slot utilization and adjust the
reservation size accordingly. The following query produces the daily slot usage
for index-management jobs. Only the past 30 days are included in the
region us-west1
:
SELECT TIMESTAMP_TRUNC(job.creation_time, DAY) AS usage_date, -- Aggregate total_slots_ms used for index-management jobs in a day and divide -- by the number of milliseconds in a day. This value is most accurate for -- days with consistent slot usage. SAFE_DIVIDE(SUM(job.total_slot_ms), (1000 * 60 * 60 * 24)) AS average_daily_slot_usage FROM `region-us-west1`.INFORMATION_SCHEMA.JOBS job WHERE project_id = 'my_project' AND job_type IS NULL AND SEARCH(job_id, '`search_index`') GROUP BY usage_date ORDER BY usage_date DESC limit 30;
When there are insufficient slots to run index-management jobs, an index can become out of sync with its table and indexing jobs might fail. In this case, BigQuery rebuilds the index from scratch. To avoid having an out-of-sync index, ensure you have enough slots to support index updates from data ingestion and optimization. For more information on monitoring slot usage, see admin resource charts.
Delete a vector index
When you no longer need a vector index or want to change which column is
indexed on a table, you can delete the index on that table by using the
DROP VECTOR INDEX
DDL statement.
For example:
DROP VECTOR INDEX my_index ON my_dataset.indexed_table;
If an indexed table is deleted, its index is deleted automatically.
What's next
- For an overview of vector index use cases, pricing, and limitations, see the Introduction to vector search.
- Try the Search embeddings with vector search tutorial.