Manage vector indexes
To provide feedback or request support for this feature, send email to bq-vector-search@google.com.
This document describes how to create and manage vector indexes.
A vector index is a data structure designed to let the
VECTOR_SEARCH
function
perform a more efficient vector search
of embeddings. When VECTOR_SEARCH
is able to use a vector index, the function
uses the
Approximate Nearest Neighbor
search technique to help improve search performance, with the trade-off of
reducing
recall
and thus returning more approximate results.
Roles and permissions
To create a vector index, you need the
bigquery.tables.createIndex
IAM permission
on the table where you're creating the index. To drop a vector index, you need
the bigquery.tables.deleteIndex
permission. Each of the following predefined
IAM roles includes the permissions that you need to work with
vector indexes:
- BigQuery Data Owner (
roles/bigquery.dataOwner
) - BigQuery Data Editor (
roles/bigquery.dataEditor
)
Create a vector index
To create a vector index, use the
CREATE VECTOR INDEX
data definition language (DDL) statement:
Go to the BigQuery page.
In the query editor, run the following SQL statement:
CREATE [ OR REPLACE ] VECTOR INDEX [ IF NOT EXISTS ] INDEX_NAME ON DATASET_NAME.TABLE_NAME(COLUMN_NAME) OPTIONS(index_type = INDEX_TYPE, distance_type = DISTANCE_TYPE, ivf_options = '{"num_lists":NUM_LISTS}')
Replace the following:
INDEX_NAME
: the name of the vector index you're creating. Since the index is always created in the same project and dataset as the base table, there is no need to specify these in the name.DATASET_NAME
: the name of the dataset that contains the table.TABLE_NAME
: the name of the table that contains the column with embeddings data.COLUMN_NAME
: the name of a column that contains the embeddings data. The column must have a type ofARRAY<FLOAT64>
. The column can't have any child fields. All elements in the array must be non-NULL
, and all values in the column must have the same array dimensions.INDEX_TYPE
: the algorithm to use to build the vector index.IVF
is the only supported value. SpecifyingIVF
builds the vector index as inverted file index (IVF). An IVF uses a k-means algorithm to cluster the vector data, and then partitions the vector data based on those clusters. When you use theVECTOR_SEARCH
function to search the vector data, it can use these partitions to reduce the amount of data it needs to read in order to determine a result.DISTANCE_TYPE
: specifies the default distance type to use when performing a vector search using this index. The supported values areEUCLIDEAN
andCOSINE
.EUCLIDEAN
is the default.The index creation itself always uses
EUCLIDEAN
distance for training but the distance used in theVECTOR_SEARCH
function can be different.If you specify a value for the
distance_type
argument of theVECTOR_SEARCH
function, that value is used instead of theDISTANCE_TYPE
value.NUM_LISTS
: anINT64
value less than or equal to 5,000 that determines how many lists the IVF algorithm creates. The IVF algorithm divides the whole data space into a number of lists equal toNUM_LISTS
, with data points that are closer to each other being more likely to be put on the same list. IfNUM_LISTS
is small, you have fewer lists with more data points, while a larger value creates more lists with fewer data points.You can use
NUM_LISTS
in combination with thefraction_lists_to_search
argument in theVECTOR_SEARCH
function to create an efficient vector search. If you have data that is distributed in many small groups in the embedding space, then specify a highNUM_LISTS
to create an index with more lists and specify a lowerfraction_lists_to_search
value to scan fewer lists in vector search. Use a lowerNUM_LISTS
and a higherfraction_lists_to_search
value when your data is distributed in fewer, larger groups. Using a highnum_lists
value might make the vector index take longer to build.If you don't specify
NUM_LISTS
, BigQuery calculates an appropriate value.
The following example creates a vector index on the embedding
column
of my_table
:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS(index_type = 'IVF');
The following example creates a vector index on the embedding
column
of my_table
, and specifies the distance type to use and the IVF options:
CREATE TABLE my_dataset.my_table(embedding ARRAY<FLOAT64>); CREATE VECTOR INDEX my_index ON my_dataset.my_table(embedding) OPTIONS(index_type = 'IVF', distance_type = 'COSINE', ivf_options = '{"num_lists": 2500}')
Understand index refresh
Vector indexes are fully managed by BigQuery and automatically refreshed when the indexed table changes. If you delete the indexed column in a table or rename the table itself, the vector index is deleted automatically.
If you create a vector index on a table that is smaller than 10 MB, then the
vector index isn't populated. Similarly, if you delete data from an indexed
table and the table size falls below 10 MB, then the vector index is
temporarily disabled. In this case, vector search queries don't use the index
and the indexUnusedReasons
code in the
vectorSearchStatistics
section of the Job
resource is BASE_TABLE_TOO_SMALL
. Without the index,
VECTOR_SEARCH
automatically falls back to using brute force to find the
nearest neighbors of embeddings.
Queries that use the VECTOR_SEARCH
function
always return correct results, even if some portion of data is not yet indexed.
Get information about vector indexes
You can verify the existence and the readiness of a vector index by querying
INFORMATION_SCHEMA
. The following views contain metadata on vector indexes:
The
INFORMATION_SCHEMA.VECTOR_INDEXES
view has information about the vector indexes in a dataset.After the
CREATE VECTOR INDEX
statement completes, the index must still be populated before you can use it. You can use thelast_refresh_time
andcoverage_percentage
columns to verify the readiness of a vector index. If the vector index isn't ready, you can still use theVECTOR_SEARCH
function on a table, it just might run more slowly without the index.The
INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS
view has information about the vector-indexed columns for all tables in a dataset.The
INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS
view has information about the options used by the vector indexes in a dataset.
Vector index examples
The following example shows all active vector indexes on tables in the dataset
my_dataset
, located in the project my_project
. It includes their names, the
DDL statements used to create them, and their coverage percentage. If an
indexed base table is less than 10 MB, then its index is not populated, in
which case the coverage_percentage
value is 0.
SELECT table_name, index_name, ddl, coverage_percentage FROM my_project.my_dataset.INFORMATION_SCHEMA.VECTOR_INDEXES WHERE index_status = 'ACTIVE';
The result is similar to the following:
+-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+ | table_name | index_name | ddl | coverage_percentage | +-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+ | small_table | myindex1 | CREATE VECTOR INDEX `myindex1` ON `my_project.my_dataset.small_table`(embeddings) | 100 | | | | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"numLists": 3}') | | +-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+ | large_table | myindex2 | CREATE VECTOR INDEX `myindex2` ON `my_project.my_dataset.large_table`(vectors) | 42 | | | | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"numLists": 12}') | | +-------------+-------------+-----------------------------------------------------------------------------------------------+---------------------+
Vector index columns examples
The following query extracts information on columns that have vector indexes:
SELECT table_name, index_name, index_column_name, index_field_path FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS;
The result is similar to the following:
+------------+------------+-------------------+------------------+ | table_name | index_name | index_column_name | index_field_path | +------------+------------+-------------------+------------------+ | table1 | indexa | a | a | | table2 | indexb | b | b | | table3 | indexc | c | c | +------------+------------+-------------------+------------------+
Vector index options examples
The following query extracts information on vector index options:
SELECT table_name, index_name, option_name, option_type, option_value FROM my_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS;
The result is similar to the following:
+------------+------------+------------------+------------------+--------------------+ | table_name | index_name | option_name | option_type | option_value | +------------+------------+------------------+------------------+--------------------+ | table1 | indexa | distance_type | STRING | EUCLIDEAN | | table1 | indexa | index_type | STRING | IVF | | table2 | indexb | ivf_options | STRING | {"num_lists": 100} | | table2 | indexb | index_type | STRING | IVF | +------------+------------+------------------+------------------+--------------------+
Vector index usage
Information on vector index usage is available in the job metadata of the job that ran the vector search query. You can view job metadata by using the Google Cloud console, the bq command-line tool, the BigQuery API, or the client libraries.
When you use the Google Cloud console, you can find vector index usage information in the Vector Index Usage Mode and Vector Index Unused Reasons fields.
When you use the bq tool or the BigQuery API, you can
find vector index usage information in the
VectorSearchStatistics
section of the Job
resource.
The index usage mode indicates whether a vector index was used by providing one of the following values:
UNUSED
: No vector index was used.PARTIALLY_USED
: SomeVECTOR_SEARCH
functions in the query used vector indexes and some didn't.FULLY_USED
: EveryVECTOR_SEARCH
function in the query used a vector index.
When the index usage mode value is UNUSED
or PARTIALLY_USED
,
the index unused reasons indicate why vector indexes weren't used in the query.
For example, the following results returned by
bq show --format=prettyjson -j my_job_id
shows that the index was not used
because the use_brute_force
option was specified in the VECTOR_SEARCH
function:
"vectorSearchStatistics": { "indexUnusedReasons": [ { "baseTable": { "datasetId": "my_dataset", "projectId": "my_project", "tableId": "my_table" }, "code": "INDEX_SUPPRESSED_BY_FUNCTION_OPTION", "message": "No vector index was used for the base table `my_project:my_dataset.my_table` because use_brute_force option has been specified." } ], "indexUsageMode": "UNUSED" }
Index management options
To create indexes and have BigQuery maintain them, you have two options:
- Use the default shared slot pool: When the data you plan to index is below your per-organization limit, you can use the free shared slot pool for index management.
- Use your own reservation: To achieve more predictable and consistent indexing progress on your larger production workloads, you can use your own reservations for index management.
Use shared slots
If you have not configured your project to use a dedicated reservation for indexing, index management is handled in the free, shared slot pool, subject to the following constraints.
If you add data to a table which causes the total size of indexed
tables to exceed your organization's limit,
BigQuery pauses index management
for all indexed tables. When this happens, the index_status
field in the
INFORMATION_SCHEMA.VECTOR_INDEXES
view
displays PENDING DISABLEMENT
and the index is queued for deletion. While
the index is pending disablement, it is
still used in queries and you are charged for the index storage.
After an index is deleted, the index_status
field shows
the index as TEMPORARILY DISABLED
. In this state, queries don't use the index,
and you are not charged for index storage. In this case, the
IndexUnusedReason
code
is BASE_TABLE_TOO_LARGE
.
If you delete data from the table and the total size of indexed tables
falls below the per-organization limit, then index management is resumed for
all indexed tables. The index_status
field in the
INFORMATION_SCHEMA.VECTOR_INDEXES
view is ACTIVE
, queries can use the index, and you are charged for the
index storage.
BigQuery does not make guarantees about the available capacity of the shared pool or the throughput of indexing you see. For production applications, you might want to use dedicated slots for your index processing.
Use your own reservation
Instead of using the default shared slot pool, you can optionally designate your own reservation to index your tables. Using your own reservation ensures predictable and consistent performance of index-management jobs, such as creation, refresh, and background optimizations.
- There are no table size limits when an indexing job runs in your reservation.
- Using your own reservation gives you flexibility in your index management. If you need to create a very large index or make a major update to an indexed table, you can temporarily add more slots to the assignment.
To index the tables in a project with a designated reservation,
create a reservation
in the region where your tables are located. Then, assign the project to the
reservation with the job_type
set to BACKGROUND
:
SQL
Use the
CREATE ASSIGNMENT
DDL statement.
In the Google Cloud console, go to the BigQuery page.
In the query editor, enter the following statement:
CREATE ASSIGNMENT `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME.ASSIGNMENT_ID` OPTIONS ( assignee = 'projects/PROJECT_ID', job_type = 'BACKGROUND');
Replace the following:
ADMIN_PROJECT_ID
: the project ID of the administration project that owns the reservation resourceLOCATION
: the location of the reservationRESERVATION_NAME
: the name of the reservationASSIGNMENT_ID
: the ID of the assignmentThe ID must be unique to the project and location, start and end with a lowercase letter or a number, and contain only lowercase letters, numbers, and dashes.
PROJECT_ID
: the ID of the project containing the tables to index. This project is assigned to the reservation.
Click
Run.
For more information about how to run queries, see Run an interactive query.
bq
Use the bq mk
command:
bq mk \ --project_id=ADMIN_PROJECT_ID \ --location=LOCATION \ --reservation_assignment \ --reservation_id=RESERVATION_NAME \ --assignee_id=PROJECT_ID \ --job_type=BACKGROUND \ --assignee_type=PROJECT
Replace the following:
ADMIN_PROJECT_ID
: the project ID of the administration project that owns the reservation resourceLOCATION
: the location of the reservationRESERVATION_NAME
: the name of the reservationPROJECT_ID
: the ID of the project to assign to this reservation
View your indexing jobs
A new indexing job is created every time an index is created or updated on
a single table. To view information about the job, query the
INFORMATION_SCHEMA.JOBS*
views. You
can filter for indexing jobs by
setting job_type IS NULL AND SEARCH(job_id, '`search_index`')
in the WHERE
clause of your query. The following example lists the five most recent indexing
jobs in the project my_project
:
SELECT * FROM region-us.INFORMATION_SCHEMA.JOBS WHERE project_id = 'my_project' AND job_type IS NULL AND SEARCH(job_id, '`search_index`') ORDER BY creation_time DESC LIMIT 5;
Choose your reservation size
To choose the right number of slots for your reservation, you should consider when index-management jobs are run, how many slots they use, and what your usage looks like over time. BigQuery triggers an index-management job in the following situations:
- You create an index on a table.
- Data is modified in an indexed table.
- The schema of a table changes and this affects which columns are indexed.
- Index data and metadata are periodically optimized or updated.
The number of slots you need for an index-management job on a table depends on the following factors:
- The size of the table
- The rate of data ingestion to the table
- The rate of DML statements applied to the table
- The acceptable delay for building and maintaining the index
- The complexity of the index, typically determined by attributes of the data, such as the number of duplicate terms
Monitor Usage and Progress
The best way to assess the number of slots you need to efficiently run your
index-management jobs is to monitor your slot utilization and adjust the
reservation size accordingly. The following query produces the daily slot usage
for index-management jobs. Only the past 30 days are included in the
region us-west1
:
SELECT TIMESTAMP_TRUNC(job.creation_time, DAY) AS usage_date, -- Aggregate total_slots_ms used for index-management jobs in a day and divide -- by the number of milliseconds in a day. This value is most accurate for -- days with consistent slot usage. SAFE_DIVIDE(SUM(job.total_slot_ms), (1000 * 60 * 60 * 24)) AS average_daily_slot_usage FROM `region-us-west1`.INFORMATION_SCHEMA.JOBS job WHERE project_id = 'my_project' AND job_type IS NULL AND SEARCH(job_id, '`search_index`') GROUP BY usage_date ORDER BY usage_date DESC limit 30;
When there are insufficient slots to run index-management jobs, an index can become out of sync with its table and indexing jobs might fail. In this case, BigQuery rebuilds the index from scratch. To avoid having an out-of-sync index, ensure you have enough slots to support index updates from data ingestion and optimization. For more information on monitoring slot usage, see admin resource charts.
Delete a vector index
When you no longer need a vector index or want to change which column is
indexed on a table, you can delete the index on that table by using the
DROP VECTOR INDEX
DDL statement.
For example:
DROP VECTOR INDEX my_index ON my_dataset.indexed_table;
If an indexed table is deleted, its index is deleted automatically.
What's next
- For an overview of vector index use cases, pricing, and limitations, see the Introduction to vector search.
- Try the Search embeddings with vector search tutorial.