Search embeddings with vector search
This tutorial shows you how to search embeddings stored in
BigQuery tables by using the
VECTOR_SEARCH
function
and optionally a vector index.
Required permissions
To run this tutorial, you need the following Identity and Access Management (IAM) permissions:
- To create a dataset, you need the
bigquery.datasets.create
permission. To create a table, you need the following permissions:
bigquery.tables.create
bigquery.tables.updateData
bigquery.jobs.create
To create a vector index, you need the
bigquery.tables.createIndex
permission on the table where you're creating the index.To drop a vector index, you need the
bigquery.tables.deleteIndex
permission on the table where you're dropping the index.
Each of the following predefined IAM roles includes the permissions that you need to work with vector indexes:
- BigQuery Data Owner (
roles/bigquery.dataOwner
) - BigQuery Data Editor (
roles/bigquery.dataEditor
)
Costs
In this document, you use the following billable components of Google Cloud:
- BigQuery: You incur costs for index storage and data processing in BigQuery.
To generate a cost estimate based on your projected usage,
use the pricing calculator.
For more information, see BigQuery pricing.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery API.
Create a dataset
Create a BigQuery dataset:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
vector_search
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Create test tables
Create the
patents
table that contains patents embeddings, based on a subset of the Google Patents public dataset:CREATE TABLE vector_search.patents AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE ARRAY_LENGTH(embedding_v1) > 0 AND publication_number NOT IN ('KR-20180122872-A') LIMIT 1000000;
Create the
patents2
table that contains a patent embedding to find nearest neighbors for:CREATE TABLE vector_search.patents2 AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE publication_number = 'KR-20180122872-A';
Create a vector index
Create the
my_index
vector index on theembeddings_v1
column of thepatents
table:CREATE VECTOR INDEX my_index ON vector_search.patents(embedding_v1) OPTIONS(distance_type='COSINE', index_type='IVF', ivf_options='{"num_lists": 1000}');
Wait several minutes for the vector index to be created, then run the following query and confirm that the
coverage_percentage
value is100
:SELECT * FROM vector_search.INFORMATION_SCHEMA.VECTOR_INDEXES;
Use the VECTOR_SEARCH
function with an index
After the vector index is created and populated, use the VECTOR_SEARCH
function to find the nearest neighbor for the embedding in the embedding_v1
column in the patents2
table. This query uses the vector index in the search,
so VECTOR_SEARCH
uses an
Approximate Nearest Neighbor
method to find the embedding's nearest neighbor:
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}');
The results look similar to the following:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.14471956347590609 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.17472108931171348 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
Use the VECTOR_SEARCH
function with brute force
Use the VECTOR_SEARCH
function to find the nearest neighbor for the embedding in the embedding_v1
column in the patents2
table. This query doesn't use the vector index in the
search, so VECTOR_SEARCH
finds the embedding's exact nearest neighbor:
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}');
The results look similar to the following:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.1447195634759062 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.1747210893117136 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
Evaluate recall
When you perform a vector search with an index, it returns approximate results,
with the trade-off of reducing
recall. You can compute recall by
comparing the results returned by vector search with an index and by vector
search with brute force. In this dataset, the publication_number
value
uniquely identifies a patent, so it is used for comparison.
WITH approx_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}') ), exact_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}') ) SELECT a.query_publication_number, SUM(CASE WHEN a.base_publication_number = e.base_publication_number THEN 1 ELSE 0 END) / 5 AS recall FROM exact_results e LEFT JOIN approx_results a ON e.query_publication_number = a.query_publication_number GROUP BY a.query_publication_number
If the recall is lower than you would like, you can increase the
fraction_lists_to_search
value, with the downside of potentially higher
latency and resource usage. To tune your vector search, you can try multiple
runs of VECTOR_SEARCH
with different argument values, save the results to
tables, and then compare the results.
Clean up
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.