使用向量搜索来搜索嵌入
本教程介绍如何使用 VECTOR_SEARCH
函数以及可选的向量索引搜索存储在 BigQuery 表中的嵌入。
所需权限
如需运行本教程,您需要拥有以下 Identity and Access Management (IAM) 权限:
- 如需创建数据集,您需要拥有
bigquery.datasets.create
权限。 如需创建表,您需要拥有以下权限:
bigquery.tables.create
bigquery.tables.updateData
bigquery.jobs.create
如需创建向量索引,您需要拥有要创建索引的表的
bigquery.tables.createIndex
权限。如需删除向量索引,您需要拥有要删除索引的表的
bigquery.tables.deleteIndex
权限。
以下每个预定义的 IAM 角色都包含使用向量索引所需的权限:
- BigQuery Data Owner (
roles/bigquery.dataOwner
) - BigQuery Data Editor (
roles/bigquery.dataEditor
)
费用
在本文档中,您将使用 Google Cloud 的以下收费组件:
- BigQuery: You incur costs for index storage and data processing in BigQuery.
您可使用价格计算器根据您的预计使用情况来估算费用。
如需了解详情,请参阅 BigQuery 价格。
准备工作
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the BigQuery API.
创建数据集
创建 BigQuery 数据集:
在 Google Cloud 控制台中,转到 BigQuery 页面。
在探索器窗格中,点击您的项目名称。
点击
查看操作 > 创建数据集。在创建数据集页面上,执行以下操作:
在数据集 ID 部分,输入
vector_search
。在位置类型部分,选择多区域,然后选择 US (multiple regions in United States)(美国[美国的多个区域])。
公共数据集存储在
US
多区域中。为简单起见,请将数据集存储在同一位置。保持其余默认设置不变,然后点击创建数据集。
创建测试表
基于 Google 专利公共数据集的子集创建包含专利嵌入的
patents
表:CREATE TABLE vector_search.patents AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE ARRAY_LENGTH(embedding_v1) > 0 AND publication_number NOT IN ('KR-20180122872-A') LIMIT 1000000;
创建包含要查找最近邻的专利嵌入的
patents2
表:CREATE TABLE vector_search.patents2 AS SELECT * FROM `patents-public-data.google_patents_research.publications` WHERE publication_number = 'KR-20180122872-A';
创建向量索引
在
patents
表的embeddings_v1
列上创建my_index
向量索引:CREATE VECTOR INDEX my_index ON vector_search.patents(embedding_v1) OPTIONS(distance_type='COSINE', index_type='IVF', ivf_options='{"num_lists": 1000}');
等待几分钟,以便向量索引创建完成,然后运行以下查询并确认
coverage_percentage
值为100
:SELECT * FROM vector_search.INFORMATION_SCHEMA.VECTOR_INDEXES;
使用 VECTOR_SEARCH
函数和索引
创建并填充向量索引后,使用 VECTOR_SEARCH
函数在 patents2
表的 embedding_v1
列中查找嵌入的最近邻。此查询在搜索中使用向量索引,因此 VECTOR_SEARCH
使用近似最近邻方法查找嵌入的最近邻:
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}');
结果类似于以下内容:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.14471956347590609 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.17472108931171348 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
使用 VECTOR_SEARCH
函数和暴力破解
使用 VECTOR_SEARCH
函数在 patents2
表的 embedding_v1
列中查找嵌入的最近邻。此查询在搜索中不使用向量索引,因此 VECTOR_SEARCH
会查找嵌入的精确最近邻:
SELECT query.publication_number AS query_publication_number, query.title AS query_title, base.publication_number AS base_publication_number, base.title AS base_title, distance FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}');
结果类似于以下内容:
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | query_publication_number | query_title | base_publication_number | base_title | distance | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+ | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-106599080-B | A kind of rapid generation for keeping away big vast transfer figure based on GIS | 0.1447195634759062 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-114118544-A | Urban waterlogging detection method and device | 0.1747210893117136 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-20200048143-A | Method and system for mornitoring dry stream using unmanned aerial vehicle | 0.17561990745619782 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | KR-101721695-B1 | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same | 0.17696129365559843 | | KR-20180122872-A | Rainwater management system based on rainwater keeping unit | CN-109000731-B | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 | +--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
评估召回率
使用索引执行向量搜索时,它会返回近似结果,代价是召回率降低。您可以通过将向量搜索和索引返回的结果与向量搜索和暴力破解返回的结果进行比较来计算召回率。在此数据集中,publication_number
值唯一标识专利,因此用于比较。
WITH approx_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"fraction_lists_to_search": 0.005}') ), exact_results AS ( SELECT query.publication_number AS query_publication_number, base.publication_number AS base_publication_number FROM VECTOR_SEARCH( TABLE vector_search.patents, 'embedding_v1', TABLE vector_search.patents2, top_k => 5, distance_type => 'COSINE', options => '{"use_brute_force":true}') ) SELECT a.query_publication_number, SUM(CASE WHEN a.base_publication_number = e.base_publication_number THEN 1 ELSE 0 END) / 5 AS recall FROM exact_results e LEFT JOIN approx_results a ON e.query_publication_number = a.query_publication_number GROUP BY a.query_publication_number
如果召回率低于您的预期,您可以提高 fraction_lists_to_search
值,但缺点是延迟时间可能增加且资源使用率可能变高。如需调整向量搜索,您可以尝试使用不同的参数值多次运行 VECTOR_SEARCH
,将结果保存到表中,然后比较结果。
清理
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.