使用向量搜索来搜索嵌入

本教程介绍如何使用 VECTOR_SEARCH 函数以及可选的向量索引搜索存储在 BigQuery 表中的嵌入。

所需权限

如需运行本教程,您需要拥有以下 Identity and Access Management (IAM) 权限:

  • 如需创建数据集,您需要拥有 bigquery.datasets.create 权限。
  • 如需创建表,您需要拥有以下权限:

    • bigquery.tables.create
    • bigquery.tables.updateData
    • bigquery.jobs.create
  • 如需创建向量索引,您需要拥有要创建索引的表的 bigquery.tables.createIndex 权限。

  • 如需删除向量索引,您需要拥有要删除索引的表的 bigquery.tables.deleteIndex 权限。

以下每个预定义的 IAM 角色都包含使用向量索引所需的权限:

  • BigQuery Data Owner (roles/bigquery.dataOwner)
  • BigQuery Data Editor (roles/bigquery.dataEditor)

费用

在本文档中,您将使用 Google Cloud 的以下收费组件:

  • BigQuery: You incur costs for index storage and data processing in BigQuery.

您可使用价格计算器根据您的预计使用情况来估算费用。 Google Cloud 新用户可能有资格申请免费试用

如需了解详情,请参阅 BigQuery 价格

准备工作

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery API.

    Enable the API

创建数据集

创建 BigQuery 数据集:

  1. 在 Google Cloud 控制台中,转到 BigQuery 页面。

    转到 BigQuery 页面

  2. 探索器窗格中,点击您的项目名称。

  3. 点击 查看操作 > 创建数据集

    创建数据集。

  4. 创建数据集页面上,执行以下操作:

    • 数据集 ID 部分,输入 vector_search

    • 位置类型部分,选择多区域,然后选择 US (multiple regions in United States)(美国[美国的多个区域])。

      公共数据集存储在 US 多区域中。为简单起见,请将数据集存储在同一位置。

    • 保持其余默认设置不变,然后点击创建数据集

创建测试表

  1. 基于 Google 专利公共数据集的子集创建包含专利嵌入的 patents 表:

    CREATE TABLE vector_search.patents AS
    SELECT * FROM `patents-public-data.google_patents_research.publications`
    WHERE ARRAY_LENGTH(embedding_v1) > 0
     AND publication_number NOT IN ('KR-20180122872-A')
    LIMIT 1000000;
  2. 创建包含要查找最近邻的专利嵌入的 patents2 表:

    CREATE TABLE vector_search.patents2 AS
    SELECT * FROM `patents-public-data.google_patents_research.publications`
    WHERE publication_number = 'KR-20180122872-A';

创建向量索引

  1. patents 表的 embeddings_v1 列上创建 my_index 向量索引:

    CREATE VECTOR INDEX my_index ON vector_search.patents(embedding_v1)
    OPTIONS(distance_type='COSINE', index_type='IVF', ivf_options='{"num_lists": 1000}');
  2. 等待几分钟,以便向量索引创建完成,然后运行以下查询并确认 coverage_percentage 值为 100

    SELECT * FROM vector_search.INFORMATION_SCHEMA.VECTOR_INDEXES;

使用 VECTOR_SEARCH 函数和索引

创建并填充向量索引后,使用 VECTOR_SEARCH 函数在 patents2 表的 embedding_v1 列中查找嵌入的最近邻。此查询在搜索中使用向量索引,因此 VECTOR_SEARCH 使用近似最近邻方法查找嵌入的最近邻:

SELECT query.publication_number AS query_publication_number,
  query.title AS query_title,
  base.publication_number AS base_publication_number,
  base.title AS base_title,
  distance
FROM
  VECTOR_SEARCH(
    TABLE vector_search.patents,
    'embedding_v1',
    TABLE vector_search.patents2,
    top_k => 5,
    distance_type => 'COSINE',
    options => '{"fraction_lists_to_search": 0.005}');

结果类似于以下内容:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        | 0.14471956347590609 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           | 0.17472108931171348 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642917 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

使用 VECTOR_SEARCH 函数和暴力破解

使用 VECTOR_SEARCH 函数在 patents2 表的 embedding_v1 列中查找嵌入的最近邻。此查询在搜索中不使用向量索引,因此 VECTOR_SEARCH 会查找嵌入的精确最近邻:

SELECT query.publication_number AS query_publication_number,
  query.title AS query_title,
  base.publication_number AS base_publication_number,
  base.title AS base_title,
  distance
FROM
  VECTOR_SEARCH(
    TABLE vector_search.patents,
    'embedding_v1',
    TABLE vector_search.patents2,
    top_k => 5,
    distance_type => 'COSINE',
    options => '{"use_brute_force":true}');

结果类似于以下内容:

+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| query_publication_number |                         query_title                         | base_publication_number |                                                        base_title                                                        |      distance       |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-106599080-B          | A kind of rapid generation for keeping away big vast transfer figure based on GIS                                        |  0.1447195634759062 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-114118544-A          | Urban waterlogging detection method and device                                                                           |  0.1747210893117136 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-20200048143-A        | Method and system for mornitoring dry stream using unmanned aerial vehicle                                               | 0.17561990745619782 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | KR-101721695-B1         | Urban Climate Impact Assessment method of Reflecting Urban Planning Scenarios and Analysis System using the same         | 0.17696129365559843 |
| KR-20180122872-A         | Rainwater management system based on rainwater keeping unit | CN-109000731-B          | The experimental rig and method that research inlet for stom water chocking-up degree influences water discharged amount | 0.17902723269642928 |
+--------------------------+-------------------------------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------+

评估召回率

使用索引执行向量搜索时,它会返回近似结果,代价是召回率降低。您可以通过将向量搜索和索引返回的结果与向量搜索和暴力破解返回的结果进行比较来计算召回率。在此数据集中,publication_number 值唯一标识专利,因此用于比较。

WITH approx_results AS (
  SELECT query.publication_number AS query_publication_number,
    base.publication_number AS base_publication_number
  FROM
    VECTOR_SEARCH(
      TABLE vector_search.patents,
      'embedding_v1',
      TABLE vector_search.patents2,
      top_k => 5,
      distance_type => 'COSINE',
      options => '{"fraction_lists_to_search": 0.005}')
),
  exact_results AS (
  SELECT query.publication_number AS query_publication_number,
    base.publication_number AS base_publication_number
  FROM
    VECTOR_SEARCH(
      TABLE vector_search.patents,
      'embedding_v1',
      TABLE vector_search.patents2,
      top_k => 5,
      distance_type => 'COSINE',
      options => '{"use_brute_force":true}')
)

SELECT
  a.query_publication_number,
  SUM(CASE WHEN a.base_publication_number = e.base_publication_number THEN 1 ELSE 0 END) / 5 AS recall
FROM exact_results e LEFT JOIN approx_results a
  ON e.query_publication_number = a.query_publication_number
GROUP BY a.query_publication_number

如果召回率低于您的预期,您可以提高 fraction_lists_to_search 值,但缺点是延迟时间可能增加且资源使用率可能变高。如需调整向量搜索,您可以尝试使用不同的参数值多次运行 VECTOR_SEARCH,将结果保存到表中,然后比较结果。

清理

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.