衡量向量查询召回率

选择文档版本:

本页面介绍了如何在 AlloyDB Omni 中衡量向量查询召回率。在向量搜索的上下文中,召回率是指索引返回的、真正最近邻向量的百分比。例如,如果对 20 个最近邻进行的最近邻查询返回 19 个接地最近邻,则召回率为 19/20x100 = 95%。

在向量查询中,召回率很重要,因为它衡量的是通过搜索检索到的相关结果的百分比。召回率有助于您评估近似最近邻 (ANN) 搜索结果与 K 最近邻 (KNN) 搜索结果相比的准确性。

ANN 是一种算法,用于查找与给定查询点类似的数据点,并通过查找近似邻近项(而非实际邻近项)来提高速度。使用 ANN 时,您可以平衡速度和召回率。

KNN 是一种算法,会根据相似性指标,查找数据集中与给定查询向量最相似的 k 个向量。k 是您希望查询返回的邻近项数量。

您可以衡量向量搜索查询对不同向量索引的召回率,包括:

  • 可扩容最近邻 (ScaNN):一种高效的向量相似性搜索算法。
  • 分层可导航小世界 (HNSW):一种基于图形的算法,用于在向量数据库中高效地进行近似最近邻搜索。
  • 经过平面压缩的倒排文件 (IVFFLAT) 和平面倒排文件 (IVF):用于 ANN 搜索的向量索引类型,尤其是在 PostgreSQL pgvector 扩展程序等数据库中。

本页面假定您熟悉 PostgreSQL、AlloyDB Omni 和向量搜索。

准备工作

  1. 安装或更新 pgvector 扩展程序。

    1. 如果未安装 pgvector 扩展程序,请安装 vector 扩展程序版本 0.8.0.google-3 或更高版本,以将生成的嵌入存储为 vector 值。vector 扩展程序包含 pgvector 函数和运算符。Google 通过针对 AlloyDB Omni 的优化扩展了此版本的 pgvector

      CREATE EXTENSION IF NOT EXISTS vector WITH VERSION '0.8.0.google-3';
      

      如需了解详情,请参阅对向量进行存储、索引编制和查询

    2. 如果 pgvector 扩展程序已安装,请将 vector 扩展程序升级到 0.8.0.google-3 或更高版本,以获取召回率评估器功能。

      ALTER EXTENSION vector UPDATE TO '0.8.0.google-3';
      
  2. 如需创建 ScaNN 索引,请安装 alloydb_scann 扩展程序。

    CREATE EXTENSION IF NOT EXISTS alloydb_scann;
    

评估对向量索引进行的向量查询的召回率

您可以使用 evaluate_query_recall 函数获得采用给定配置时,对向量索引进行的向量查询的召回率。借助此函数,您可以对参数进行调优以实现所需的向量查询召回率结果。召回率是一种用于搜索质量的指标,定义为返回结果中与查询向量客观上最接近的结果所占的百分比。默认情况下,evaluate_query_recall 函数处于开启状态。

获得向量查询的召回率

  1. AlloyDB Studio 中打开 SQL 编辑器,或打开 psql 客户端
  2. 创建 ScaNN、HNSW 或 IVFFLAT 向量索引

  3. 确保 enable_indexscan 标志处于开启状态。如果此标志处于关闭状态,则不会选择任何索引扫描,并且所有索引的召回率均为 1。

  4. 运行 evaluate_query_recall 函数,该函数会将查询作为参数并返回以下召回率:

    SELECT * FROM evaluate_query_recall( QUERY_TEXT, QUERY_TIME_CONFIGURATIONS, INDEX_METHODS )
    

    在运行此命令之前,请先进行以下替换:

    • QUERY_TEXT:SQL 查询,包含在 $$.
    • QUERY_TIME_CONFIGURATIONS: Optional: the configuration that you can set for the ANN query. This must be in JSON format. The default value is NULL.
    • INDEX_METHODS: Optional: a text array that contains different vector index methods for which you want to calculate the recall. If you set an index method for which a corresponding index doesn't exist, the recall is 1. The input must be a subset of {scann, hnsw, ivf, ivfflat}. If no value is provided, the ScaNN method is used.

      To view differences between query recall and execution time, change the query time parameters for your index.

      The following table lists query time parameters for ScaNN, HNSW, and IVF/IVFFLAT index methods. The parameters are formatted as {"scann.num_leaves_to_search":1, "scann.pre_reordering_num_neighbors":10, "hnsw.ef_search": 1}.

      Index type Parameters
      ScaNN
      • scann.num_leaves_to_search
      • scann.pre_reordering_num_neighbors
      • scann.pct_leaves_to_search
      • scann.num_search_threads
      HNSW
      • hnsw.ef_search
      • hnsw.iterative_scan
      • hnsw.max_scan_tuples
      • hnsw.scan_mem_multiplier
      IVF
      • ivf.probes
      IVFFLAT
      • ivfflat.probes
      • ivfflat.iterative_scan
      • ivfflat.max_probes

      For more information about ScaNN index methods, see AlloyDB Omni ScaNN Index reference. For more information about HNSW and IVF/IVFFLAT index methods, see pgvector.

  5. Optional: You can also add configurations from pg_settings to the QUERY_TIME_CONFIGURATIONS. For example, to run a query with columnar engine scan enabled, add the following config from pg_settings as {"google_columnar_engine.enable_columnar_scan" : on}.

    The configurations are set locally in the function. Adding these configurations doesn't impact the configurations that you set in your session. If you don't set any configurations, AlloyDB uses all of the configurations that you set in the session. You can also set only those configurations that are best suited for your use case.

  6. Optional: To view the default configuration settings, run the SHOW command or view the pg_settings.

  7. Optional: If you have a ScaNN index for which you want to tune the recall, see the tuning parameters in ScaNN index reference.

    The following is an example output, where ann_execution_time is the time that it takes a vector query to execute using index scans. ground_truth_execution_time is the time that it takes the query to run using a sequential scan.

    ann_execution_time and ground_truth_execution_time are different from but directly dependent on Execution time in the query plan. Execution time is the total time to execute the query from the client.

    t=# SELECT * FROM evaluate_query_recall( $$ SELECT id FROM t1 ORDER BY val <=> '[1000,1000,49000]' LIMIT 10 $$, '{"scann.num_leaves_to_search":1, "scann.pre_reordering_num_neighbors":10, "hnsw.ef_search": 1}', ARRAY['scann', 'hnsw']);
    NOTICE:  Recall is 1. This might mean that the vector index is not present on the table or index scan not chosen during query execution.
    id|               query                                               |                                         configurations                                         |  recall |ann_execution_time | ground_truth_execution_time |  index_type
    ----+-------------------------------------------------------------------+------------------------------------------------------------------------------------------------+--------+--------------------+-----------------------------+------------
    1 |  SELECT id FROM t1 ORDER BY val <=> '[1000,1000,49000]' LIMIT 10  | {"scann.num_leaves_to_search":1, "scann.pre_reordering_num_neighbors":10, "hnsw.ef_search": 1} |    0.5 |               4.23 |                     118.211 | scann
    2 |  SELECT id FROM t1 ORDER BY val <=> '[1000,1000,49000]' LIMIT 10  | {"scann.num_leaves_to_search":1, "scann.pre_reordering_num_neighbors":10, "hnsw.ef_search": 1} |      1 |            107.198 |                     118.211 | hnsw
    (2 rows)
    

    如果结果为 Recall is 1(查询的召回率为 1),则可能表示表中不存在向量索引,或者在执行查询期间未选择向量索引。如果表中不存在向量索引,或者规划工具未选择向量索引扫描,便会出现这种情况。

    如果查询是 select id, name from table order by embedding <->'[1,2,3]' LIMIT 10;.,并且列名称的预期值为 NULL,请将查询更改为以下其中一个查询:

    select id, COALESCE(name, 'NULL') as name from table order by embedding <-> '[1,2,3]' LIMIT 10;
    

    select id from table order by embedding <-> '[1,2,3]' LIMIT 10;
    

后续步骤