过滤结构化或非结构化数据的通用搜索

如果您的搜索应用使用结构化数据或包含元数据的非结构化数据,您可以使用元数据过滤搜索查询。本页介绍了如何使用元数据字段将搜索范围限制为一组特定文档。

准备工作

确保您已创建应用并提取结构化数据或包含元数据的非结构化数据。如需了解详情,请参阅创建搜索应用

元数据示例

查看以下四个 PDF 文件(document_1.pdfdocument_2.pdfdocument_3.pdfdocument_4.pdf)的元数据示例。这些元数据将与 PDF 文件一起存储在 Cloud Storage 存储桶中的 JSON 文件中。阅读本页内容时,您可以随时参考此示例。

{"id": "1", "structData": {"title": "Policy on accepting corrected claims", "category": ["persona_A"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_1.pdf"}}
{"id": "2", "structData": {"title": "Claims documentation and reporting guidelines for commercial members", "category": ["persona_A", "persona_B"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_2.pdf"}}
{"id": "3", "structData": {"title": "Claims guidelines for bundled services and supplies for commercial members", "category": ["persona_B", "persona_C"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_3.pdf"}}
{"id": "4", "structData": {"title": "Advantage claims submission guidelines", "category": ["persona_A", "persona_C"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_4.pdf"}}

过滤器表达式语法

请务必了解您将用于定义搜索过滤条件的过滤表达式的语法。过滤条件表达式语法可按以下扩展巴科斯范式总结:

  # A single expression or multiple expressions that are joined by "AND" or "OR".
  filter = expression, { " AND " | "OR", expression };
  # Expressions can be prefixed with "-" or "NOT" to express a negation.
  expression = [ "-" | "NOT " ],
    # A parenthetical expression.
    | "(", expression, ")"
    # A simple expression applying to a text field.
    # Function "ANY" returns true if the field exactly matches any of the literals.
    ( text_field, ":", "ANY", "(", literal, { ",", literal }, ")"
    # A simple expression applying to a numerical field. Function "IN" returns true
    # if a field value is within the range. By default, lower_bound is inclusive and
    # upper_bound is exclusive.
    | numerical_field, ":", "IN", "(", lower_bound, ",", upper_bound, ")"
    # A simple expression that applies to a numerical field and compares with a double value.
    | numerical_field, comparison, double
    # An expression that applies to a geolocation field with text/street/postal address.
    |  geolocation_field, ":", "GEO_DISTANCE(", literal, ",", distance_in_meters, ")"
    # An expression that applies to a geolocation field with latitude and longitude.
    | geolocation_field, ":", "GEO_DISTANCE(", latitude_double, ",", longitude_double, ",", distance_in_meters, ")"
    # Datetime field
    | datetime_field, comparison, literal_iso_8601_datetime_format);
  # A lower_bound is either a double or "*", which represents negative infinity.
  # Explicitly specify inclusive bound with the character 'i' or exclusive bound
  # with the character 'e'.
  lower_bound = ( double, [ "e" | "i" ] ) | "*";
  # An upper_bound is either a double or "*", which represents infinity.
  # Explicitly specify inclusive bound with the character 'i' or exclusive bound
  # with the character 'e'.
  upper_bound = ( double, [ "e" | "i" ] ) | "*";
  # Supported comparison operators.
  comparison = "<=" | "<" | ">=" | ">" | "=";
  # A literal is any double quoted string. You must escape backslash (\) and
  # quote (") characters.
  literal = double quoted string;
  text_field = text field - for example, category;
  numerical_field = numerical field - for example, score;
  geolocation_field = field of geolocation data type - for example home_address, location;
  datetime_field = field of datetime data type - for example creation_date, expires_on;
  literal_iso_8601_datetime_format = either a double quoted string representing ISO 8601 datetime or a numerical field representing microseconds from unix epoch.

使用元数据过滤条件进行搜索

如需使用元数据过滤条件进行搜索,请按以下步骤操作:

  1. 确定要用于过滤搜索查询的元数据字段。例如,对于开始前须知中的元数据,您可以使用 category 字段作为搜索过滤条件。用户可以按 persona_Apersona_Bpersona_C 进行过滤,这样他们的搜索结果就会仅限于与他们感兴趣的角色相关联的文档。

  2. 使元数据字段可编入索引:

    1. 在 Google Cloud 控制台中,前往 Agent Builder 页面,然后在导航菜单中点击 Apps

      前往“应用”页面

    2. 点击您的搜索应用。

    3. 在导航菜单中,点击数据

    4. 点击架构标签页。此标签页会显示当前的字段设置。

    5. 点击修改

    6. 选中要将其设为可编入索引的字段对应的可编入索引复选框。

    7. 点击保存。如需了解详情,请参阅配置字段设置

  3. 找到您的数据存储区 ID。如果您已拥有数据存储区 ID,请跳至下一步。

    1. 在 Google Cloud 控制台中,前往 Agent Builder 页面,然后在导航菜单中点击 Data Stores

      前往“数据存储区”页面

    2. 点击您的数据存储区的名称。

    3. 在数据存储区的数据页面上,获取数据存储区 ID。

  4. 获取搜索结果。

    curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/servingConfigs/default_search:search" \
    -d '{
    "query": "QUERY",
    "filter": "FILTER"
    }'
    
    • PROJECT_ID:您的项目的 ID。
    • DATA_STORE_ID:数据存储区的 ID。
    • QUERY:要搜索的查询文本。
    • FILTER:可选。一个文本字段,可让您使用过滤条件表达式语法按指定的一组字段进行过滤。默认值为空字符串,表示不应用过滤条件。

    例如,假设您已按照准备工作部分中的说明导入了四个包含元数据的 PDF 文件。您想搜索包含字词“claims”的文档,并且仅查询 category 值为 persona_A 的文档。为此,您需要在调用中添加以下语句:

    "query": "claims",
    "filter": "category: ANY(\"persona_A\")"
    

    如需了解详情,请参阅获取包含结构化数据或非结构化数据的应用的搜索结果中的 REST 标签页。

    点击查看示例回复。

    如果您执行与上文中相似的搜索,则应该会收到类似于以下内容的响应。请注意,响应包含 category 值为 persona_A 的三个文档。

    {
    "results": [
    {
      "id": "2",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/2",
        "id": "2",
        "structData": {
          "title": "Claims documentation and reporting guidelines for commercial members",
          "category": [
            "persona_A",
            "persona_B"
          ]
        },
        "derivedStructData": {
          "link": "gs://bucketname_87654321/data/document_2.pdf",
          "extractive_answers": [
            {
              "pageNumber": "1",
              "content": "lorem ipsum"
            }
          ]
        }
      }
    },
    {
      "id": "1",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/1",
        "id": "1",
        "structData": {
          "title": "Policy on accepting corrected claims",
          "category": [
            "persona_A"
          ]
        },
        "derivedStructData": {
          "extractive_answers": [
            {
              "pageNumber": "2",
              "content": "lorem ipsum"
            }
          ],
          "link": "gs://bucketname_87654321/data/document_1.pdf"
        }
      }
    },
    {
      "id": "4",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/4",
        "id": "4",
        "structData": {
          "title": "Advantage claims submission guidelines",
          "category": [
            "persona_A",
            "persona_C"
          ]
        },
        "derivedStructData": {
          "extractive_answers": [
            {
              "pageNumber": "47",
              "content": "lorem ipsum"
            }
          ],
          "link": "gs://bucketname_87654321/data/document_4.pdf"
        }
      }
    }
    ],
    "totalSize": 330,
    "attributionToken": "UvBRCgsI26PxpQYQs7vQZRIkNjRiYWY1MTItMDAwMC0yZWIwLTg3MTAtMTQyMjNiYzYzMWEyIgdHRU5FUklDKhSOvp0VpovvF8XL8xfC8J4V1LKdFQ",
    "guidedSearchResult": {},
    "summary": {}
    }
    

过滤表达式示例

下表提供了过滤表达式的示例。

过滤 仅返回满足以下条件的文档的结果:
category: ANY(\"persona_A\") 文本字段 categorypersona_A
score: IN(*, 100.0e) 数值字段 score 大于负无穷大且小于 100.0
non-smoking = \"true\" 布尔值 non-smoking 为 true
pet-friendly = \"false\" 布尔值 pet-friendly 为 false
manufactured_date = \"2023\" manufactured date 是 2023 年的任意时间
manufactured_date >= \"2024-04-16\" manufactured_date 不早于 2024 年 4 月 16 日
manufactured_date < \"2024-04-16T12:00:00-07:00\" manufactured_date 在 2024 年 4 月 16 日太平洋夏令时间中午之前
office.location:GEO_DISTANCE(\"1600 Amphitheater Pkwy, Mountain View, CA, 94043\", 500) 地理位置字段 office.location 距离 1600 Amphitheatre Pkwy 不超过 500 米
NOT office.location:GEO_DISTANCE(\"Palo Alto, CA\", 1000) 地理位置字段 office.location 不在加利福尼亚州帕洛阿尔托市 1 公里半径范围内。
office.location:GEO_DISTANCE(34.1829, -121.293, 500) 地理位置字段 office.location 位于纬度 34.1829 和经度 -121.293 的 500 米半径范围内
category: ANY(\"persona_A\") AND score: IN(*, 100.0e) categorypersona_A,且 score 小于 100
office.location:GEO_DISTANCE(\"Mountain View, CA\", 500) OR office.location:GEO_DISTANCE(\"Palo Alto, CA\", 500) office.location 距离山景谷或帕洛阿尔托不超过 500 米。
(price<175 AND pet-friendly = \"true\") OR (price<125 AND pet-friendly = \"false\") price 小于 175,可以携带宠物;或者 price 小于 125,不能携带宠物

后续步骤

  • 如需了解过滤条件对搜索质量的影响,请评估搜索质量。如需了解详情,请参阅评估搜索质量