Filter generic search for structured or unstructured data

If you have a search app that uses structured data, or unstructured data with metadata, you can use the metadata to filter your search queries. This page explains how use metadata fields to restrict your search to a specific set of documents.

Before you begin

Make sure you have created an app and ingested structured data, or unstructured data with metadata. For more information, see Create a search app.

Metadata example

Review this example of metadata for four PDF files (document_1.pdf, document_2.pdf, document_3.pdf, and document_4.pdf). This metadata would be in a JSON file in a Cloud Storage bucket, along with the PDF files. You can refer back to this example as you read through this page.

{"id": "1", "structData": {"title": "Policy on accepting corrected claims", "category": ["persona_A"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_1.pdf"}}
{"id": "2", "structData": {"title": "Claims documentation and reporting guidelines for commercial members", "category": ["persona_A", "persona_B"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_2.pdf"}}
{"id": "3", "structData": {"title": "Claims guidelines for bundled services and supplies for commercial members", "category": ["persona_B", "persona_C"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_3.pdf"}}
{"id": "4", "structData": {"title": "Advantage claims submission guidelines", "category": ["persona_A", "persona_C"]}, "content": {"mimeType": "application/pdf", "uri": "gs://bucketname_87654321/data/document_4.pdf"}}

Filter expression syntax

Make sure you understand the syntax of the filter expression that you'll use to define your search filter. The filter expression syntax can be summarized by the following Extended Backus–Naur form:

  # A single expression or multiple expressions that are joined by "AND" or "OR".
  filter = expression, { " AND " | "OR", expression };
  # Expressions can be prefixed with "-" or "NOT" to express a negation.
  expression = [ "-" | "NOT " ],
    # A parenthetical expression.
    | "(", expression, ")"
    # A simple expression applying to a text field.
    # Function "ANY" returns true if the field exactly matches any of the literals.
    ( text_field, ":", "ANY", "(", literal, { ",", literal }, ")"
    # A simple expression applying to a numerical field. Function "IN" returns true
    # if a field value is within the range. By default, lower_bound is inclusive and
    # upper_bound is exclusive.
    | numerical_field, ":", "IN", "(", lower_bound, ",", upper_bound, ")"
    # A simple expression that applies to a numerical field and compares with a double value.
    | numerical_field, comparison, double
    # An expression that applies to a geolocation field with text/street/postal address.
    |  geolocation_field, ":", "GEO_DISTANCE(", literal, ",", distance_in_meters, ")"
    # An expression that applies to a geolocation field with latitude and longitude.
    | geolocation_field, ":", "GEO_DISTANCE(", latitude_double, ",", longitude_double, ",", distance_in_meters, ")"
    # Datetime field
    | datetime_field, comparison, literal_iso_8601_datetime_format);
  # A lower_bound is either a double or "*", which represents negative infinity.
  # Explicitly specify inclusive bound with the character 'i' or exclusive bound
  # with the character 'e'.
  lower_bound = ( double, [ "e" | "i" ] ) | "*";
  # An upper_bound is either a double or "*", which represents infinity.
  # Explicitly specify inclusive bound with the character 'i' or exclusive bound
  # with the character 'e'.
  upper_bound = ( double, [ "e" | "i" ] ) | "*";
  # Supported comparison operators.
  comparison = "<=" | "<" | ">=" | ">" | "=";
  # A literal is any double quoted string. You must escape backslash (\) and
  # quote (") characters.
  literal = double quoted string;
  text_field = text field - for example, category;
  numerical_field = numerical field - for example, score;
  geolocation_field = field of geolocation data type - for example home_address, location;
  datetime_field = field of datetime data type - for example creation_date, expires_on;
  literal_iso_8601_datetime_format = either a double quoted string representing ISO 8601 datetime or a numerical field representing microseconds from unix epoch.

Search using a metadata filter

To search using a metadata filter, follow these steps:

  1. Determine the metadata field to use for filtering your search queries. For example, for the metadata in Before you begin, you could use the category field as a search filter. Your users could filter on persona_A, persona_B, or persona_C, so their search is restricted to the documents associated with the persona that they're interested in.

  2. Make the metadata field indexable:

    1. In the Google Cloud console, go to the Agentspace page and in the navigation menu, click Apps.

      Go to the Apps page

    2. Click your search app.

    3. In the navigation menu, click Data.

    4. Click the Schema tab. This tab shows current field settings.

    5. Click Edit.

    6. Select the Indexable checkbox for the field that you want to make indexable.

    7. Click Save. For more information, see Configure field settings.

  3. Find your data store ID. If you already have your data store ID, skip to the next step.

    1. In the Google Cloud console, go to the Agentspace page and in the navigation menu, click Data Stores.

      Go to the Data Stores page

    2. Click the name of your data store.

    3. On the Data page for your data store, get the data store ID.

  4. Preview search results.

    curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/servingConfigs/default_search:search" \
    -d '{
    "query": "QUERY",
    "filter": "FILTER"
    }'
    
    • PROJECT_ID: The ID of your project.
    • DATA_STORE_ID: The ID your data store.
    • QUERY: The query text to search.
    • FILTER: Optional. A text field that lets you to filter on a specified set of fields, using filter expression syntax. The default value is an empty string, which means no filter is applied.

    For example, say that you imported the four PDF files with metadata from Before you begin. You want to search for documents that contain the word "claims", and only query documents with a category value of persona_A. You would do that by including the following statements with your call:

    "query": "claims",
    "filter": "category: ANY(\"persona_A\")"
    

    For more information, see the REST tab at Preview search results for an app with structured or unstructured data.

    Click for an example response.

    If you perform a search like the one in the preceding procedure, you can expect to get a response similar to the following. Notice that the response includes the three documents that have a category value of persona_A.

    {
    "results": [
    {
      "id": "2",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/2",
        "id": "2",
        "structData": {
          "title": "Claims documentation and reporting guidelines for commercial members",
          "category": [
            "persona_A",
            "persona_B"
          ]
        },
        "derivedStructData": {
          "link": "gs://bucketname_87654321/data/document_2.pdf",
          "extractive_answers": [
            {
              "pageNumber": "1",
              "content": "lorem ipsum"
            }
          ]
        }
      }
    },
    {
      "id": "1",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/1",
        "id": "1",
        "structData": {
          "title": "Policy on accepting corrected claims",
          "category": [
            "persona_A"
          ]
        },
        "derivedStructData": {
          "extractive_answers": [
            {
              "pageNumber": "2",
              "content": "lorem ipsum"
            }
          ],
          "link": "gs://bucketname_87654321/data/document_1.pdf"
        }
      }
    },
    {
      "id": "4",
      "document": {
        "name": "projects/abcdefg/locations/global/collections/default_collection/dataStores/search_store_id/branches/0/documents/4",
        "id": "4",
        "structData": {
          "title": "Advantage claims submission guidelines",
          "category": [
            "persona_A",
            "persona_C"
          ]
        },
        "derivedStructData": {
          "extractive_answers": [
            {
              "pageNumber": "47",
              "content": "lorem ipsum"
            }
          ],
          "link": "gs://bucketname_87654321/data/document_4.pdf"
        }
      }
    }
    ],
    "totalSize": 330,
    "attributionToken": "UvBRCgsI26PxpQYQs7vQZRIkNjRiYWY1MTItMDAwMC0yZWIwLTg3MTAtMTQyMjNiYzYzMWEyIgdHRU5FUklDKhSOvp0VpovvF8XL8xfC8J4V1LKdFQ",
    "guidedSearchResult": {},
    "summary": {}
    }
    

Examples of filter expressions

The following table provides examples of filter expressions.

Filter Only returns results for documents where:
category: ANY(\"persona_A\") the text field category is persona_A
score: IN(*, 100.0e) the numerical field score is greater than negative infinity and less than 100.0
non-smoking = \"true\" the boolean non-smoking is true
pet-friendly = \"false\" the boolean pet-friendly is false
manufactured_date = \"2023\" the manufactured date is any time in 2023
manufactured_date >= \"2024-04-16\" the manufactured_date is on or after April 16, 2024
manufactured_date < \"2024-04-16T12:00:00-07:00\" the manufactured_date is before noon Pacific Daylight Time on April 16, 2024
office.location:GEO_DISTANCE(\"1600 Amphitheater Pkwy, Mountain View, CA, 94043\", 500) the geolocation field office.location is within a 500 m distance of 1600 Amphitheater Pkwy
NOT office.location:GEO_DISTANCE(\"Palo Alto, CA\", 1000) the geolocation field office.location is not within a 1 km radius of Palo Alto, California.
office.location:GEO_DISTANCE(34.1829, -121.293, 500) the geolocation field office.location is within a 500 m radius of latitude 34.1829 and longitude -121.293
category: ANY(\"persona_A\") AND score: IN(*, 100.0e) category is persona_A and score is less than 100
office.location:GEO_DISTANCE(\"Mountain View, CA\", 500) OR office.location:GEO_DISTANCE(\"Palo Alto, CA\", 500) office.location is within a 500 m distance of either Mountain View or Palo Alto.
(price<175 AND pet-friendly = \"true\") OR (price<125 AND pet-friendly = \"false\") price is less than 175 and I can bring my pet, or price is less than 125 and I can't bring my pet

What's next

  • To understand the impact of filters on the search quality, evaluate the search quality. For more information, see Evaluate search quality.