Try Gemini 1.5 models, the latest multimodal models in Vertex AI, and see what you can build with up to a 2M token context window.

Search documents

Before you start

To ingest sample documents into the Document AI Warehouse, see the Quickstart Guide.

Define your data for search

When defining your document schemas and creating your documents, it's important to consider what properties you want to define and how they're going to be used with search, if at all.

Mark a property filterable if you want to use that property to include or exclude a portion of documents for a search. For example, you might make a property that represents a "Vendor" filterable because your users want to search for invoices from a specific vendor.

If you want to construct a histogram (see the example later in this topic) on a property, then the property needs to be filterable.

Mark a property searchable if it has data that your users will want to query during a keyword search.

Full text search

Full text search is the process of retrieving all documents that match the search keywords in their searchable text. The user provides a list of keywords (words separated by a blank space), presumably typed into a search field in the UI. In Document AI Warehouse, the keywords are processed and converted into a proper query. Such processing strips stopwords ("the," "in," and " an, ") and stems the remaining words. Stemming reduces the word to a common version of the wording, so that word variation matches. For example: "work," "working," "worked."

What data gets searched?

The document's plain_text.
If you are importing a Document AI object, use the embedded cloud_ai_document.text.
The document's display_name.
All searchable properties.

The query partially supports Google AIP style syntax. Specifically, the query supports literals, logical operators, negation operators, comparison operators, and functions.

Literals: A bare literal value (examples: "42", "Hugo") is a value to be matched against. It searches over the full text of the document and the searchable properties.
Logical operators: "AND", "and", "OR", and "or" are binary logical operators (example: "engineer OR developer").
Negation operators: "NOT" and "!" are negation operators (example: "NOT software").
Comparison operators: support the binary comparison operators =, !=, <, >, <= and >= for string, numeric, enum, boolean. Also support like operator ~~ for string. It provides semantic search functionality by parsing, stemming and doing synonyms expansion against the input query.

To specify a property in the query, the left hand side expression in the comparison must be the property ID including the parent. The right hand side must be literals. For example: \"projects/123/locations/us\".property_a < 1 matches results whose property_a is less than 1 in project 123 and us location. The literals and comparison expression can be connected in a single query (example: software engineer \"projects/123/locations/us\".salary > 100).
Functions: supported functions are LOWER([property_name]) to perform a case insensitive match and EMPTY([property_name]) to filter on the existence of a key.
Support nested expressions connected using parentheses and logical operators. The default logical operators is AND if there is no operators between expressions.

The query can be used with other filters e.g. time_filters and folder_name_filter. They are connected with AND operator under the hood.

Search queries can be filtered by additional parameters such as by property, time, schema, folder, and creator.

Call to a search request

To call the search service, you must use a search request, which is defined as follows:

{
  "requestMetadata": {
    object (RequestMetadata)
  },
  "documentQuery": {
    object (DocumentQuery)
  },
  "offset": integer,
  "pageSize": integer,
  "pageToken": string,
  "orderBy": string,
  "histogramQueries": [
    {
      object (HistogramQuery)
    }
  ],
  "requireTotalSize": boolean,
  "totalResultSize": enum (TotalResultSize),
  "qaSizeLimit": integer
}

The parent field must be filled in with the format:

/projects/PROJECT_ID/locations/LOCATION

Response to a search request

The search response is defined as follows:

{
  "matchingDocuments": [
    {
      object (MatchingDocument)
    }
  ],
  "nextPageToken": string,
  "totalSize": integer,
  "metadata": {
    object (ResponseMetadata)
  },
  "histogramQueryResults": [
    {
      object (HistogramQueryResult)
    }
  ]
}

Document Query

The document_query field is defined as follows:

{
  "query": string,
  "isNlQuery": boolean,
  "customPropertyFilter": string,
  "timeFilters": [
    {
      object (TimeFilter)
    }
  ],
  "documentSchemaNames": [
    string
  ],
  "propertyFilter": [
    {
      object (PropertyFilter)
    }
  ],
  "fileTypeFilter": {
    object (FileTypeFilter)
  },
  "folderNameFilter": string,
  "queryContext": [
    string
  ],
  "documentCreatorFilter": [
    string
  ],
  "customWeightsMetadata": {
    object (CustomWeightsMetadata)
  }
}

The query field is for the requesting user's search query words. Typically, these come from the search field in the UI.

Filters

Document AI Warehouse offers a variety of filters.

Document time filter

The create and update time filter is exactly what you would expect: it finds documents matching the keywords within a specified time period.

A TimeFilter object is used to specify the time range and it is defined as follows:

{
  "timeRange": {
    object (Interval)
  },
  "timeField": enum (TimeField)
}

The time_field field is where you specify if the time range specified in the time_range is for the document's creation time or the document's last update time.

The time_range field specifies the time range as an Interval. An Interval is defined as:

{
  "startTime": string,
  "endTime": string
}

Creator filter

To search for documents that were created by specific user or users then use the creator filter. For example:

  {
    document_query {
      query: "videogames director",
      documentCreatorFilter: [
        "diane@some_company.com",
        "frank@some_company.com",
      ],
    },
  }

Property filter

The property filter lets you specify filters on any of the properties that you have specified in a schema, as long as that property has been configured to be filterable.

For example, using property filters in the legal industry might filter on a property called COURT to search only documents from a particular court.

Property filters use a PropertyFilter object. You can have more than one property filter. When you use multiple property filters, they are combined using the OR operator. A property filter is defined as follows:

  {
    "documentSchemaName": string,
    "condition": string
  }

Properties are defined in schemas. Thus, the documentSchemaName field is where you specify the schema for the property that you use for filtering. In the condition field, you specify the desired logic. For examples of using the documentSchemaName and condition fields, see the preceding examples on this page.

Matching document

A matching document contains a Document and a snippet (discussed later). The returned document in MatchingDocument is not a fully filled-in document. It contains minimal data for displaying a search results list to the requesting user. If the full document is desired (for example, if the user clicked on a search result), then the full document should be retrieved via the GetDocument API.

The following Document fields are filled in: Project number, Document id, Document schema id, Create time, Update time, Display name, Raw document file type, Reference id, and Filterable properties.

A matching document would look like this:

{
  "document": {
    object (Document)
  },
  "searchTextSnippet": string,
  "qaResult": {
    object (QAResult)
  }
}

Ranking/sort

The search request lets you specify how you want the results sorted. To sort, use the order_by field in the search request. The possible values for this field include:

relevance desc - relevance descending, that is, the best matches are on top.
upload_date desc - the date the document was created in descending order (newest on top).
upload_date - the date the document was created in ascending order (oldest on top).
update_date desc - the date the document was last updated in descending order (newest on top).
Update_date - the date the document was last updated in ascending order (oldest on top).

If you don't specify a sort, but you supply search keywords, then the sort is by relevance descending (the best matches on top). If neither the sort nor keywords are provided, then the default sort is by update time descending (the latest documents on top).

Pagination

Pagination is useful for displaying a page worth of data to the end user. Here you can specify the size of the page and get a total count of the result size to display back to the user (for example, "Showing 50 documents of 300").

Set the page_size field to the desired number of results that you want to receive with the search request. This might correspond to the requirements of the UI search result display size.

There are two mechanisms: offset and page token.

An offset is the index into the list of returnable documents that you want returned. For example, an offset of 5 means you want the sixth document onward. Presumably you would increment the offset by the page size for the next page of results.

Alternatively, you can use a page token and not have to worry about calculating the next offset. After making your first search request, you get a search response that contains the next_page_token field. If this field is empty, then there are no more results. If the field is not empty, use this token in your next search request by setting the page_token field.

Some UIs display the count of documents found by the search. For example, you are viewing 10 documents of 120. To get a document count returned, set the request's require_total_size boolean field to True. Tip: require_total_size=True carries a performance penalty. Set this on the first page query, then set this to false on all subsequent requests, keeping the total count in a local variable.

Code Samples

Python

For more information, see the Document AI Warehouse Python API reference documentation.

To authenticate to Document AI Warehouse, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


from google.cloud import contentwarehouse

# TODO(developer): Uncomment these variables before running the sample.
# project_number = 'YOUR_PROJECT_NUMBER'
# location = 'YOUR_PROJECT_LOCATION' # Format is 'us' or 'eu'
# document_query_text = 'YOUR_DOCUMENT_QUERY'
# user_id = 'user:YOUR_SERVICE_ACCOUNT_ID' # Format is "user:xxxx@example.com"


def search_documents_sample(
    project_number: str, location: str, document_query_text: str, user_id: str
) -> None:
    # Create a client
    client = contentwarehouse.DocumentServiceClient()

    # The full resource name of the location, e.g.:
    # projects/{project_number}/locations/{location}
    parent = client.common_location_path(project=project_number, location=location)

    # File Type Filter
    # Options: DOCUMENT, FOLDER
    file_type_filter = contentwarehouse.FileTypeFilter(
        file_type=contentwarehouse.FileTypeFilter.FileType.DOCUMENT
    )

    # Document Text Query
    document_query = contentwarehouse.DocumentQuery(
        query=document_query_text,
        file_type_filter=file_type_filter,
    )

    # Histogram Query
    histogram_query = contentwarehouse.HistogramQuery(
        histogram_query='count("DocumentSchemaId")'
    )

    request_metadata = contentwarehouse.RequestMetadata(
        user_info=contentwarehouse.UserInfo(id=user_id)
    )

    # Define request
    request = contentwarehouse.SearchDocumentsRequest(
        parent=parent,
        request_metadata=request_metadata,
        document_query=document_query,
        histogram_queries=[histogram_query],
    )

    # Make the request
    response = client.search_documents(request=request)

    # Print search results
    for matching_document in response.matching_documents:
        document = matching_document.document
        # Display name - schema display name.
        # Name.
        # Create date.
        # Snippet - keywords are highlighted with <b> & </b>.
        print(
            f"{document.display_name} - {document.document_schema_name}\n"
            f"{document.name}\n"
            f"{document.create_time}\n"
            f"{matching_document.search_text_snippet}\n"
        )

    # Print histogram
    for histogram_query_result in response.histogram_query_results:
        print(
            f"Histogram Query: {histogram_query_result.histogram_query}\n"
            f"| {'Schema':<70} | {'Count':<15} |"
        )
        for key, value in histogram_query_result.histogram.items():
            print(f"| {key:<70} | {value:<15} |")

Java

For more information, see the Document AI Warehouse Java API reference documentation.