Perform efficient top-k retrieval

Many applications query a database to populate a single page in their applications. In such applications, the application doesn't need all of the matches, but only the top-k matches based on index sort order. Search indexes can implement this type of search very efficiently. This page describes how to create and search an index that has top-k matching.

Create search indexes for top-k matches

To configure a search index for top-k matching, use ORDER BY to order the search index by a specific column. Queries need to have an ORDER BY clause that exactly matches the search index sort order (including ascending versus descending direction) and a LIMIT clause that requests the query to stop after finding k-matching rows.

You can also implement pagination using these clauses. For more information, see Paginate search queries.

For some use cases, it might make sense to maintain multiple search indexes sorted by different columns. Like partitioning, it's a trade-off between storage and write cost versus query latency.

For example, consider a table that uses the following schema:

CREATE TABLE Albums (
  AlbumId STRING(MAX) NOT NULL,
  RecordTimestamp INT64 NOT NULL,
  ReleaseTimestamp INT64 NOT NULL,
  ListenTimestamp INT64 NOT NULL,
  AlbumTitle STRING(MAX),
  AlbumTitle_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN,
) PRIMARY KEY(AlbumId);

CREATE SEARCH INDEX AlbumsRecordTimestampIndex
ON Albums(AlbumTitle_Tokens, SingerId_Tokens)
ORDER BY RecordTimestamp DESC
STORING ListenTimestamp

CREATE SEARCH INDEX AlbumsReleaseTimestampIndex
ON Albums(AlbumTitle_Tokens)
ORDER BY ReleaseTimestamp DESC
STORING ListenTimestamp

Query search indexes for top-k matches

As stated previously, queries need to have an ORDER BY clause that exactly matches the search index sort order (including ascending versus descending direction) and a LIMIT clause that requests the query to stop after finding k-matching rows.

The following are some common queries:

  • The following query is very efficient. It selects the AlbumsRecordTimestampIndex index. Even if there are many albums with the word "happy", the query only scans a small number of rows:

    SELECT AlbumId
    FROM Albums
    WHERE SEARCH(AlbumTitle_Tokens, 'happy')
    ORDER BY RecordTimestamp DESC
    LIMIT 10
    
  • The same query, requesting sort order by ReleaseTimestamp in descending order, uses the AlbumsReleaseTimestampIndex index and is equally efficient:

    SELECT AlbumId
    FROM Albums
    WHERE SEARCH(AlbumTitle_Tokens, 'happy')
    ORDER BY ReleaseTimestamp DESC
    LIMIT 10
    
  • At the same time, a query requesting sort order by ListenTimestamp doesn't execute a top-k query efficiently. It has to fetch all matching albums, sort them by ListenTimestamp, and return the top 10. Such a query uses more resources if there's a large number of documents that contain the term "happy".

    SELECT AlbumId
    FROM Albums
    WHERE SEARCH(AlbumTitle_Tokens, 'happy')
    ORDER BY ListenTimestamp DESC
    LIMIT 10`
    
  • Similarly, a query doesn't run efficiently if it requests that results are ordered using the RecordTimestamp column in ascending order. It scans all rows with the word "happy", despite having a LIMIT.

    SELECT AlbumId
    FROM Albums
    WHERE SEARCH(AlbumTitle_Tokens, 'happy')
    ORDER BY RecordTimestamp ASC
    LIMIT 10
    

What's next