Choose a vector index

You must select an index type for your vector search that gives you a balance between query latency and accuracy of your results. Effective indexing directly impacts query speed and the ability to retrieve truly relevant matches that are measured by recall.

Key terms

Before you decide an index strategy, you must know the following terms:

  • Recall: It defines how effectively a search retrieves all relevant items for a given query. For example, if you have 100 images and 10 are cats, which is true match, and your search for "cat" returns 6 cat images, your recall is 6 out of 10, which is 60%. In approximate nearest neighbors (ANN) search for vector indexes, recall is measured by comparing the results obtained with the ANN index to the true match results obtained through a brute-force search.
  • Query latency: It defines how fast the search results are generated. For example, latency is calculated based on the time spent on a search to return the 6 cat images after you submit a query.

Search strategy

You must select one of the following search strategies. Consider an example table with customer data that includes details such as purchase history, preferences, and demographics to understand how both algorithms differ:

  • K-Nearest Neighbors (KNN) is an algorithm that finds the k-nearest data points to a given query data point. For the previous example, to run a query that recommends products to new customers based on preferences of similar customers, KNN calculates the distance between the new customer data points and the data points of each customer in the table using a distance metric. KNN would identify k customers that match the criteria based on the distance, and then recommend products that the identified customers (k) like to the new customer.

    A KNN query with distance metric, such as Euclidean distance or cosine similarity, is recommended for smaller datasets. If your use case requires 100% accuracy and you can tolerate a higher query latency, then we recommend using KNN. For more information about performing a KNN search, see Perform similarity search.

  • Approximate Nearest Neighbors (ANN) algorithm-based indexes help queries run faster, but might compromise on query accuracy or recall. For the earlier example, to run a query that recommends products to new customers based on preferences of similar customers, ANN divides existing customer data points into small groups based on similarities. When a new customer data point is added, ANN first identifies a group, out of multiple groups, that the data point belongs to, and then only calculates distance using a distance metric within the group. This filtering reduces the number of calculations required to show recommendations.

    Google recommends creating a vector index that implements the ANN index when you have a large dataset since it finds similarities faster and is less resource-intensive, but comprises accuracy. For more information about how the ANN index is used for similarity searches, see Create indexes and query vectors using ScaNN. To achieve KNN accuracy with a vector index, set the search parameters equal to the number of partitions. This strategy helps to achieve KNN accuracy with a balance of query latency.

Use the following flowchart as a guidance to decide which index type is optimal based on your dataset and requirements.

Choose an index strategy

What's next