Matching Engine ANN service overview

Given a vector, in a matter of milliseconds, Vertex AI Matching Engine's Approximate Nearest Neighbor (ANN) service helps you find the most similar vectors from a large corpus of vectors. Matching Engine's ANN service can serve such similarity-matching queries at very high queries per second (QPS), requiring only a few milliseconds to search corpuses of over a billion vectors.

Matching Engine's ANN service uses a new type of vector quantization developed by Google Research: Accelerating Large-Scale Inference with Anisotropic Vector Quantization. For more information about how this works, see Announcing ScaNN: Efficient Vector Similarity Search.

Matching Engine can create and serve indices from user-generated embedding vectors that can have even thousands of dimensions. Moreover, users can request hundreds of nearest neighbors in a single query.

Notable benefits of Google's ANN service

Improved scale and recall, at lower cost

Matching Engine delivers similarity search at scale, with high QPS, high recall, and cost efficiency:

  • Scales to billions of embedding vectors.
  • Serves results with 50th percentile latencies as low as 5ms, even when the QPS is in the hundreds of thousands, and hundreds of nearest neighbors are requested per query.
  • Delivers industry leading recall. Recall measures the percentage of actual neighbors returned for each vector search call.
  • Uses less CPU and memory than most other known alternatives.
  • Supports embedding vectors with thousands of dimensions.

Valuable capabilities that simplify real-world architectures

  • Autoscaling
  • Query time Boolean predicates for filtering results.

Why does ANN perform approximate matches, instead of exact matches?

It is impossible for a brute-force search to search over a really large collection of embedding vectors, at high queries per second, while serving results with low latency. Such a search necessitates the use of sophisticated approximation algorithms that exchange some accuracy for massive speed and scale. This paper describes the underlying techniques in more detail. Fundamentally, the system tries to find most of the nearest neighbors, at very low latency. Recall is a metric used to measure the percentage of true nearest neighbors returned by the system. Based on empirical statistics from teams within Google, we know that for many real world applications, Vertex AI Matching Engine can achieve recall of 95% to 98%, while serving results with 90th percentile latency less than 10 ms (Google Cloud internal research, May 2021).

The ANN service also provides a brute force exact match option. However, these exact match indices are provided to enable testing and recall measurement. They are not recommended for low latency deployments in production.

Key user journeys

  • Create and deploy an index from a user-provided set of embedding vectors
  • Update a live index with a user-provided set of embedding vectors
  • Low latency online querying to get the nearest neighbors of a query embedding vector

Useful terminology

Index: A collection of vectors deployed together for similarity search. Vectors can be added to an index or removed from an index. Similarity search queries are issued to a specific index and will search over the vectors in that index. Recall: The percentage of true nearest neighbors returned by the index. For example, if a nearest neighbor query for 20 nearest neighbors returned 19 of the "ground truth" nearest neighbors, the recall is 19/20*100 = 95%. Restricts: Functionality to "restrict" searches to a subset of the index by using Boolean rules.

What's next