Matching Engine ANN service overview

Stay organized with collections Save and categorize content based on your preferences.

In a matter of milliseconds, Vertex AI Matching Engine's Approximate Nearest Neighbor (ANN) service helps you find the most similar vectors from a over a billion vectors. Matching Engine's ANN service can serve similarity-matching queries at high queries per second (QPS).

Matching Engine's ANN service uses a new type of vector quantization developed by Google Research: Accelerating Large-Scale Inference with Anisotropic Vector Quantization. For more information about how this works, see Announcing ScaNN: Efficient Vector Similarity Search.

Matching Engine can create and serve indices from user-generated embedding vectors that can have even thousands of dimensions. Moreover, users can request hundreds of nearest neighbors in a single query.

Improved scale and recall, at lower cost

Matching Engine delivers similarity search at scale, with high QPS, high recall, and cost efficiency. Matching Engine:

  • Scales to billions of embedding vectors.
  • Serves results with 50th percentile latencies as low as 5ms, even when the QPS is in the hundreds of thousands, and hundreds of nearest neighbors are requested per query.
  • Delivers industry leading recall. Recall measures the percentage of actual neighbors returned for each vector search call.
  • Uses less CPU and memory than most other known alternatives.
  • Supports embedding vectors with thousands of dimensions.

Valuable capabilities that simplify real-world architectures

  • Autoscaling
  • Query time Boolean predicates for filtering results.

Why does ANN perform approximate matches, instead of exact matches?

It is impossible for a brute-force search to search over a really large collection of embedding vectors, at high queries per second, while serving results with low latency. Such a search necessitates the use of sophisticated approximation algorithms that exchange some accuracy for massive speed and scale. Fundamentally, the system tries to find most of the nearest neighbors, at very low latency. Recall is a metric used to measure the percentage of true nearest neighbors returned by the system. Based on empirical statistics from teams within Google, we know that for many real world applications, Vertex AI Matching Engine can achieve recall of 95% to 98%, while serving results with 90th percentile latency less than 10 ms (Google Cloud internal research, May 2021).

The ANN service also provides a brute force exact match option. However, these exact match indices are provided to enable testing and recall measurement. They are not recommended for low latency deployments in production.

Key user journeys

  • Create and deploy an index from a user-provided set of embedding vectors
  • Update a live index with a user-provided set of embedding vectors
  • Low latency online querying to get the nearest neighbors of a query embedding vector

To learn more, see the official notebook. This notebook demonstrates how to use the GCP ANN service, including the following tasks:

  • Create an ANN index and brute force index
  • Create an IndexEndpoint with VPC network
  • Deploy an ANN index and brute force index
  • Perform online query
  • Compute recall

Useful terminology

Index: A collection of vectors deployed together for similarity search. Vectors can be added to an index or removed from an index. Similarity search queries are issued to a specific index and will search over the vectors in that index.

Recall: The percentage of true nearest neighbors returned by the index. For example, if a nearest neighbor query for 20 nearest neighbors returned 19 of the "ground truth" nearest neighbors, the recall is 19/20x100 = 95%.

Restricts: Functionality to "restrict" searches to a subset of the index by using Boolean rules.