Matching Engine ANN service overview

Vertex AI Matching Engine is a vector database, which can find the most similar vectors from over a billion vectors. Matching Engine's ANN service can serve similarity-matching queries at high queries per second (QPS).

Matching Engine's ANN service uses a new type of vector quantization developed by Google Research: Accelerating Large-Scale Inference with Anisotropic Vector Quantization. For more information about how this works, see Announcing ScaNN: Efficient Vector Similarity Search.

Improved scale and recall, at lower cost

Matching Engine delivers similarity search at scale, with high QPS, high recall, and cost efficiency. Matching Engine:

  • Scales to billions of embedding vectors.
  • Serves results with 50th percentile latencies as low as 5ms, even when the QPS is in the hundreds of thousands, and hundreds of nearest neighbors are requested per query.
  • Delivers industry leading recall. Recall measures the percentage of actual neighbors returned for each vector search call.
  • Uses less CPU and memory than most other known alternatives.
  • Supports embedding vectors with thousands of dimensions.

Valuable capabilities that simplify real-world architectures

  • Autoscaling configuration to save costs during periods of low demand, and add capacity to support peak loads or queries.
  • Query time Boolean predicates for filtering results

Why does ANN perform approximate matches, instead of exact matches?

It is impossible for a brute-force search to search over a huge collection of embedding vectors, at high queries per second, while serving results with low latency. Fundamentally, the system tries to find most of the nearest neighbors, at very low latency. Recall is a metric used to measure the percentage of true nearest neighbors returned by the system. Based on empirical statistics from teams within Google, we know that for many real world applications, Vertex AI Matching Engine can achieve recall of 95% to 98%, while serving results with 90th percentile latency less than 10 ms (Google Cloud internal research, May 2021).

The ANN service also provides a brute-force exact match option. However, these exact match indexes are provided to enable testing and recall measurement. They are not recommended for low latency deployments in production.

You can use ANN to do the following:

  • Create and deploy an index from a user-provided set of embedding vectors
  • Update a live index with a user-provided set of embedding vectors
  • Run a low-latency online query to get the nearest neighbors of a query

To learn more, see the official notebook. This notebook demonstrates how to use Vertex AI Matching Engine to complete the following tasks:

  • Create an ANN index and a brute-force index
  • Create an index endpoint with VPC network
  • Deploy an ANN index and brute-force index to an endpoint
  • Perform an online query
  • Compute recall

Useful terminology

Index: A collection of vectors deployed together for similarity search. Vectors can be added to an index or removed from an index. Similarity search queries are issued to a specific index and will search over the vectors in that index.

Recall: The percentage of true nearest neighbors returned by the index. For example, if a nearest neighbor query for 20 nearest neighbors returned 19 of the "ground truth" nearest neighbors, the recall is 19/20x100 = 95%.

Restricts: Functionality to "restrict" searches to a subset of the index by using Boolean rules.

What's next