Perform similarity vector search in Spanner by finding the K-nearest neighbors

This page describes how to perform similarity vector search in Spanner by using the cosine distance, Euclidean distance, and dot product vector functions to find K-nearest neighbors. Before you read this page, it's important that you understand the following concepts:

  • Euclidean distance: measures the shortest distance between two vectors.
  • Cosine distance: measures the cosine of the angle between two vectors.
  • Dot product: calculates the cosine of the angle multiplied by the product of corresponding vector magnitudes. If you know that all the vector embeddings in your dataset are normalized, then you can use DOT_PRODUCT() as a distance function.
  • K-nearest neighbors (KNN): a supervised machine learning algorithm used to solve classification or regression problems.

You can use vector distance functions to perform K-nearest neighbors (KNN) vector search for use cases like similarity search or retrieval-augmented generation. Spanner supports the COSINE_DISTANCE(), EUCLIDEAN_DISTANCE(), and DOT_PRODUCT() functions, which operate on vector embeddings, allowing you to find the KNN of the input embedding.

For example, after you generate and save your operational Spanner data as vector embeddings, you can then provide these vector embeddings as an input parameter in your query to find the nearest vectors in N-dimensional space to search for semantically similar or related items.

All three distance functions take the arguments vector1 and vector2, which are of the type array<>, and must consist of the same dimensions and have the same length. For more details about these functions, see:

Examples

The following examples show KNN search, KNN search over partitioned data, and using a secondary index with KNN.

The examples all use EUCLIDEAN_DISTANCE(). You can also use COSINE_DISTANCE(). In addition, if all the vector embeddings in your dataset are normalized, you can use DOT_PRODUCT() as a distance function.

Consider a Documents table that has a column (DocEmbedding) of precomputed text embeddings from the DocContents bytes column.

GoogleSQL

CREATE TABLE Documents (
UserId       INT64 NOT NULL,
DocId        INT64 NOT NULL,
Author       STRING(1024),
DocContents  BYTES,
DocEmbedding ARRAY<FLOAT32>
) PRIMARY KEY (UserId, DocId);

PostgreSQL

CREATE TABLE Documents (
UserId       bigint primary key,
DocId        bigint primary key,
Author       varchar(1024),
DocContents  bytea,
DocEmbedding float4[]
);

Assuming that an input embedding for "baseball, but not professional baseball" is the array [0.3, 0.3, 0.7, 0.7], you can find the top five nearest documents that match, with the following query:

GoogleSQL

SELECT DocId, DocEmbedding FROM Documents
ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,
ARRAY<FLOAT32>[0.3, 0.3, 0.7, 0.8])
LIMIT 5;

PostgreSQL

SELECT DocId, DocEmbedding FROM Documents
ORDER BY spanner.euclidean_distance(DocEmbedding,
'{0.3, 0.3, 0.7, 0.8}'::float4[])
LIMIT 5;

The expected results of this example:

Documents
+---------------------------+-----------------+
| DocId                     | DocEmbedding    |
+---------------------------+-----------------+
| 24                        | [8, ...]        |
+---------------------------+-----------------+
| 25                        | [6, ...]        |
+---------------------------+-----------------+
| 26                        | [3.2, ...]      |
+---------------------------+-----------------+
| 27                        | [38, ...]       |
+---------------------------+-----------------+
| 14229                     | [1.6, ...]      |
+---------------------------+-----------------+

Example 2: KNN search over partitioned data

The query in the previous example can be modified by adding conditions to the WHERE clause to limit the vector search to a subset of your data. One common application of this is to search over partitioned data, such as rows that belong to a specific UserId.

GoogleSQL

SELECT UserId, DocId, DocEmbedding FROM Documents
WHERE UserId=18
ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,
ARRAY<FLOAT32>[0.3, 0.3, 0.7, 0.8])
LIMIT 5;

PostgreSQL

SELECT UserId, DocId, DocEmbedding FROM Documents
WHERE UserId=18
ORDER BY spanner.euclidean_distance(DocEmbedding,
'{0.3, 0.3, 0.7, 0.8}'::float4[])
LIMIT 5;

The expected results of this example:

Documents
+-----------+-----------------+-----------------+
| UserId    | DocId           | DocEmbedding    |
+-----------+-----------------+-----------------+
| 18        | 234             | [12, ...]       |
+-----------+-----------------+-----------------+
| 18        | 12              | [1.6, ...]      |
+-----------+-----------------+-----------------+
| 18        | 321             | [22, ...]       |
+-----------+-----------------+-----------------+
| 18        | 432             | [3, ...]        |
+-----------+-----------------+-----------------+

Example 3: KNN search over secondary index ranges

If the WHERE clause filter you're using isn't part of the table's primary key, then you can create a secondary index to accelerate the operation with an index-only scan.

GoogleSQL

CREATE INDEX DocsByAuthor
ON Documents(Author)
STORING (DocEmbedding);

SELECT Author, DocId, DocEmbedding FROM Documents
WHERE Author="Mark Twain"
ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,
   <embeddings for "book about the time traveling American">)
LIMIT 5;

PostgreSQL

CREATE INDEX DocsByAuthor
ON Documents(Author)
INCLUDE (DocEmbedding);

SELECT Author, DocId, DocEmbedding FROM Documents
WHERE Author="Mark Twain"
ORDER BY spanner.euclidean_distance(DocEmbedding,
   <embeddings for "that book about the time traveling American">)
LIMIT 5;

The expected results of this example:

Documents
+------------+-----------------+-----------------+
| Author     | DocId           | DocEmbedding    |
+------------+-----------------+-----------------+
| Mark Twain | 234             | [12, ...]       |
+------------+-----------------+-----------------+
| Mark Twain | 12              | [1.6, ...]      |
+------------+-----------------+-----------------+
| Mark Twain | 321             | [22, ...]       |
+------------+-----------------+-----------------+
| Mark Twain | 432             | [3, ...]        |
+------------+-----------------+-----------------+
| Mark Twain | 375             | [9, ...]        |
+------------+-----------------+-----------------+

What's next