Perform similarity vector search in Spanner by finding the K-nearest neighbors

This page describes how to perform similarity vector search in Spanner by using the cosine distance and Euclidean distance vector functions to find K-nearest neighbors. Before you read this page, it's important that you understand the following concepts:

You can use vector distance functions to perform K-nearest neighbors (KNN) vector search for use cases like similarity search or retrieval-augmented generation. Spanner supports COSINE_DISTANCE() and EUCLIDEAN_DISTANCE() functions, which operate on vector embeddings, allowing you to find the KNN of the input embedding.

For example, after you generate and save your operational Spanner data as vector embeddings, you can then provide these vector embeddings as an input parameter in your query to find the nearest vectors in N-dimensional space to search for semantically similar or related items.

Depending on how the embeddings were generated and what type of search is desired, either the cosine distance or the Euclidean distance function can produce search results that are subjectively better than the other function. Both distance functions take the arguments vector1 and vector2, which are of the type array<>, and must consist of the same dimensions and have the same length. For more details about these functions, see:

Examples

The following examples provide use cases of when to use COSINE_DISTANCE() or EUCLIDEAN_DISTANCE().

Consider a Documents table that has a column (DocEmbedding) of precomputed text embeddings from the DocContents bytes column.

GoogleSQL

CREATE TABLE Documents (
UserId       INT64 NOT NULL,
DocId        INT64 NOT NULL,
Author       STRING(1024),
DocContents  BYTES,
DocEmbedding ARRAY<FLOAT32>
) PRIMARY KEY (UserId, DocId);

PostgreSQL

CREATE TABLE Documents (
UserId       bigint primary key,
DocId        bigint primary key,
Author       varchar(1024),
DocContents  bytea,
DocEmbedding float4[]
);

Assuming that an input embedding for "baseball, but not professional baseball" is the array [0.3, 0.3, 0.7, 0.7], you can find the top five nearest documents that match, with the following query:

GoogleSQL

SELECT DocId, DocEmbedding FROM Documents
ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,
ARRAY<FLOAT32>[0.3, 0.3, 0.7, 0.8])
LIMIT 5;

PostgreSQL

SELECT DocId, DocEmbedding FROM Documents
ORDER BY spanner.euclidean_distance(DocEmbedding,
'{0.3, 0.3, 0.7, 0.8}'::float4[])
LIMIT 5;

The expected results of this example:

Documents
+---------------------------+-----------------+
| DocId                     | DocEmbedding    |
+---------------------------+-----------------+
| 24                        | [8, ...]        |
+---------------------------+-----------------+
| 25                        | [6, ...]        |
+---------------------------+-----------------+
| 26                        | [3.2, ...]      |
+---------------------------+-----------------+
| 27                        | [38, ...]       |
+---------------------------+-----------------+
| 14229                     | [1.6, ...]      |
+---------------------------+-----------------+

Example 2: KNN search over partitioned data

The query in the previous example can be modified by adding conditions to the WHERE clause to limit the vector search to a subset of your data. One common application of this is to search over partitioned data, such as rows that belong to a specific UserId.

GoogleSQL

SELECT UserId, DocId, DocEmbedding FROM Documents
WHERE UserId=18
ORDER BY COSINE_DISTANCE(DocEmbedding,
ARRAY<FLOAT32>[0.3, 0.3, 0.7, 0.8])
LIMIT 5;

PostgreSQL

SELECT UserId, DocId, DocEmbedding FROM Documents
WHERE UserId=18
ORDER BY spanner.cosine_distance(DocEmbedding,
'{0.3, 0.3, 0.7, 0.8}'::float4[])
LIMIT 5;

The expected results of this example:

Documents
+-----------+-----------------+-----------------+
| UserId    | DocId           | DocEmbedding    |
+-----------+-----------------+-----------------+
| 18        | 234             | [12, ...]       |
+-----------+-----------------+-----------------+
| 18        | 12              | [1.6, ...]      |
+-----------+-----------------+-----------------+
| 18        | 321             | [22, ...]       |
+-----------+-----------------+-----------------+
| 18        | 432             | [3, ...]        |
+-----------+-----------------+-----------------+

Example 3: KNN search over secondary index ranges

If the WHERE clause filter you're using isn't part of the table's primary key, then you can create a secondary index to accelerate the operation with an index-only scan.

GoogleSQL

CREATE INDEX DocsByAuthor
ON Documents(Author)
STORING (DocEmbedding);

SELECT Author, DocId, DocEmbedding FROM Documents
WHERE Author="Mark Twain"
ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,
   <embeddings for "book about the time traveling American">)
LIMIT 5;

PostgreSQL

CREATE INDEX DocsByAuthor
ON Documents(Author)
INCLUDE (DocEmbedding);

SELECT Author, DocId, DocEmbedding FROM Documents
WHERE Author="Mark Twain"
ORDER BY spanner.euclidean_distance(DocEmbedding,
   <embeddings for "that book about the time traveling American">)
LIMIT 5;

The expected results of this example:

Documents
+------------+-----------------+-----------------+
| Author     | DocId           | DocEmbedding    |
+------------+-----------------+-----------------+
| Mark Twain | 234             | [12, ...]       |
+------------+-----------------+-----------------+
| Mark Twain | 12              | [1.6, ...]      |
+------------+-----------------+-----------------+
| Mark Twain | 321             | [22, ...]       |
+------------+-----------------+-----------------+
| Mark Twain | 432             | [3, ...]        |
+------------+-----------------+-----------------+
| Mark Twain | 375             | [9, ...]        |
+------------+-----------------+-----------------+

What's next