Vector search

This page describes how vector searches are implemented on Cloud SQL for MySQL instances. Cloud SQL lets you store vector embeddings, create vector indexes, and perform vector searches in conjunction with your other stored data.

Vector embedding storage

You store vector embeddings in a table that's compliant with atomicity, consistency, isolation, and durability (ACID) properties. Like other relational data in the table, you can access the vector embeddings in the table with existing transactional semantics.

To establish mapping between table rows and vector representations, you need to create a column in your table to store your vector embeddings. The column must use the Cloud SQL VECTOR data type and must indicate the number of dimensions that the embedding requires. The vector embedding column can only store vector embeddings that use exactly the same dimensions that you specify when you define the column.

A table can only have one vector embedding column. There are no restrictions for the number of rows in the table.

To distinguish the vector embedding column from other columns, Cloud SQL adds a special COMMENT and CONSTRAINT to the column. The constraint is required for input validation, and the vector embedding column annotation is visible as a COMMENT. You can't modify or delete the comment or constraint.

If you have enough storage and memory available on your Cloud SQL instance, you can have multiple tables with their own vector embedding columns.

Data replication works the same way for the vector embedding column as it does for other MySQL InnoDB columns.

For a list of limitations and restrictions for vector embedding tables, columns, and DML statements, see Limitations.

Vector indexes

You must use a vector index to perform ANN similarity searches on your vector embeddings. Cloud SQL creates vector indexes using the Scalable Nearest Neighbors (ScANN) algorithm.

Vector indexes have the following requirements:

  • You can only create one vector index per table.
  • If you have multiple tables with vector embeddings on your instance, you can create vector indexes for each of them.
  • If you're creating a vector index, you can't add a constraint to the primary key of the indexed table.

For better search quality, create a vector index only after loading the bulk of your data in the base table. If you have less than 1000 embeddings in the base table, the index creation fails.

When deciding whether to create a vector index, if you have a small number of rows, consider whether you can perform a KNN search instead. The decision to use a KNN versus an ANN search also depends on the number of dimensions on the vector embedding. A larger number of embeddings might require a vector index.

For a list of limitations and restrictions for vector indexes, see Limitations. For information on creating a vector index, see Create and manage vector indexes.

Vector index updates

Cloud SQL updates vector indexes in real time. Any transaction that performs data manipulation language (DML) operations on the base table also propagates changes to the associated vector indexes. Vector indexes behave in the same way as any other secondary index on the table. The vector indexes are fully transactionally consistent and ACID compliant. If you roll back a transaction, then the corresponding rollback changes also occur in the vector index.

Replication of vector indexes

Cloud SQL replicates vector indexes to all read replicas, including for cascading replication. When you create a new read replica from a primary instance that has vector embedding, the read replica inherits the vector embedding settings from the primary instance. For existing read replicas, you must enable vector embedding support on each one.

In terms of impact to replication lag, creating and maintaining vector indexes operate in the same way as regular MySQL indexes.

Persistence, shutdown, and impact on maintenance

Vector indexes are persisted the same way as base tables, with full ACID support. Vector indexes are always in-sync with their base table data, and have the same visibility, isolation, and crash safety. There is no impact to the vector index when the instance is shut down or receives maintenance.

Index maintenance

After extensive DML operations are performed on the base table, the vector index that you trained on the initial data (at the time of index creation) might not reflect the new state. This can impact search quality.

There are two parts to the index:

  • The index tree. This is built by training on existing data. It stays unchanged during the lifetime of the index.
  • The index leaves. These contain all the rows of data. The index leaves never go out of sync.

The index tree might become less efficient after a large number of DML statements are run because rows move from one leaf to another. To refresh the index tree, you need to rebuild the index.

Unsupported DDL operations on tables with vector indexes

  • Alter table operations requiring copy algorithm.
  • Alter table operations that requires the table to be rebuilt.
  • Drop or change the primary key.
  • Move the table to a general tablespace.

Vector search

Cloud SQL provides vector distance functions that you use to perform approximate nearest neighbor (ANN) and K-nearest neighbors (KNN) vector similarity searches on your instance. When you run a query, the query vector is compared to vectors in your dataset. Distance functions calculate the distance between the vectors using a similarity metric such as cosine. The vectors with the shortest distance between them are the most similar and are returned in search results.

Cloud SQL uses the following functions to measure distance between vectors in vector searches when you perform ANN and KNN vector searches:

  • Cosine: measures the cosine of the angle between two vectors. A smaller value indicates greater similarity between the vectors.
  • Dot product: calculates the cosine of the angle multiplied by the product of corresponding vector magnitudes.
  • L2 squared distance: measures the Euclidean distance between two vectors by adding the squared distance on each dimension.

A KNN vector search is the preferred search method when you need exact results or want to add selective filtering. KNN search performs a distance computation of the query vector with every embedding in the dataset to find the nearest neighbor. KNN searches in Cloud SQL provide perfect recall. KNN searches don't use a vector index so they're a good option when working with smaller datasets.

To perform a KNN search, you use the vector_distance function that takes two vectors as input: the query vector (what you're searching for) and a candidate vector from your dataset. It calculates the distance between these two vectors. You use vector_distance in a SELECT statement. For more information, See Search K-nearest neighbors (KNN).

If you find that KNN isn't performing well, you can build a vector index later and continue to use approx_distance in your application for ANN searches.

An ANN vector search is the preferred search type when query efficiency is a concern. It speeds up similarity searches by calculating the distance between your query vector and only a portion of the vectors in your dataset. To do this, Cloud SQL organizes the data into clusters or partitions and then focuses the search on the clusters closest to the query. ANN searches require vector indexes. These indexes prioritize search speed over perfect recall. In Cloud SQL, the TREE_SQ index type is used for ANN searches.

To perform an ANN search, you use the approx_distance function with a distance measurement option. You use approx_distance in an ORDER BY or SELECT list and a LIMIT clause is permitted to limit search results. You can also add a WHERE clause to perform post-filtering of your search results. For more information, see Search approximate nearest neighbors (ANN).

There are some cases when an ANN search falls back to a KNN search. For more information, see Check the fallback status for ANN searches.

Requirements

Cloud SQL requires you to enable vector embeddings on the instance using the cloudsql_vector flag before you add vector embeddings. For more information, see Enable and disable vector embeddings on your instance.

Limitations

The following are limitations on tables that have a vector embedding column:

  • There can only be one vector embedding column per table.
  • There can only be one vector index per table.
  • A vector embedding is restricted to 16,000 dimensions.
  • The vector embedding column can't be a generated column.
  • Table-level partitioning on tables with vector embedding columns isn't supported.
  • Primary keys that use the BIT, BINARY, VARBINARY, JSON, BLOB, TEXT data types, or spatial data aren't supported for vector indexes. Composite primary keys also can't include any of these types.
  • If there's a vector index, you can't add a constraint to the primary key of the base table.
  • When a vector index is present on a table, there are some DDL operations you can't perform. For more information, see Unsupported DDL operations on tables with vector indexes.

The following are restrictions for vector search queries:

  • The approx_distance function can only be used in an ORDER BY or SELECT list.
  • Predicates involving the base table can be used in the WHERE condition in combination with approx_distance expressions in the ORDER BY or SELECT list. The WHERE condition predicates are evaluated after the approx_distance vector functions are evaluated.

Best practices for working with vector indexes

This section provides best practices for working with vector indexes. Every workload is different, and you might need to adjust accordingly.

  • After major DML operations, it's a good practice to rebuild the index.
  • Generally, it's acceptable to let Cloud SQL compute the number of leaves to use. If you have a use case where you want to specify the number of leaves, it's recommended to have at least 100 vectors per leaf for the best recall.

What's next