Cloud SQL for PostgreSQL: Faster similarity search performance with pgvector indexes
Join us at Google Cloud Next
Coming to Las Vegas, April 9–11.Register
Update February 8, 2024: Cloud SQL for PostgreSQL now supports pgvector 0.5.1. This version introduces new performance improvements around the build time for HNSW indexes.
In the fast evolving landscape of approximate nearest neighbor (ANN) search, one of the most important recent launches is pgvector extension version 0.5.0, which adds support for Hierarchical Navigable Small Worlds (HNSW) indexes. With the constant quest for efficiency, speed and performance, HNSW indexes significantly reduce ANN query latency for the desired recall. We are excited to announce support for pgvector 0.5.0 in Cloud SQL for PostgreSQL.
In this post, we explain pgvector indexes, clarify different configurations, and give hands-on coding examples for improving the performance and viability of a pgvector-based application using HNSW indexes.
The pgvector extension in Cloud SQL for PostgreSQL
The pgvector extension can be installed within an existing instance of Cloud SQL for PostgreSQL using the CREATE EXTENSION command as shown below. If you do not have an existing instance, create one today. Cloud SQL for PostgreSQL now supports pgvector 0.5.0, which adds HNSW indexes in pgvector, in addition to Inverted File Flat (IVFFlat) indexes.
If you have already installed an older version of pgvector in your Cloud SQL for PostgreSQL instance, it can now be upgraded to pgvector 0.5.0 using the following command.
The extension registers a new data type called “vector” in PostgreSQL and defines several new operators on it:
- Element-wise addition (+)
- Element-wise subtraction (-)
- Element-wise multiplication (*)
- Euclidean distance (<->)
- Cosine distance (<=>)
The commands below demonstrate how to insert vectors to a PostgreSQL table and compute row-wise cosine distances.
Indexes in pgvector
In the same way that database indexes allow for faster data retrieval, vector indexes can significantly speed up similarity search operation and avoid the brute-force exact nearest neighbor search that is used by default. Although a brute-force search across a large database provides a perfect recall, it also comes at a significant cost to performance.
While the pgvector extension already supported IVFFLAT indexes, version 0.5.0 adds support for HNSW indexes. Unlike indexes that are used for exact retrieval, such as b-tree indexes, HNSW indexes can lead to different results for the same queries.
The HNSW index uses the hierarchical navigable small worlds (HNSW) algorithm to build optimized graphs for performing approximate nearest neighbor search. It is one of the top-performing vector indexing algorithms used in vector databases. It has better query performance than many other algorithms, including IVFFlat, albeit at the cost of slower build times and more memory usage.
Another advantage of HNSW is the fact that there’s no training step when building the index. In other words, you can create an index without any data in the table, and it will incrementally build up as data is added. This is different from IVFFlat, where the index needs to be rebuilt periodically as the indexed data changes over time to get better performance.
The following code snippet demonstrates how you can build HNSW indexes in pgvector with different distance functions and index parameters.
This snippet creates an HNSW index on the
embedding column of the
product_embeddings table using the cosine distance function, with HNSW parameters m and
ef_construction set to
Before trying it out yourself, you should understand the following HNSW tuning parameters:
- m: The maximum number of connections with neighboring data points in the graph. Higher values would make the graph denser, providing faster lookups at the cost of an increase in build times and memory usage. According to the original paper on using HNSW, this is the most important parameter. A reasonable range for m is 5 to 48, with 16 being the default in pgvector. A larger value would provide better performance at good recall, especially for high-dimensional data.
- ef_construction: The size of the list which holds the closest candidates during the graph traversal when building the index. Higher values would lead the algorithm to consider more candidates, potentially allowing the creation of a better index. However, after a point, increasing this parameter provides diminishing returns.
- ef_search: Unlike the other parameters, which configure building the index, this parameter configures query execution and limits the number of nearest neighbors maintained in the list. Higher values lead to better recall at the cost of query performance since the algorithm considers and returns more nodes. This parameter can be set at the instance, session, or even transaction level.
You can also use the IVFFlat index to index vector columns by dividing the vector into lists and identifying the subset of lists closest to the input vector. IVFFlat indexes have faster build times and use less memory than HNSW indexes. However, HNSW indexes have better query performance.
The following code snippets demonstrate how to build IVFFlat indexes in pgvector:
Enriching the sample application with indexes
We created an interactive Google Colab application using pgvector and LLMs in our previous blog. As part of that, we loaded a dataset sampled from a larger public retail dataset, which is available on Kaggle. The dataset used in the application has around 800 toy products, with descriptions available for each product. This was loaded into a PostgreSQL table called
These long descriptions were processed using LangChain and embeddings were generated for each of the chunks using the Vertex AI Text Embedding model. These embeddings were then stored as a
vector type in a table called
product_embeddings. Finally, similarity search operators were used to find the nearest neighbor of embeddings generated from search terms.
In this blog, we will extend the same sample application to leverage pgvector indexes to make our application scalable with any reasonable number of vectors. Please note that we will be using the same setup as the previous blog and build upon the interactive Google Colab notebook.
Follow the instructions in the Colab notebook to set up your environment. If an instance with the required name does not exist, the notebook creates a Cloud SQL PostgreSQL instance for you. Running the notebook may incur Google Cloud charges. You may be eligible for a free trial that gets you credits for these costs.
Adding HNSW indexes to optimize vector similarity search
Following the steps of the previous blog, we’ve made an entire product dataset searchable using natural language!
All you have to do is generate vector embeddings for the incoming input query and then use the pgvector cosine similarity search operator to find related products. Now let’s try to optimize it!
If we analyze the query plan for this by prepending
EXPLAIN ANALYZE, the query iterates over more than 800 rows and takes over 12ms.
Instead, let’s try creating an HNSW index. Since the dataset has only 800 rows, we can consider smaller values for the index parameters.
Running the same query again, the response is the same as the full scan. However, if we
EXPLAIN ANALYZE the same query again, PostgreSQL uses the
product_embeddings_embedding_idx index, resulting in a much cheaper query.
Adding the index results in a massive improvement in the query performance, which is much cheaper to call since it uses the index. In this example, the execution time dropped to around 0.5ms, demonstrating the power of pgvector indexes.
This performance gain is for a dataset with only 800 rows, but it becomes even more apparent with bigger datasets. In some cases, it is the only realistic way to scale pgvector to real datasets with hundreds of thousands to even millions of rows.
For example, let’s use the 30,000 rows from the original dataset on Kaggle. At this scale, using an HNSW index is already much faster than a sequential scan. With even more rows and dimensions, the sequential scan becomes completely infeasible.
With sequential scan:
With HNSW index scan:
The ANN-benchmarks package on GitHub provides tools to benchmark several of these approximate nearest neighbor (ANN) algorithms. Using this package, you can clearly find the performance characteristics and behavior of pgvector with HNSW indexes, IVFFlat indexes and full scan on varied real-world datasets.
Leveraging these indexes, you can do similarity search at scale with an appreciable queries per second on real-world data, even with millions of vector embeddings.
We hope this post demonstrates the power of indexes in pgvector and how adding indexes like HNSW makes vector similarity search fast and scalable with production data in PostgreSQL. Our hands-on application and dataset should make it easier for you to build your own application with step-by-step instructions.
We’ve updated the existing Google Colab notebook to include the benefits of pgvector indexes described in this blog. Get started today!