Create indexes

This page describes how to use stored embeddings to generate indexes and query embeddings using ScaNN, IVF, IVFFlat, and HNSW indexes with AlloyDB for PostgreSQL. For more information about storing embedding, see Store vector embeddings.

Before you begin

Before you can start creating indexes, you must complete the following prerequisites.

Embedding vectors are added to a table in your AlloyDB database.
The vector extension version 0.5.0 or later that is based on pgvector, extended by Google for AlloyDB is installed.
```
CREATE EXTENSION IF NOT EXISTS vector;
```
To generate ScaNN indexes, install the alloydb_scann extension in addition to the vector extension.
```
CREATE EXTENSION IF NOT EXISTS alloydb_scann;
```

Create an index

You can create one of the following index types for tables in your database.

Create a `ScaNN` index

AlloyDB alloydb_scann, a PostgreSQL extension developed by Google that implements a highly efficient nearest-neighbor index powered by the ScaNN algorithm.

The ScaNN index is a tree-based quantization index for approximate nearest neighbor search. It provides lower index building time and smaller memory footprint as compared to HNSW. In addition, it provides faster QPS in comparison to HNSW based on the workload.

Two-level tree `ScaNN` index

To apply a two-level tree index using the ScaNN algorithm to a column containing stored vector embeddings, run the following DDL query:

CREATE INDEX INDEX_NAME ON TABLE
  USING scann (EMBEDDING_COLUMN DISTANCE_FUNCTION)
  WITH (num_leaves=NUM_LEAVES_VALUE);

Replace the following:

INDEX_NAME: the name of the index you want to create—for example, my-scann-index. The index names are shared across your database. Ensure that each index name is unique to each table in your database.
TABLE: the table to add the index to.
EMBEDDING_COLUMN: a column that stores vector data.
DISTANCE_FUNCTION: the distance function to use with this index. Choose one of the following:
- L2 distance: l2
- Dot product: dot_product
- Cosine distance: cosine
NUM_LEAVES_VALUE: the number of partitions to apply to this index. Set to any value between 1 to 1048576. For more information about how to decide this value, see Tune a ScaNN index.

Three-level tree `ScaNN` index

To create a three-level tree index using the ScaNN algorithm to a column containing stored vector embeddings, run the following DDL query:

CREATE INDEX INDEX_NAME ON TABLE
  USING scann (EMBEDDING_COLUMN DISTANCE_FUNCTION)
  WITH (num_leaves=NUM_LEAVES_VALUE, max_num_levels = MAX_NUM_LEVELS);

Replace the following:

MAX_NUM_LEVELS: the maximum number of levels of the K-means clustering tree. Set to 1(default) for two-level tree-based quantization and to 2 for three-level tree-based quantization.

After you create the index, you can run nearest-neighbor search queries that make use of the index by following the instructions in Make a nearest-neighbor query with given text.

The index parameters must be set to strike a right balance between QPS and recall. For more information about tuning the ScaNN index, see Tune a ScaNN index.

To create this index on an embedding column that uses the real[] data type instead of vector, cast the column into the vector data type:

CREATE INDEX INDEX_NAME ON TABLE
  USING scann (CAST(EMBEDDING_COLUMN AS vector(DIMENSIONS)) DISTANCE_FUNCTION)
  WITH (num_leaves=NUM_LEAVES_VALUE, max_num_levels = MAX_NUM_LEVELS);

Replace DIMENSIONS with the dimensional width of the embedding column. For more information about how to find the dimensions, see the vector_dims function in Vector functions.

To achieve a consistent search experience, enable auto-maintenance when you create a ScaNN index. For more information, see Maintain vector indexes. This feature is available in Preview.

To view the indexing progress, use the pg_stat_progress_create_index view:

SELECT * FROM pg_stat_progress_create_index;

The phase column shows the current state of your index creation, and the building index: tree training phase disappears after the index is created.

To tune your index for a target recall and QPS balance, see Tune a ScaNN index.

Analyze your indexed table

After you create the ScaNN index, you must run the ANALYZE command to update statistics about your data.

ANALYZE TABLE;

Build indexes in parallel

To build your index faster, AlloyDB might automatically spawn multiple parallel workers, depending on your dataset and the type of index you choose.

The parallel index build is often triggered if you're creating a 3-level ScaNN index or if your dataset exceeds 100M rows.

Though AlloyDB automatically optimizes the number of parallel workers, you can tune the parallel workers using the max_parallel_maintenance_workers, max_parallel_workers, and the min_parallel_table_scan_size PostgreSQL query planning parameters.

Run a query using pgvector

After you store and index the embeddings in your database, you can start querying using the pgvector query functionality. You cannot run bulk search queries using the alloydb_scann extension.

To find the nearest semantic neighbors for an embedding vector, you can run the following example query, where you set the same distance function that you used during the index creation.

  SELECT * FROM TABLE
    ORDER BY EMBEDDING_COLUMN DISTANCE_FUNCTION_QUERY ['EMBEDDING']
    LIMIT ROW_COUNT

Replace the following:

TABLE: the table containing the embedding to compare the text to.
INDEX_NAME: the name of the index you want to use—for example, my-scann-index.
EMBEDDING_COLUMN: the column containing the stored embeddings.
DISTANCE_FUNCTION_QUERY: the distance function to use with this query. Choose one of the following based on the distance function used while creating the index:
- L2 distance: <->
- Inner product: <#>
- Cosine distance: <=>
EMBEDDING: the embedding vector you want to find the nearest stored semantic neighbors of.
ROW_COUNT: the number of rows to return.

Specify 1 if you want only the single best match.

For more information about other query examples, see Querying.

You can use also use the embedding() function to translate the text into a vector. You apply the vector to one of the pgvector nearest-neighbor operator, <-> for L2 distance, to find the database rows with the most semantically similar embeddings.

Because embedding() returns a real array, you must explicitly cast the embedding() call to vector in order to use these values with pgvector operators.

Create indexes Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Create an index

Create a ScaNN index

Two-level tree ScaNN index

Three-level tree ScaNN index

Analyze your indexed table

Build indexes in parallel

Run a query using pgvector

What's next

Create indexes

Create a `ScaNN` index

Two-level tree `ScaNN` index

Three-level tree `ScaNN` index