Perform similarity vector search in Bigtable by finding the K-nearest neighbors
Similarity vector search can help you identify similar concepts and contextual meaning in your Bigtable data, which means it can yield more relevant results when filtering for data stored within a specified key range. Example use cases might include the following:
- Inbox search, where you want to perform semantic matching of messages for a particular user
- Anomaly detection within a range of sensors
- Retrieving the most relevant documents within a set of known keys for retrieval augmented generation (RAG)
This page describes how to perform similarity vector search in Bigtable by using the cosine distance and Euclidean distance vector functions in GoogleSQL for Bigtable to find K-nearest neighbors. Before you read this page, it's important that you understand the following concepts:
- Euclidean distance: measures the shortest distance between two vectors.
- Cosine distance: measures the cosine of the angle between two vectors.
- K-nearest neighbors (KNN): a supervised machine learning algorithm used to solve classification or regression problems.
Bigtable supports the COSINE_DISTANCE()
and
EUCLIDEAN_DISTANCE()
functions, which operate on vector embeddings, letting
you find the KNN of the input embedding.
You can use the Vertex AI text embeddings APIs to generate and store your Bigtable data as vector embeddings. You can then provide these vector embeddings as an input parameter in your query to find the nearest vectors in N-dimensional space to search for semantically similar or related items.
Both distance functions take the arguments vector1
and vector2
, which are of
the type array<>
and must consist of the same dimensions and have the same
length. For more details about these functions, see the following:
The code on this page demonstrate how to create embeddings, store them in Bigtable, and then perform a KNN search.
The example on this page uses EUCLIDEAN_DISTANCE()
and the
Bigtable client library for Python. However, you can also use
COSINE_DISTANCE()
and any client library that supports
GoogleSQL for Bigtable, such as the
Bigtable client library for
Java.
Before you begin
Complete the following before you try the code samples.
Required roles
To get the permissions that you need to read and write to Bigtable, ask your administrator to grant you the following IAM role.
- Bigtable User
(
roles/bigtable.user
) on the Bigtable instance that you want to send requests to
Set up your environment
Download and install the Bigtable client library for Python. To use GoogleSQL for Bigtable functions, you must use
python-bigtable
version 2.26.0 or later. Instructions, including how to set up authentication, are at Python hello world.If you don't have a Bigtable instance, follow the steps at Create an instance.
Identify your resource IDs. When you run the code, replace the following placeholders with the IDs of your Google Cloud project, Bigtable instance, and table:
PROJECT_ID
INSTANCE_ID
TABLE_ID
Create a table to store the text, embeddings, and search phrase
Create a table with two column families.
Python
from google.cloud import bigtable
from google.cloud.bigtable import column_family
client = bigtable.Client(project=PROJECT_ID, admin=True)
instance = client.instance(INSTANCE_ID)
table = instance.table(TABLE_ID)
column_families = {"docs":column_family.MaxVersionsGCRule(2), "search_phrase":column_family.MaxVersionsGCRule(2)}
if not table.exists():
table.create(column_families=column_families)
else:
print("Table already exists")
Embed texts with a pre-trained, foundational model from Vertex
Generate the text and embeddings to store in Bigtable along with the associated keys. For additional documentation, see Get text embeddings or Get multimodal embeddings.
Python
from typing import List, Optional
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
from vertexai.generative_models import GenerativeModel
#defines which LLM that we should use to generate the text
model = GenerativeModel("gemini-1.5-pro-001")
#First, use generative AI to create a list of 10 chunks for phrases
#This can be replaced with a static list of text items or your own data
chunks = []
for i in range(10):
response = model.generate_content(
"Generate a paragraph between 10 and 20 words that is about about either
Bigtable or Generative AI"
)
chunks.append(response.text)
print(response.text)
#create embeddings for the chunks of text
def embed_text(
texts: List[str] = chunks,
task: str = "RETRIEVAL_DOCUMENT",
model_name: str = "text-embedding-004",
dimensionality: Optional[int] = 128,
) -> List[List[float]]:
"""Embeds texts with a pre-trained, foundational model."""
model = TextEmbeddingModel.from_pretrained(model_name)
inputs = [TextEmbeddingInput(text, task) for text in texts]
kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}
embeddings = model.get_embeddings(inputs, **kwargs)
return [embedding.values for embedding in embeddings]
embeddings = embed_text()
print("embeddings created for text phrases")
Define functions that let you convert into byte objects
Bigtable is optimized for key-value pairs and generally stores data as byte objects. For more information about designing your data model for Bigtable, see Schema design best practices.
You need to convert the embeddings that come back from Vertex, which are stored as a list of floating point numbers in Python. You convert each element to big-endian IEEE 754 floating-point formation and then concatenate them together. The following function achieves this.
Python
import struct
def floats_to_bytes(float_list):
"""
Convert a list of floats to a bytes object, where each float is represented
by 4 big-endian bytes.
Parameters:
float_list (list of float): The list of floats to be converted.
Returns:
bytes: The resulting bytes object with concatenated 4-byte big-endian
representations of the floats.
"""
byte_array = bytearray()
for value in float_list:
packed_value = struct.pack('>f', value)
byte_array.extend(packed_value)
# Convert bytearray to bytes
return bytes(byte_array)
Write the embeddings to Bigtable
Convert the embeddings to byte objects, create a mutation, and then write the data to Bigtable.
Python
from google.cloud.bigtable.data import RowMutationEntry
from google.cloud.bigtable.data import SetCell
mutations = []
embeddings = embed_text()
for i, embedding in enumerate(embeddings):
print(embedding)
#convert each embedding into a byte object
vector = floats_to_bytes(embedding)
#set the row key which will be used to pull the range of documents (ex. doc type or user id)
row_key = f"doc_{i}"
row = table.direct_row(row_key)
#set the column for the embedding based on the byte object format of the embedding
row.set_cell("docs","embedding",vector)
#store the text associated with vector in the same key
row.set_cell("docs","text",chunks[i])
mutations.append(row)
#write the rows to Bigtable
table.mutate_rows(mutations)
Perform a KNN search using GoogleSQL for Bigtable
The vectors are stored as binary-encoded data that can be read from
Bigtable using a conversion function from the BYTES
type to
ARRAY<FLOAT32>
.
Here is the SQL query:
SELECT _key, TO_VECTOR32(data['embedding']) AS embedding
FROM table WHERE _key LIKE 'store123%';
In Python, you can use the GoogleSQL COSINE_DISTANCE
function
to find the similarity between your text embeddings and the search phrases that
you give it. Since this computation can take time to process, use the Python
client library's asynchronous data
client
to execute the SQL query.
Python
from google.cloud.bigtable.data import BigtableDataClientAsync
#first embed the search phrase
search_embedding = embed_text(texts=["Apache HBase"])
query = """
select _key, docs['text'] as description
FROM knn_intro
ORDER BY COSINE_DISTANCE(TO_VECTOR32(docs['embedding']), {search_embedding})
LIMIT 1;
"""
async def execute_query():
async with BigtableDataClientAsync(project=PROJECT_ID) as client:
local_query = query
async for row in await client.execute_query(query.format(search_embedding=search_embedding[0]), INSTANCE_ID):
return(row["_key"],row["description"])
await execute_query()
The response that is returned is a generated text description that describes Bigtable.