The ML.GENERATE_TEXT_EMBEDDING function
This document describes the ML.GENERATE_TEXT_EMBEDDING
function, which
lets you embed text that's stored in BigQuery tables.
A text embedding is a dense vector representation of a piece of text. If two
pieces of text are semantically similar, then their respective embeddings are
located near each other in the embedding vector space. Text embeddings help you
perform the following tasks:
- Semantic search: search text ranked by semantic similarity.
- Recommendation: return items with text attributes similar to the given text.
- Classification: return the class of items whose text attributes are similar to the given text.
- Clustering: cluster items whose text attributes are similar to the given text.
- Outlier detection: return items where text attributes are least related to the given text.
The function works by sending requests to a BigQuery ML
remote model
that represents one of the Vertex AI textembedding-gecko*
text embedding
foundation models
(LLMs), and then returning the LLM's response.
Syntax
ML.GENERATE_TEXT_EMBEDDING( MODEL `project_id.dataset.model_name`, { TABLE table_name | (query_statement) }, STRUCT([flatten_json_output AS flatten_json_output]) )
Arguments
ML.GENERATE_TEXT_EMBEDDING
takes the following arguments:
project_id
: aSTRING
value that specifies your project ID.dataset
: aSTRING
value that specifies the BigQuery dataset that contains the model.model_name
: aSTRING
value that specifies the name of a remote model that uses one of thetextembedding-gecko*
Vertex AI LLMs. For more information about how to create this type of remote model, seeENDPOINT
.table_name
: aSTRING
value that specifies the name of the BigQuery table that contains aSTRING
column to embed. The text in the column that's namedcontent
is sent to the model. If your table doesn't have acontent
column, use aSELECT
statement for this argument to provide an alias for an existing table column. An error occurs if nocontent
column exists.query_statement
: aSTRING
value that specifies a query statement whose result contains aSTRING
column that's namedcontent
.flatten_json_output
: aBOOL
value that determines whether theJSON
content returned by the function is parsed into separate columns. The default isTRUE
.
Output
ML.GENERATE_TEXT_EMBEDDING
returns the input table and the following columns:
ml_embed_text_result
: aJSON
value that contains the text embedding and related statistics. This column is returned whenflatten_json_output
isFALSE
.text_embedding
: anARRAY<FLOAT>
value that contains a 768-dimensional text embedding. This column is returned whenflatten_json_output
isTRUE
.statistics
: aJSON
value that contains atoken_count
field with the number of tokens in the content, and atruncated
field that indicates whether the content was truncated. This column is returned whenflatten_json_output
isTRUE
.ml_embed_text_status
: aSTRING
value that contains the API response status for the corresponding row. This value is empty if the operation was successful.
Locations
ML.GENERATE_TEXT_EMBEDDING
must run in the same region as the remote model
that the function references. You can only create remote models based on
textembedding-gecko*
models in the following
locations:
us
us-central1
Quotas
See Cloud AI service functions quotas and limits.
Example
This example shows how to generate an embedding of a single piece of sample text:
SELECT * FROM ML.GENERATE_TEXT_EMBEDDING( MODEL `bqml_tutorial.embedding_model`, (SELECT "Example text to embed" AS content), STRUCT(TRUE AS flatten_json_output) );
What's next
- Try a tutorial on embedding text using an LLM.
- Learn more about AI functions you can use to analyze BigQuery data.