The ML.GENERATE_EMBEDDING function
This document describes the ML.GENERATE_EMBEDDING
function, which
lets you create embeddings that describe an entity—for example, a piece
of text or an image.
Embeddings are high-dimensional numerical vectors that represent a given entity. Machine learning (ML) models use embeddings to encode semantics about entities to make it easier to reason about and compare them. If two entities are semantically similar, then their respective embeddings are located near each other in the embedding vector space.
Embeddings help you perform the following tasks:
- Semantic search: search entities ranked by semantic similarity.
- Recommendation: return entities with attributes similar to a given entity.
- Classification: return the class of entities whose attributes are similar to the given entity.
- Clustering: cluster entities whose attributes are similar to a given entity.
- Outlier detection: return entities whose attributes are least related to the given entity.
- Matrix factorization (in preview): return entities that represent the underlying weights that a model uses during prediction.
- Principal component analysis (PCA) (in preview): return entities (principal components) that represent the input data in such a way that it is easier to identify patterns, clusters, and outliers.
- Autoencoding (in preview): return the latent space representations of the input data.
Depending on the task, the ML.GENERATE_EMBEDDING
function works in one of the
following ways:
- To generate embeddings from text or visual content,
ML.GENERATE_EMBEDDING
sends the request to a BigQuery ML remote model that represents one of the Vertex AI embedding foundation models (LLMs), and then returns the LLM's response. - For PCA and autoencoding,
ML.GENERATE_EMBEDDING
processes the request using a BigQuery ML PCA or autoencoder model and theML.PREDICT
function.ML.GENERATE_EMBEDDING
gathers theML.PREDICT
output for the model into an array and outputs it as theml_generate_embedding_result
column. Having all of the embeddings in a single column lets you directly use theVECTOR_SEARCH
function on theML.GENERATE_EMBEDDING
output. - For matrix factorization,
ML.GENERATE_EMBEDDING
processes the request using a BigQuery ML matrix factorization model and theML.WEIGHTS
function.ML.GENERATE_EMBEDDING
gathers thefactor_weights.weight
andintercept
values from theML.WEIGHTS
output for the model into an array and outputs it as theml_generate_embedding_result
column. Having all of the embeddings in a single column lets you directly use theVECTOR_SEARCH
function on theML.GENERATE_EMBEDDING
output.
The ML.GENERATE_EMBEDDING
function works with the Vertex AI LLM to
perform embedding tasks supported by that model. For more information on the
types of tasks these LLMs can perform, see the following documentation:
Typically, you want to use text embedding models for text-only use cases, and use multimodal models for cross-modal search use cases, where embeddings for text and visual content are generated in the same semantic space.
Syntax
ML.GENERATE_EMBEDDING
syntax differs depending on the
BigQuery ML model you choose. If you use a remote model, it also
differs depending on the Vertex AI LLM that your remote models
targets. Choose the option appropriate for your use case.
multimodalembedding
# Syntax for visual content
ML.GENERATE_EMBEDDING(
MODEL project_id.dataset.model_name
,
{ TABLE table_name | (query_statement) },
STRUCT(
[flatten_json_output AS flatten_json_output])
)
# Syntax for text content
ML.GENERATE_EMBEDDING(
MODEL project_id.dataset.model_name
,
{ TABLE table_name | (query_statement) },
STRUCT(
[flatten_json_output AS flatten_json_output])
)
Arguments
ML.GENERATE_EMBEDDING
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model_name
: the name of a remote model over a Vertex AImultimodalembedding@001
model.You can confirm what LLM is used by the remote model by opening the Google Cloud console and looking at the Remote endpoint field in the model details page.
table_name
: one of the following:- If you are creating embeddings for visual content, the name of a BigQuery object table that contains the visual content to embed.
- If you are creating text embeddings, the name of a
BigQuery table that contains a
STRING
column to embed. The text in the column that's namedcontent
is sent to the model. If your table doesn't have acontent
column, use aSELECT
statement for this argument to provide an alias for an existing table column. An error occurs if nocontent
column exists.
query_statement
: If you are creating text embeddings, a query whose result contains aSTRING
column that's namedcontent
. For information about the supported SQL syntax of thequery_statement
clause, see GoogleSQL query syntax. To create embeddings for visual content from an object table, you can also specifyquery_statement
. The subquery of an object table can only supportWHERE
,ORDER BY
, andLIMIT
clause.flatten_json_output
: aBOOL
value that determines whether theJSON
content returned by the function is parsed into separate columns. The default isTRUE
.
Details
The model and input table must be in the same region.
textembedding-gecko
ML.GENERATE_EMBEDDING(
MODEL project_id.dataset.model_name
,
{ TABLE table_name | (query_statement) },
STRUCT(
[flatten_json_output AS flatten_json_output]
[, task_type AS task_type])
)
Arguments
ML.GENERATE_EMBEDDING
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model_name
: the name of a remote model over a Vertex AI Vertex AI PaLM API embedding LLM.You can confirm what LLM is used by the remote model by opening the Google Cloud console and looking at the Remote endpoint field in the model details page.
table_name
: the name of the BigQuery table that contains aSTRING
column to embed. The text in the column that's namedcontent
is sent to the model. If your table doesn't have acontent
column, use aSELECT
statement for this argument to provide an alias for an existing table column. An error occurs if nocontent
column exists.query_statement
: a query whose result contains aSTRING
column that's namedcontent
. For information about the supported SQL syntax of thequery_statement
clause, see GoogleSQL query syntax.flatten_json_output
: aBOOL
value that determines whether theJSON
content returned by the function is parsed into separate columns. The default isTRUE
.task_type
: aSTRING
literal that specifies the intended downstream application to help the model produce better quality embeddings. Thetask_type
argument accepts the following values:RETRIEVAL_QUERY
: specifies that the given text is a query in a search or retrieval setting.RETRIEVAL_DOCUMENT
: specifies that the given text is a document in a search or retrieval setting.When using this task type, it is helpful to include the document title in the query statement in order to improve embedding quality. You can use the
title
option to specify the name of the column that contains the document title, otherwise the document title must be in a column either namedtitle
or aliased astitle
, for example:SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mydataset.embedding_model`, (SELECT abstract as content, header as title, publication_number FROM `mydataset.publications`), STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' as task_type) );
SEMANTIC_SIMILARITY
: specifies that the given text will be used for Semantic Textual Similarity (STS).CLASSIFICATION
: specifies that the embeddings will be used for classification.CLUSTERING
: specifies that the embeddings will be used for clustering.
Details
The model and input table must be in the same region.
textembedding-gecko-multilingual
ML.GENERATE_EMBEDDING(
MODEL project_id.dataset.model_name
,
{ TABLE table_name | (query_statement) },
STRUCT(
[flatten_json_output AS flatten_json_output]
[, task_type AS task_type])
)
Arguments
ML.GENERATE_EMBEDDING
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model_name
: the name of a remote model over a Vertex AI Vertex AI PaLM API embedding LLM.You can confirm what LLM is used by the remote model by opening the Google Cloud console and looking at the Remote endpoint field in the model details page.
table_name
: the name of the BigQuery table that contains aSTRING
column to embed. The text in the column that's namedcontent
is sent to the model. If your table doesn't have acontent
column, use aSELECT
statement for this argument to provide an alias for an existing table column. An error occurs if nocontent
column exists.query_statement
: a query whose result contains aSTRING
column that's namedcontent
. For information about the supported SQL syntax of thequery_statement
clause, see GoogleSQL query syntax.flatten_json_output
: aBOOL
value that determines whether theJSON
content returned by the function is parsed into separate columns. The default isTRUE
.task_type
: aSTRING
literal that specifies the intended downstream application to help the model produce better quality embeddings. Thetask_type
argument accepts the following values:RETRIEVAL_QUERY
: specifies that the given text is a query in a search or retrieval setting.RETRIEVAL_DOCUMENT
: specifies that the given text is a document in a search or retrieval setting.When using this task type, it is helpful to include the document title in the query statement in order to improve embedding quality. You can use the
title
option to specify the name of the column that contains the document title, otherwise the document title must be in a column either namedtitle
or aliased astitle
, for example:SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mydataset.embedding_model`, (SELECT abstract as content, header as title, publication_number FROM `mydataset.publications`), STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' as task_type) );
SEMANTIC_SIMILARITY
: specifies that the given text will be used for Semantic Textual Similarity (STS).CLASSIFICATION
: specifies that the embeddings will be used for classification.CLUSTERING
: specifies that the embeddings will be used for clustering.
Details
The model and input table must be in the same region.
PCA
ML.GENERATE_EMBEDDING(
MODEL project_id.dataset.model_name
,
{ TABLE table_name | (query_statement) }
)
Arguments
ML.GENERATE_EMBEDDING
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model_name
: the name of a PCA model.You can confirm the type of model by opening the Google Cloud console and looking at the Model type field in the model details page.
table_name
: the name of the BigQuery table that contains the input data for the PCA model.query_statement
: a query whose result contains the input data for the PCA model.
Details
The model and input table must be in the same region.
Autoencoder
ML.GENERATE_EMBEDDING(
MODEL project_id.dataset.model_name
,
{ TABLE table_name | (query_statement) },
STRUCT([trial_id AS trial_id])
)
Arguments
ML.GENERATE_EMBEDDING
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model_name
: the name of an autoencoder model.You can confirm the type of model by opening the Google Cloud console and looking at the Model type field in the model details page.
table_name
: the name of the BigQuery table that contains the input data for the autoencoder model.query_statement
: a query whose result contains the input data for the autoencoder model.trial_id
: anINT64
value that identifies the hyperparameter tuning trial that you want the function to evaluate. The function uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.
Details
The model and input table must be in the same region.
Matrix factorization
ML.GENERATE_EMBEDDING(
MODEL project_id.dataset.model_name
,
STRUCT([trial_id AS trial_id])
)
Arguments
ML.GENERATE_EMBEDDING
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the model.model_name
: the name of a matrix factorization model.You can confirm the type of model by opening the Google Cloud console and looking at the Model type field in the model details page.
trial_id
: anINT64
value that identifies the hyperparameter tuning trial that you want the function to evaluate. The function uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.
Output
multimodalembedding
ML.GENERATE_EMBEDDING
returns the input table and the following columns:
ml_generate_embedding_result
:- If
flatten_json_output
isFALSE
, this is the JSON response from theprojects.locations.endpoints.predict
call to the model. The generated embeddings are in thetextEmbedding
orimageEmbedding
element, depending on the type of input data you used. - If
flatten_json_output
isTRUE
, this is anARRAY<FLOAT64>
value that contains the generated embeddings.
- If
ml_generate_embedding_status
: aSTRING
value that contains the API response status for the corresponding row. This value is empty if the operation was successful.
textembedding-gecko
ML.GENERATE_EMBEDDING
returns the input table and the following columns:
ml_generate_embedding_result
:- If
flatten_json_output
isFALSE
, this is the JSON response from theprojects.locations.endpoints.predict
call to the model. The generated embeddings are in thevalues
element. - If
flatten_json_output
isTRUE
, this is anARRAY<FLOAT64>
value that contains the generated embeddings.
- If
ml_generate_embedding_statistics
: aJSON
value that contains atoken_count
field with the number of tokens in the content, and atruncated
field that indicates whether the content was truncated. This column is returned whenflatten_json_output
isTRUE
.ml_generate_embedding_status
: aSTRING
value that contains the API response status for the corresponding row. This value is empty if the operation was successful.
textembedding-gecko-multilingual
ML.GENERATE_EMBEDDING
returns the input table and the following columns:
ml_generate_embedding_result
:- If
flatten_json_output
isFALSE
, this is the JSON response from theprojects.locations.endpoints.predict
call to the model. The generated embeddings are in thevalues
element. - If
flatten_json_output
isTRUE
, this is anARRAY<FLOAT64>
value that contains the generated embeddings.
- If
ml_generate_embedding_statistics
: aJSON
value that contains atoken_count
field with the number of tokens in the content, and atruncated
field that indicates whether the content was truncated. This column is returned whenflatten_json_output
isTRUE
.ml_generate_embedding_status
: aSTRING
value that contains the API response status for the corresponding row. This value is empty if the operation was successful.
PCA
ML.GENERATE_EMBEDDING
returns the input table and the following column:
ml_generate_embedding_result
: this is anARRAY<FLOAT>
value that contains the principal components for the input data. The number of array dimensions is equal to the PCA model'sNUM_PRINCIPAL_COMPONENTS
option value if that option is used when the model is created. If thePCA_EXPLAINED_VARIANCE_RATIO
option is used instead, the array dimensions vary depending on the input table and the option ratio determined by BigQuery ML.
Autoencoder
ML.GENERATE_EMBEDDING
returns the input table and the following column:
trial_id
: anINT64
value that identifies the hyperparameter tuning trial used by the function. This column is only returned if you ran hyperparameter tuning when creating the model.ml_generate_embedding_result
: this is anARRAY<FLOAT>
value that contains the latent space dimensions for the input data. The number of array dimensions is equal to the number in the middle of the autoencoder model'sHIDDEN_UNITS
option array value.
Matrix factorization
ML.GENERATE_EMBEDDING
returns the following columns:
trial_id
: anINT64
value that identifies the hyperparameter tuning trial used by the function. This column is only returned if you ran hyperparameter tuning when creating the model.ml_generate_embedding_result
: this is anARRAY<FLOAT>
value that contains the weights of the feature, and also the intercept or bias term for the feature. The intercept value is the last value in the array. The number of array dimensions is equal to the matrix factorization model'sNUM_FACTORS
option value.processed_input
: aSTRING
value that contains the name of the user or item column. The value of this column matches the name of the user or item column provided in thequery_statement
clause that was used when the matrix factorization model was trained.feature
: aSTRING
value that contains the names of the specific users or items used during training.
Supported visual content
You can use the ML.GENERATE_EMBEDDING
function to generate embeddings for
images that meet the requirements described in
API limits.
Known issues
Sometimes after a query job that uses this function finishes successfully, some returned rows contain the following error message:
A retryable error occurred: RESOURCE EXHAUSTED error from <remote endpoint>
This issue occurs because BigQuery query jobs finish successfully
even if the function fails for some of the rows. The function fails when the
volume of API calls to the remote endpoint exceeds the quota limits for that
service. This issue occurs most often when you are running multiple parallel
batch queries. BigQuery retries these calls, but if the retries
fail, the resource exhausted
error message is returned.
Examples
multimodalembedding
This example shows how to generate embeddings from visual content by using a
remote model that references a multimodalembedding
LLM.
Create the remote model:
CREATE OR REPLACE MODEL `mydataset.multimodalembedding` REMOTE WITH CONNECTION `us.test_connection` OPTIONS(ENDPOINT = 'multimodalembedding@001')
Generate embeddings from visual content in an object table:
SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mydataset.multimodalembedding`, TABLE `mydataset.my_object_table`);
textembedding-gecko
This example shows how to generate an embedding of a single piece of sample text
by using a remote model that references a textembedding-gecko
LLM.
Create the remote model:
CREATE OR REPLACE MODEL `mydataset.textembedding` REMOTE WITH CONNECTION `us.test_connection` OPTIONS(ENDPOINT = 'textembedding-gecko@003')
Generate the embedding:
SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mydataset.textembedding`, (SELECT "Example text to embed" AS content), STRUCT(TRUE AS flatten_json_output) );
textembedding-gecko-multilingual
This example shows how to generate embeddings from a table and specify
a task type by using a remote model that references a
textembedding-gecko-multilingual
LLM.
Create the remote model:
CREATE OR REPLACE MODEL `mydataset.textembedding_multi` REMOTE WITH CONNECTION `us.test_connection` OPTIONS(ENDPOINT = 'textembedding-gecko-multilingual')
Generate the embeddings:
SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mydataset.textembedding_multi`, TABLE `mydataset.customer_feedback`, STRUCT(TRUE AS flatten_json_output, 'SEMANTIC_SIMILARITY' as task_type) );
PCA
This example shows how to generate embeddings that represent the principal components of a PCA model.
Create the PCA model:
CREATE OR REPLACE MODEL `mydataset.pca_nyc_trees` OPTIONS ( MODEL_TYPE = 'PCA', PCA_EXPLAINED_VARIANCE_RATIO = 0.9) AS ( SELECT tree_id, block_id, tree_dbh, stump_diam, curb_loc, status, health, spc_latin FROM `bigquery-public-data.new_york_trees.tree_census_2015` );
Generate embeddings that represent principal components:
SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mydataset.pca_nyc_trees`, ( SELECT tree_id, block_id, tree_dbh, stump_diam, curb_loc, status, health, spc_latin FROM `bigquery-public-data.new_york_trees.tree_census_2015` ));
Autoencoder
This example shows how to generate embeddings that represent the latent space dimensions of an autoencoder model.
Create the autoencoder model:
CREATE OR REPLACE MODEL `mydataset.my_autoencoder_model` OPTIONS ( model_type = 'autoencoder', activation_fn = 'relu', batch_size = 8, dropout = 0.2, hidden_units = [ 32, 16, 4, 16, 32], learn_rate = 0.001, l1_reg_activation = 0.0001, max_iterations = 10, optimizer = 'adam') AS SELECT * EXCEPT ( Time, Class) FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection`;
Generate embeddings that represent latent space dimensions:
SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mydataset.my_autoencoder_model`, TABLE `bigquery-public-data.ml_datasets.ulb_fraud_detection`);
Matrix factorization
This example shows how to generate embeddings that represent the underlying weights that the matrix factorization model uses during prediction.
Create the matrix factorization model:
CREATE OR REPLACE MODEL `mydataset.my_mf_model` OPTIONS ( model_type='matrix_factorization', user_col='user_id', item_col='item_id', l2_reg=9.83, num_factors=34) AS SELECT user_id, item_id, AVG(rating) as rating FROM movielens.movielens_1m GROUP BY user_id, item_id;
Generate embeddings that represent model weights and intercepts:
SELECT * FROM ML.GENERATE_EMBEDDING(MODEL `mydataset.my_mf_model`)
Locations
The ML.GENERATE_EMBEDDING
function must run in the same
region or multi-region as the model that the
function references.
Quotas
Quotas apply when you use the ML.GENERATE_EMBEDDING
function with remote
models. For more information, see Vertex AI and Cloud AI service
functions quotas and limits.
What's next
- Learn more about choosing a text embedding model.
- Try creating embeddings by using an LLM.
- For more information about using Vertex AI models to generate text and embeddings, see Generative AI overview.
- For more information about using Cloud AI APIs to perform AI tasks, see AI application overview.