The ML.GENERATE_TEXT_EMBEDDING function

This document describes the ML.GENERATE_TEXT_EMBEDDING function, which lets you embed text that's stored in BigQuery tables. A text embedding is a dense vector representation of a piece of text. If two pieces of text are semantically similar, then their respective embeddings are located near each other in the embedding vector space. Text embeddings help you perform the following tasks:

  • Semantic search: search text ranked by semantic similarity.
  • Recommendation: return items with text attributes similar to the given text.
  • Classification: return the class of items whose text attributes are similar to the given text.
  • Clustering: cluster items whose text attributes are similar to the given text.
  • Outlier detection: return items where text attributes are least related to the given text.

The function works by sending requests to a BigQuery ML remote model that represents one of the Vertex AI textembedding-gecko* text embedding foundation models (LLMs), and then returning the LLM's response.

Syntax

ML.GENERATE_TEXT_EMBEDDING(
  MODEL `project_id.dataset.model_name`,
  { TABLE table_name | (query_statement) },
  STRUCT([flatten_json_output AS flatten_json_output])
)

Arguments

ML.GENERATE_TEXT_EMBEDDING takes the following arguments:

  • project_id: a STRING value that specifies your project ID.

  • dataset: a STRING value that specifies the BigQuery dataset that contains the model.

  • model_name: a STRING value that specifies the name of a remote model that uses one of the textembedding-gecko* Vertex AI LLMs. For more information about how to create this type of remote model, see ENDPOINT.

  • table_name: a STRING value that specifies the name of the BigQuery table that contains a STRING column to embed. The text in the column that's named content is sent to the model. If your table doesn't have a content column, use a SELECT statement for this argument to provide an alias for an existing table column. An error occurs if no content column exists.

  • query_statement: a STRING value that specifies a query statement whose result contains a STRING column that's named content.

  • flatten_json_output: a BOOL value that determines whether the JSON content returned by the function is parsed into separate columns. The default is TRUE.

Output

ML.GENERATE_TEXT_EMBEDDING returns the input table and the following columns:

  • ml_embed_text_result: a JSON value that contains the text embedding and related statistics. This column is returned when flatten_json_output is FALSE.
  • text_embedding: an ARRAY<FLOAT> value that contains a 768-dimensional text embedding. This column is returned when flatten_json_output is TRUE.
  • statistics: a JSON value that contains a token_count field with the number of tokens in the content, and a truncated field that indicates whether the content was truncated. This column is returned when flatten_json_output is TRUE.
  • ml_embed_text_status: a STRING value that contains the API response status for the corresponding row. This value is empty if the operation was successful.

Locations

ML.GENERATE_TEXT_EMBEDDING must run in the same region as the remote model that the function references. You can only create remote models based on textembedding-gecko* models in the following locations:

  • us
  • us-central1

Quotas

See Cloud AI service functions quotas and limits.

Example

This example shows how to generate an embedding of a single piece of sample text:

SELECT *
FROM
  ML.GENERATE_TEXT_EMBEDDING(
    MODEL `bqml_tutorial.embedding_model`,
    (SELECT "Example text to embed" AS content),
    STRUCT(TRUE AS flatten_json_output)
);

What's next