Choose an embeddings task type

This guide shows you how to choose the optimal task type when generating embeddings with Vertex AI to improve their quality for your specific use case. It covers the following topics:

  • Benefits of using task types: Learn how task types improve embedding quality for use cases like Retrieval Augmented Generation (RAG).
  • Supported task types: Get a comparative overview of all available task types to help you choose the right one.
  • Use case deep dive: Explore detailed explanations and examples for each task type, including classification, clustering, retrieval, and semantic similarity.

With Vertex AI embeddings models, you can generate embeddings optimized for various tasks, such as document retrieval, question and answering, and fact verification. A task type is a parameter that you specify to optimize the embeddings that the model generates for your intended use case. This document describes how to choose the optimal task type for your embeddings.

Supported models

Task types are supported by the following models:

  • text-embedding-005
  • text-multilingual-embedding-002
  • gemini-embedding-001

The following limitations apply when using these models:

  • Don't use these preview models on mission critical or production systems.
  • These models are available in us-central1 only.
  • Batch predictions are not supported.
  • Customization is not supported.

Benefits of task types

Using task types can improve the quality of the embeddings that a model generates.

Questions and their answers aren't semantically similar
Figure 1. Questions and their answers aren't semantically similar, which means that embeddings won't automatically show their relation.

For example, when building Retrieval Augmented Generation (RAG) systems, a common design is to use text embeddings and Vector Search to perform a similarity search. In some cases, this approach can result in lower search quality because questions and their answers are not semantically similar. For example, a question like "Why is the sky blue?" and its answer "The scattering of sunlight causes the blue color," have distinctly different meanings as statements. This means that a RAG system might not automatically recognize their relation, as demonstrated in figure 1.

Without task types, a RAG developer would need to train their model to learn the relationship between queries and answers, which requires advanced data science skills, or use LLM-based query expansion or HyDE, which can introduce high latency and costs.

Task type puts questions and answers closer in the embeddings space
Figure 2. Task types optimize embeddings for specific tasks. In this case, questions and answers are brought closer together in the embeddings space.

Task types enable you to generate optimized embeddings for specific tasks, which saves you the time and cost it would take to develop your own task-specific embeddings. The generated embedding for a query "Why is the sky blue?" and its answer "The scattering of sunlight causes the blue color" would be in the shared embedding space that represents the relationship between them, as demonstrated in figure 2. In this RAG example, the optimized embeddings would lead to improved similarity searches.

In addition to the query and answer use case, task types also provide an optimized embedding space for tasks such as classification, clustering, and fact verification.

Supported task types

The following table describes the supported task types. The best task type for your project depends on your use case. To explore all task types, see the model reference.

Task Type Category Task Type(s) Description Common Use Cases
Retrieval (Asymmetric) For queries: RETRIEVAL_QUERY, QUESTION_ANSWERING, FACT_VERIFICATION, CODE_RETRIEVAL_QUERY
For corpus: RETRIEVAL_DOCUMENT
Generates embeddings for a short query to search against a large corpus of documents. The query and document embeddings are optimized to work together. Document search, RAG systems, question answering, fact verification, code retrieval.
Classification (Symmetric) CLASSIFICATION Generates an embedding for a single text input, optimized for classification models. Sentiment analysis, topic classification, spam detection.
Clustering (Symmetric) CLUSTERING Generates an embedding for a single text input, optimized for clustering algorithms. Topic modeling, customer segmentation, identifying duplicate content.
Semantic Similarity (Symmetric) SEMANTIC_SIMILARITY Generates embeddings for comparing two pieces of text to determine their semantic similarity. Not intended for retrieval. Calculating similarity scores for recommendations or paraphrasing detection.

Determine your embeddings use case

Embeddings use cases typically fall into one of four categories: assessing text similarity, classifying texts, clustering texts, or retrieving information from texts. If your use case doesn't align with a documented use case, use the RETRIEVAL_QUERY task type by default.

Task types can be symmetric or asymmetric. Depending on your use case, you use either a symmetric or asymmetric task type.

Symmetric use cases

Symmetric use cases are when the texts being compared are of similar length and content, such as comparing two sentences for similarity.

Classification

To classify texts according to preset labels, use the CLASSIFICATION task type. This task type generates embeddings that are optimized for classification models. For example, if you generate an embedding for the social media post "I don't like traveling on airplanes," a classification model could use the embedding to classify the sentiment as negative.

Clustering

To cluster texts based on their similarities, use the CLUSTERING task type. This task type generates embeddings that are optimized for clustering algorithms. For example, after generating and clustering embeddings for news articles, you can suggest additional sports-related articles to users who read a lot about sports.

Additional use cases for clustering include the following:

  • Customer segmentation: Group customers with similar embeddings generated from their profiles or activities for targeted marketing and personalized experiences.
  • Product segmentation: Cluster product embeddings based on their product title and description, product images, or customer reviews to help businesses do segment analysis on their products.
  • Market research: Cluster consumer survey responses or social media data embeddings to reveal hidden patterns and trends in consumer opinions, preferences, and behaviors.
  • Healthcare: Cluster patient embeddings derived from medical data to help identify groups with similar conditions or treatment responses, leading to more personalized healthcare plans.
  • Customer feedback trends: Cluster customer feedback from various channels (surveys, social media, support tickets) to help identify common pain points, feature requests, and areas for product improvement.

Semantic similarity

To assess text similarity, use the SEMANTIC_SIMILARITY task type. This task type generates embeddings that are optimized for comparing the semantic similarity between two pieces of text. For example, when comparing embeddings for "The cat is sleeping" and "The feline is napping," the similarity score would be high because both texts have nearly the same meaning.

A real-world scenario for assessing input similarity is a recommendation system that identifies items (for example, products, articles, movies) that are semantically similar to a user's preferred items to provide personalized recommendations.

Asymmetric use cases

Asymmetric use cases are when you compare a short query text against a large body of longer documents.

Information retrieval

When you build a search or retrieval system, you work with two types of text:

  • Corpus: The collection of documents that you want to search over.
  • Query: The text that a user provides to search for information within the corpus.

To get the best performance, use different task types to generate embeddings for your corpus and your queries.

First, generate embeddings for your entire collection of documents using the RETRIEVAL_DOCUMENT task type. You typically perform this step once to index your entire corpus and then store the resulting embeddings in a vector database.

Next, when a user submits a search, generate an embedding for their query text in real time using a task type that matches the user's intent. Your system then uses this query embedding to find the most similar document embeddings in your vector database.

The following task types are used for queries:

  • RETRIEVAL_QUERY: Use this for a standard search query where you want to find relevant documents.
  • QUESTION_ANSWERING: Use this when queries are expected to be proper questions, such as "Why is the sky blue?".
  • FACT_VERIFICATION: Use this when you want to retrieve a document from your corpus that proves or disproves a statement.

Code retrieval

text-embedding-005 supports the CODE_RETRIEVAL_QUERY task type, which you can use to retrieve relevant code blocks using plain text queries. To use this feature, embed code blocks using the RETRIEVAL_DOCUMENT task type, and embed text queries using CODE_RETRIEVAL_QUERY.

Here is an example:

REST

PROJECT_ID=PROJECT_ID

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/text-embedding-005:predict -d \
$'{
  "instances": [
    {
      "task_type": "CODE_RETRIEVAL_QUERY",
      "content": "Function to add two numbers"
    }
  ],
}'

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel

MODEL_NAME = "gemini-embedding-001"
DIMENSIONALITY = 3072


def embed_text(
    texts: list[str] = ["Retrieve a function that adds two numbers"],
    task: str = "CODE_RETRIEVAL_QUERY",
    model_name: str = "gemini-embedding-001",
    dimensionality: int | None = 3072,
) -> list[list[float]]:
    """Embeds texts with a pre-trained, foundational model."""
    model = TextEmbeddingModel.from_pretrained(model_name)
    kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}

    embeddings = []
    # gemini-embedding-001 takes one input at a time
    for text in texts:
        text_input = TextEmbeddingInput(text, task)
        embedding = model.get_embeddings([text_input], **kwargs)
        print(embedding)
        # Example response:
        # [[0.006135190837085247, -0.01462465338408947, 0.004978656303137541, ...]]
        embeddings.append(embedding[0].values)

    return embeddings


if __name__ == "__main__":
    # Embeds code block with a pre-trained, foundational model.
    # Using this function to calculate the embedding for corpus.
    texts = ["Retrieve a function that adds two numbers"]
    task = "CODE_RETRIEVAL_QUERY"
    code_block_embeddings = embed_text(
        texts=texts, task=task, model_name=MODEL_NAME, dimensionality=DIMENSIONALITY
    )

    # Embeds code retrieval with a pre-trained, foundational model.
    # Using this function to calculate the embedding for query.
    texts = [
        "def func(a, b): return a + b",
        "def func(a, b): return a - b",
        "def func(a, b): return (a ** 2 + b ** 2) ** 0.5",
    ]
    task = "RETRIEVAL_DOCUMENT"
    code_query_embeddings = embed_text(
        texts=texts, task=task, model_name=MODEL_NAME, dimensionality=DIMENSIONALITY
    )

What's next