Gen AI evaluation service overview

Vertex AI provides model evaluation metrics for both predictive AI and generative AI models. This page provides an overview of the evaluation service for generative AI models. To evaluate a predictive AI model, see Model evaluation in Vertex AI.

This page provides an overview of the Gen AI evaluation service service, which you can use to evaluate generative models and applications against your own criteria. This document covers the following topics:

The following diagram summarizes the overall workflow for evaluating a generative model:

You can use the Gen AI evaluation service in Vertex AI to evaluate any generative model or application against your own criteria. While public leaderboards offer general insights, the evaluation service helps you understand how a model performs on your specific tasks and data.

Evaluation is a critical step throughout the generative AI development lifecycle, including model selection, prompt engineering, and model customization. The service is integrated within Vertex AI to help you launch and reuse evaluations as needed.

Gen AI evaluation service capabilities

The Gen AI evaluation service can help you with the following tasks:

  • Model selection: Choose the best pre-trained model for your task based on benchmark results and its performance on your specific data.
  • Generation settings: Adjust model parameters, like temperature, to optimize output for your needs.
  • Prompt engineering: Craft effective prompts and prompt templates to guide the model towards your preferred behavior and responses.
  • Improve and safeguard fine-tuning: Fine-tune a model to improve performance for your use case, while avoiding biases or undesirable behaviors.
  • RAG optimization: Select the most effective Retrieval Augmented Generation (RAG) architecture to enhance performance for your application.
  • Migration: Continuously assess and improve the performance of your AI solution by migrating to newer models when they provide a clear advantage for your specific use case.
  • Translation (preview): Assess the quality of your model's translations.
  • Evaluate agents: Evaluate the performance of your agents using the Gen AI evaluation service.

Evaluation process

To evaluate a generative AI model or application using Gen AI evaluation service, follow these steps:

  1. Define evaluation metrics:
    • Tailor model-based metrics to your business criteria.
    • Evaluate a single model (pointwise) or determine the winner when comparing two models (pairwise).
    • Include computation-based metrics for additional insights.
  2. Prepare your evaluation dataset:
    • Provide a dataset that reflects your specific use case.
  3. Run an evaluation:
    • Start from scratch, use a template, or adapt existing examples.
    • Define candidate models and create an <abbr data-title="A reusable object in the Vertex AI evaluation service that encapsulates your evaluation logic, including models, metrics, and dataset.">EvalTask</abbr> to reuse your evaluation logic through Vertex AI.
  4. View and interpret your evaluation results.
  5. (Optional) Evaluate and improve the quality of the judge model:
  6. (Optional) Evaluate generative AI agents.

Notebooks for evaluation use cases

The following Vertex AI SDK for Python notebooks demonstrate various generative AI evaluation use cases.

Evaluate models

Evaluate prompt templates

Evaluate Gen AI applications

Evaluate Gen AI agents

Metric customization

Other topics

Supported models

The Vertex AI Gen AI evaluation service supports Google's foundation models, third-party models, and open models. You can provide pre-generated predictions directly, or automatically generate candidate model responses. The following table helps you choose the right integration method for your model.

Model Source Description Use Case
Google's foundation models Directly use Google's models like Gemini 2.0 Flash for response generation. When you want to leverage Google's latest models without managing infrastructure.
Vertex AI Model Registry Use any model (custom-trained, imported) that is deployed as an endpoint in the Vertex AI Model Registry. For evaluating your fine-tuned models or other models managed within Vertex AI.
Third-party and open models (via SDK) Integrate with external model APIs using their respective SDKs. When your model is hosted outside of Google Cloud and provides an SDK for access.
Wrapped model endpoints Create a wrapper around external model endpoints using the Vertex AI SDK. For models that are accessible via an API endpoint but don't have a direct SDK integration.

Supported languages

This section describes the language support for model-based and translation metrics.

For model-based metrics

For Gemini model-based metrics, the Gen AI evaluation service supports all input languages that are supported by Gemini 2.0 Flash. However, the quality of evaluations for non-English inputs might not be as high as the quality for English inputs.

For translation metrics

For translation tasks, you can use the following model-based metrics, which support the languages listed in this section:

Metric Description
MetricX A family of model-based metrics for evaluating text generation tasks, including translation, by comparing model output to a reference.
COMET A neural framework for training multilingual machine translation evaluation models which has been shown to have high correlation with human judgments of translation quality.
  • MetricX

    Supported languages for MetricX:

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Azerbaijani
    • Basque
    • Belarusian
    • Bengali
    • Bulgarian
    • Burmese
    • Catalan
    • Cebuano
    • Chichewa
    • Chinese
    • Corsican
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Filipino
    • Finnish
    • French
    • Galician
    • Georgian
    • German
    • Greek
    • Gujarati
    • Haitian Creole
    • Hausa
    • Hawaiian
    • Hebrew
    • Hindi
    • Hmong
    • Hungarian
    • Icelandic
    • Igbo
    • Indonesian
    • Irish
    • Italian
    • Japanese
    • Javanese
    • Kannada
    • Kazakh
    • Khmer
    • Korean
    • Kurdish
    • Kyrgyz
    • Lao
    • Latin
    • Latvian
    • Lithuanian
    • Luxembourgish
    • Macedonian
    • Malagasy
    • Malay
    • Malayalam
    • Maltese
    • Maori
    • Marathi
    • Mongolian
    • Nepali
    • Norwegian
    • Pashto
    • Persian
    • Polish
    • Portuguese
    • Punjabi
    • Romanian
    • Russian
    • Samoan
    • Scottish Gaelic
    • Serbian
    • Shona
    • Sindhi
    • Sinhala
    • Slovak
    • Slovenian
    • Somali
    • Sotho
    • Spanish
    • Sundanese
    • Swahili
    • Swedish
    • Tajik
    • Tamil
    • Telugu
    • Thai
    • Turkish
    • Ukrainian
    • Urdu
    • Uzbek
    • Vietnamese
    • Welsh
    • West Frisian
    • Xhosa
    • Yiddish
    • Yoruba
    • Zulu
  • COMET

    Supported languages for COMET:

    • Afrikaans
    • Albanian
    • Amharic
    • Arabic
    • Armenian
    • Assamese
    • Azerbaijani
    • Basque
    • Belarusian
    • Bengali
    • Bengali Romanized
    • Bosnian
    • Breton
    • Bulgarian
    • Burmese
    • Burmese
    • Catalan
    • Chinese (Simplified)
    • Chinese (Traditional)
    • Croatian
    • Czech
    • Danish
    • Dutch
    • English
    • Esperanto
    • Estonian
    • Filipino
    • Finnish
    • French
    • Galician
    • Georgian
    • German
    • Greek
    • Gujarati
    • Hausa
    • Hebrew
    • Hindi
    • Hindi Romanized
    • Hungarian
    • Icelandic
    • Indonesian
    • Irish
    • Italian
    • Japanese
    • Javanese
    • Kannada
    • Kazakh
    • Khmer
    • Korean
    • Kurdish (Kurmanji)
    • Kyrgyz
    • Lao
    • Latin
    • Latvian
    • Lithuanian
    • Macedonian
    • Malagasy
    • Malay
    • Malayalam
    • Marathi
    • Mongolian
    • Nepali
    • Norwegian
    • Oriya
    • Oromo
    • Pashto
    • Persian
    • Polish
    • Portuguese
    • Punjabi
    • Romanian
    • Russian
    • Sanskrit
    • Scottish
    • Gaelic
    • Serbian
    • Sindhi
    • Sinhala
    • Slovak
    • Slovenian
    • Somali
    • Spanish
    • Sundanese
    • Swahili
    • Swedish
    • Tamil
    • Tamil Romanized
    • Telugu
    • Telugu Romanized
    • Thai
    • Turkish
    • Ukrainian
    • Urdu
    • Urdu Romanized
    • Uyghur
    • Uzbek
    • Vietnamese
    • Welsh
    • Western
    • Frisian
    • Xhosa
    • Yiddish

What's next