The Gen AI evaluation service in Vertex AI lets you evaluate any generative model or application and benchmark the evaluation results against your own judgment, using your own evaluation criteria.
While leaderboards and reports offer insights into overall model performance, they don't reveal how a model handles your specific needs. The Gen AI evaluation service helps you define your own evaluation criteria, ensuring a clear understanding of how well generative AI models and applications align with your unique use case.
Evaluation is important at every step of your Gen AI development process including model selection, prompt engineering, and model customization. Evaluating Gen AI is integrated within Vertex AI to help you launch and reuse evaluations as needed.
Gen AI evaluation service capabilities
The Gen AI evaluation service can help you with the following tasks:
Model selection: Choose the best pre-trained model for your task based on benchmark results and its performance on your specific data.
Generation settings: Tweak model parameters (like temperature) to optimize output for your needs.
Prompt engineering: Craft effective prompts and prompt templates to guide the model towards your preferred behavior and responses.
Improve and safeguard fine-tuning: Fine-tune a model to improve performance for your use case, while avoiding biases or undesirable behaviors.
RAG optimization: Select the most effective Retrieval Augmented Generation (RAG) architecture to enhance performance for your application.
Migration: Continuously assess and improve the performance of your AI solution by migrating to newer models when they provide a clear advantage for your specific use case.
Translation (preview): Assess the quality of your model's translations.
Evaluation process
The Gen AI evaluation service lets you evaluate any Gen AI model or application on your evaluation criteria by following these steps:
-
Learn how to tailor model-based metrics to your business criteria.
Evaluate a single model (pointwise) or determine the winner when comparing 2 models (pairwise).
Include computation-based metrics for additional insights.
Prepare your evaluation dataset.
- Provide a dataset that reflects your specific use case.
-
Start from scratch, use a template, or adapt existing examples.
Define candidate models and create an
EvalTask
to reuse your evaluation logic through Vertex AI.
Notebooks for evaluation use cases
The following table lists Vertex AI SDK for Python notebooks for various generative AI evaluation use cases:
Use case | Description | Links to notebooks |
---|---|---|
Evaluate models | Quickstart: Introduction to Gen AI evaluation service SDK. | Getting Started with Gen AI evaluation service SDK |
Evaluate and select first-party (1P) foundation models for your task. | Evaluate and select first-party (1P) foundation models for your task | |
Evaluate and select Gen AI model settings: Adjust temperature, output token limit, safety settings and other model generation configurations of Gemini models on a summarization task and compare the evaluation results from different model settings on several metrics. |
Compare different model parameter settings for Gemini | |
Evaluate third-party (3P) models on Vertex AI Model Garden. This notebook provides a comprehensive guide to evaluating both Google's Gemini models and 3P language models using the Gen AI evaluation service SDK. Learn how to assess and compare models from different sources, including open and closed models, model endpoints, and 3P client libraries using various evaluation metrics and techniques. Gain practical experience in conducting controlled experiments and analyzing model performance across a range of tasks. |
Use Gen AI evaluation service SDK to Evaluate Models in Vertex AI Studio, Model Garden, and Model Registry | |
Migrate from PaLM to Gemini model with Gen AI evaluation service SDK. This notebook guides you through evaluating PaLM and Gemini foundation models using multiple evaluation metrics to support decisions around migrating from one model to another. We visualize these metrics to gain insights into the strengths and weaknesses of each model, helping you make an informed decision about which one aligns best with the specific requirements of your use case. |
Compare and migrate from PaLM to Gemini model | |
Evaluate translation models. This notebook shows you how to use the Vertex AI SDK for the Gen AI evaluation service to measure the translation quality of your large language model (LLM) responses using BLEU, MetricX, and COMET. |
Evaluate a translation model | |
Evaluate prompt templates | Prompt engineering and prompt evaluation with Gen AI evaluation service SDK. | Evaluate and Optimize Prompt Template Design for Better Results |
Evaluate Gen AI applications | Evaluate Gemini model tool use and function calling capabilities. | Evaluate Gemini Model Tool Use |
Evaluate generated answers from Retrieval-Augmented Generation (RAG) for a question-answering task with Gen AI evaluation service SDK. | Evaluate Generated Answers from Retrieval-Augmented Generation (RAG) | |
Evaluate LangChain chatbots with Vertex AI Gen AI evaluation service. This notebook demonstrates how to evaluate a LangChain conversational chatbot using the Vertex AI Gen AI evaluation service SDK. It covers data preparation, LangChain chain setup, creating custom evaluation metrics, and analyzing results. The tutorial uses a recipe suggestion chatbot as an example and shows how to improve its performance by iterating on the prompt design. |
Evaluate LangChain | |
Metric customization | Customize model-based metrics and evaluate a generative AI model according to your specific criteria using the following features:
|
Customize Model-based Metrics to evaluate a Gen AI model |
Evaluate generative AI models with your locally-defined custom metric, and bring your own judge model to perform model-based metric evaluation. | Bring-Your-Own-Autorater using Custom Metric | |
Define your own computation-based custom metric functions, and use them for evaluation with Gen AI evaluation service SDK. | Bring your own computation-based Custom Metric | |
Other topics | Gen AI evaluation service SDK Preview-to-GA Migration Guide. This tutorial guides you through the migration process from the Preview version to the latest GA version of the Vertex AI SDK for Python for Gen AI evaluation service. The guide also showcases how to use the GA version SDK to evaluate Retrieval-Augmented Generation (RAG) and compare two models using pairwise evaluation. |
Gen AI evaluation service SDK Preview-to-GA Migration Guide |
Supported models and languages
The Vertex AI Gen AI evaluation service supports Google's foundation models, third party models, and open models. You can provide pre-generated predictions directly, or automatically generate candidate model responses in the following ways:
Automatically generate responses for Google's foundation models (such as Gemini 1.5 Pro) and any model deployed in Vertex AI Model Registry.
Integrate with SDK text generation APIs from other third party and open models.
Wrap model endpoints from other providers using the Vertex AI SDK.
For Gemini model-based metrics, the Gen AI evaluation service supports all input languages that are supported by Gemini 1.5 Pro. However, the quality of evaluations for non-English inputs may not be as high as the quality for English inputs.
The Gen AI evaluation service supports the following languages for model-based translation metrics:
MetricX
Supported languages for MetricX: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
COMET
Supported languages for COMET: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
What's next
Try the evaluation quickstart.
Learn how to tune a foundation model.