The first step to evaluate your generative models or applications is to identify your evaluation goal and define your evaluation metrics. This page provides an overview of concepts related to defining evaluation metrics for your use case.
Overview
Generative AI models can be used to create applications for a wide range of tasks, such as summarizing news articles, responding to customer inquiries, or assisting with code writing. The Gen AI evaluation service in Vertex AI lets you evaluate any model with explainable metrics.
For example, you might be developing an application to summarize articles. To evaluate your application's performance on that specific task, consider the criteria you would like to measure and the metrics that you would use to score them:
Criteria: Single or multiple dimensions you would like to evaluate upon, such as
conciseness
,relevance
,correctness
, orappropriate choice of words
.Metrics: A single score that measures the model output against criteria.
The Gen AI evaluation service provides two major types of metrics:
Model-based metrics: Our model-based metrics assess your candidate model against a judge model. The judge model for most use cases is Gemini, but you can also use models such as MetricX or COMET for translation use cases.
You can measure model-based metrics pairwise or pointwise:
Pointwise metrics: Let the judge model assess the candidate model's output based on the evaluation criteria. For example, the score could be 0~5, where 0 means the response does not fit the criteria, while 5 means the response fits the criteria well.
Pairwise metrics: Let the judge model compare the responses of two models and pick the better one. This is often used when comparing a candidate model with the baseline model. Pairwise metrics are only supported with Gemini as a judge model.
Computation-based metrics: These metrics are computed using mathematical formulas to compare the model's output against a ground truth or reference. Commonly used computation-based metrics include ROUGE and BLEU.
You can use computation-based metrics standalone, or together with model-based metrics. Use the following table to decide when to use model-based or computation-based metrics:
Evaluation approach | Data | Cost and speed | |
---|---|---|---|
Model-based metrics | Use a judge model to assess performance based on descriptive evaluation criteria | Ground truth is optional | Slightly more expensive and slower |
Computation-based metrics | Use mathematical formulas to assess performance | Ground truth is usually required | Low cost and fast |
To get started, see Prepare your dataset and Run evaluation.
Define your model-based metrics
Model-based evaluation involves using a machine learning model as a judge model to evaluate the outputs of the candidate model.
Proprietary Google judge models, such as Gemini, are calibrated with human raters to ensure their quality. They are managed and available out of the box. The process of model-based evaluation varies based on the evaluation metrics you provide.
Model-based evaluation follows this process:
Data preparation: You provide evaluation data in the form of input prompts. The candidate models receive the prompts and generate corresponding responses.
Evaluation: The evaluation metrics and generated responses are sent to the judge model. The judge model evaluates each response individually, providing a row-based assessment.
Aggregation and explanation: Gen AI evaluation service aggregates these individual assessments into an overall score. The output also includes chain-of-thought explanations for each judgment, outlining the rationale behind the selection.
Gen AI evaluation service offers the following options to set up your model-based metrics with the Vertex AI SDK:
Option | Description | Best for |
---|---|---|
Use an existing example | Use a prebuilt metric prompt template to get started. | Common use cases, time-saving |
Define metrics with our templated interface | Get guided assistance in defining your metrics. Our templated interface provides structure and suggestions. | Customization with support |
Define metrics from scratch | Have complete control over your metric definitions. | Ideal for highly specific use cases. Requires more technical expertise and time investment. |
As an example, you might want to develop a generative AI application that returns fluent and entertaining responses. For this application, you can define two criteria for evaluation using the templated interface:
Fluency: Sentences flow smoothly, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.
Entertainment: Short, amusing text that incorporates emoji, exclamations, and questions to convey quick and spontaneous communication and diversion.
To turn those two criteria into a metric, you want an overall score ranging from -1 ~ 1 called custom_text_quality
. You can define a metric like this:
# Define a pointwise metric with two criteria: Fluency and Entertaining.
custom_text_quality = PointwiseMetric(
metric="custom_text_quality",
metric_prompt_template=PointwiseMetricPromptTemplate(
criteria={
"fluency": (
"Sentences flow smoothly and are easy to read, avoiding awkward"
" phrasing or run-on sentences. Ideas and sentences connect"
" logically, using transitions effectively where needed."
),
"entertaining": (
"Short, amusing text that incorporates emojis, exclamations and"
" questions to convey quick and spontaneous communication and"
" diversion."
),
},
rating_rubric={
"1": "The response performs well on both criteria.",
"0": "The response is somewhat aligned with both criteria",
"-1": "The response falls short on both criteria",
},
),
)
For a complete list of metric prompt templates, see Metric prompt templates for evaluation.
Evaluate translation models
The Gen AI evaluation service offers the following translation task evaluation metrics:
MetricX and COMET are pointwise model-based metrics that have been trained for translation tasks. You can evaluate the quality and accuracy of translation model results for your content, whether they are outputs of NMT, TranslationLLM, or Gemini models.
You can also use Gemini as a judge model to evaluate your model for fluency, coherence, verbosity and text quality in combination with MetricX, COMET or BLEU.
MetricX is an error-based metric developed by Google that predicts a floating point score between 0 and 25 representing the quality of a translation. MetricX is available both as a referenced-based and reference-free (QE) method. When you use this metric, a lower score is a better score, because it means there are fewer errors.
COMET employs a reference-based regression approach that provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
BLEU (Bilingual Evaluation Understudy) is a computation-based metric. The BLEU score indicates how similar the candidate text is to the reference text. A BLEU score value that is closer to one indicates that a translation is closer to the reference text.
Note that BLEU scores are not recommended for comparing across different corpora and languages. For example, an English to German BLEU score of 50 is not comparable to a Japanese to English BLEU score of 50. Many translation experts have shifted to model-based metric approaches, which have higher correlation with human ratings and are more granular in identifying error scenarios.
To learn how to run evaluations for translation models, see Evaluate a translation model.
Choose between pointwise or pairwise evaluation
Use the following table to decide when you want to use pointwise or pairwise evaluation:
Definition | When to use | Example use cases | |
---|---|---|---|
Pointwise evaluation | Evaluate one model and generate scores based on the criteria |
|
|
Pairwise evaluation | Compare two models against each other, generating a preference based on the criteria |
|
|
Computation-based metrics
Computation-based metrics compare whether the LLM-generated results are consistent with a ground-truth dataset of input and output pairs. The commonly used metrics can be categorized into the following groups:
- Lexicon-based metrics: Use math to calculate the string
similarities between LLM-generated results and ground
truth, such as
Exact Match
andROUGE
. - Count-based metrics: Aggregate the number of rows that hit or miss certain
ground-truth labels, such as
F1-score
,Accuracy
, andTool Name Match
. - Embedding-based metrics: Calculate the distance between the LLM-generated results and ground truth in the embedding space, reflecting their level of similarity.
General text generation
The following metrics help you to evaluate the model's ability to ensure the responses are useful, safe, and effective for your users.
Exact match
The exact_match
metric computes whether a model response
matches a reference exactly.
- Token limit: None
Evaluation criteria
Not applicable.
Metric input parameters
Input parameter | Description |
---|---|
response |
The LLM response. |
reference |
The golden LLM response for reference. |
Output scores
Value | Description |
---|---|
0 | Not matched |
1 | Matched |
BLEU
The bleu
(BiLingual Evaluation Understudy) metric holds the
result of an algorithm for evaluating the quality of the response, which has
been translated from one natural language to another natural language. The
quality of the response is considered to be the correspondence between a
response
parameter and its reference
parameter.
- Token limit: None
Evaluation criteria
Not applicable.
Metric input parameters
Input parameter | Description |
---|---|
response |
The LLM response. |
reference |
The golden LLM response for the reference. |
Output scores
Value | Description |
---|---|
A float in the range of [0,1] | Higher scores indicate better translations. A score of 1 represents a perfect match to the reference . |
ROUGE
The ROUGE
metric is used to compare the provided
response
parameter against a reference
parameter.
All rouge
metrics return the F1 score. rouge-l-sum
is calculated by default,
but you can specify the rouge
variant
that you want to use.
- Token limit: None
Evaluation criteria
Not applicable
Metric input parameters
Input parameter | Description |
---|---|
response |
The LLM response. |
reference |
The golden LLM response for the reference. |
Output scores
Value | Description |
---|---|
A float in the range of [0,1] | A score closer to 0 means poor similarity between response and reference . A score closer to 1 means strong similarity between response and reference . |
Tool use and function calling
The following metrics help you to evaluate the model's ability to predict a valid tool (function) call.
Call valid
The tool_call_valid
metric describes the model's ability to
predict a valid tool call. Only the first tool call is
inspected.
- Token limit: None
Evaluation criteria
Evaluation criterion | Description |
---|---|
Validity | The model's output contains a valid tool call. |
Formatting | A JSON dictionary contains the name and
arguments fields. |
Metric input parameters
Input parameter | Description |
---|---|
prediction |
The candidate model output, which is a JSON
serialized string that contains content and
tool_calls keys. The content value is the text
output from the model. The tool_calls value is a JSON
serialized string of a list of tool calls. Here is an example:{"content": "", "tool_calls": [{"name":
"book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning
Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA",
"showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]} |
reference |
The ground-truth reference prediction, which follows the same format as
prediction . |
Output scores
Value | Description |
---|---|
0 | Invalid tool call |
1 | Valid tool call |
Name match
The tool_name_match
metric describes the model's ability to predict
a tool call with the correct tool name. Only the first tool call is inspected.
- Token limit: None
Evaluation criteria
Evaluation criterion | Description |
---|---|
Name matching | The model-predicted tool call matches the reference tool call's name. |
Metric input parameters
Input parameter | Description |
---|---|
prediction |
The candidate model output, which is a JSON
serialized string that contains content and
tool_calls keys. The content value is the text
output from the model. The tool_call value is a JSON
serialized string of a list of tool calls. Here is an example:{"content": "","tool_calls": [{"name": "book_tickets", "arguments":
{"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal
Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date":
"2024-03-30","num_tix": "2"}}]} |
reference |
The ground-truth reference prediction, which follows the same format
as the prediction . |
Output scores
Value | Description |
---|---|
0 | Tool call name doesn't match the reference. |
1 | Tool call name matches the reference. |
Parameter key match
The tool_parameter_key_match
metric describes the model's ability to
predict a tool call with the correct parameter names.
- Token limit: None
Evaluation criteria
Evaluation criterion | Description |
---|---|
Parameter matching ratio | The ratio between the number of predicted parameters that match the parameter names of the reference tool call and the total number of parameters. |
Metric input parameters
Input parameter | Description |
---|---|
prediction |
The candidate model output, which is a JSON
serialized string that contains the content and
tool_calls keys. The content value is the text
output from the model. The tool_call value is a JSON
serialized string of a list of tool calls. Here is an example:{"content": "", "tool_calls": [{"name": "book_tickets", "arguments":
{"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal
Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date":
"2024-03-30","num_tix": "2"}}]} |
reference |
The ground-truth reference model prediction, which follows the same
format as prediction . |
Output scores
Value | Description |
---|---|
A float in the range of [0,1] | The higher score of 1 means more parameters match the reference parameters' names. |
Parameter KV match
The tool_parameter_kv_match
metric describes the model's ability to
predict a tool call with the correct parameter names and key values.
- Token limit: None
Evaluation criteria
Evaluation criterion | Description |
---|---|
Parameter matching ratio | The ratio between the number of the predicted parameters that match both the parameter names and values of the reference tool call and the total number of parameters. |
Metric input parameters
Input parameter | Description |
---|---|
prediction |
The candidate model output, which is a JSON
serialized string that contains content and
tool_calls keys. The content value is the text
output from the model. The tool_call value is a JSON
serialized string of a list of tool calls. Here is an example:{"content": "", "tool_calls": [{"name": "book_tickets", "arguments":
{"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal
Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date":
"2024-03-30","num_tix": "2"}}]} |
reference |
The ground-truth reference prediction, which follows the same format as
prediction . |
Output scores
Value | Description |
---|---|
A float in the range of [0,1] | The higher score of 1 means more parameters match the reference parameters' names and values. |
In the generative AI evaluation service, you can use computation-based metrics through the Vertex AI SDK for Python.
Baseline evaluation quality for generative tasks
When evaluating the output of generative AI models, note that the evaluation process is inherently subjective, and the quality of evaluation can vary depending on the specific task and evaluation criteria. This subjectivity also applies to human evaluators. For more information about the challenges of achieving consistent evaluation for generative AI models, see Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena and Learning to summarize from human feedback.
What's next
Find a model-based metrics template.
Try an evaluation example notebook.