This guide shows you how to define evaluation metrics for your generative models. Before you can evaluate your generative models or applications, you need to identify your evaluation goals and define your metrics.
Generative AI models can create applications for a wide range of tasks, such as summarizing news articles, responding to customer inquiries, or assisting with code writing. With Gen AI evaluation service in Vertex AI, you can evaluate any model using explainable metrics.
For example, if you are developing an application to summarize articles, you need to define the criteria and metrics to evaluate its performance:
- Criteria: The dimensions you want to evaluate, such as
conciseness
,relevance
,correctness
, orappropriate choice of words
. - Metrics: A score that measures the model output against the defined criteria.
The Gen AI evaluation service provides two main types of metrics:
Model-based metrics: These metrics use a powerful "judge" model to assess your model's performance. For most use cases, the judge model is Gemini, but you can also use models like MetricX or COMET for translation tasks. You can measure model-based metrics in two ways:
- Pointwise: The judge model assesses your model's output based on your evaluation criteria.
- Pairwise: The judge model compares the responses of two models and selects the better one. This method is often used to compare a candidate model to a baseline model and is only supported with Gemini as the judge model.
Computation-based metrics: These metrics use mathematical formulas to compare a model's output to a ground truth or reference answer. Common examples include ROUGE and BLEU.
You can use computation-based metrics alone or in combination with model-based metrics. Use the following table to decide which metric type is right for your use case:
Metric Type | Description | Pros | Cons | Use Case |
---|---|---|---|---|
Model-based metrics | Uses a powerful "judge" model (like Gemini) to assess performance based on descriptive, human-like criteria (e.g., fluency , safety ). |
|
|
Evaluating subjective qualities like creativity, style, or safety. When ground truth data is unavailable or difficult to create. |
Computation-based metrics | Uses mathematical algorithms (e.g., ROUGE, BLEU) to compute scores by comparing model output to a reference "ground truth" answer. |
|
|
Tasks with clear right/wrong answers, like summarization (ROUGE), translation (BLEU), or tool use validation. |
Define your model-based metrics
Model-based evaluation uses a machine learning model as a "judge" to evaluate the outputs of your model. Google's proprietary judge models, like Gemini, are calibrated with human raters for quality and are available out-of-the-box.
Model-based evaluation follows these steps:
- Data preparation: You provide evaluation data as input prompts. Your models receive these prompts and generate responses.
- Evaluation: The evaluation metrics and generated responses are sent to the judge model, which evaluates each response individually.
- Aggregation and explanation: Gen AI evaluation service aggregates these individual assessments into an overall score. The output also includes chain-of-thought explanations for each judgment.
Gen AI evaluation service offers the following options to set up your model-based metrics with the Vertex AI SDK:
Option | Description | Best for |
---|---|---|
Use an existing example | Use a prebuilt metric prompt template to get started. | Common use cases, time-saving |
Define metrics with our templated interface | Get guided assistance to define your metrics. The templated interface provides structure and suggestions. | Customization with support |
Define metrics from scratch | Gain complete control over your metric definitions. | Ideal for highly specific use cases. Requires more technical expertise and time investment. |
For example, to evaluate a generative AI application for fluent and entertaining responses, you can define two criteria using the templated interface:
- Fluency: Sentences flow smoothly, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.
- Entertainment: Short, amusing text that incorporates emoji, exclamations, and questions to convey quick and spontaneous communication and diversion.
To turn these criteria into a single metric called custom_text_quality
with a score from -1 to 1, you can define it as follows:
# Define a pointwise metric with two criteria: Fluency and Entertaining.
custom_text_quality = PointwiseMetric(
metric="custom_text_quality",
metric_prompt_template=PointwiseMetricPromptTemplate(
criteria={
"fluency": (
"Sentences flow smoothly and are easy to read, avoiding awkward"
" phrasing or run-on sentences. Ideas and sentences connect"
" logically, using transitions effectively where needed."
),
"entertaining": (
"Short, amusing text that incorporates emojis, exclamations and"
" questions to convey quick and spontaneous communication and"
" diversion."
),
},
rating_rubric={
"1": "The response performs well on both criteria.",
"0": "The response is somewhat aligned with both criteria",
"-1": "The response falls short on both criteria",
},
),
)
For a complete list of metric prompt templates, see Metric prompt templates for evaluation.
Evaluate translation models
The Gen AI evaluation service offers the following translation task evaluation metrics:
MetricX and COMET are pointwise model-based metrics that have been trained for translation tasks. You can evaluate the quality and accuracy of translation model results for your content, whether they are outputs of NMT, TranslationLLM, or Gemini models.
You can also use Gemini as a judge model to evaluate your model for fluency, coherence, verbosity and text quality in combination with MetricX, COMET or BLEU.
MetricX is an error-based metric developed by Google that predicts a floating point score between 0 and 25 representing the quality of a translation. MetricX is available both as a referenced-based and reference-free (QE) method. When you use this metric, a lower score is a better score, because it means there are fewer errors.
COMET employs a reference-based regression approach that provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
BLEU (Bilingual Evaluation Understudy) is a computation-based metric. The BLEU score indicates how similar the candidate text is to the reference text. A BLEU score value that is closer to one indicates that a translation is closer to the reference text.
Note that BLEU scores are not recommended for comparing across different corpora and languages. For example, an English to German BLEU score of 50 is not comparable to a Japanese to English BLEU score of 50. Many translation experts have shifted to model-based metric approaches, which have higher correlation with human ratings and are more granular in identifying error scenarios.
To learn how to run evaluations for translation models, see Evaluate a translation model.
Choose between pointwise or pairwise evaluation
Use the following table to decide when to use pointwise or pairwise evaluation:
Definition | When to use | Example use cases | |
---|---|---|---|
Pointwise evaluation | Evaluate one model and generate scores based on the criteria. |
|
|
Pairwise evaluation | Compare two models against each other, generating a preference based on the criteria. |
|
|
Computation-based metrics
Computation-based metrics compare LLM-generated results against a ground-truth dataset of input and output pairs. These metrics fall into the following categories:
- Lexicon-based metrics: Use mathematical formulas to calculate string similarities between LLM-generated results and ground truth, such as
Exact Match
andROUGE
. - Count-based metrics: Aggregate the number of rows that match or miss certain ground-truth labels, such as
<abbr data-title="A measure of a model's accuracy that considers both precision (what proportion of positive identifications was actually correct) and recall (what proportion of actual positives was identified correctly).">F1-score</abbr>
,Accuracy
, andTool Name Match
. - Embedding-based metrics: Calculate the distance between the LLM-generated results and ground truth in the embedding space to reflect their similarity.
General text generation
The following metrics help you evaluate a model's ability to generate useful, safe, and effective responses.
Metric | Description | Measures | Use Case |
---|---|---|---|
exact_match |
Checks if the model response is identical to the reference. | Strict correctness. | When the output must be a precise string (e.g., code generation, specific answers). |
bleu |
(BiLingual Evaluation Understudy) Compares n-gram overlap between response and reference. | Translation quality, fluency. | Evaluating machine translation tasks. |
ROUGE |
(Recall-Oriented Understudy for Gisting Evaluation) Compares n-gram overlap, focusing on recall. | Summarization quality, content overlap. | Evaluating text summarization and content similarity. |
exact_match
The
exact_match
metric computes whether a model response matches a reference exactly.- Token limit: None
Evaluation criteria
Not applicable.
Metric input parameters
Input parameter Description response
The LLM response. reference
The golden LLM response for reference. Output scores
Value Description 0 Not matched 1 Matched bleu
The
bleu
(BiLingual Evaluation Understudy) metric evaluates the quality of a machine-translated response by comparing it to a reference translation.- Token limit: None
Evaluation criteria
Not applicable.
Metric input parameters
Input parameter Description response
The LLM response. reference
The golden LLM response for the reference. Output scores
Value Description A float in the range of [0,1] Higher scores indicate better translations. A score of 1
represents a perfect match to thereference
.ROUGE
The
ROUGE
metric compares theresponse
parameter against areference
parameter. Allrouge
metrics return the F1 score.rouge-l-sum
is calculated by default, but you can specify therouge
variant you want to use.- Token limit: None
Evaluation criteria
Not applicable.
Metric input parameters
Input parameter Description response
The LLM response. reference
The golden LLM response for the reference. Output scores
Value Description A float in the range of [0,1] A score closer to 0
means poor similarity betweenresponse
andreference
. A score closer to1
means strong similarity betweenresponse
andreference
.
Tool use and function calling
The following metrics help you evaluate the model's ability to predict a valid tool (function) call.
Metric | Description | Measures | Use Case |
---|---|---|---|
tool_call_valid |
Checks if the output is a syntactically valid tool call. | Formatting correctness. | Ensuring the model generates valid function calls that can be parsed. |
tool_name_match |
Checks if the predicted tool name matches the reference. | Correct tool selection. | Verifying the model chose the right function to call. |
tool_parameter_key_match |
Checks if the predicted parameter names (keys) match the reference. | Correct argument structure. | Verifying the model is trying to pass the correct set of parameters. |
tool_parameter_kv_match |
Checks if both parameter names (keys) and their values match the reference. | Full correctness of arguments. | The strictest check to ensure the model called the function with the exact correct arguments. |
tool_call_valid
The
tool_call_valid
metric describes the model's ability to predict a valid tool call. Only the first tool call is inspected.- Token limit: None
Evaluation criteria
Evaluation criterion Description Validity The model's output contains a valid tool call. Formatting A JSON dictionary contains the name
andarguments
fields.Metric input parameters
Input parameter Description prediction
The candidate model output, which is a JSON serialized string that contains content
andtool_calls
keys. Thecontent
value is the text output from the model. Thetool_calls
value is a JSON serialized string of a list of tool calls. Here is an example:
{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
reference
The ground-truth reference prediction, which follows the same format as prediction
.Output scores
Value Description 0 Invalid tool call 1 Valid tool call tool_name_match
The
tool_name_match
metric describes the model's ability to predict a tool call with the correct tool name. Only the first tool call is inspected.- Token limit: None
Evaluation criteria
Evaluation criterion Description Name matching The model-predicted tool call matches the reference tool call's name. Metric input parameters
Input parameter Description prediction
The candidate model output, which is a JSON serialized string that contains content
andtool_calls
keys. Thecontent
value is the text output from the model. Thetool_call
value is a JSON serialized string of a list of tool calls. Here is an example:
{"content": "","tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
reference
The ground-truth reference prediction, which follows the same format as the prediction
.Output scores
Value Description 0 Tool call name doesn't match the reference. 1 Tool call name matches the reference. tool_parameter_key_match
The
tool_parameter_key_match
metric describes the model's ability to predict a tool call with the correct parameter names.- Token limit: None
Evaluation criteria
Evaluation criterion Description Parameter matching ratio The ratio between the number of predicted parameters that match the parameter names of the reference tool call and the total number of parameters. Metric input parameters
Input parameter Description prediction
The candidate model output, which is a JSON serialized string that contains the content
andtool_calls
keys. Thecontent
value is the text output from the model. Thetool_call
value is a JSON serialized string of a list of tool calls. Here is an example:
{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
reference
The ground-truth reference model prediction, which follows the same format as prediction
.Output scores
Value Description A float in the range of [0,1] The higher score of 1
means more parameters match thereference
parameters' names.tool_parameter_kv_match
The
tool_parameter_kv_match
metric describes the model's ability to predict a tool call with the correct parameter names and values.- Token limit: None
Evaluation criteria
Evaluation criterion Description Parameter matching ratio The ratio between the number of the predicted parameters that match both the parameter names and values of the reference tool call and the total number of parameters. Metric input parameters
Input parameter Description prediction
The candidate model output, which is a JSON serialized string that contains content
andtool_calls
keys. Thecontent
value is the text output from the model. Thetool_call
value is a JSON serialized string of a list of tool calls. Here is an example:
{"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
reference
The ground-truth reference prediction, which follows the same format as prediction
.Output scores
Value Description A float in the range of [0,1] The higher score of 1
means more parameters match thereference
parameters' names and values.
Baseline evaluation quality for generative tasks
When you evaluate the output of generative AI models, the process is inherently subjective. The quality of the evaluation can vary depending on the specific task and evaluation criteria. This subjectivity also applies to human evaluators. For more information about the challenges of achieving consistent evaluation for generative AI models, see Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena and Learning to summarize from human feedback.
What's next
Find a model-based metrics template.
Try an evaluation example notebook.