Define your evaluation metrics

This guide shows you how to define evaluation metrics for your generative models. Before you can evaluate your generative models or applications, you need to identify your evaluation goals and define your metrics.

Generative AI models can create applications for a wide range of tasks, such as summarizing news articles, responding to customer inquiries, or assisting with code writing. With Gen AI evaluation service in Vertex AI, you can evaluate any model using explainable metrics.

For example, if you are developing an application to summarize articles, you need to define the criteria and metrics to evaluate its performance:

  • Criteria: The dimensions you want to evaluate, such as conciseness, relevance, correctness, or appropriate choice of words.
  • Metrics: A score that measures the model output against the defined criteria.

The Gen AI evaluation service provides two main types of metrics:

  • Model-based metrics: These metrics use a powerful "judge" model to assess your model's performance. For most use cases, the judge model is Gemini, but you can also use models like MetricX or COMET for translation tasks. You can measure model-based metrics in two ways:

    • Pointwise: The judge model assesses your model's output based on your evaluation criteria.
    • Pairwise: The judge model compares the responses of two models and selects the better one. This method is often used to compare a candidate model to a baseline model and is only supported with Gemini as the judge model.
  • Computation-based metrics: These metrics use mathematical formulas to compare a model's output to a ground truth or reference answer. Common examples include ROUGE and BLEU.

You can use computation-based metrics alone or in combination with model-based metrics. Use the following table to decide which metric type is right for your use case:

Metric Type Description Pros Cons Use Case
Model-based metrics Uses a powerful "judge" model (like Gemini) to assess performance based on descriptive, human-like criteria (e.g., fluency, safety).
  • Can evaluate nuanced qualities without a ground truth dataset.
  • Provides human-readable explanations for scores.
  • Highly customizable.
  • Can be slower and more expensive.
  • Evaluation quality depends on the judge model.
Evaluating subjective qualities like creativity, style, or safety. When ground truth data is unavailable or difficult to create.
Computation-based metrics Uses mathematical algorithms (e.g., ROUGE, BLEU) to compute scores by comparing model output to a reference "ground truth" answer.
  • Fast, low-cost, and objective.
  • Results are easily reproducible.
  • Requires a ground truth dataset.
  • Cannot measure subjective qualities like style or creativity.
  • Metrics can be rigid (e.g., exact_match is all or nothing).
Tasks with clear right/wrong answers, like summarization (ROUGE), translation (BLEU), or tool use validation.

Define your model-based metrics

Model-based evaluation uses a machine learning model as a "judge" to evaluate the outputs of your model. Google's proprietary judge models, like Gemini, are calibrated with human raters for quality and are available out-of-the-box.

Model-based evaluation follows these steps:

  1. Data preparation: You provide evaluation data as input prompts. Your models receive these prompts and generate responses.
  2. Evaluation: The evaluation metrics and generated responses are sent to the judge model, which evaluates each response individually.
  3. Aggregation and explanation: Gen AI evaluation service aggregates these individual assessments into an overall score. The output also includes chain-of-thought explanations for each judgment.

Gen AI evaluation service offers the following options to set up your model-based metrics with the Vertex AI SDK:

Option Description Best for
Use an existing example Use a prebuilt metric prompt template to get started. Common use cases, time-saving
Define metrics with our templated interface Get guided assistance to define your metrics. The templated interface provides structure and suggestions. Customization with support
Define metrics from scratch Gain complete control over your metric definitions. Ideal for highly specific use cases. Requires more technical expertise and time investment.

For example, to evaluate a generative AI application for fluent and entertaining responses, you can define two criteria using the templated interface:

  • Fluency: Sentences flow smoothly, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.
  • Entertainment: Short, amusing text that incorporates emoji, exclamations, and questions to convey quick and spontaneous communication and diversion.

To turn these criteria into a single metric called custom_text_quality with a score from -1 to 1, you can define it as follows:

# Define a pointwise metric with two criteria: Fluency and Entertaining.
custom_text_quality = PointwiseMetric(
    metric="custom_text_quality",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
            "fluency": (
                "Sentences flow smoothly and are easy to read, avoiding awkward"
                " phrasing or run-on sentences. Ideas and sentences connect"
                " logically, using transitions effectively where needed."
            ),
            "entertaining": (
                "Short, amusing text that incorporates emojis, exclamations and"
                " questions to convey quick and spontaneous communication and"
                " diversion."
            ),
        },
        rating_rubric={
            "1": "The response performs well on both criteria.",
            "0": "The response is somewhat aligned with both criteria",
            "-1": "The response falls short on both criteria",
        },
    ),
)

For a complete list of metric prompt templates, see Metric prompt templates for evaluation.

Evaluate translation models

The Gen AI evaluation service offers the following translation task evaluation metrics:

MetricX and COMET are pointwise model-based metrics that have been trained for translation tasks. You can evaluate the quality and accuracy of translation model results for your content, whether they are outputs of NMT, TranslationLLM, or Gemini models.

You can also use Gemini as a judge model to evaluate your model for fluency, coherence, verbosity and text quality in combination with MetricX, COMET or BLEU.

  • MetricX is an error-based metric developed by Google that predicts a floating point score between 0 and 25 representing the quality of a translation. MetricX is available both as a referenced-based and reference-free (QE) method. When you use this metric, a lower score is a better score, because it means there are fewer errors.

  • COMET employs a reference-based regression approach that provides scores ranging from 0 to 1, where 1 signifies a perfect translation.

  • BLEU (Bilingual Evaluation Understudy) is a computation-based metric. The BLEU score indicates how similar the candidate text is to the reference text. A BLEU score value that is closer to one indicates that a translation is closer to the reference text.

Note that BLEU scores are not recommended for comparing across different corpora and languages. For example, an English to German BLEU score of 50 is not comparable to a Japanese to English BLEU score of 50. Many translation experts have shifted to model-based metric approaches, which have higher correlation with human ratings and are more granular in identifying error scenarios.

To learn how to run evaluations for translation models, see Evaluate a translation model.

Choose between pointwise or pairwise evaluation

Use the following table to decide when to use pointwise or pairwise evaluation:

Definition When to use Example use cases
Pointwise evaluation Evaluate one model and generate scores based on the criteria.
  • When you need a score for each model you're evaluating.
  • When it's not difficult to define the rubric for each score.
  • Understanding how your model behaves in production.
  • Exploring the strengths and weaknesses of a single model.
  • Identifying which behaviors to focus on when tuning.
  • Getting the baseline performance of a model.
Pairwise evaluation Compare two models against each other, generating a preference based on the criteria.
  • When you want to compare two models and a score isn't necessary.
  • When the score rubric for pointwise is difficult to define. For example, it might be difficult to define a 1-5 rubric for text quality, but it's easier to compare two models and state a preference.
  • Determining which model to put into production.
  • Choosing between model types (e.g., Gemini Pro versus Claude 3).
  • Choosing between different prompts.
  • Determining if tuning improved a baseline model.

Computation-based metrics

Computation-based metrics compare LLM-generated results against a ground-truth dataset of input and output pairs. These metrics fall into the following categories:

  • Lexicon-based metrics: Use mathematical formulas to calculate string similarities between LLM-generated results and ground truth, such as Exact Match and ROUGE.
  • Count-based metrics: Aggregate the number of rows that match or miss certain ground-truth labels, such as <abbr data-title="A measure of a model's accuracy that considers both precision (what proportion of positive identifications was actually correct) and recall (what proportion of actual positives was identified correctly).">F1-score</abbr>, Accuracy, and Tool Name Match.
  • Embedding-based metrics: Calculate the distance between the LLM-generated results and ground truth in the embedding space to reflect their similarity.

General text generation

The following metrics help you evaluate a model's ability to generate useful, safe, and effective responses.

Metric Description Measures Use Case
exact_match Checks if the model response is identical to the reference. Strict correctness. When the output must be a precise string (e.g., code generation, specific answers).
bleu (BiLingual Evaluation Understudy) Compares n-gram overlap between response and reference. Translation quality, fluency. Evaluating machine translation tasks.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Compares n-gram overlap, focusing on recall. Summarization quality, content overlap. Evaluating text summarization and content similarity.
  • exact_match

    The exact_match metric computes whether a model response matches a reference exactly.

    • Token limit: None

    Evaluation criteria

    Not applicable.

    Metric input parameters

    Input parameter Description
    response The LLM response.
    reference The golden LLM response for reference.

    Output scores

    Value Description
    0 Not matched
    1 Matched
  • bleu

    The bleu (BiLingual Evaluation Understudy) metric evaluates the quality of a machine-translated response by comparing it to a reference translation.

    • Token limit: None

    Evaluation criteria

    Not applicable.

    Metric input parameters

    Input parameter Description
    response The LLM response.
    reference The golden LLM response for the reference.

    Output scores

    Value Description
    A float in the range of [0,1] Higher scores indicate better translations. A score of 1 represents a perfect match to the reference.
  • ROUGE

    The ROUGE metric compares the response parameter against a reference parameter. All rouge metrics return the F1 score. rouge-l-sum is calculated by default, but you can specify the rouge variant you want to use.

    • Token limit: None

    Evaluation criteria

    Not applicable.

    Metric input parameters

    Input parameter Description
    response The LLM response.
    reference The golden LLM response for the reference.

    Output scores

    Value Description
    A float in the range of [0,1] A score closer to 0 means poor similarity between response and reference. A score closer to 1 means strong similarity between response and reference.

Tool use and function calling

The following metrics help you evaluate the model's ability to predict a valid tool (function) call.

Metric Description Measures Use Case
tool_call_valid Checks if the output is a syntactically valid tool call. Formatting correctness. Ensuring the model generates valid function calls that can be parsed.
tool_name_match Checks if the predicted tool name matches the reference. Correct tool selection. Verifying the model chose the right function to call.
tool_parameter_key_match Checks if the predicted parameter names (keys) match the reference. Correct argument structure. Verifying the model is trying to pass the correct set of parameters.
tool_parameter_kv_match Checks if both parameter names (keys) and their values match the reference. Full correctness of arguments. The strictest check to ensure the model called the function with the exact correct arguments.
  • tool_call_valid

    The tool_call_valid metric describes the model's ability to predict a valid tool call. Only the first tool call is inspected.

    • Token limit: None

    Evaluation criteria

    Evaluation criterion Description
    Validity The model's output contains a valid tool call.
    Formatting A JSON dictionary contains the name and arguments fields.

    Metric input parameters

    Input parameter Description
    prediction The candidate model output, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_calls value is a JSON serialized string of a list of tool calls. Here is an example:

    {"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
    reference The ground-truth reference prediction, which follows the same format as prediction.

    Output scores

    Value Description
    0 Invalid tool call
    1 Valid tool call
  • tool_name_match

    The tool_name_match metric describes the model's ability to predict a tool call with the correct tool name. Only the first tool call is inspected.

    • Token limit: None

    Evaluation criteria

    Evaluation criterion Description
    Name matching The model-predicted tool call matches the reference tool call's name.

    Metric input parameters

    Input parameter Description
    prediction The candidate model output, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls. Here is an example:

    {"content": "","tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
    reference The ground-truth reference prediction, which follows the same format as the prediction.

    Output scores

    Value Description
    0 Tool call name doesn't match the reference.
    1 Tool call name matches the reference.
  • tool_parameter_key_match

    The tool_parameter_key_match metric describes the model's ability to predict a tool call with the correct parameter names.

    • Token limit: None

    Evaluation criteria

    Evaluation criterion Description
    Parameter matching ratio The ratio between the number of predicted parameters that match the parameter names of the reference tool call and the total number of parameters.

    Metric input parameters

    Input parameter Description
    prediction The candidate model output, which is a JSON serialized string that contains the content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls. Here is an example:

    {"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
    reference The ground-truth reference model prediction, which follows the same format as prediction.

    Output scores

    Value Description
    A float in the range of [0,1] The higher score of 1 means more parameters match the reference parameters' names.
  • tool_parameter_kv_match

    The tool_parameter_kv_match metric describes the model's ability to predict a tool call with the correct parameter names and values.

    • Token limit: None

    Evaluation criteria

    Evaluation criterion Description
    Parameter matching ratio The ratio between the number of the predicted parameters that match both the parameter names and values of the reference tool call and the total number of parameters.

    Metric input parameters

    Input parameter Description
    prediction The candidate model output, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls. Here is an example:

    {"content": "", "tool_calls": [{"name": "book_tickets", "arguments": {"movie": "Mission Impossible Dead Reckoning Part 1", "theater":"Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30","num_tix": "2"}}]}
    reference The ground-truth reference prediction, which follows the same format as prediction.

    Output scores

    Value Description
    A float in the range of [0,1] The higher score of 1 means more parameters match the reference parameters' names and values.

Baseline evaluation quality for generative tasks

When you evaluate the output of generative AI models, the process is inherently subjective. The quality of the evaluation can vary depending on the specific task and evaluation criteria. This subjectivity also applies to human evaluators. For more information about the challenges of achieving consistent evaluation for generative AI models, see Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena and Learning to summarize from human feedback.

What's next