Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
This guide shows you how to evaluate a judge model by comparing its performance against human ratings.
This page covers the following topics:
Prepare the dataset: Learn how to structure your dataset with human ratings to serve as the ground truth for evaluation.
Available metrics: Understand the metrics used to measure the agreement between the judge model and human ratings.
Evaluate the model-based metric: See a code example of how to run an evaluation job and get quality scores for your judge model.
For model-based metrics, the Gen AI evaluation service service uses a foundational model, such as Gemini, as a judge model to evaluate your models. To learn more about the judge model, the Advanced judge model customization series describes additional tools that you can use to evaluate and configure it.
For the basic evaluation workflow, see the Gen AI evaluation service quickstart. The Advanced judge model customization series includes the following pages:
Using human judges to evaluate large language models (LLMs) can be expensive and time-consuming. Using a judge model is a more scalable way to evaluate LLMs. The Gen AI evaluation service uses a configured Gemini 2.0 Flash model by default as the judge model, with customizable prompts to evaluate your model for various use cases.
The following sections show you how to evaluate a customized judge model for your ideal use case.
Metric types
The Gen AI evaluation service uses two types of model-based metrics to evaluate judge models.
Metric Type
Description
Use Case
PointwiseMetric
Assigns a numerical score to a single model's output based on a specific criterion (for example, fluency or safety).
When you need to rate a single model response on a scale (for example, rating helpfulness from 1 to 5).
PairwiseMetric
Compares the outputs from two models (a candidate and a baseline) and chooses the preferred one.
When you need to determine which of two model responses is better for a given prompt.
Prepare the dataset
To evaluate model-based metrics, you need to prepare an evaluation dataset that includes human ratings to serve as the ground truth. The goal is to compare the scores from model-based metrics with human ratings to determine if the model-based metrics have the ideal quality for your use case.
Your dataset must include a column for human ratings that corresponds to the model-based metric you are evaluating. The following table shows the required human rating column for each metric type:
Model-based metric
Required human rating column
PointwiseMetric
{metric_name}/human_rating
PairwiseMetric
{metric_name}/human_pairwise_choice
Available metrics
The Gen AI evaluation service provides different metrics depending on the number of possible outcomes.
Metrics for binary outcomes
For a PointwiseMetric that returns only 2 scores (such as 0 and 1), and a PairwiseMetric that only has 2 preference types (Model A or Model B), the following metrics are available:
Use the confusion_matrix and confusion_matrix_labels fields to calculate metrics such as True positive rate (TPR), True negative rate (TNR), False positive rate (FPR), and False negative rate (FNR).
For a PointwiseMetric that returns more than 2 scores (such as 1 through 5), and a PairwiseMetric that has more than 2 preference types (Model A, Model B, or Tie), the following metrics are available:
\( cnt_i \) : number of \( class_i \) in ground truth data
\( sum \): number of elements in ground truth data
To calculate other metrics, you can use open-source libraries.
Evaluate the model-based metric
The following example updates the model-based metric with a custom definition of fluency, then evaluates the quality of the metric.
fromvertexai.preview.evaluationimport{AutoraterConfig,PairwiseMetric,}fromvertexai.preview.evaluation.autorater_utilsimportevaluate_autorater# Step 1: Prepare the evaluation dataset with the human rating data column.human_rated_dataset=pd.DataFrame({"prompt":[PROMPT_1,PROMPT_2],"response":[RESPONSE_1,RESPONSE_2],"baseline_model_response":[BASELINE_MODEL_RESPONSE_1,BASELINE_MODEL_RESPONSE_2],"pairwise_fluency/human_pairwise_choice":["model_A","model_B"]})# Step 2: Get the results from model-based metricpairwise_fluency=PairwiseMetric(metric="pairwise_fluency",metric_prompt_template="please evaluate pairwise fluency...")eval_result=EvalTask(dataset=human_rated_dataset,metrics=[pairwise_fluency],).evaluate()# Step 3: Calibrate model-based metric result and human preferences.# eval_result contains human evaluation result from human_rated_dataset.evaluate_autorater_result=evaluate_autorater(evaluate_autorater_input=eval_result.metrics_table,eval_metrics=[pairwise_fluency])
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-21 UTC."],[],[]]