Model-based metrics let you customize how you generate evaluation metrics based on your criteria and use cases. This guide shows you how to configure a judge model and covers the following topics: For the basic evaluation workflow, see the Gen AI evaluation service quickstart. The Advanced judge model customization series includes the following pages: You have several options to configure your judge model for improved quality. The following table provides a high-level comparison of each approach. Gemini models can take in system instructions, which are a set of instructions that impact how the model processes prompts. You can use system instructions when you initialize or generate content from a model to specify product-level behavior such as roles, personas, contextual information, and explanation style and tone. The judge model typically gives more weight to system instructions than to input prompts. For a list of models that support system instructions, see Supported models. The following example uses the Vertex AI SDK to add You can use the same approach with For To reduce bias in the evaluation results, you can enable response flipping. This technique swaps the baseline and candidate model responses for half of the calls to the judge model. The following example shows how to enable response flipping using the Vertex AI SDK: The judge model can exhibit randomness in its responses during an evaluation. To mitigate the effects of this randomness and produce more consistent results, you can use additional sampling. This technique is also known as multi-sampling. However, increasing sampling also increases the latency to complete the request. You can update the sampling count value using Using the Vertex AI SDK, you can specify the number of samples to execute for each request: If you have good tuning data for your evaluation use case, you can use the Vertex AI SDK to tune a Gemini model as the judge model and use the tuned model for evaluation. You can specify a tuned model as the judge model through
Choose a configuration option
Option
Description
Use Case
System instructions
Provides high-level, persistent instructions to the judge model that influence its behavior for all subsequent evaluation prompts.
When you need to define a consistent role, persona, or output format for the judge model across the entire evaluation task.
Response flipping
Swaps the position of the baseline and candidate model responses for half of the evaluation calls.
To reduce potential positional bias in pairwise evaluations where the judge model might favor the response in the first or second position.
Multi-sampling
Calls the judge model multiple times for the same input and aggregates the results.
To improve the consistency and reliability of evaluation scores by mitigating the effects of randomness in the judge model's responses.
Tuned judge model
Uses a fine-tuned LLM as the judge model for evaluation.
For specialized evaluation tasks that require nuanced understanding or domain-specific knowledge that a general-purpose model lacks.
System instructions
system_instruction
at the metric level for PointwiseMetric
:system_instruction = "You are an expert evaluator."
linguistic_acceptability = PointwiseMetric(
metric="linguistic_acceptability",
metric_prompt_template=linguistic_acceptability_metric_prompt_template,
system_instruction=system_instruction,
)
eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[linguistic_acceptability]
).evaluate()
PairwiseMetric
.Response flipping
PairwiseMetrics
, Gen AI evaluation service uses responses from both a baseline model and a candidate model. The judge model evaluates which response better aligns with the criteria in the metric_prompt_template
. However, the judge model might be biased toward the baseline or candidate model in certain settings.from vertexai.preview.evaluation import AutoraterConfig
pairwise_relevance_prompt_template = """
# Instruction
…
### Response A
{baseline_model_response}
### Response B
{candidate_model_response}
"""
my_pairwise_metric = PairwiseMetric(
metric="my_pairwise_metric",
metric_prompt_template=pairwise_relevance_prompt_template,
candidate_response_field_name = "candidate_model_response",
baseline_response_field_name = "baseline_model_response"
)
# Define an AutoraterConfig with flip_enabled
my_autorater_config = AutoraterConfig(flip_enabled=True)
# Define an EvalTask with autorater_config
flip_enabled_eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[my_pairwise_metric],
autorater_config=my_autorater_config,
).evaluate()
Multi-sampling
AutoraterConfig
to an integer between 1 and 32. We recommend using the default sampling_count
value of 4 to balance randomness and latency.from vertexai.preview.evaluation import AutoraterConfig
# Define customized sampling count in AutoraterConfig
autorater_config = AutoraterConfig(sampling_count=6)
# Run evaluation with the sampling count.
eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[METRICS],
autorater_config=autorater_config
).evaluate()
Tuned judge model
AutoraterConfig
:from vertexai.preview.evaluation import {
AutoraterConfig,
PairwiseMetric,
tune_autorater,
evaluate_autorater,
}
# Tune a model to be the judge model. The tune_autorater helper function returns an AutoraterConfig with the judge model set as the tuned model.
autorater_config: AutoRaterConfig = tune_autorater(
base_model="gemini-2.0-flash",
train_dataset=f"{BUCKET_URI}/train/sft_train_samples.jsonl",
validation_dataset=f"{BUCKET_URI}/val/sft_val_samples.jsonl",
tuned_model_display_name=tuned_model_display_name,
)
# Alternatively, you can set up the judge model with an existing tuned model endpoint
autorater_config = AutoraterConfig(autorater_model=TUNED_MODEL)
# Use the tuned judge model
eval_result = EvalTask(
dataset=EVAL_DATASET,
metrics=[METRICS],
autorater_config=autorater_config,
).evaluate()
What's next
Configure a judge model
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-21 UTC.