- 1.76.0 (latest)
- 1.75.0
- 1.74.0
- 1.73.0
- 1.72.0
- 1.71.1
- 1.70.0
- 1.69.0
- 1.68.0
- 1.67.1
- 1.66.0
- 1.65.0
- 1.63.0
- 1.62.0
- 1.60.0
- 1.59.0
- 1.58.0
- 1.57.0
- 1.56.0
- 1.55.0
- 1.54.1
- 1.53.0
- 1.52.0
- 1.51.0
- 1.50.0
- 1.49.0
- 1.48.0
- 1.47.0
- 1.46.0
- 1.45.0
- 1.44.0
- 1.43.0
- 1.39.0
- 1.38.1
- 1.37.0
- 1.36.4
- 1.35.0
- 1.34.0
- 1.33.1
- 1.32.0
- 1.31.1
- 1.30.1
- 1.29.0
- 1.28.1
- 1.27.1
- 1.26.1
- 1.25.0
- 1.24.1
- 1.23.0
- 1.22.1
- 1.21.0
- 1.20.0
- 1.19.1
- 1.18.3
- 1.17.1
- 1.16.1
- 1.15.1
- 1.14.0
- 1.13.1
- 1.12.1
- 1.11.0
- 1.10.0
- 1.9.0
- 1.8.1
- 1.7.1
- 1.6.2
- 1.5.0
- 1.4.3
- 1.3.0
- 1.2.0
- 1.1.1
- 1.0.1
- 0.9.0
- 0.8.0
- 0.7.1
- 0.6.0
- 0.5.1
- 0.4.0
- 0.3.1
PairwiseMetric(
*,
metric: typing.Literal["summarization_quality", "question_answering_quality"],
baseline_model: typing.Optional[
typing.Union[
vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
]
] = None,
use_reference: bool = False,
version: typing.Optional[int] = None
)
The Side-by-side(SxS) Pairwise Metric.
A model-based evaluation metric that compares two generative models side-by-side, and allows users to A/B test their generative models to determine which model is performing better on the given evaluation task.
For more details on when to use pairwise metrics, see Evaluation methods and metrics.
Result Details:
* In `EvalResult.summary_metrics`, win rates for both the baseline and
candidate model are computed, showing the rate of each model performs
better on the given task. The win rate is computed as the number of times
the candidate model performs better than the baseline model divided by the
total number of examples. The win rate is a number between 0 and 1.
* In `EvalResult.metrics_table`, a pairwise metric produces three
evaluation results for each row in the dataset:
* `pairwise_choice`: the `pairwise_choice` in the evaluation result is
an enumeration that indicates whether the candidate or baseline
model perform better.
* `explanation`: The model AutoRater's rationale behind each verdict
using chain-of-thought reasoning. These explanations help users
scrutinize the AutoRater's judgment and build appropriate trust in
its decisions.
* `confidence`: A score between 0 and 1, which signifies how confident
the AutoRater was with its verdict. A score closer to 1 means higher
confidence.
See [documentation page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)
for more details on understanding the metric results.
Usages:
```
from <xref uid="vertexai.generative_models">vertexai.generative_models</xref> import GenerativeModel
from vertexai.preview.evaluation import EvalTask, PairwiseMetric
baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")
pairwise_summarization_quality = PairwiseMetric(
metric = "summarization_quality",
baseline_model=baseline_model,
)
eval_task = EvalTask(
dataset = pd.DataFrame({
"instruction": [...],
"context": [...],
}),
metrics=[pairwise_summarization_quality],
)
pairwise_results = eval_task.evaluate(
prompt_template="instruction: {instruction}. context: {context}",
model=candidate_model,
)
```
Methods
PairwiseMetric
PairwiseMetric(
*,
metric: typing.Literal["summarization_quality", "question_answering_quality"],
baseline_model: typing.Optional[
typing.Union[
vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
]
] = None,
use_reference: bool = False,
version: typing.Optional[int] = None
)
Initializes the Side-by-side(SxS) Pairwise evaluation metric.