fromvertexai.preview.evaluationimport{AutoraterConfig,PairwiseMetric,}fromvertexai.preview.evaluation.autorater_utilsimportevaluate_autorater# Step 1: Prepare the evaluation dataset with the human rating data column.human_rated_dataset=pd.DataFrame({"prompt":[PROMPT_1,PROMPT_2],"response":[RESPONSE_1,RESPONSE_2],"baseline_model_response":[BASELINE_MODEL_RESPONSE_1,BASELINE_MODEL_RESPONSE_2],"pairwise_fluency/human_pairwise_choice":["model_A","model_B"]})# Step 2: Get the results from model-based metricpairwise_fluency=PairwiseMetric(metric="pairwise_fluency",metric_prompt_template="please evaluate pairwise fluency...")eval_result=EvalTask(dataset=human_rated_dataset,metrics=[pairwise_fluency],).evaluate()# Step 3: Calibrate model-based metric result and human preferences.# eval_result contains human evaluation result from human_rated_dataset.evaluate_autorater_result=evaluate_autorater(evaluate_autorater_input=eval_result.metrics_table,eval_metrics=[pairwise_fluency])
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Evaluate a judge model\n\n| **Preview**\n|\n|\n| This product or feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA products and features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nFor model-based metrics, the Gen AI evaluation service evaluates your models with a foundational model, such as Gemini, that has been configured and prompted as a judge model. If you want to learn more about the judge model, the *Advanced judge model customization series* describes additional tools you can use to evaluate and configure the judge model.\n\nFor the basic evaluation workflow, see the [Gen AI evaluation service quickstart](/vertex-ai/generative-ai/docs/models/evaluation-quickstart). The *Advanced judge model customization series* includes the following pages:\n\n1. Evaluate a judge model (current page)\n2. [Prompting for judge model customization](/vertex-ai/generative-ai/docs/models/prompt-judge-model)\n3. [Configure a judge model](/vertex-ai/generative-ai/docs/models/configure-judge-model)\n\nOverview\n--------\n\nUsing human judges to evaluate large language models (LLMs) can be expensive and\ntime consuming. Using a judge model is a more scalable way to evaluate LLMs. The\nGen AI evaluation service uses a configured Gemini 2.0 Flash\nmodel by default as the judge model, with customizable prompts to evaluate your\nmodel for various use cases.\n\nThe following sections show how to evaluate a customized judge model for your ideal use case.\n\nPrepare the dataset\n-------------------\n\nTo evaluate model-based metrics, prepare an evaluation dataset with human ratings as the ground truth. The goal is to compare the scores from model-based metrics with human ratings and see if model-based metrics have the ideal quality for your use case.\n\n- For `PointwiseMetric`, prepare the `{metric_name}/human_rating` column in the dataset as the ground truth for the `{metric_name}/score` result generated by model-based metrics.\n\n- For `PairwiseMetric`, prepare the `{metric_name}/human_pairwise_choice` column in the dataset as the ground truth for the `{metric_name}/pairwise_choice` result generated by model-based metrics.\n\nUse the following dataset schema:\n\nAvailable metrics\n-----------------\n\nFor a `PointwiseMetric` that returns only 2 scores (such as 0 and 1), and a `PairwiseMetric` that only has 2 preference types (Model A or Model B), the following metrics are available:\n\nFor a `PointwiseMetric` that returns more than 2 scores (such as 1 through 5), and a `PairwiseMetric` that has more than 2 preference types (Model A, Model B, or Tie), the following metrics are available:\n\nWhere:\n\n- \\\\( f1 = 2 \\* precision \\* recall / (precision + recall) \\\\)\n\n - \\\\( precision = True Positives / (True Positives + False Positives) \\\\)\n\n - \\\\( recall = True Positives / (True Positives + False Negatives) \\\\)\n\n- \\\\( n \\\\) : number of classes\n\n- \\\\( cnt_i \\\\) : number of \\\\( class_i \\\\) in ground truth data\n\n- \\\\( sum \\\\): number of elements in ground truth data\n\nTo calculate other metrics, you can use open-source libraries.\n\nEvaluate the model-based metric\n-------------------------------\n\nThe following example updates the model-based metric with a custom definition of fluency, then evaluates the quality of the metric. \n\n from vertexai.preview.evaluation import {\n AutoraterConfig,\n PairwiseMetric,\n }\n from vertexai.preview.evaluation.autorater_utils import evaluate_autorater\n\n\n # Step 1: Prepare the evaluation dataset with the human rating data column.\n human_rated_dataset = pd.DataFrame({\n \"prompt\": [PROMPT_1, PROMPT_2],\n \"response\": [RESPONSE_1, RESPONSE_2],\n \"baseline_model_response\": [BASELINE_MODEL_RESPONSE_1, BASELINE_MODEL_RESPONSE_2],\n \"pairwise_fluency/human_pairwise_choice\": [\"model_A\", \"model_B\"]\n })\n\n # Step 2: Get the results from model-based metric\n pairwise_fluency = PairwiseMetric(\n metric=\"pairwise_fluency\",\n metric_prompt_template=\"please evaluate pairwise fluency...\"\n )\n\n eval_result = EvalTask(\n dataset=human_rated_dataset,\n metrics=[pairwise_fluency],\n ).evaluate()\n\n # Step 3: Calibrate model-based metric result and human preferences.\n # eval_result contains human evaluation result from human_rated_dataset.\n evaluate_autorater_result = evaluate_autorater(\n evaluate_autorater_input=eval_result.metrics_table,\n eval_metrics=[pairwise_fluency]\n )\n\nWhat's next\n-----------\n\n- [Prompting for judge model customization](/vertex-ai/generative-ai/docs/models/prompt-judge-model)\n- [Configure your judge model](/vertex-ai/generative-ai/docs/models/configure-judge-model)"]]