모델 기반 측정항목의 경우 Gen AI Evaluation Service는 평가 모델로 구성되고 프롬프트된 Gemini와 같은 기본 모델을 사용하여 모델을 평가합니다. 평가 모델에 대해 자세히 알아보려면 고급 평가 모델 맞춤설정 시리즈에서 평가 모델을 평가하고 구성하는 데 사용할 수 있는 추가 도구를 참조하세요.
인간 평자가를 통해 대규모 언어 모델(LLM)을 평가하는 것은 비용이 많이 들고 시간이 걸릴 수 있습니다. 평가 모델을 사용하면 LLM을 더 확장성 있게 평가할 수 있습니다. Gen AI Evaluation Service는 기본적으로 구성된 Gemini 2.0 Flash를 평가 모델로 사용하며, 맞춤설정 가능한 프롬프트를 통해 다양한 사용 사례에 맞게 모델을 평가할 수 있습니다.
다음 섹션에서는 이상적인 사용 사례에 맞게 맞춤설정된 평가 모델을 평가하는 방법을 보여줍니다.
데이터세트 준비
모델 기반 측정항목을 평가하려면 인간의 평가를 정답으로 하는 평가 데이터 세트를 준비합니다. 모델 기반 측정항목의 점수를 인간 평가자와 비교하여 모델 기반 측정항목이 사용 사례에 적합한 품질인지 확인하는 것이 목표입니다.
PointwiseMetric의 경우 데이터 세트의 {metric_name}/human_rating 열을 모델 기반 측정항목에서 생성된 {metric_name}/score 결과의 정답으로 준비합니다.
PairwiseMetric의 경우 데이터 세트의 {metric_name}/human_pairwise_choice 열을 모델 기반 측정항목에서 생성된 {metric_name}/pairwise_choice 결과의 정답으로 준비합니다.
다음 데이터 세트 스키마를 사용합니다.
모델 기반 측정항목
인간 평가 열
PointwiseMetric
{metric_name}/human_rating
PairwiseMetric
{metric_name}/human_pairwise_choice
사용 가능한 측정항목
점수 2개(예: 0 및 1)만 반환하는 PointwiseMetric과 선호도 유형이 2개(모델 A 또는 모델 B)인 PairwiseMetric의 경우 다음 측정항목을 사용할 수 있습니다.
다음 예시에서는 유창성의 맞춤 정의로 모델 기반 측정항목을 업데이트한 후 측정항목의 품질을 평가합니다.
fromvertexai.preview.evaluationimport{AutoraterConfig,PairwiseMetric,}fromvertexai.preview.evaluation.autorater_utilsimportevaluate_autorater# Step 1: Prepare the evaluation dataset with the human rating data column.human_rated_dataset=pd.DataFrame({"prompt":[PROMPT_1,PROMPT_2],"response":[RESPONSE_1,RESPONSE_2],"baseline_model_response":[BASELINE_MODEL_RESPONSE_1,BASELINE_MODEL_RESPONSE_2],"pairwise_fluency/human_pairwise_choice":["model_A","model_B"]})# Step 2: Get the results from model-based metricpairwise_fluency=PairwiseMetric(metric="pairwise_fluency",metric_prompt_template="please evaluate pairwise fluency...")eval_result=EvalTask(dataset=human_rated_dataset,metrics=[pairwise_fluency],).evaluate()# Step 3: Calibrate model-based metric result and human preferences.# eval_result contains human evaluation result from human_rated_dataset.evaluate_autorater_result=evaluate_autorater(evaluate_autorater_input=eval_result.metrics_table,eval_metrics=[pairwise_fluency])
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[],[],null,["# Evaluate a judge model\n\n| **Preview**\n|\n|\n| This product or feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA products and features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nFor model-based metrics, the Gen AI evaluation service evaluates your models with a foundational model, such as Gemini, that has been configured and prompted as a judge model. If you want to learn more about the judge model, the *Advanced judge model customization series* describes additional tools you can use to evaluate and configure the judge model.\n\nFor the basic evaluation workflow, see the [Gen AI evaluation service quickstart](/vertex-ai/generative-ai/docs/models/evaluation-quickstart). The *Advanced judge model customization series* includes the following pages:\n\n1. Evaluate a judge model (current page)\n2. [Prompting for judge model customization](/vertex-ai/generative-ai/docs/models/prompt-judge-model)\n3. [Configure a judge model](/vertex-ai/generative-ai/docs/models/configure-judge-model)\n\nOverview\n--------\n\nUsing human judges to evaluate large language models (LLMs) can be expensive and\ntime consuming. Using a judge model is a more scalable way to evaluate LLMs. The\nGen AI evaluation service uses a configured Gemini 2.0 Flash\nmodel by default as the judge model, with customizable prompts to evaluate your\nmodel for various use cases.\n\nThe following sections show how to evaluate a customized judge model for your ideal use case.\n\nPrepare the dataset\n-------------------\n\nTo evaluate model-based metrics, prepare an evaluation dataset with human ratings as the ground truth. The goal is to compare the scores from model-based metrics with human ratings and see if model-based metrics have the ideal quality for your use case.\n\n- For `PointwiseMetric`, prepare the `{metric_name}/human_rating` column in the dataset as the ground truth for the `{metric_name}/score` result generated by model-based metrics.\n\n- For `PairwiseMetric`, prepare the `{metric_name}/human_pairwise_choice` column in the dataset as the ground truth for the `{metric_name}/pairwise_choice` result generated by model-based metrics.\n\nUse the following dataset schema:\n\nAvailable metrics\n-----------------\n\nFor a `PointwiseMetric` that returns only 2 scores (such as 0 and 1), and a `PairwiseMetric` that only has 2 preference types (Model A or Model B), the following metrics are available:\n\nFor a `PointwiseMetric` that returns more than 2 scores (such as 1 through 5), and a `PairwiseMetric` that has more than 2 preference types (Model A, Model B, or Tie), the following metrics are available:\n\nWhere:\n\n- \\\\( f1 = 2 \\* precision \\* recall / (precision + recall) \\\\)\n\n - \\\\( precision = True Positives / (True Positives + False Positives) \\\\)\n\n - \\\\( recall = True Positives / (True Positives + False Negatives) \\\\)\n\n- \\\\( n \\\\) : number of classes\n\n- \\\\( cnt_i \\\\) : number of \\\\( class_i \\\\) in ground truth data\n\n- \\\\( sum \\\\): number of elements in ground truth data\n\nTo calculate other metrics, you can use open-source libraries.\n\nEvaluate the model-based metric\n-------------------------------\n\nThe following example updates the model-based metric with a custom definition of fluency, then evaluates the quality of the metric. \n\n from vertexai.preview.evaluation import {\n AutoraterConfig,\n PairwiseMetric,\n }\n from vertexai.preview.evaluation.autorater_utils import evaluate_autorater\n\n\n # Step 1: Prepare the evaluation dataset with the human rating data column.\n human_rated_dataset = pd.DataFrame({\n \"prompt\": [PROMPT_1, PROMPT_2],\n \"response\": [RESPONSE_1, RESPONSE_2],\n \"baseline_model_response\": [BASELINE_MODEL_RESPONSE_1, BASELINE_MODEL_RESPONSE_2],\n \"pairwise_fluency/human_pairwise_choice\": [\"model_A\", \"model_B\"]\n })\n\n # Step 2: Get the results from model-based metric\n pairwise_fluency = PairwiseMetric(\n metric=\"pairwise_fluency\",\n metric_prompt_template=\"please evaluate pairwise fluency...\"\n )\n\n eval_result = EvalTask(\n dataset=human_rated_dataset,\n metrics=[pairwise_fluency],\n ).evaluate()\n\n # Step 3: Calibrate model-based metric result and human preferences.\n # eval_result contains human evaluation result from human_rated_dataset.\n evaluate_autorater_result = evaluate_autorater(\n evaluate_autorater_input=eval_result.metrics_table,\n eval_metrics=[pairwise_fluency]\n )\n\nWhat's next\n-----------\n\n- [Prompting for judge model customization](/vertex-ai/generative-ai/docs/models/prompt-judge-model)\n- [Configure your judge model](/vertex-ai/generative-ai/docs/models/configure-judge-model)"]]