모델 기반 측정항목의 경우 Gen AI 평가 서비스는 심사 모델로 구성된 Gemini와 같은 기반 모델을 사용하여 모델을 평가합니다. 이 페이지에서는 프롬프트 엔지니어링 기법을 사용하여 이러한 판단 모델의 품질을 개선하고 필요에 맞게 맞춤설정하는 방법을 설명합니다.
개요
인간 심사자를 사용하여 대규모 언어 모델 (LLM)을 평가하는 것은 비용이 많이 들고 시간이 걸릴 수 있습니다. 심사 모델을 사용하면 LLM을 더 확장성 있게 평가할 수 있습니다.
Gen AI 평가 서비스는 기본적으로 Gemini 1.5 Pro를 심사 모델로 사용하며, 맞춤설정 가능한 메시지를 통해 다양한 사용 사례에 맞게 모델을 평가할 수 있습니다. 많은 기본 사용 사례는 모델 기반 측정항목 템플릿에서 다루지만 다음 프로세스를 사용하여 기본 사용 사례 외에도 판단 모델을 추가로 맞춤설정할 수 있습니다.
사용 사례를 대표하는 프롬프트가 포함된 데이터 세트를 만듭니다. 권장 데이터 세트 크기는 프롬프트 100~1,000개입니다.
프롬프트를 사용하여 프롬프트 엔지니어링 기법으로 판단 모델을 수정합니다.
심사 모델을 사용하여 평가를 실행합니다.
프롬프트 엔지니어링 기법
이 섹션에는 판단 모델을 수정하는 데 사용할 수 있는 프롬프트 엔지니어링 기법이 나와 있습니다. 이 예에서는 제로 샷 프롬프트를 사용하지만 프롬프트에서 퓨샷 예시를 사용하여 모델 품질을 개선할 수도 있습니다.
전체 평가 데이터 세트에 적용되는 프롬프트로 시작합니다. 프롬프트에는 대략적인 평가 기준과 평점을 위한 루브릭이 포함되어야 하며 심사 모델의 최종 확인을 요청해야 합니다. 다양한 사용 사례에 대한 평가 기준 및 루브릭의 예는 측정항목 프롬프트 템플릿을 참고하세요.
연쇄적 사고(CoT) 프롬프트 사용
판사 모델에 논리적으로 일관된 일련의 작업 또는 단계를 사용하여 후보 모델을 평가하도록 프롬프트합니다.
예를 들어 다음 단계별 안내를 참고하세요.
"Please first list down the instructions in the user query."
"Please highlight such specific keywords."
"After listing down instructions, you should rank the instructions in the order of importance."
"After that, INDEPENDENTLY check if response A and response B for meeting each of the instructions."
"Writing quality/style should NOT be used to judge the response quality unless it was requested by the user."
"When evaluating the final response quality, please value Instruction Following a more important rubrics than Truthfulness."
다음 프롬프트 예에서는 심사 모델에 생각의 연쇄 프롬프트를 사용하여 텍스트 작업을 평가하도록 요청합니다.
# Rubrics
Your mission is to judge responses from two AI models, Model A and Model B, and decide which is better. You will be given the previous conversations between the user and the model, a prompt, and responses from both models.
Please use the following rubric criteria to judge the responses:
<START OF RUBRICS>
Your task is to first analyze each response based on the two rubric criteria: instruction_following, and truthfulness (factual correctness). Start your analysis with "Analysis".
(1) Instruction Listing
Please first list down the instructions in the user query. In general, an instruction is VERY important if it is specific asked in the prompt and deviate from the norm. Please highlight such specific keywords.
You should also derive the task type from the prompt and include the task specific implied instructions.
Sometimes, no instruction is available in the prompt. It is your job to infer if the instruction is to auto-complete the prompt or asking LLM for followups.
After listing down instructions, you should rank the instructions in the order of importance.
After that, INDEPENDENTLY check if response A and response B for meeting each of the instructions. You should itemize for each instruction, if response meet, partially meet or does not meet the requirement using reasoning. You should start reasoning first before reaching a conclusion whether response satisfies the requirement. Citing examples while making reasoning is preferred.
(2) Truthfulness
Compare response A and response B for factual correctness. The one with less hallucinated issues is better.
If response is in sentences and not too long, you should check every sentence separately.
For longer responses, to check factual correctness, focus specifically on places where response A and B differ. Find the correct information in the text to decide if one is more truthful to the other or they are about the same.
If you cannot determine validity of claims made in the response, or response is a punt ("I am not able to answer that type of question"), the response has no truthful issues.
Truthfulness check is not applicable in the majority of creative writing cases ("write me a story about a unicorn on a parade")
Writing quality/style should NOT be used to judge the response quality unless it was requested by the user.
In the end, express your final verdict in one of the following choices:
1. Response A is better: [[A>B]]
2. Tie, relatively the same: [[A=B]]
3. Response B is better: [[B>A]]
Example of final verdict: "My final verdict is tie, relatively the same: [[A=B]]".
When evaluating the final response quality, please value Instruction Following a more important rubrics than Truthfulness.
When for both response, instruction and truthfulness are fully meet, it is a tie.
<END OF RUBRICS>
등급 가이드라인으로 모델 추론 안내
평가 가이드라인을 사용하여 심사 모델이 모델 추론을 평가하도록 지원합니다. 평점 가이드라인은 평점 기준과 다릅니다.
예를 들어 다음 프롬프트는 평가 기준을 사용합니다. 이 기준은 평가자 모델에 '안내 따르기' 과제를 '심각한 문제', '경미한 문제', '문제 없음' 평가 기준으로 평가하도록 안내합니다.
Your task is to first analyze each response based on the three rubric criteria: verbosity, instruction_following, truthfulness (code correctness) and (coding) executability. Please note that the model responses should follow "response system instruction" (if provided). Format your judgment in the following way:
Response A - verbosity:too short|too verbose|just right
Response A - instruction_following:major issues|minor issues|no issues
Response A - truthfulness:major issues|minor issues|no issues
Response A - executability:no|no code present|yes-fully|yes-partially
Then do the same for response B.
After the rubric judgements, you should also give a brief rationale to summarize your evaluation considering each individual criteria as well as the overall quality in a new paragraph starting with "Reason: ".
In the last line, express your final judgment in the format of: "Which response is better: [[verdict]]" where "verdict" is one of {Response A is much better, Response A is better, Response A is slightly better, About the same, Response B is slightly better, Response B is better, Response B is much better}. Do not use markdown format or output anything else.
다음 프롬프트는 평가 가이드라인을 사용하여 심사 모델이 '안내 따르기' 과제를 평가하는 데 도움을 줍니다.
You are a judge for coding related tasks for LLMs. You will be provided with a coding prompt, and two responses (Response A and Response B) attempting to answer the prompt. Your task is to evaluate each response based on the following criteria:
Correctness: Does the code produce the correct output and solve the problem as stated?
Executability: Does the code run without errors?
Instruction Following: Does the code adhere to the given instructions and constraints?
Please think about the three criteria, and provide a side-by-side comparison rating to to indicate which one is better.
참조 답변으로 심사 모델 보정
일부 또는 모든 프롬프트의 참조 답변으로 심사 모델을 보정할 수 있습니다.
다음 프롬프트는 참조 답변을 사용하는 방법을 심사 모델에 안내합니다.
"Note that you can compare the responses with the reference answer to make your judgment, but the reference answer may not be the only correct answer to the query."
다음 예에서는 추론, 생각의 연쇄 프롬프트, 평가 가이드라인을 사용하여 '안내 따르기' 과제의 평가 절차를 안내합니다.
# Rubrics
Your mission is to judge responses from two AI models, Model A and Model B, and decide which is better. You will be given a user query, source summaries, and responses from both models. A reference answer
may also be provided - note that you can compare the responses with the reference answer to make your judgment, but the reference answer may not be the only correct answer to the query.
Please use the following rubric criteria to judge the responses:
<START OF RUBRICS>
Your task is to first analyze each response based on the three rubric criteria: grounding, completeness, and instruction_following. Start your analysis with "Analysis".
(1) Grounding
Please first read through all the given sources in the source summaries carefully and make sure you understand the key points in each one.
After that, INDEPENDENTLY check if response A and response B use ONLY the given sources in the source summaries to answer the user query. It is VERY important to check that all
statements in the response MUST be traceable back to the source summaries and ACCURATELY cited.
(2) Completeness
Please first list down the aspects in the user query. After that, INDEPENDENTLY check if response A and response B for covering each of the aspects by using ALL RELEVANT information from the sources.
(3) Instruction Following
Please read through the following instruction following rubrics carefully. After that, INDEPENDENTLY check if response A and response B for following each of the instruction following rubrics successfully.
* Does the response provide a final answer based on summaries of 3 potential answers to a user query?
* Does the response only use the technical sources provided that are relevant to the query?
* Does the response use only information from sources provided?
* Does the response select all the sources that provide helpful details to answer the question in the Technical Document?
* If the sources have significant overlapping or duplicate details, does the response select sources which are most detailed and comprehensive?
* For each selected source, does the response prepend source citations?
* Does the response use the format: "Source X" where x represents the order in which the technical source appeared in the input?
* Does the response use original source(s) directly in its response, presenting each source in its entirety, word-for-word, without omitting and altering any details?
* Does the response create a coherent technical final answer from selected Sources without inter-mixing text from any of the Sources?
Writing quality/style can be considered, but should NOT be used as critical rubric criteria to judge the response quality.
In the end, express your final verdict in one of the following choices:
1. Response A is better: [[A>B]]
2. Tie, relatively the same: [[A=B]]
3. Response B is better: [[B>A]]
Example of final verdict: "My final verdict is tie, relatively the same: [[A=B]]".
When for both response, grounding, completeness, and instruction following are fully meet, it is a tie.
<END OF RUBRICS>
다음 단계
- 수정된 판사 모델을 사용하여 평가를 실행합니다.