평가 데이터 세트 준비

Gen AI Evaluation Service의 경우 평가 데이터 세트는 일반적으로 평가하려는 모델 응답, 응답을 생성하는 데 사용되는 입력 데이터, 정답 응답으로 구성됩니다.

평가 데이터 세트 스키마

일반적인 모델 기반 측정항목 사용 사례의 경우 데이터 세트에서 다음 정보를 제공해야 합니다.

입력 유형	입력 필드 콘텐츠
프롬프트	생성형 AI 모델 또는 애플리케이션의 사용자 입력입니다. 경우에 따라 선택사항입니다.
응답	평가할 LLM 추론 응답입니다.
baseline_model_response(쌍별 측정항목에 필요)	쌍별 평가에서 LLM 응답을 비교하는 데 사용되는 기준 LLM 추론 응답입니다.

Vertex AI SDK for Python의 Gen AI Evaluation 모듈을 사용하면 Gen AI Evaluation Service가 지정된 모델로 response 및 baseline_model_response를 자동으로 생성할 수 있습니다.

다른 평가 사용 사례의 경우 추가 정보를 제공해야 할 수 있습니다.

멀티턴 또는 채팅

입력 유형	입력 필드 콘텐츠
기록	현재 턴 이전의 사용자와 모델 간의 대화 기록입니다.
프롬프트	현재 턴의 생성형 AI 모델 또는 애플리케이션에 대한 사용자 입력입니다.
응답	평가할 LLM 추론 응답으로, 기록 및 현재 회전 프롬프트를 기반으로 합니다.
baseline_model_response(쌍별 측정항목에 필요)	이웃 평가에서 LLM 응답을 비교하는 데 사용되는 기준 LLM 추론 응답으로, 기록 및 현재 차례 프롬프트를 기반으로 합니다.

계산 기반 측정항목

데이터 세트는 대규모 언어 모델의 대답과 비교할 참조를 모두 제공해야 합니다.

입력 유형	입력 필드 콘텐츠
응답	평가할 LLM 추론 응답입니다.
참조	LLM 응답을 비교할 정답입니다.

사용 사례에 따라 입력 사용자 프롬프트를 instruction 및 context와 같은 세부적인 부분으로 분류하고 프롬프트 템플릿을 제공하여 추론을 위해 조합할 수도 있습니다. 필요한 경우 참조 또는 정답 정보도 제공할 수 있습니다.

입력 유형	입력 필드 콘텐츠
안내	사용자 입력 프롬프트의 일부입니다. LLM으로 전송되는 추론 요청 사항을 나타냅니다. 예를 들어 "다음 텍스트를 요약해 주세요"는 요청 사항입니다.
context	현재 턴의 생성형 AI 모델 또는 애플리케이션에 대한 사용자 입력입니다.
참조	LLM 응답을 비교할 정답입니다.

평가 데이터 세트에 필요한 입력은 측정항목과 일치해야 합니다. 측정항목 맞춤설정에 관한 자세한 내용은 평가 측정항목 정의 및 평가 실행을 참고하세요. 모델 기반 측정항목에 참조 데이터를 포함하는 방법에 관한 자세한 내용은 입력 데이터에 맞게 측정항목 프롬프트 템플릿 조정을 참고하세요.

평가 데이터 세트 가져오기

다음 형식으로 데이터 세트를 가져올 수 있습니다.

Cloud Storage에 저장된 JSONL 또는 CSV 파일
BigQuery 테이블
Pandas DataFrame

평가 데이터 세트 예시

이 섹션에서는 Pandas Dataframe 형식을 사용하는 데이터 세트 예시를 보여줍니다. 여기에는 몇 개의 데이터 레코드만 예시로 표시되어 있으며 평가 데이터 세트에는 일반적으로 100개 이상의 데이터 포인트가 있습니다. 데이터 세트를 준비할 때의 권장사항은 권장사항 섹션을 참고하세요.

점별 모델 기반 측정항목

다음은 점별 모델 기반 측정항목의 샘플 데이터 세트를 보여주는 요약 사례입니다.

prompts = [
    # Example 1
    (
        "Summarize the text in one sentence: As part of a comprehensive"
        " initiative to tackle urban congestion and foster sustainable urban"
        " living, a major city has revealed ambitious plans for an extensive"
        " overhaul of its public transportation system. The project aims not"
        " only to improve the efficiency and reliability of public transit but"
        " also to reduce the city's carbon footprint and promote eco-friendly"
        " commuting options. City officials anticipate that this strategic"
        " investment will enhance accessibility for residents and visitors"
        " alike, ushering in a new era of efficient, environmentally conscious"
        " urban transportation."
    ),
    # Example 2
    (
        "Summarize the text such that a five-year-old can understand: A team of"
        " archaeologists has unearthed ancient artifacts shedding light on a"
        " previously unknown civilization. The findings challenge existing"
        " historical narratives and provide valuable insights into human"
        " history."
    ),
]

responses = [
    # Example 1
    (
        "A major city is revamping its public transportation system to fight"
        " congestion, reduce emissions, and make getting around greener and"
        " easier."
    ),
    # Example 2
    (
        "Some people who dig for old things found some very special tools and"
        " objects that tell us about people who lived a long, long time ago!"
        " What they found is like a new puzzle piece that helps us understand"
        " how people used to live."
    ),
]

eval_dataset = pd.DataFrame({
    "prompt": prompts,
    "response": responses,
})

쌍별 모델 기반 측정항목

다음 예는 쌍별 모델 기반 측정항목의 샘플 데이터 세트를 보여주는 오픈북 질의 응답 사례를 보여줍니다.

prompts = [
    # Example 1
    (
        "Based on the context provided, what is the hardest material? Context:"
        " Some might think that steel is the hardest material, or even"
        " titanium. However, diamond is actually the hardest material."
    ),
    # Example 2
    (
        "Based on the context provided, who directed The Godfather? Context:"
        " Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The"
        " Godfather, and the latter directed it as well."
    ),
]

responses = [
    # Example 1
    "Diamond is the hardest material. It is harder than steel or titanium.",
    # Example 2
    "Francis Ford Coppola directed The Godfather.",
]

baseline_model_responses = [
    # Example 1
    "Steel is the hardest material.",
    # Example 2
    "John Smith.",
]

eval_dataset = pd.DataFrame(
  {
    "prompt":  prompts,
    "response":  responses,
    "baseline_model_response": baseline_model_responses,
  }
)

계산 기반 측정항목

계산 기반 측정항목의 경우 reference가 필요한 경우가 많습니다.

eval_dataset = pd.DataFrame({
  "response": ["The Roman Senate was filled with exuberance due to Pompey's defeat in Asia."],
  "reference": ["The Roman Senate was filled with exuberance due to successes against Catiline."],
})

도구 사용(함수 호출) 측정항목

다음 예는 계산 기반 도구 사용 측정항목의 입력 데이터를 보여줍니다.

json_responses = ["""{
    "content": "",
    "tool_calls":[{
      "name":"get_movie_info",
      "arguments": {"movie":"Mission Impossible", "time": "today 7:30PM"}
    }]
  }"""]

json_references = ["""{
    "content": "",
    "tool_calls":[{
      "name":"book_tickets",
      "arguments":{"movie":"Mission Impossible", "time": "today 7:30PM"}
      }]
  }"""]

eval_dataset = pd.DataFrame({
    "response": json_responses,
    "reference": json_references,
})

권장사항

평가 데이터 세트를 정의할 때는 다음 권장사항을 따르세요.

모델이 프로덕션에서 처리하는 입력 유형을 나타내는 예시를 제공합니다.
데이터 세트에는 최소 하나의 평가 예시가 포함되어야 합니다. 고품질 집계 측정항목과 통계적으로 유의미한 결과를 얻으려면 약 100개의 예시가 권장됩니다. 이 크기는 집계된 평가 결과에 대한 신뢰 수준을 높이고, 이상치의 영향을 최소화하며, 성능 측정항목이 다양한 시나리오에서 모델의 실제 기능을 반영하도록 하는 데 도움이 됩니다. 400개가 넘는 예시가 제공되면 집계 측정항목의 품질 개선 비율이 감소하는 경향이 있습니다.

다음 단계

평가 실행하기
평가 예시 노트북 사용해 보기