평가 데이터 세트 준비

이 페이지에서는 Gen AI Evaluation Service에 사용할 데이터 세트를 준비하는 방법을 설명합니다.

개요

Gen AI Evaluation Service는 여러 일반적인 데이터 형식을 자동으로 감지하고 처리합니다. 따라서 수동으로 변환하지 않고도 데이터를 있는 그대로 사용할 수 있는 경우가 많습니다.

데이터 세트에 제공해야 하는 필드는 목표에 따라 다릅니다.

목표	필요한 데이터	SDK 워크플로
새 응답을 생성한 후 평가	`prompt`	`run_inference()` → `evaluate()`
기존 응답 평가	`prompt` 및 `response`	`evaluate()`

client.evals.evaluate()를 실행하면 Gen AI Evaluation Service가 데이터 세트에서 다음과 같은 일반적인 필드를 자동으로 찾습니다.

prompt: (필수 항목) 평가하려는 모델에 대한 입력입니다. 최상의 결과를 얻으려면 모델이 프로덕션에서 처리하는 입력 유형을 나타내는 프롬프트 예시를 제공해야 합니다.
response: (필수 항목) 평가 중인 모델이나 애플리케이션에서 생성된 출력입니다.
reference: (선택사항) 모델의 대답과 비교할 수 있는 정답 또는 '모범' 답안입니다. 이 필드는 bleu 및 rouge과 같은 계산 기반 측정항목에 필요한 경우가 많습니다.
conversation_history: (선택사항) 멀티턴 대화의 이전 턴 목록입니다. Gen AI Evaluation Service는 지원되는 형식에서 이 필드를 자동으로 추출합니다. 자세한 내용은 멀티턴 대화 처리를 참조하세요.

지원되는 데이터 형식

Gen AI Evaluation Service는 다음 형식을 지원합니다.

Pandas DataFrame(플랫 형식)
Gemini 일괄 예측 형식(JSONL)
OpenAI 채팅 자동 완성 형식(JSONL)

Pandas DataFrame

간단한 평가의 경우 pandas.DataFrame을 사용할 수 있습니다. Gen AI Evaluation Service는 prompt, response, reference와 같은 일반적인 열 이름을 찾습니다. 이 형식은 이전 버전과 완전히 호환됩니다.

import pandas as pd

# Example DataFrame with prompts and ground truth references
prompts_df = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Who wrote 'Hamlet'?",
    ],
    "reference": [
        "Paris",
        "William Shakespeare",
    ]
})

# You can use this DataFrame directly with run_inference or evaluate
eval_dataset = client.evals.run_inference(model="gemini-2.5-flash", src=prompts_df)
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

Gemini 일괄 예측 형식

Vertex AI 일괄 예측 작업 출력을 직접 사용할 수 있습니다. 출력은 일반적으로 Cloud Storage에 저장된 JSONL 파일이며 각 줄에는 요청과 대답 객체가 포함되어 있습니다. Gen AI Evaluation Service는 이 구조를 자동으로 파싱하여 다른 Vertex AI 서비스와 통합합니다.

다음은 JSONl 파일의 단일 행 예시입니다.

{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}

그런 다음 일괄 작업에서 사전 생성된 대답을 직접 평가할 수 있습니다.

# Cloud Storage path to your batch prediction output file
batch_job_output_uri = "gs://path/to/your/batch_output.jsonl"

# Evaluate the pre-generated responses directly
eval_result = client.evals.evaluate(
    dataset=batch_job_output_uri,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

OpenAI Chat Completion 형식

OpenAI 및 Anthropic과 같은 서드 파티 모델을 평가하거나 비교하기 위해 Gen AI Evaluation Service는 OpenAI Chat Completion 형식을 지원합니다. 각 행이 OpenAI API 요청과 같이 구조화된 JSON 객체인 데이터 세트를 제공할 수 있습니다. Gen AI Evaluation Service는 이 형식을 자동으로 감지합니다.

다음은 이 형식의 단일 행 예시입니다.

{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}

이 데이터를 사용하여 서드 파티 모델에서 대답을 생성하고 응답을 평가할 수 있습니다.

# Ensure your third-party API key is set
# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'

openai_request_uri = "gs://path/to/your/openai_requests.jsonl"

# Generate responses using a LiteLLM-supported model string
openai_responses = client.evals.run_inference(
    model="gpt-4o",  # LiteLLM compatible model string
    src=openai_request_uri,
)

# The resulting openai_responses object can then be evaluated
eval_result = client.evals.evaluate(
    dataset=openai_responses,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

멀티턴 대화 처리

Gen AI Evaluation Service는 지원되는 형식의 멀티턴 대화 데이터를 자동으로 파싱합니다. 입력 데이터에 교환 기록이 포함된 경우(예: Gemini 형식의 request.contents 필드 또는 OpenAI 형식의 request.messages) Gen AI Evaluation Service는 이전 턴을 식별하고 conversation_history로 처리합니다.

평가 측정항목에서 대화 기록을 사용하여 모델 대답 컨텍스트를 이해할 수 있으므로 현재 프롬프트를 수동으로 이전 대화와 구분할 필요가 없습니다.

다음과 같은 Gemini 형식의 멀티턴 대화 예시를 살펴보세요.

{
  "request": {
    "contents": [
      {"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]},
      {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]},
      {"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]}
    ]
  },
  "response": {
    "candidates": [
      {"content": {"role": "model", "parts": [{"text": "For spring in Paris, you should definitely visit the Eiffel Tower, the Louvre Museum, and wander through Montmartre."}]}}
    ]
  }
}

멀티턴 대화는 다음과 같이 자동으로 파싱됩니다.

prompt: 마지막 사용자 메시지가 현재 프롬프트({"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]})로 식별됩니다.
conversation_history: 이전 메시지는 자동으로 추출되어 대화 기록([{"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]}, {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]}])으로 제공됩니다.
response: 모델 대답은 response 필드({"role": "model", "parts": [{"text": "For spring in Paris..."}]})에서 가져옵니다.

다음 단계

평가 실행하기