自 2025 年 4 月 29 日起，Gemini 1.5 Pro 和 Gemini 1.5 Flash 模型將無法用於先前未使用這些模型的專案，包括新專案。詳情請參閱「模型版本和生命週期」。

本頁面由 Cloud Translation API 翻譯而成。

準備評估資料集

本頁說明如何準備生成式 AI 評估服務的資料集。

總覽

Gen AI Evaluation Service 會自動偵測及處理幾種常見的資料格式。也就是說，您通常可以直接使用資料，不必手動轉換。

您需要在資料集中提供的欄位視目標而定：

目標	必要資料	SDK 工作流程
生成新回覆並評估	`prompt`	`run_inference()` → `evaluate()`
評估現有回覆	`prompt` 和 `response`	`evaluate()`

執行 client.evals.evaluate() 時，Gen AI Evaluation Service 會自動在資料集中尋找下列常見欄位：

prompt：(必要) 要評估的模型輸入內容。為獲得最佳結果，請提供範例提示，代表模型在實際工作環境中處理的輸入類型。
response：(必要) 待評估模型或應用程式生成的輸出內容。
reference：(選用) 基準真相或「正確」答案，可做為模型回應的比較對象。計算型指標 (例如 bleu 和 rouge) 通常需要這個欄位。
conversation_history：(選用) 多輪對話中先前的輪次清單。Gen AI Evaluation Service 會自動從支援的格式中擷取這個欄位。詳情請參閱「處理多輪對話」。

支援的資料格式

Gen AI Evaluation Service 支援下列格式：

Pandas DataFrame (扁平格式)
Gemini 批次預測格式 (JSONL)
OpenAI 即時通訊完成格式 (JSONL)

Pandas DataFrame

如要進行簡單的評估，可以使用 pandas.DataFrame。Gen AI Evaluation Service 會尋找常見的資料欄名稱，例如 prompt、response 和 reference。這個格式完全回溯相容。

import pandas as pd

# Example DataFrame with prompts and ground truth references
prompts_df = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Who wrote 'Hamlet'?",
    ],
    "reference": [
        "Paris",
        "William Shakespeare",
    ]
})

# You can use this DataFrame directly with run_inference or evaluate
eval_dataset = client.evals.run_inference(model="gemini-2.5-flash", src=prompts_df)
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

Gemini 批次預測格式

您可以直接使用 Vertex AI 批次預測工作的輸出內容 (通常是儲存在 Cloud Storage 中的 JSONL 檔案，每行包含要求和回應物件)。生成式 AI 評估服務會自動剖析這個結構，以便與其他 Vertex AI 服務整合。

以下是 JSONl 檔案中單行的範例：

{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}

接著，您可以直接評估批次工作預先產生的回覆：

# Cloud Storage path to your batch prediction output file
batch_job_output_uri = "gs://path/to/your/batch_output.jsonl"

# Evaluate the pre-generated responses directly
eval_result = client.evals.evaluate(
    dataset=batch_job_output_uri,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

OpenAI Chat Completion 格式

如要評估或比較 OpenAI 和 Anthropic 等第三方模型，Gen AI Evaluation Service 支援 OpenAI Chat Completion 格式。您可以提供資料集，其中每列都是結構類似 OpenAI API 要求的 JSON 物件。Gen AI Evaluation Service 會自動偵測這種格式。

以下是這個格式的單行範例：

{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}

您可以運用這項資料，從第三方模型生成回覆並評估回覆：

# Ensure your third-party API key is set
# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'

openai_request_uri = "gs://path/to/your/openai_requests.jsonl"

# Generate responses using a LiteLLM-supported model string
openai_responses = client.evals.run_inference(
    model="gpt-4o",  # LiteLLM compatible model string
    src=openai_request_uri,
)

# The resulting openai_responses object can then be evaluated
eval_result = client.evals.evaluate(
    dataset=openai_responses,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

處理多輪對話

生成式 AI 評估服務會自動剖析支援格式的多輪對話資料。如果輸入資料包含對話記錄 (例如 Gemini 格式的 request.contents 欄位，或 OpenAI 格式的 request.messages)，Gen AI 評估服務會識別先前的輪次，並將其處理為 conversation_history。

也就是說，您不需要手動將目前的提示與先前的對話分開，因為評估指標可以使用對話記錄，瞭解模型回覆的脈絡。

以下是 Gemini 格式的多輪對話範例：

{
  "request": {
    "contents": [
      {"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]},
      {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]},
      {"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]}
    ]
  },
  "response": {
    "candidates": [
      {"content": {"role": "model", "parts": [{"text": "For spring in Paris, you should definitely visit the Eiffel Tower, the Louvre Museum, and wander through Montmartre."}]}}
    ]
  }
}

系統會自動剖析多輪對話，如下所示：

prompt：系統將最後一則使用者訊息視為目前的提示 ({"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]})。
conversation_history：系統會自動擷取先前的訊息，並做為對話記錄 ([{"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]}, {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]}]) 提供。
response：模型的答覆取自 response 欄位 ({"role": "model", "parts": [{"text": "For spring in Paris..."}]})。

後續步驟

執行評估作業。