如要進行簡單的評估,可以使用 pandas.DataFrame。Gen AI Evaluation Service 會尋找常見的資料欄名稱,例如 prompt、response 和 reference。這個格式完全回溯相容。
importpandasaspd# Example DataFrame with prompts and ground truth referencesprompts_df=pd.DataFrame({"prompt":["What is the capital of France?","Who wrote 'Hamlet'?",],"reference":["Paris","William Shakespeare",]})# You can use this DataFrame directly with run_inference or evaluateeval_dataset=client.evals.run_inference(model="gemini-2.5-flash",src=prompts_df)eval_result=client.evals.evaluate(dataset=eval_dataset,metrics=[types.PrebuiltMetric.GENERAL_QUALITY])eval_result.show()
Gemini 批次預測格式
您可以直接使用 Vertex AI 批次預測工作的輸出內容 (通常是儲存在 Cloud Storage 中的 JSONL 檔案,每行包含要求和回應物件)。生成式 AI 評估服務會自動剖析這個結構,以便與其他 Vertex AI 服務整合。
以下是 JSONl 檔案中單行的範例:
{"request":{"contents":[{"role":"user","parts":[{"text":"Why is the sky blue?"}]}]},"response":{"candidates":[{"content":{"role":"model","parts":[{"text":"The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}
接著,您可以直接評估批次工作預先產生的回覆:
# Cloud Storage path to your batch prediction output filebatch_job_output_uri="gs://path/to/your/batch_output.jsonl"# Evaluate the pre-generated responses directlyeval_result=client.evals.evaluate(dataset=batch_job_output_uri,metrics=[types.PrebuiltMetric.GENERAL_QUALITY])eval_result.show()
OpenAI Chat Completion 格式
如要評估或比較 OpenAI 和 Anthropic 等第三方模型,Gen AI Evaluation Service 支援 OpenAI Chat Completion 格式。您可以提供資料集,其中每列都是結構類似 OpenAI API 要求的 JSON 物件。Gen AI Evaluation Service 會自動偵測這種格式。
以下是這個格式的單行範例:
{"request":{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What's the capital of France?"}],"model":"gpt-4o"}}
您可以運用這項資料,從第三方模型生成回覆並評估回覆:
# Ensure your third-party API key is set# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'openai_request_uri="gs://path/to/your/openai_requests.jsonl"# Generate responses using a LiteLLM-supported model stringopenai_responses=client.evals.run_inference(model="gpt-4o",# LiteLLM compatible model stringsrc=openai_request_uri,)# The resulting openai_responses object can then be evaluatedeval_result=client.evals.evaluate(dataset=openai_responses,metrics=[types.PrebuiltMetric.GENERAL_QUALITY])eval_result.show()
處理多輪對話
生成式 AI 評估服務會自動剖析支援格式的多輪對話資料。如果輸入資料包含對話記錄 (例如 Gemini 格式的 request.contents 欄位,或 OpenAI 格式的 request.messages),Gen AI 評估服務會識別先前的輪次,並將其處理為 conversation_history。
{"request":{"contents":[{"role":"user","parts":[{"text":"I'm planning a trip to Paris."}]},{"role":"model","parts":[{"text":"That sounds wonderful! What time of year are you going?"}]},{"role":"user","parts":[{"text":"I'm thinking next spring. What are some must-see sights?"}]}]},"response":{"candidates":[{"content":{"role":"model","parts":[{"text":"For spring in Paris, you should definitely visit the Eiffel Tower, the Louvre Museum, and wander through Montmartre."}]}}]}}
系統會自動剖析多輪對話,如下所示:
prompt:系統將最後一則使用者訊息視為目前的提示 ({"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]})。
conversation_history:系統會自動擷取先前的訊息,並做為對話記錄 ([{"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]}, {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]}]) 提供。
response:模型的答覆取自 response 欄位 ({"role": "model", "parts": [{"text": "For spring in Paris..."}]})。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# Prepare your evaluation dataset\n\n| **Preview**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\n\nThis page describes how to prepare your dataset for the Gen AI evaluation service.\n\nOverview\n--------\n\nThe Gen AI evaluation service automatically detects and handles several common data formats. This means you can often use your data as-is without needing to perform manual conversions.\n\nThe fields you need to provide in your dataset depend on your goal:\n\nWhen running `client.evals.evaluate()`, the Gen AI evaluation service automatically looks for the following common fields in your dataset:\n\n- `prompt`: (Required) The input to the model that you want to evaluate. For best results, you should provide example prompts that represent the types of inputs that your models process in production.\n\n- `response`: (Required) The output generated by the model or application that is being evaluated.\n\n- `reference`: (Optional) The ground truth or \"golden\" answer that you can compare the model's response against. This field is often required for computation-based metrics like `bleu` and `rouge`.\n\n- `conversation_history`: (Optional) A list of preceding turns in a multi-turn conversation. The Gen AI evaluation service automatically extracts this field from supported formats. For more information, see [Handling multi-turn conversations](#handle-multi-turn).\n\nSupported data formats\n----------------------\n\nThe Gen AI evaluation service supports the following formats:\n\n- [Pandas DataFrame (flattened format)](#pandas-dataframe)\n\n- [Gemini batch prediction format (JSONL)](#gemini-batch-prediction-format)\n\n- [OpenAI chat completion format (JSONL)](#openai-chat-completion-format)\n\n### Pandas DataFrame\n\nFor straightforward evaluations, you can use a `pandas.DataFrame`. The Gen AI evaluation service looks for common column names like `prompt`, `response`, and `reference`. This format is fully backward-compatible. \n\n import pandas as pd\n\n # Example DataFrame with prompts and ground truth references\n prompts_df = pd.DataFrame({\n \"prompt\": [\n \"What is the capital of France?\",\n \"Who wrote 'Hamlet'?\",\n ],\n \"reference\": [\n \"Paris\",\n \"William Shakespeare\",\n ]\n })\n\n # You can use this DataFrame directly with run_inference or evaluate\n eval_dataset = client.evals.run_inference(model=\"gemini-2.5-flash\", src=prompts_df)\n eval_result = client.evals.evaluate(\n dataset=eval_dataset,\n metrics=[types.PrebuiltMetric.GENERAL_QUALITY]\n )\n eval_result.show()\n\n### Gemini batch prediction format\n\nYou can directly use the output of a Vertex AI batch prediction job, which are typically JSONL files stored in Cloud Storage, where each line contains a request and response object. The Gen AI evaluation service parses this structure automatically to provide integration with other Vertex AI services.\n\nThe following is an example of a single line in a JSONl file: \n\n {\"request\": {\"contents\": [{\"role\": \"user\", \"parts\": [{\"text\": \"Why is the sky blue?\"}]}]}, \"response\": {\"candidates\": [{\"content\": {\"role\": \"model\", \"parts\": [{\"text\": \"The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering.\"}]}}]}}\n\nYou can then evaluate pre-generated responses from a batch job directly: \n\n # Cloud Storage path to your batch prediction output file\n batch_job_output_uri = \"gs://path/to/your/batch_output.jsonl\"\n\n # Evaluate the pre-generated responses directly\n eval_result = client.evals.evaluate(\n dataset=batch_job_output_uri,\n metrics=[types.PrebuiltMetric.GENERAL_QUALITY]\n )\n eval_result.show()\n\n### OpenAI Chat Completion format\n\nFor evaluating or comparing with third-party models such as OpenAI and Anthropic, the Gen AI evaluation service supports the OpenAI Chat Completion format. You can supply a dataset where each row is a JSON object structured like an OpenAI API request. The Gen AI evaluation service automatically detects this format.\n\nThe following is an example of a single line in this format: \n\n {\"request\": {\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What's the capital of France?\"}], \"model\": \"gpt-4o\"}}\n\nYou can use this data to generate responses from a third-party model and evaluate the responses: \n\n # Ensure your third-party API key is set\n # e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'\n\n openai_request_uri = \"gs://path/to/your/openai_requests.jsonl\"\n\n # Generate responses using a LiteLLM-supported model string\n openai_responses = client.evals.run_inference(\n model=\"gpt-4o\", # LiteLLM compatible model string\n src=openai_request_uri,\n )\n\n # The resulting openai_responses object can then be evaluated\n eval_result = client.evals.evaluate(\n dataset=openai_responses,\n metrics=[types.PrebuiltMetric.GENERAL_QUALITY]\n )\n eval_result.show()\n\n### Handling multi-turn conversations\n\nThe Gen AI evaluation service automatically parses multi-turn conversation data from supported formats. When your input data includes a history of exchanges (such as within the `request.contents` field in the Gemini format, or `request.messages` in the OpenAI format), the Gen AI evaluation service identifies the previous turns and processes them as `conversation_history`.\n\nThis means you don't need to manually separate the current prompt from the prior conversation, since the evaluation metrics can use the conversation history to understand the context of the model's response.\n\nConsider the following example of a multi-turn conversation in Gemini format: \n\n {\n \"request\": {\n \"contents\": [\n {\"role\": \"user\", \"parts\": [{\"text\": \"I'm planning a trip to Paris.\"}]},\n {\"role\": \"model\", \"parts\": [{\"text\": \"That sounds wonderful! What time of year are you going?\"}]},\n {\"role\": \"user\", \"parts\": [{\"text\": \"I'm thinking next spring. What are some must-see sights?\"}]}\n ]\n },\n \"response\": {\n \"candidates\": [\n {\"content\": {\"role\": \"model\", \"parts\": [{\"text\": \"For spring in Paris, you should definitely visit the Eiffel Tower, the Louvre Museum, and wander through Montmartre.\"}]}}\n ]\n }\n }\n\nThe multi-turn conversation is automatically parsed as follows:\n\n- `prompt`: The last user message is identified as the current prompt (`{\"role\": \"user\", \"parts\": [{\"text\": \"I'm thinking next spring. What are some must-see sights?\"}]}`).\n\n- `conversation_history`: The preceding messages are automatically extracted and made available as the conversation history (`[{\"role\": \"user\", \"parts\": [{\"text\": \"I'm planning a trip to Paris.\"}]}, {\"role\": \"model\", \"parts\": [{\"text\": \"That sounds wonderful! What time of year are you going?\"}]}]`).\n\n- `response`: The model's reply is taken from the `response` field (`{\"role\": \"model\", \"parts\": [{\"text\": \"For spring in Paris...\"}]}`).\n\nWhat's next\n-----------\n\n- [Run an evaluation](/vertex-ai/generative-ai/docs/models/run-evaluation)."]]