# Example: Generate rubrics using a predefined methoddata_with_rubrics=client.evals.generate_rubrics(src=prompts_df,rubric_group_name="general_quality_rubrics",predefined_spec_name=types.RubricMetric.GENERAL_QUALITY,)# Display the dataset with the generated rubricsdata_with_rubrics.show()
# First, run inference to get an EvaluationDatasetgpt_response=client.evals.run_inference(model='gpt-4o',src=prompt_df)# Now, visualize the inference resultsgpt_response.show()
# First, run an evaluation on a single candidateeval_result=client.evals.evaluate(dataset=eval_dataset,metrics=[types.RubricMetric.TEXT_QUALITY,types.RubricMetric.FLUENCY,types.Metric(name='rouge_1'),])# Visualize the detailed evaluation reporteval_result.show()
在所有報表中,您都可以展開「查看原始 JSON」部分,檢查任何結構化格式的資料,例如 Gemini 或 OpenAI Chat Completion API 格式。
# Example of comparing two modelsinference_result_1=client.evals.run_inference(model="gemini-2.0-flash",src=prompts_df,)inference_result_2=client.evals.run_inference(model="gemini-2.5-flash",src=prompts_df,)comparison_result=client.evals.evaluate(dataset=[inference_result_1,inference_result_2],metrics=[types.PrebuiltMetric.TEXT_QUALITY])comparison_result.show()
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[],[],null,["# View and interpret evaluation results\n\nThis page describes how to view and interpret your model evaluation results after running your model evaluation.\n\nView evaluation results\n-----------------------\n\nAfter you define your evaluation task, run the task to get\nevaluation results, as follows: \n\n from vertexai.evaluation import EvalTask\n\n eval_result = EvalTask(\n dataset=DATASET,\n metrics=[METRIC_1, METRIC_2, METRIC_3],\n experiment=EXPERIMENT_NAME,\n ).evaluate(\n model=MODEL,\n experiment_run=EXPERIMENT_RUN_NAME,\n )\n\nThe `EvalResult` class represents the result of an evaluation run with the following attributes:\n\n- **`summary_metrics`**: A dictionary of aggregated evaluation metrics for an evaluation run.\n- **`metrics_table`** : A `pandas.DataFrame` table containing evaluation dataset inputs, responses, explanations, and metric results per row.\n- **`metadata`**: the experiment name and experiment run name for the evaluation run.\n\nThe `EvalResult` class is defined as follows: \n\n @dataclasses.dataclass\n class EvalResult:\n \"\"\"Evaluation result.\n\n Attributes:\n summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.\n metrics_table: A pandas.DataFrame table containing evaluation dataset inputs,\n responses, explanations, and metric results per row.\n metadata: the experiment name and experiment run name for the evaluation run.\n \"\"\"\n\n summary_metrics: Dict[str, float]\n metrics_table: Optional[\"pd.DataFrame\"] = None\n metadata: Optional[Dict[str, str]] = None\n\nWith the use of helper functions, the evaluation results can be displayed in the\n[Colab notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/intro_to_gen_ai_evaluation_service_sdk.ipynb) as follows:\n\nVisualize evaluation results\n----------------------------\n\nYou can plot summary metrics in a radar or bar chart for visualization and\ncomparison between results from different evaluation runs. This visualization\ncan be helpful for evaluating different models and different prompt templates.\n\nIn the following example, we visualize four metrics (coherence, fluency, instruction following and overall text quality) for responses generated using four different prompt templates. From the radar and bar plot, we can infer that prompt template #2 consistently outperforms the other templates across all four metrics. This is particularly evident in its significantly higher scores for instruction following and text quality. Based on this analysis, prompt template #2 appears to be the most effective choice among the four options.\n\nUnderstand metric results\n-------------------------\n\nThe following tables list various components of instance-level and aggregate results included in `metrics_table` and `summary_metrics` respectively for `PointwiseMetric`, `PairwiseMetric` and computation-based metrics: \n\n### `PointwiseMetric`\n\n### Instance-level results\n\n| **Note:** Results for translation metrics only include `score`.\n\n### Aggregate results\n\n### `PairwiseMetric`\n\n### Instance-level results\n\n### Aggregate results\n\n### Computation-based metrics\n\n### Instance-level results\n\n### Aggregate results\n\nExamples\n--------\n\nThe examples in this section demonstrate how to read and understand the evaluation results.\n\n### Example 1\n\nIn the first example, consider evaluating a pointwise evaluation instance for `TEXT_QUALITY`. The score from the pointwise evaluation of `TEXT_QUALITY` metric is 4 (from a scale 1 to 5), which means the response is good. Furthermore, the explanation in the evaluation result shows why the judge model thinks the prediction deserves the score 4, and not a score that's higher or lower.\n\n#### Dataset\n\n- `prompt`: \"Summarize the following text in a way that a five-year-old can understand: Social Media Platform Faces Backlash Over Content Moderation Policies\\\\nA prominent social media platform finds itself embroiled in controversy as users and content creators express discontent over its content moderation policies. Allegations of biased censorship, inconsistent enforcement, and suppression of certain viewpoints have sparked outrage among users who claim that the platform is stifling free speech. On the other hand, the platform asserts that its policies are designed to maintain a safe and inclusive online environment. This controversy raises broader questions about the role of social media in shaping public discourse and the responsibilities of platforms in curating online content.\"\n\n- `response`: \"People are upset with a website because they think it's not being fair with what people can post. Some people say the website is stopping them from saying what they want. But the website says it's trying to keep everyone safe. This makes people wonder if websites should control what people can say online.\"\n\n#### Result\n\n- `score`: 4,\n- `explanation`: The summary in response follows the instruction to summarize the context in a way that a five-year old can understand. It is grounded in the context and provides important details in its summarization. However, the language used in the response is a bit verbose.\n\n### Example 2\n\nThe second example is a pairwise comparison evaluation on `PAIRWISE_QUESTION_ANSWERING_QUALITY`. The `pairwise_choice` result shows the candidate response \"France is a country located in Western Europe.\" is preferred by the judge model compared to the baseline response \"France is a country.\" to answer the question in the prompt. Similar to pointwise results, an explanation is also provided to explain why the candidate response is better than the baseline response (candidate response is more helpful in this case).\n\n#### Dataset\n\n- `prompt`: \"Can you answer where France is based on the following paragraph? France is a country located in Western Europe. It's bordered by Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, and Andorra. France's coastline stretches along the English Channel, the North Sea, the Atlantic Ocean, and the Mediterranean Sea. Known for its rich history, iconic landmarks like the Eiffel Tower, and delicious cuisine, France is a major cultural and economic power in Europe and throughout the world.\"\n\n- `response`: \"France is a country located in Western Europe.\",\n\n- `baseline_model_response`: \"France is a country.\",\n\n#### Result\n\n- `pairwise_choice`: CANDIDATE,\n- `explanation`: BASELINE response is grounded but does not fully answer the question. CANDIDATE response, however, is correct and provides helpful details on the location of France.\n\nWhat's next\n-----------\n\n- Try an\n [evaluation example notebook](/vertex-ai/generative-ai/docs/models/evaluation-overview#use_cases).\n\n- Learn about [generative AI evaluation](/vertex-ai/generative-ai/docs/models/evaluation-overview)."]]