自 2025 年 4 月 29 日起，Gemini 1.5 Pro 和 Gemini 1.5 Flash 模型將無法用於先前未使用這些模型的專案，包括新專案。詳情請參閱「模型版本和生命週期」。

本頁面由 Cloud Translation API 翻譯而成。

Gen AI Evaluation Service API

本指南說明如何使用 Gen AI Evaluation Service API 評估大型語言模型 (LLM)。本指南涵蓋下列主題：

指標類型：瞭解可用的不同評估指標類別。
語法範例：請參閱評估 API 的範例 curl 和 Python 要求。
參數詳細資料：瞭解各項評估指標的特定參數。
範例：查看常見評估工作的完整程式碼範例。

您可以使用 Gen AI 評估服務，依據自己的標準，透過多項指標評估大型語言模型 (LLM)。您提供推論時間輸入內容、LLM 回覆和其他參數，Gen AI 評估服務就會傳回評估工作專用的指標。

指標包括以模型為基礎的指標 (例如 PointwiseMetric 和 PairwiseMetric)，以及在記憶體中計算的指標 (例如 rouge、bleu 和工具函式呼叫指標)。PointwiseMetric 和 PairwiseMetric 是以模型為基礎的通用指標，您可以根據自己的條件自訂。這項服務會直接接收模型提供的預測結果，因此您可以在 Vertex AI 支援的任何模型上執行推論和評估。

如要進一步瞭解如何評估模型，請參閱 Gen AI 評估服務總覽。

限制

評估服務有下列限制：

首次呼叫時，評估服務可能會有傳播延遲。
大多數以模型為基礎的指標都會耗用 gemini-2.0-flash 配額，因為 Gen AI Evaluation Service 會使用 gemini-2.0-flash 做為基礎判斷模型來計算這些指標。
部分以模型為準的指標 (例如 MetricX 和 COMET) 使用不同的機器學習模型，不會耗用 gemini-2.0-flash 配額。

指標類型

生成式 AI 評估服務 API 提供多種指標類別，可評估模型效能的不同面向。下表提供概要總覽，協助您為用途選擇合適的指標。

指標類別	說明	用途
詞彙指標 (例如`bleu`、`rouge`、`exact_match`)	這些指標會根據模型預測與參考 (真值) 文字的重疊程度計算分數。快速且客觀。	適合用於有明確「正確」答案的任務，例如翻譯或事實型問題回答，這類任務的品質可透過與參考資料的相似度來評估。
以模型為基準的逐點指標 (例如`fluency`、`safety`、`groundedness`、`summarization_quality`)	這類指標會使用評估模型，根據特定條件 (例如流暢度或安全性) 評估單一模型的回覆品質，不需要參考答案。	最適合評估生成文字的主觀特質，例如回覆的創意、連貫性或安全性，這類特質沒有單一正確答案。
以模型為基準的逐對指標 (例如`pairwise_summarization_quality`)	這些指標會使用評估模型比較兩個模型的回應 (例如基準模型和候選模型)，並判斷哪個模型的回應較佳。	適用於 A/B 測試，可直接比較兩個不同模型或同一模型兩個版本在相同工作上的成效。
工具使用指標 (例如 `tool_call_valid`、`tool_name_match`)	這些指標會檢查語法是否有效、工具名稱是否正確，以及參數是否準確，藉此評估模型正確使用工具 (函式呼叫) 的能力。	評估模型時，如果模型設計為透過工具呼叫與外部 API 或系統互動，這項功能就非常重要。
自訂指標 (`pointwise_metric`、`pairwise_metric`)	這些範本提供彈性架構，方便您使用提示範本定義自己的評估條件。接著，這項服務會使用評估模型，根據自訂指令評估回覆內容。	對於預先定義的指標不足以評估的專業評估工作，您需要根據特定領域的獨特需求評估效能。
專業指標 (`comet`、`metricx`)	專為特定工作設計的高度專業指標，主要用於評估機器翻譯品質。	用於機器翻譯工作，可進行細緻的評估，而不只是簡單的詞彙比對。

語法範例

下列範例顯示傳送評估要求時的語法。

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

參數詳細資料

本節將詳細說明各項評估指標的要求和回應物件。

要求主體

頂層要求主體包含下列其中一個指標輸入物件。

參數
`exact_match_input`	自由參加：`ExactMatchInput` 評估預測結果是否與參照完全相符。
`bleu_input`	自由參加：`BleuInput` 比較預測結果與參照內容，計算 BLEU 分數。
`rouge_input`	自由參加：`RougeInput` 比較預測結果與參照內容，計算 ROUGE 分數。`rouge_type` 參數可讓您指定不同的 ROUGE 類型。
`fluency_input`	自由參加：`FluencyInput` 評估單一回覆的語言流暢度。
`coherence_input`	自由參加：`CoherenceInput` 評估單一回覆的連貫性。
`safety_input`	自由參加：`SafetyInput` 評估單一回覆的安全等級。
`groundedness_input`	自由參加：`GroundednessInput` 評估回覆是否以提供的內容為依據。
`fulfillment_input`	自由參加：`FulfillmentInput` 評估回覆內容是否符合指定指示。
`summarization_quality_input`	自由參加：`SummarizationQualityInput` 評估回覆的整體摘要品質。
`pairwise_summarization_quality_input`	自由參加：`PairwiseSummarizationQualityInput` 比較兩則回覆的摘要品質。
`summarization_helpfulness_input`	自由參加：`SummarizationHelpfulnessInput` 評估摘要是否實用，以及是否包含原文的必要詳細資料。
`summarization_verbosity_input`	自由參加：`SummarizationVerbosityInput` 評估摘要的詳細程度。
`question_answering_quality_input`	自由參加：`QuestionAnsweringQualityInput` 根據提供的內容，評估問題答案的整體品質。
`pairwise_question_answering_quality_input`	自由參加：`PairwiseQuestionAnsweringQualityInput` 根據提供的脈絡，比較兩個問題答案的品質。
`question_answering_relevance_input`	自由參加：`QuestionAnsweringRelevanceInput` 評估答案與問題的關聯性。
`question_answering_helpfulness_input`	自由參加：`QuestionAnsweringHelpfulnessInput` 檢查關鍵詳細資料，評估答案的實用程度。
`question_answering_correctness_input`	自由參加：`QuestionAnsweringCorrectnessInput` 評估問題答案是否正確。
`pointwise_metric_input`	自由參加：`PointwiseMetricInput` 自訂逐點評估的輸入內容。
`pairwise_metric_input`	自由參加：`PairwiseMetricInput` 自訂逐對評估的輸入內容。
`tool_call_valid_input`	自由參加：`ToolCallValidInput` 評估回覆是否預測有效的工具呼叫。
`tool_name_match_input`	自由參加：`ToolNameMatchInput` 評估回應是否在工具呼叫中預測正確的工具名稱。
`tool_parameter_key_match_input`	自由參加：`ToolParameterKeyMatchInput` 評估回應是否在工具呼叫中預測正確的參數名稱。
`tool_parameter_kv_match_input`	自由參加：`ToolParameterKvMatchInput` 評估回應是否在工具呼叫中預測正確的參數名稱和值。
`comet_input`	自由參加：`CometInput` 使用 COMET 評估的輸入內容。
`metricx_input`	自由參加：`MetricxInput` 使用 MetricX 評估的輸入內容。

完全比對 (`exact_match_input`)

輸入 (ExactMatchInput)

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	選用：`ExactMatchSpec`。指定指標的行為。
`instances`	自由參加：`ExactMatchInstance[]` 一或多個評估例項，每個例項都包含 LLM 回應和參照。
`instances.prediction`	自由參加：`string` LLM 回覆。
`instances.reference`	自由參加：`string` 基準真相或參考回覆。

輸出 (ExactMatchResults)

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`exact_match_metric_values`	`ExactMatchMetricValue[]` 評估結果陣列，每個輸入執行個體各有一個結果。
`exact_match_metric_values.score`	`float` 可以是下列其中一項： `0`：執行個體不完全相符。 `1`：執行個體完全相符。

exact_match_metric_values

ExactMatchMetricValue[]

評估結果陣列，每個輸入執行個體各有一個結果。

exact_match_metric_values.score

float

可以是下列其中一項：

0：執行個體不完全相符。
1：執行個體完全相符。

BLEU (`bleu_input`)

輸入 (BleuInput)

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	自由參加：`BleuSpec` 指定指標的行為。
`metric_spec.use_effective_order`	自由參加：`bool` 指定是否要考慮沒有相符項目的 n 元語法順序。
`instances`	自由參加：`BleuInstance[]` 一或多個評估例項，每個例項都包含 LLM 回應和參照。
`instances.prediction`	自由參加：`string` LLM 回覆。
`instances.reference`	自由參加：`string` 基準真相或參考回覆。

輸出 (BleuResults)

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`bleu_metric_values`	`BleuMetricValue[]` 評估結果陣列，每個輸入執行個體各有一個結果。
`bleu_metric_values.score`	`float`：範圍 `[0, 1]` 中的值。分數越高，代表預測結果與參考資料越相似。

bleu_metric_values

BleuMetricValue[]

評估結果陣列，每個輸入執行個體各有一個結果。

bleu_metric_values.score

float：範圍 [0, 1] 中的值。分數越高，代表預測結果與參考資料越相似。

ROUGE (`rouge_input`)

輸入 (RougeInput)

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	自由參加：`RougeSpec` 指定指標的行為。
`metric_spec.rouge_type`	自由參加：`string` 支援的值： `rougen[1-9]`：根據預測結果和參考資料之間的 n 元語法重疊程度，計算 ROUGE 分數。 `rougeL`：根據預測和參照之間的「最長共同子序列」(LCS)，計算 ROUGE 分數。 `rougeLsum`：將預測結果和參照內容分成句子，然後計算每個元組的 LCS。最終 `rougeLsum` 分數是這些個別 LCS 分數的平均值。
`metric_spec.use_stemmer`	自由參加：`bool` 指定是否要使用 Porter 詞幹還原器去除字尾，以提高比對準確度。
`metric_spec.split_summaries`	自由參加：`bool` 指定是否要在 `rougeLsum` 的句子之間加入換行符號。
`instances`	自由參加：`RougeInstance[]` 一或多個評估例項，每個例項都包含 LLM 回應和參照。
`instances.prediction`	自由參加：`string` LLM 回覆。
`instances.reference`	自由參加：`string` 基準真相或參考回覆。

輸出 (RougeResults)

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`rouge_metric_values`	`RougeValue[]` 評估結果陣列，每個輸入執行個體各有一個結果。
`rouge_metric_values.score`	`float`：範圍 `[0, 1]` 中的值。分數越高，代表預測結果與參考資料越相似。

rouge_metric_values

RougeValue[]

評估結果陣列，每個輸入執行個體各有一個結果。

rouge_metric_values.score

float：範圍 [0, 1] 中的值。分數越高，代表預測結果與參考資料越相似。

流暢度 (`fluency_input`)

輸入 (FluencyInput)

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	自由參加：`FluencySpec` 指定指標的行為。
`instance`	自由參加：`FluencyInstance` 評估輸入內容，包括 LLM 回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。

metric_spec

自由參加：FluencySpec

指定指標的行為。

instance

自由參加：FluencyInstance

評估輸入內容，包括 LLM 回覆。

instance.prediction

自由參加：string

LLM 回覆。

輸出 (FluencyResult)

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：Inarticulate `2`：不太清楚 `3`：普通 `4`：還算流利 `5`：流利
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：Inarticulate
2：不太清楚
3：普通
4：還算流利
5：流利

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

連貫性 (`coherence_input`)

輸入 (CoherenceInput)

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	自由參加：`CoherenceSpec` 指定指標的行為。
`instance`	自由參加：`CoherenceInstance` 評估輸入內容，包括 LLM 回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。

metric_spec

自由參加：CoherenceSpec

指定指標的行為。

instance

自由參加：CoherenceInstance

評估輸入內容，包括 LLM 回覆。

instance.prediction

自由參加：string

LLM 回覆。

輸出 (CoherenceResult)

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：語意不連貫 `2`：有點不連貫 `3`：普通 `4`：還算連貫 `5`：連貫
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：語意不連貫
2：有點不連貫
3：普通
4：還算連貫
5：連貫

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

安全性 (`safety_input`)

輸入 (SafetyInput)

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	自由參加：`SafetySpec` 指定指標的行為。
`instance`	自由參加：`SafetyInstance` 評估輸入內容，包括 LLM 回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。

metric_spec

自由參加：SafetySpec

指定指標的行為。

instance

自由參加：SafetyInstance

評估輸入內容，包括 LLM 回覆。

instance.prediction

自由參加：string

LLM 回覆。

輸出 (SafetyResult)

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `0`：不安全 `1`：安全
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

0：不安全
1：安全

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

根據事實程度 (`groundedness_input`)

輸入 (GroundednessInput)

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

參數	說明
`metric_spec`	選用：GroundednessSpec 指定指標的行為。
`instance`	選用：GroundednessInstance 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (GroundednessResult)

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `0`：未接地 `1`：已建立基準
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

0：未接地
1：已建立基準

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

履行 (`fulfillment_input`)

輸入 (FulfillmentInput)

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

參數
`metric_spec`	自由參加：`FulfillmentSpec` 指定指標的行為。
`instance`	自由參加：`FulfillmentInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。

輸出 (FulfillmentResult)

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：未履行 `2`：執行要求不佳 `3`：部分執行要求 `4`：執行要求良好 `5`：完成履行
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：未履行
2：執行要求不佳
3：部分執行要求
4：執行要求良好
5：完成履行

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

摘要品質 (`summarization_quality_input`)

輸入 (SummarizationQualityInput)

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`SummarizationQualitySpec` 指定指標的行為。
`instance`	自由參加：`SummarizationQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (SummarizationQualityResult)

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：非常差 `2`：不佳 `3`：確定 `4`：良好 `5`：非常良好
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：非常差
2：不佳
3：確定
4：良好
5：非常良好

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

成對摘要品質 (`pairwise_summarization_quality_input`)

輸入 (PairwiseSummarizationQualityInput)

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`PairwiseSummarizationQualitySpec` 指定指標的行為。
`instance`	自由參加：`PairwiseSummarizationQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.baseline_prediction`	自由參加：`string` 基準模型的 LLM 回覆。
`instance.prediction`	自由參加：`string` 候選模型的 LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (PairwiseSummarizationQualityResult)

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`pairwise_choice`	`PairwiseChoice`：列舉，包含下列其中一個值： `BASELINE`：基準預測結果較佳。 `CANDIDATE`：候選人預測結果更準確。 `TIE`：基準和候選預測的品質相同。
`explanation`	`string`：指派 `pairwise_choice` 的原因。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

pairwise_choice

PairwiseChoice：列舉，包含下列其中一個值：

BASELINE：基準預測結果較佳。
CANDIDATE：候選人預測結果更準確。
TIE：基準和候選預測的品質相同。

explanation

string：指派 pairwise_choice 的原因。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

摘要實用性 (`summarization_helpfulness_input`)

輸入 (SummarizationHelpfulnessInput)

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`SummarizationHelpfulnessSpec` 指定指標的行為。
`instance`	自由參加：`SummarizationHelpfulnessInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (SummarizationHelpfulnessResult)

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：不實用 `2`：不太實用 `3`：普通 `4`：還算有幫助 `5`：實用
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：不實用
2：不太實用
3：普通
4：還算有幫助
5：實用

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

摘要詳細程度 (`summarization_verbosity_input`)

輸入 (SummarizationVerbosityInput)

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`SummarizationVerbositySpec` 指定指標的行為。
`instance`	自由參加：`SummarizationVerbosityInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (SummarizationVerbosityResult)

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`。下列其中一項： `-2`：簡要 `-1`：還算簡短 `0`：最佳 `1`：詳細程度適中 `2`：詳細
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float。下列其中一項：

-2：簡要
-1：還算簡短
0：最佳
1：詳細程度適中
2：詳細

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

問題回答品質 (`question_answering_quality_input`)

輸入 (QuestionAnsweringQualityInput)

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringQualitySpec` 指定指標的行為。
`instance`	自由參加：`QuestionAnsweringQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (QuestionAnsweringQualityResult)

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：非常差 `2`：不佳 `3`：確定 `4`：良好 `5`：非常良好
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：非常差
2：不佳
3：確定
4：良好
5：非常良好

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

逐對問題回答品質 (`pairwise_question_answering_quality_input`)

輸入 (PairwiseQuestionAnsweringQualityInput)

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringQualitySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`QuestionAnsweringQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.baseline_prediction`	自由參加：`string` 基準模型的 LLM 回覆。
`instance.prediction`	自由參加：`string` 候選模型的 LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (PairwiseQuestionAnsweringQualityResult)

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`pairwise_choice`	`PairwiseChoice`：列舉，包含下列其中一個值： `BASELINE`：基準預測結果較佳。 `CANDIDATE`：候選人預測結果更準確。 `TIE`：基準和候選預測的品質相同。
`explanation`	`string`：指派 `pairwise_choice` 的原因。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

pairwise_choice

PairwiseChoice：列舉，包含下列其中一個值：

BASELINE：基準預測結果較佳。
CANDIDATE：候選人預測結果更準確。
TIE：基準和候選預測的品質相同。

explanation

string：指派 pairwise_choice 的原因。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

問題回答關聯性 (`question_answering_relevance_input`)

輸入 (QuestionAnsweringRelevanceInput)

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringRelevanceSpec` 指定指標的行為。
`instance`	自由參加：`QuestionAnsweringRelevanceInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (QuestionAnsweringRelevanceResult)

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：不相關 `2`：不太相關 `3`：普通 `4`：還算符合需求 `5`：切合需求
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：不相關
2：不太相關
3：普通
4：還算符合需求
5：切合需求

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

問題回答實用性 (`question_answering_helpfulness_input`)

輸入 (QuestionAnsweringHelpfulnessInput)

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringHelpfulnessSpec` 指定指標的行為。
`instance`	自由參加：`QuestionAnsweringHelpfulnessInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (QuestionAnsweringHelpfulnessResult)

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：不實用 `2`：不太實用 `3`：普通 `4`：還算有幫助 `5`：實用
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

1：不實用
2：不太實用
3：普通
4：還算有幫助
5：實用

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

問題回答正確度 (`question_answering_correctness_input`)

輸入 (QuestionAnsweringCorrectnessInput)

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringCorrectnessSpec` 指定指標的行為。
`metric_spec.use_reference`	自由參加：`bool` 指定是否在評估中使用參照。
`instance`	自由參加：`QuestionAnsweringCorrectnessInstance` 評估輸入內容，包括推論輸入內容和相應的回應。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.reference`	自由參加：`string` 基準真相或參考回覆。
`instance.instruction`	自由參加：`string` 推論時提供的指令。
`instance.context`	自由參加：`string` LLM 回覆可使用的推論時間提供的脈絡。

輸出 (QuestionAnsweringCorrectnessResult)

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `0`：不正確 `1`：正確
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：結果的信賴分數，範圍為 `[0, 1]`。

score

float：下列其中一項：

0：不正確
1：正確

explanation

string：指派分數的理由。

confidence

float：結果的信賴分數，範圍為 [0, 1]。

自訂逐點 (`pointwise_metric_input`)

輸入 (PointwiseMetricInput)

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

參數
`metric_spec`	必要條件：`PointwiseMetricSpec` 指定指標的行為。
`metric_spec.metric_prompt_template`	必要條件：`string` 定義指標的提示範本。系統會使用 `instance.json_instance` 中的鍵/值組合來算繪範本。
`instance`	必要條件：`PointwiseMetricInstance` 評估輸入內容，由 `json_instance` 組成。
`instance.json_instance`	自由參加：`string` 鍵/值組合的 JSON 字串 (例如 `{"key_1": "value_1", "key_2": "value_2"}`)，用於算繪 `metric_spec.metric_prompt_template`。

輸出 (PointwiseMetricResult)

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

輸出
`score`	`float`：逐點指標評估結果的分數。
`explanation`	`string`：指派分數的理由。

自訂成對 (`pairwise_metric_input`)

輸入 (PairwiseMetricInput)

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

參數
`metric_spec`	必要條件：`PairwiseMetricSpec` 指定指標的行為。
`metric_spec.metric_prompt_template`	必要條件：`string` 定義指標的提示範本。系統會使用 `instance.json_instance` 中的鍵/值組合來算繪範本。
`instance`	必要條件：`PairwiseMetricInstance` 評估輸入內容，由 `json_instance` 組成。
`instance.json_instance`	自由參加：`string` 鍵/值組合的 JSON 字串 (例如 `{"key_1": "value_1", "key_2": "value_2"}`)，用於算繪 `metric_spec.metric_prompt_template`。

輸出 (PairwiseMetricResult)

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

輸出
`score`	`float`：逐對指標評估結果的分數。
`explanation`	`string`：指派分數的理由。

工具呼叫有效 (`tool_call_valid_input`)

輸入 (ToolCallValidInput)

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolCallValidSpec` 指定指標的行為。
`instance`	自由參加：`ToolCallValidInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型的回覆。這必須是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_calls` 值是工具呼叫清單的 JSON 序列化字串。例如： { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	自由參加：`string` 實際資料或參考回應，格式與 `prediction` 相同。

輸出 (ToolCallValidResults)

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`tool_call_valid_metric_values`	`ToolCallValidMetricValue[]`：評估結果陣列，每個輸入執行個體各有一個結果。
`tool_call_valid_metric_values.score`	`float`：下列其中一項： `0`：無效的工具呼叫 `1`：有效的工具呼叫

tool_call_valid_metric_values

ToolCallValidMetricValue[]：評估結果陣列，每個輸入執行個體各有一個結果。

tool_call_valid_metric_values.score

float：下列其中一項：

0：無效的工具呼叫
1：有效的工具呼叫

工具名稱相符 (`tool_name_match_input`)

輸入 (ToolNameMatchInput)

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolNameMatchSpec` 指定指標的行為。
`instance`	自由參加：`ToolNameMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型的回覆。這必須是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_calls` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	自由參加：`string` 實際資料或參考回應，格式與 `prediction` 相同。

輸出 (ToolNameMatchResults)

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`tool_name_match_metric_values`	`ToolNameMatchMetricValue[]`：評估結果陣列，每個輸入執行個體各有一個結果。
`tool_name_match_metric_values.score`	`float`：下列其中一項： `0`：工具呼叫名稱與參照不符。 `1`：工具呼叫名稱與參照相符。

tool_name_match_metric_values

ToolNameMatchMetricValue[]：評估結果陣列，每個輸入執行個體各有一個結果。

tool_name_match_metric_values.score

float：下列其中一項：

0：工具呼叫名稱與參照不符。
1：工具呼叫名稱與參照相符。

工具參數鍵相符 (`tool_parameter_key_match_input`)

輸入 (ToolParameterKeyMatchInput)

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolParameterKeyMatchSpec` 指定指標的行為。
`instance`	自由參加：`ToolParameterKeyMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型的回覆。這必須是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_calls` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	自由參加：`string` 實際資料或參考回應，格式與 `prediction` 相同。

輸出 (ToolParameterKeyMatchResults)

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出
`tool_parameter_key_match_metric_values`	`ToolParameterKeyMatchMetricValue[]`：評估結果陣列，每個輸入執行個體各有一個結果。
`tool_parameter_key_match_metric_values.score`	`float`：範圍 `[0, 1]` 中的值。分數越高，表示有更多參數符合參考參數的名稱。

工具參數 KV 比對 (`tool_parameter_kv_match_input`)

輸入 (ToolParameterKVMatchInput)

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolParameterKVMatchSpec` 指定指標的行為。
`instance`	自由參加：`ToolParameterKVMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型的回覆。這必須是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_calls` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	自由參加：`string` 實際資料或參考回應，格式與 `prediction` 相同。

輸出 (ToolParameterKVMatchResults)

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出
`tool_parameter_kv_match_metric_values`	`ToolParameterKVMatchMetricValue[]`：評估結果陣列，每個輸入執行個體各有一個結果。
`tool_parameter_kv_match_metric_values.score`	`float`：範圍 `[0, 1]` 中的值。分數越高，代表有更多參數符合參照參數的名稱和值。

COMET (`comet_input`)

輸入 (CometInput)

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

參數
`metric_spec`	自由參加：`CometSpec` 指定指標的行為。
`metric_spec.version`	自由參加：`string` `COMET_22_SRC_REF`： COMET 22，用於翻譯、來源和參考。系統會使用這三項輸入內容評估翻譯 (預測) 結果。
`metric_spec.source_language`	自由參加：`string` BCP-47 格式的來源語言。例如「es」。
`metric_spec.target_language`	自由參加：`string` BCP-47 格式的目標語言。例如「es」。
`instance`	自由參加：`CometInstance` 評估輸入內容。評估時使用的確切欄位取決於 COMET 版本。
`instance.prediction`	自由參加：`string` 候選模型的回覆，也就是要評估的翻譯文字。
`instance.source`	自由參加：`string` 翻譯前的原文。
`instance.reference`	自由參加：`string` 實際資料或參考譯文，語言與 `prediction` 相同。

輸出 (CometResult)

{
  "comet_result" : {
    "score": float
  }
}

輸出
`score`	`float`：範圍 `[0, 1]` 中的值，其中 1 代表完美翻譯。

指標 X (`metricx_input`)

輸入 (MetricxInput)

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

參數
`metric_spec`	自由參加：`MetricxSpec` 指定指標的行為。
`metric_spec.version`	選填： `string` 可以是下列其中一項： `METRICX_24_REF`：MetricX 24，用於翻譯和參考。系統會將預測結果 (翻譯) 與提供的參考文字輸入內容進行比較，藉此評估預測結果。 `METRICX_24_SRC`：MetricX 24，適用於翻譯和來源。這項服務會使用品質估算 (QE) 評估翻譯 (預測) 結果，無需輸入參考文字。 `METRICX_24_SRC_REF`：MetricX 24，適用於翻譯、來源和參照。系統會使用所有三項輸入內容評估翻譯 (預測)。
`metric_spec.source_language`	自由參加：`string` BCP-47 格式的來源語言。例如「es」。
`metric_spec.target_language`	自由參加：`string` BCP-47 格式的目標語言。例如「es」。
`instance`	自由參加：`MetricxInstance` 評估輸入內容。用於評估的確切欄位取決於 MetricX 版本。
`instance.prediction`	自由參加：`string` 候選模型的回覆，也就是要評估的翻譯文字。
`instance.source`	自由參加：`string` 預測結果的原文語言。
`instance.reference`	自由參加：`string` 用於與預測結果比較的基準真相。與預測結果的語言相同。

輸出 (MetricxResult)

{
  "metricx_result" : {
    "score": float
  }
}

輸出
`score`	`float`：範圍 `[0, 25]` 中的值，其中 0 代表完美翻譯。

範例

在一次呼叫中評估多項指標

以下範例說明如何呼叫 Gen AI 評估服務 API，使用各種評估指標 (包括 summarization_quality、groundedness、fulfillment、summarization_helpfulness 和 summarization_verbosity) 評估 LLM 的輸出內容。

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

評估逐對摘要品質

以下範例說明如何呼叫 Gen AI 評估服務 API，使用成對摘要品質比較功能評估 LLM 的輸出內容。

REST

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：。
LOCATION：處理要求的區域。
PREDICTION：LLM 回覆。
BASELINE_PREDICTION：基準模型 LLM 回覆。
INSTRUCTION：推論時使用的指令。
CONTEXT：推論時間文字，內含所有相關資訊，可用於 LLM 回覆。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

JSON 要求主體：

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

如要傳送要求，請選擇以下其中一個選項：

curl

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI，或使用 Cloud Shell，自動登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

如要瞭解如何安裝或更新 Python 適用的 Vertex AI SDK，請參閱「安裝 Python 適用的 Vertex AI SDK」。詳情請參閱 Python API 參考說明文件。

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

在試用這個範例之前，請先按照Go使用用戶端程式庫的 Vertex AI 快速入門中的操作說明進行設定。詳情請參閱 Vertex AI Go API 參考說明文件。

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

評估 ROUGE 分數

以下範例說明如何呼叫 Gen AI 評估服務 API，取得預測的 ROUGE 分數。要求會使用 metric_spec 設定指標的行為。

REST

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：。
LOCATION：處理要求的區域。
PREDICTION：LLM 回覆。
REFERENCE：做為參考的黃金 LLM 回覆。
ROUGE_TYPE：用來計算 ROUGE 分數的計算方式。如要查看可接受的值，請參閱 metric_spec.rouge_type。
USE_STEMMER：決定是否使用 Porter 詞幹提取器去除字尾，以提升比對效果。如要瞭解可接受的值，請參閱 metric_spec.use_stemmer。
SPLIT_SUMMARIES：判斷是否要在 rougeLsum 句子之間新增換行符。如要瞭解可接受的值，請參閱 metric_spec.split_summaries。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

JSON 要求主體：

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

如要傳送要求，請選擇以下其中一個選項：

curl

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

如要瞭解如何安裝或更新 Python 適用的 Vertex AI SDK，請參閱「安裝 Python 適用的 Vertex AI SDK」。詳情請參閱 Python API 參考說明文件。

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

在試用這個範例之前，請先按照Go使用用戶端程式庫的 Vertex AI 快速入門中的操作說明進行設定。詳情請參閱 Vertex AI Go API 參考說明文件。

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

後續步驟

瞭解如何執行評估作業。

Gen AI Evaluation Service API 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

限制

指標類型

語法範例

curl

Python

參數詳細資料

要求主體

完全比對 (exact_match_input)

BLEU (bleu_input)

ROUGE (rouge_input)

流暢度 (fluency_input)

連貫性 (coherence_input)

安全性 (safety_input)

根據事實程度 (groundedness_input)

履行 (fulfillment_input)

摘要品質 (summarization_quality_input)

成對摘要品質 (pairwise_summarization_quality_input)

摘要實用性 (summarization_helpfulness_input)

摘要詳細程度 (summarization_verbosity_input)

問題回答品質 (question_answering_quality_input)

逐對問題回答品質 (pairwise_question_answering_quality_input)

問題回答關聯性 (question_answering_relevance_input)

問題回答實用性 (question_answering_helpfulness_input)

問題回答正確度 (question_answering_correctness_input)

自訂逐點 (pointwise_metric_input)

自訂成對 (pairwise_metric_input)

工具呼叫有效 (tool_call_valid_input)

工具名稱相符 (tool_name_match_input)

工具參數鍵相符 (tool_parameter_key_match_input)

工具參數 KV 比對 (tool_parameter_kv_match_input)

COMET (comet_input)

指標 X (metricx_input)

範例

在一次呼叫中評估多項指標

Python

Go

評估逐對摘要品質

REST

curl

PowerShell

Python

Python

Go

Go

評估 ROUGE 分數

REST

curl

PowerShell

Python

Python

Go

Go

後續步驟

Gen AI Evaluation Service API

完全比對 (`exact_match_input`)

BLEU (`bleu_input`)

ROUGE (`rouge_input`)

流暢度 (`fluency_input`)

連貫性 (`coherence_input`)

安全性 (`safety_input`)

根據事實程度 (`groundedness_input`)

履行 (`fulfillment_input`)

摘要品質 (`summarization_quality_input`)

成對摘要品質 (`pairwise_summarization_quality_input`)

摘要實用性 (`summarization_helpfulness_input`)

摘要詳細程度 (`summarization_verbosity_input`)

問題回答品質 (`question_answering_quality_input`)

逐對問題回答品質 (`pairwise_question_answering_quality_input`)

問題回答關聯性 (`question_answering_relevance_input`)

問題回答實用性 (`question_answering_helpfulness_input`)

問題回答正確度 (`question_answering_correctness_input`)

自訂逐點 (`pointwise_metric_input`)

自訂成對 (`pairwise_metric_input`)

工具呼叫有效 (`tool_call_valid_input`)

工具名稱相符 (`tool_name_match_input`)

工具參數鍵相符 (`tool_parameter_key_match_input`)

工具參數 KV 比對 (`tool_parameter_kv_match_input`)

COMET (`comet_input`)

指標 X (`metricx_input`)