自 2025 年 4 月 29 日起，Gemini 1.5 Pro 和 Gemini 1.5 Flash 模型將無法用於先前未使用這些模型的專案，包括新專案。詳情請參閱「模型版本和生命週期」。

本頁面由 Cloud Translation API 翻譯而成。

Gen AI Evaluation Service API

Gen AI 評估服務可讓您依據自己的標準，透過多項指標評估大型語言模型 (LLM)。您可以提供推論時間輸入內容、LLM 回覆和其他參數，Gen AI Evaluation Service 會傳回評估工作專用的指標。

指標包括以模型為基礎的指標 (例如 PointwiseMetric 和 PairwiseMetric)，以及在記憶體中計算的指標 (例如 rouge、bleu 和工具函式呼叫指標)。PointwiseMetric 和 PairwiseMetric 是以模型為基礎的一般指標，您可以根據自己的條件自訂。這項服務會直接從模型取得預測結果做為輸入內容，因此評估服務可以對 Vertex AI 支援的所有模型執行推論和後續評估。

如要進一步瞭解如何評估模型，請參閱生成式 AI 評估服務總覽。

限制

以下是評估服務的限制：

在第一次呼叫時，評估服務可能會出現傳播延遲。
大多數以模型為基礎的指標都會耗用 gemini-2.0-flash 配額，因為 Gen AI 評估服務會使用 gemini-2.0-flash 做為基礎評估模型，計算這些以模型為基礎的指標。
部分以模型為準的指標 (例如 MetricX 和 COMET) 使用不同的機器學習模型，因此不會耗用 gemini-2.0-flash 配額。

語法範例

傳送評估呼叫的語法。

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

參數清單

參數
`exact_match_input`	自由參加：`ExactMatchInput` 輸入內容，評估預測結果是否與參照內容完全相符。
`bleu_input`	自由參加：`BleuInput` 輸入內容，比較預測結果與參考資料，計算 BLEU 分數。
`rouge_input`	自由參加：`RougeInput` 輸入內容，比較預測結果與參照，計算 `rouge` 分數。`rouge_type` 支援不同的 `rouge` 分數。
`fluency_input`	自由參加：`FluencyInput` 請輸入內容，評估單一回覆的語言掌握程度。
`coherence_input`	自由參加：`CoherenceInput` 輸入內容，評估單一回覆是否連貫且容易理解。
`safety_input`	自由參加：`SafetyInput` 輸入內容，評估單一回覆的安全程度。
`groundedness_input`	自由參加：`GroundednessInput` 輸入內容，評估單一回覆僅憑輸入文字提供或參考資訊的能力。
`fulfillment_input`	自由參加：`FulfillmentInput` 輸入內容，評估單一回覆是否完全符合指令。
`summarization_quality_input`	自由參加：`SummarizationQualityInput` 輸入內容，評估單一回覆總結文字重點的整體能力。
`pairwise_summarization_quality_input`	自由參加：`PairwiseSummarizationQualityInput` 輸入內容，比較兩則回覆的整體摘要品質。
`summarization_helpfulness_input`	自由參加：`SummarizationHelpfulnessInput` 輸入內容，評估單一回覆提供摘要的能力，其中包含取代原始文字所需的詳細資料。
`summarization_verbosity_input`	自由參加：`SummarizationVerbosityInput` 輸入內容，評估單一回覆提供簡潔摘要的能力。
`question_answering_quality_input`	自由參加：`QuestionAnsweringQualityInput` 輸入內容，評估單一回覆的整體問答能力，並提供參考文字。
`pairwise_question_answering_quality_input`	自由參加：`PairwiseQuestionAnsweringQualityInput` 輸入內容，比較兩個回覆的整體問答能力，並提供參考文字。
`question_answering_relevance_input`	自由參加：`QuestionAnsweringRelevanceInput` 輸入內容，評估單一回覆在回答問題時，提供相關資訊的能力。
`question_answering_helpfulness_input`	自由參加：`QuestionAnsweringHelpfulnessInput` 輸入內容，評估單一回覆在回答問題時提供重要詳細資料的能力。
`question_answering_correctness_input`	自由參加：`QuestionAnsweringCorrectnessInput` 輸入內容，評估單一回覆是否能正確回答問題。
`pointwise_metric_input`	自由參加：`PointwiseMetricInput` 一般逐點評估的輸入內容。
`pairwise_metric_input`	自由參加：`PairwiseMetricInput` 一般逐對評估的輸入內容。
`tool_call_valid_input`	自由參加：`ToolCallValidInput` 輸入內容，評估單一回覆預測有效工具呼叫的能力。
`tool_name_match_input`	自由參加：`ToolNameMatchInput` 輸入內容，評估單一回覆預測工具呼叫的能力，以及是否能提供正確的工具名稱。
`tool_parameter_key_match_input`	自由參加：`ToolParameterKeyMatchInput` 輸入內容，評估單一回覆預測工具呼叫的能力，以及是否能提供正確的參數名稱。
`tool_parameter_kv_match_input`	自由參加：`ToolParameterKvMatchInput` 輸入內容，評估單一回應預測工具呼叫的能力，以及是否能提供正確的參數名稱和值
`comet_input`	自由參加：`CometInput` 使用 COMET 評估的輸入內容。
`metricx_input`	自由參加：`MetricxInput` 使用 MetricX 評估的輸入內容。

`ExactMatchInput`

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	選用：`ExactMatchSpec`。指標規格，定義指標的行為。
`instances`	自由參加：`ExactMatchInstance[]` 評估輸入內容，包括 LLM 回覆和參考資料。
`instances.prediction`	自由參加：`string` LLM 回覆。
`instances.reference`	自由參加：`string` 可供參考的標準 LLM 回覆。

`ExactMatchResults`

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`exact_match_metric_values`	`ExactMatchMetricValue[]` 每個執行個體輸入的評估結果。
`exact_match_metric_values.score`	`float` 可以是下列其中一項： `0`：執行個體不完全相符 `1`：完全比對

exact_match_metric_values

ExactMatchMetricValue[]

每個執行個體輸入的評估結果。

exact_match_metric_values.score

float

可以是下列其中一項：

0：執行個體不完全相符
1：完全比對

`BleuInput`

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	自由參加：`BleuSpec` 指標規格，定義指標的行為。
`metric_spec.use_effective_order`	自由參加：`bool` 是否要將沒有任何相符項的 n 元語法順序納入考量。
`instances`	自由參加：`BleuInstance[]` 評估輸入內容，包括 LLM 回覆和參考資料。
`instances.prediction`	自由參加：`string` LLM 回覆。
`instances.reference`	自由參加：`string` 可供參考的標準 LLM 回覆。

`BleuResults`

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`bleu_metric_values`	`BleuMetricValue[]` 每個執行個體輸入的評估結果。
`bleu_metric_values.score`	`float`：`[0, 1]`，分數越高表示預測結果越接近參照。

bleu_metric_values

BleuMetricValue[]

每個執行個體輸入的評估結果。

bleu_metric_values.score

float：[0, 1]，分數越高表示預測結果越接近參照。

`RougeInput`

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

參數
`metric_spec`	自由參加：`RougeSpec` 指標規格，定義指標的行為。
`metric_spec.rouge_type`	自由參加：`string` 可接受的值： `rougen[1-9]`：根據預測和參照之間 n 元語法的重疊程度計算 `rouge` 分數。：根據預測和參照之間的最長共同子序列 (LCS) 計算 `rouge` 分數。`rougeL` `rougeLsum`：首先將預測結果和參照內容分割成句子，然後計算每個元組的 LCS。最終 `rougeLsum` 分數是這些個別 LCS 分數的平均值。
`metric_spec.use_stemmer`	自由參加：`bool` 是否應使用 Porter 詞幹提取器去除字尾，以提升比對效果。
`metric_spec.split_summaries`	自由參加：`bool` 是否要在句子之間加入換行符，以供 rougeLsum 使用。
`instances`	自由參加：`RougeInstance[]` 評估輸入內容，包括 LLM 回覆和參考資料。
`instances.prediction`	自由參加：`string` LLM 回覆。
`instances.reference`	自由參加：`string` 可供參考的標準 LLM 回覆。

`RougeResults`

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`rouge_metric_values`	`RougeValue[]` 每個執行個體輸入的評估結果。
`rouge_metric_values.score`	`float`：`[0, 1]`，分數越高表示預測結果越接近參照。

rouge_metric_values

RougeValue[]

每個執行個體輸入的評估結果。

rouge_metric_values.score

float：[0, 1]，分數越高表示預測結果越接近參照。

`FluencyInput`

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	自由參加：`FluencySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`FluencyInstance` 評估輸入內容，包括 LLM 回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。

metric_spec

自由參加：FluencySpec

指標規格，定義指標的行為。

instance

自由參加：FluencyInstance

評估輸入內容，包括 LLM 回覆。

instance.prediction

自由參加：string

LLM 回覆。

`FluencyResult`

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：Inarticulate `2`：不太清楚 `3`：普通 `4`：還算流利 `5`：流利
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：Inarticulate
2：不太清楚
3：普通
4：還算流利
5：流利

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`CoherenceInput`

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	自由參加：`CoherenceSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`CoherenceInstance` 評估輸入內容，包括 LLM 回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。

metric_spec

自由參加：CoherenceSpec

指標規格，定義指標的行為。

instance

自由參加：CoherenceInstance

評估輸入內容，包括 LLM 回覆。

instance.prediction

自由參加：string

LLM 回覆。

`CoherenceResult`

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：語意不連貫 `2`：有點不連貫 `3`：普通 `4`：還算連貫 `5`：連貫
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：語意不連貫
2：有點不連貫
3：普通
4：還算連貫
5：連貫

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`SafetyInput`

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

參數

參數
`metric_spec`	自由參加：`SafetySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`SafetyInstance` 評估輸入內容，包括 LLM 回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。

metric_spec

自由參加：SafetySpec

指標規格，定義指標的行為。

instance

自由參加：SafetyInstance

評估輸入內容，包括 LLM 回覆。

instance.prediction

自由參加：string

LLM 回覆。

`SafetyResult`

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `0`：不安全 `1`：安全
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

0：不安全
1：安全

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`GroundednessInput`

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

參數	說明
`metric_spec`	選用：GroundednessSpec 指標規格，定義指標的行為。
`instance`	選用：GroundednessInstance 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`GroundednessResult`

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `0`：未接地 `1`：已建立基準
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

0：未接地
1：已建立基準

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`FulfillmentInput`

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

參數
`metric_spec`	自由參加：`FulfillmentSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`FulfillmentInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。

`FulfillmentResult`

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：未履行 `2`：執行要求不佳 `3`：部分執行要求 `4`：執行要求良好 `5`：完成履行
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：未履行
2：執行要求不佳
3：部分執行要求
4：執行要求良好
5：完成履行

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`SummarizationQualityInput`

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`SummarizationQualitySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`SummarizationQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`SummarizationQualityResult`

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：非常差 `2`：不佳 `3`：確定 `4`：良好 `5`：非常良好
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：非常差
2：不佳
3：確定
4：良好
5：非常良好

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`PairwiseSummarizationQualityInput`

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`PairwiseSummarizationQualitySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`PairwiseSummarizationQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.baseline_prediction`	自由參加：`string` 基準模型 LLM 回覆。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`PairwiseSummarizationQualityResult`

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`pairwise_choice`	`PairwiseChoice`：列舉，可能的值如下： `BASELINE`：基準預測較佳 `CANDIDATE`：候選人預測更準確 `TIE`：基準預測和候選預測之間的關係。
`explanation`	`string`：指派 pairwise_choice 的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

pairwise_choice

PairwiseChoice：列舉，可能的值如下：

BASELINE：基準預測較佳
CANDIDATE：候選人預測更準確
TIE：基準預測和候選預測之間的關係。

explanation

string：指派 pairwise_choice 的理由。

confidence

float：[0, 1]結果的信賴分數。

`SummarizationHelpfulnessInput`

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`SummarizationHelpfulnessSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`SummarizationHelpfulnessInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`SummarizationHelpfulnessResult`

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：不實用 `2`：不太實用 `3`：普通 `4`：還算有幫助 `5`：實用
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：不實用
2：不太實用
3：普通
4：還算有幫助
5：實用

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`SummarizationVerbosityInput`

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`SummarizationVerbositySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`SummarizationVerbosityInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`SummarizationVerbosityResult`

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`。下列其中一項： `-2`：簡要 `-1`：還算簡短 `0`：最佳 `1`：詳細程度適中 `2`：詳細
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float。下列其中一項：

-2：簡要
-1：還算簡短
0：最佳
1：詳細程度適中
2：詳細

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`QuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringQualitySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`QuestionAnsweringQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringQualityResult`

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：非常差 `2`：不佳 `3`：確定 `4`：良好 `5`：非常良好
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：非常差
2：不佳
3：確定
4：良好
5：非常良好

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`PairwiseQuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringQualitySpec` 指標規格，定義指標的行為。
`instance`	自由參加：`QuestionAnsweringQualityInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.baseline_prediction`	自由參加：`string` 基準模型 LLM 回覆。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`PairwiseQuestionAnsweringQualityResult`

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`pairwise_choice`	`PairwiseChoice`：列舉，可能的值如下： `BASELINE`：基準預測較佳 `CANDIDATE`：候選人預測更準確 `TIE`：基準預測和候選預測之間的關係。
`explanation`	`string`：指派 `pairwise_choice` 的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

pairwise_choice

PairwiseChoice：列舉，可能的值如下：

BASELINE：基準預測較佳
CANDIDATE：候選人預測更準確
TIE：基準預測和候選預測之間的關係。

explanation

string：指派 pairwise_choice 的理由。

confidence

float：[0, 1]結果的信賴分數。

`QuestionAnsweringRelevanceInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringRelevanceSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`QuestionAnsweringRelevanceInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringRelevancyResult`

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：不相關 `2`：不太相關 `3`：普通 `4`：還算符合需求 `5`：切合需求
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：不相關
2：不太相關
3：普通
4：還算符合需求
5：切合需求

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`QuestionAnsweringHelpfulnessInput`

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringHelpfulnessSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`QuestionAnsweringHelpfulnessInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringHelpfulnessResult`

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `1`：不實用 `2`：不太實用 `3`：普通 `4`：還算有幫助 `5`：實用
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

1：不實用
2：不太實用
3：普通
4：還算有幫助
5：實用

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`QuestionAnsweringCorrectnessInput`

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

參數
`metric_spec`	自由參加：`QuestionAnsweringCorrectnessSpec` 指標規格，定義指標的行為。
`metric_spec.use_reference`	自由參加：`bool` 評估時是否使用參照。
`instance`	自由參加：`QuestionAnsweringCorrectnessInstance` 評估輸入內容，包括推論輸入內容和相應的回覆。
`instance.prediction`	自由參加：`string` LLM 回覆。
`instance.reference`	自由參加：`string` 可供參考的標準 LLM 回覆。
`instance.instruction`	自由參加：`string` 推論時使用的指令。
`instance.context`	自由參加：`string` 包含所有資訊的推論時間文字，可用於 LLM 回覆。

`QuestionAnsweringCorrectnessResult`

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

輸出

輸出
`score`	`float`：下列其中一項： `0`：不正確 `1`：正確
`explanation`	`string`：指派分數的理由。
`confidence`	`float`：`[0, 1]`結果的信賴分數。

score

float：下列其中一項：

0：不正確
1：正確

explanation

string：指派分數的理由。

confidence

float：[0, 1]結果的信賴分數。

`PointwiseMetricInput`

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

參數
`metric_spec`	必要條件：`PointwiseMetricSpec` 指標規格，定義指標的行為。
`metric_spec.metric_prompt_template`	必要條件：`string` 定義指標的提示範本。這是由 instance.json_instance 中的鍵/值組合算繪而成
`instance`	必要條件：`PointwiseMetricInstance` 評估輸入內容，由 json_instance 組成。
`instance.json_instance`	自由參加：`string` JSON 格式的鍵/值組合。例如：{"key_1": "value_1", "key_2": "value_2"}。用於算繪 metric_spec.metric_prompt_template。

`PointwiseMetricResult`

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

輸出
`score`	`float`：逐點指標評估結果的分數。
`explanation`	`string`：指派分數的理由。

`PairwiseMetricInput`

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

參數
`metric_spec`	必要條件：`PairwiseMetricSpec` 指標規格，定義指標的行為。
`metric_spec.metric_prompt_template`	必要條件：`string` 定義指標的提示範本。這是由 instance.json_instance 中的鍵/值組合算繪而成
`instance`	必要條件：`PairwiseMetricInstance` 評估輸入內容，由 json_instance 組成。
`instance.json_instance`	自由參加：`string` JSON 格式的鍵/值組合。例如：{"key_1": "value_1", "key_2": "value_2"}。用於算繪 metric_spec.metric_prompt_template。

`PairwiseMetricResult`

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

輸出
`score`	`float`：逐對指標評估結果的分數。
`explanation`	`string`：指派分數的理由。

`ToolCallValidInput`

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolCallValidSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`ToolCallValidInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。範例如下： { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	自由參加：`string` 黃金模型輸出內容，格式與預測結果相同。

`ToolCallValidResults`

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`tool_call_valid_metric_values`	重複 `ToolCallValidMetricValue`：每個例項輸入的評估結果。
`tool_call_valid_metric_values.score`	`float`：下列其中一項： `0`：無效的工具呼叫 `1`：有效的工具呼叫

tool_call_valid_metric_values

重複 ToolCallValidMetricValue：每個例項輸入的評估結果。

tool_call_valid_metric_values.score

float：下列其中一項：

0：無效的工具呼叫
1：有效的工具呼叫

`ToolNameMatchInput`

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolNameMatchSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`ToolNameMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	自由參加：`string` 黃金模型輸出內容，格式與預測結果相同。

`ToolNameMatchResults`

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出

輸出
`tool_name_match_metric_values`	重複 `ToolNameMatchMetricValue`：每個例項輸入的評估結果。
`tool_name_match_metric_values.score`	`float`：下列其中一項： `0`：工具呼叫名稱與參照不符。 `1`：工具呼叫名稱與參照相符。

tool_name_match_metric_values

重複 ToolNameMatchMetricValue：每個例項輸入的評估結果。

tool_name_match_metric_values.score

float：下列其中一項：

0：工具呼叫名稱與參照不符。
1：工具呼叫名稱與參照相符。

`ToolParameterKeyMatchInput`

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolParameterKeyMatchSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`ToolParameterKeyMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	自由參加：`string` 黃金模型輸出內容，格式與預測結果相同。

`ToolParameterKeyMatchResults`

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出
`tool_parameter_key_match_metric_values`	重複 `ToolParameterKeyMatchMetricValue`：每個例項輸入的評估結果。
`tool_parameter_key_match_metric_values.score`	`float`：`[0, 1]`，分數越高表示有更多參數符合參照參數的名稱。

`ToolParameterKVMatchInput`

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

參數
`metric_spec`	自由參加：`ToolParameterKVMatchSpec` 指標規格，定義指標的行為。
`instance`	自由參加：`ToolParameterKVMatchInstance` 評估輸入內容，包括 LLM 回覆和參考資料。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回應，這是包含 `content` 和 `tool_calls` 鍵的 JSON 序列化字串。`content` 值是模型輸出的文字。`tool_call` 值是工具呼叫清單的 JSON 序列化字串。
`instance.reference`	自由參加：`string` 黃金模型輸出內容，格式與預測結果相同。

`ToolParameterKVMatchResults`

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

輸出
`tool_parameter_kv_match_metric_values`	重複 `ToolParameterKVMatchMetricValue`：每個例項輸入的評估結果。
`tool_parameter_kv_match_metric_values.score`	`float`：`[0, 1]`，分數越高表示有更多參數符合參考參數的名稱和值。

`CometInput`

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

參數
`metric_spec`	自由參加：`CometSpec` 指標規格，定義指標的行為。
`metric_spec.version`	自由參加：`string` `COMET_22_SRC_REF`： COMET 22，用於翻譯、來源和參考。系統會使用這三項輸入內容評估翻譯 (預測) 結果。
`metric_spec.source_language`	自由參加：`string` BCP-47 格式的來源語言。例如「es」。
`metric_spec.target_language`	自由參加：`string` 目標語言，格式為 BCP-47。例如「es」
`instance`	自由參加：`CometInstance` 評估輸入內容，包括 LLM 回覆和參考資料。評估時使用的確切欄位取決於 COMET 版本。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回覆。這是正在評估的 LLM 輸出內容。
`instance.source`	自由參加：`string` 原文。這是指預測內容的原始語言。
`instance.reference`	自由參加：`string` 用於與預測結果比較的基準真相。這與預測結果的語言相同。

`CometResult`

{
  "comet_result" : {
    "score": float
  }
}

輸出
`score`	`float`：`[0, 1]`，其中 1 代表完美翻譯。

`MetricxInput`

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

參數
`metric_spec`	自由參加：`MetricxSpec` 指標規格，定義指標的行為。
`metric_spec.version`	選填： `string` 可以是下列其中一項： `METRICX_24_REF`：MetricX 24，用於翻譯和參考。系統會將預測結果 (翻譯) 與提供的參考文字輸入內容進行比較，藉此評估預測結果。 `METRICX_24_SRC`：MetricX 24，適用於翻譯和來源。這項功能會透過品質估算 (QE) 評估翻譯 (預測)，不需要輸入參考文字。 `METRICX_24_SRC_REF`：MetricX 24，適用於翻譯、來源和參照。系統會使用所有三項輸入內容評估翻譯 (預測)。
`metric_spec.source_language`	自由參加：`string` BCP-47 格式的來源語言。例如「es」。
`metric_spec.target_language`	自由參加：`string` 目標語言，格式為 BCP-47。例如「es」。
`instance`	自由參加：`MetricxInstance` 評估輸入內容，包括 LLM 回覆和參考資料。用於評估的確切欄位取決於 MetricX 版本。
`instance.prediction`	自由參加：`string` 候選模型 LLM 回覆。這是正在評估的 LLM 輸出內容。
`instance.source`	自由參加：`string` 來源文字，即預測內容的原文語言。
`instance.reference`	自由參加：`string` 用於與預測結果比較的基準真相。與預測結果的語言相同。

`MetricxResult`

{
  "metricx_result" : {
    "score": float
  }
}

輸出
`score`	`float`：`[0, 25]`，其中 0 代表完美翻譯。

範例

評估輸出內容

以下範例說明如何呼叫 Gen AI 評估 API，使用各種評估指標 (包括下列指標) 評估 LLM 的輸出內容：

summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

評估輸出內容：成對摘要品質

以下範例示範如何呼叫 Gen AI 評估服務 API，使用成對摘要品質比較來評估 LLM 的輸出內容。

REST

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：。
LOCATION：處理要求的區域。
PREDICTION：LLM 回覆。
BASELINE_PREDICTION：基準模型 LLM 回覆。
INSTRUCTION：推論時使用的指令。
CONTEXT：推論時間文字，內含所有相關資訊，可用於 LLM 回覆。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

JSON 要求主體：

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

如要傳送要求，請選擇以下其中一個選項：

curl

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI，或使用 Cloud Shell，自動登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

如要瞭解如何安裝或更新 Python 適用的 Vertex AI SDK，請參閱「安裝 Python 適用的 Vertex AI SDK」。詳情請參閱 Python API 參考說明文件。

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

在試用這個範例之前，請先按照Go使用用戶端程式庫的 Vertex AI 快速入門中的操作說明進行設定。詳情請參閱 Vertex AI Go API 參考說明文件。

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

取得 ROUGE 分數

以下範例會呼叫 Gen AI 評估服務 API，取得多個輸入內容生成的預測結果 ROUGE 分數。ROUGE 輸入內容會使用 metric_spec，決定指標的行為。

REST

使用任何要求資料之前，請先替換以下項目：

PROJECT_ID：。
LOCATION：處理要求的區域。
PREDICTION：LLM 回覆。
REFERENCE：做為參考的黃金 LLM 回覆。
ROUGE_TYPE：用來計算 ROUGE 分數的計算方式。如要查看可接受的值，請參閱 metric_spec.rouge_type。
USE_STEMMER：決定是否使用 Porter 詞幹提取器去除字尾，以提升比對效果。如要瞭解可接受的值，請參閱 metric_spec.use_stemmer。
SPLIT_SUMMARIES：判斷是否要在 rougeLsum 句子之間新增換行符。如要瞭解可接受的值，請參閱 metric_spec.split_summaries。

HTTP 方法和網址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

JSON 要求主體：

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

如要傳送要求，請選擇以下其中一個選項：

curl

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

注意： 下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list 查看目前有效的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

如要瞭解如何安裝或更新 Python 適用的 Vertex AI SDK，請參閱「安裝 Python 適用的 Vertex AI SDK」。詳情請參閱 Python API 參考說明文件。

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

在試用這個範例之前，請先按照Go使用用戶端程式庫的 Vertex AI 快速入門中的操作說明進行設定。詳情請參閱 Vertex AI Go API 參考說明文件。

如要向 Vertex AI 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

後續步驟

如需詳細說明文件，請參閱「執行評估」。

Gen AI Evaluation Service API 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

語法範例

curl

Python

參數清單

ExactMatchInput

ExactMatchResults

BleuInput

BleuResults

RougeInput

RougeResults

FluencyInput

FluencyResult

CoherenceInput

CoherenceResult

SafetyInput

SafetyResult

GroundednessInput

GroundednessResult

FulfillmentInput

FulfillmentResult

SummarizationQualityInput

SummarizationQualityResult

PairwiseSummarizationQualityInput

PairwiseSummarizationQualityResult

SummarizationHelpfulnessInput

SummarizationHelpfulnessResult

SummarizationVerbosityInput

SummarizationVerbosityResult

QuestionAnsweringQualityInput

QuestionAnsweringQualityResult

PairwiseQuestionAnsweringQualityInput

PairwiseQuestionAnsweringQualityResult

QuestionAnsweringRelevanceInput

QuestionAnsweringRelevancyResult

QuestionAnsweringHelpfulnessInput

QuestionAnsweringHelpfulnessResult

QuestionAnsweringCorrectnessInput

QuestionAnsweringCorrectnessResult

PointwiseMetricInput

PointwiseMetricResult

PairwiseMetricInput

PairwiseMetricResult

ToolCallValidInput

ToolCallValidResults

ToolNameMatchInput

ToolNameMatchResults

ToolParameterKeyMatchInput

ToolParameterKeyMatchResults

ToolParameterKVMatchInput

ToolParameterKVMatchResults

CometInput

CometResult

MetricxInput

MetricxResult

範例

評估輸出內容

Python

Go

評估輸出內容：成對摘要品質

REST

curl

PowerShell

Python

Python

Go

Go

取得 ROUGE 分數

REST

curl

PowerShell

Python

Python

Go

Go

後續步驟

Gen AI Evaluation Service API

`ExactMatchInput`

`ExactMatchResults`

`BleuInput`

`BleuResults`

`RougeInput`

`RougeResults`

`FluencyInput`

`FluencyResult`

`CoherenceInput`

`CoherenceResult`

`SafetyInput`

`SafetyResult`

`GroundednessInput`

`GroundednessResult`

`FulfillmentInput`

`FulfillmentResult`

`SummarizationQualityInput`

`SummarizationQualityResult`

`PairwiseSummarizationQualityInput`

`PairwiseSummarizationQualityResult`

`SummarizationHelpfulnessInput`

`SummarizationHelpfulnessResult`

`SummarizationVerbosityInput`

`SummarizationVerbosityResult`

`QuestionAnsweringQualityInput`

`QuestionAnsweringQualityResult`

`PairwiseQuestionAnsweringQualityInput`

`PairwiseQuestionAnsweringQualityResult`

`QuestionAnsweringRelevanceInput`

`QuestionAnsweringRelevancyResult`

`QuestionAnsweringHelpfulnessInput`

`QuestionAnsweringHelpfulnessResult`

`QuestionAnsweringCorrectnessInput`

`QuestionAnsweringCorrectnessResult`

`PointwiseMetricInput`

`PointwiseMetricResult`

`PairwiseMetricInput`

`PairwiseMetricResult`

`ToolCallValidInput`

`ToolCallValidResults`

`ToolNameMatchInput`

`ToolNameMatchResults`

`ToolParameterKeyMatchInput`

`ToolParameterKeyMatchResults`

`ToolParameterKVMatchInput`

`ToolParameterKVMatchResults`

`CometInput`

`CometResult`

`MetricxInput`

`MetricxResult`