Gen AI Evaluation Service API

借助 Gen AI Evaluation Service，您可以根据自己的标准使用多个指标，对大语言模型 (LLM) 进行评估。您可以提供推理时间输入、LLM 回答和其他参数，而 Gen AI Evaluation Service 会返回特定于评估任务的指标。

指标包括基于模型的指标（例如 PointwiseMetric 和 PairwiseMetric）和内存中计算的指标（例如 rouge、bleu 和工具函数调用指标）。PointwiseMetric 和 PairwiseMetric 是基于模型的通用指标，您可以根据自己的条件对其进行自定义。由于该服务直接将模型的预测结果作为输入，因此评估服务可以对 Vertex AI 支持的所有模型执行推断和后续评估。

如需详细了解如何评估模型，请参阅 Gen AI Evaluation Service 概览。

限制

评估服务存在以下限制：

评估服务在首次调用时可能会有传播延迟。
大多数基于模型的指标都会消耗 gemini-2.0-flash 配额，因为 Gen AI Evaluation Service 会利用 gemini-2.0-flash 作为底层评判模型来计算这些基于模型的指标。
某些基于模型的指标（例如 MetricX 和 COMET）使用不同的机器学习模型，因此不会消耗 gemini-2.0-flash 配额。

示例语法

用于发送评估调用的语法。

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

参数列表

参数
`exact_match_input`	可选：`ExactMatchInput`。输入，用于评估预测结果是否与参考完全一致。
`bleu_input`	可选：`BleuInput`。输入，用于通过将预测结果与参考进行比较来计算 BLEU 得分。
`rouge_input`	可选：`RougeInput`。输入，用于通过将预测结果与参考进行比较来计算 `rouge` 得分。`rouge_type` 支持不同的 `rouge` 分数。
`fluency_input`	可选：`FluencyInput`。用于评估单个回答的语言掌握情况的输入。
`coherence_input`	可选：`CoherenceInput`。用于评估单个回答能否提供连贯、易于理解的回答。
`safety_input`	可选：`SafetyInput`。用于评估单个回答的安全级别的输入。
`groundedness_input`	可选：`GroundednessInput`。用于评估单个回答能否提供或参考仅输入文本中包含的信息能力的输入。
`fulfillment_input`	可选：`FulfillmentInput`。用于评估单个回答是否能够完全执行指令的输入。
`summarization_quality_input`	可选：`SummarizationQualityInput`。用于评估单个回答的整体文本能力的输入。
`pairwise_summarization_quality_input`	可选：`PairwiseSummarizationQualityInput`。用于比较两个回答的整体汇总质量的输入。
`summarization_helpfulness_input`	可选：`SummarizationHelpfulnessInput`。用于评估单个回答能否提供摘要的输入，其中包含替换原始文本所需的详细信息。
`summarization_verbosity_input`	可选：`SummarizationVerbosityInput`。用于评估单个回答能否提供简洁的摘要的输入。
`question_answering_quality_input`	可选：`QuestionAnsweringQualityInput`。用于评估单个回答的整体回答问题的能力的输入，同时提供可供参考的文本内容。
`pairwise_question_answering_quality_input`	可选：`PairwiseQuestionAnsweringQualityInput`。用于比较两个回答的整体回答问题的能力的输入，同时提供要参考的文本正文。
`question_answering_relevance_input`	可选：`QuestionAnsweringRelevanceInput`。用于评估单个回答在被询问时回答相关信息的能力。
`question_answering_helpfulness_input`	可选：`QuestionAnsweringHelpfulnessInput`。用于评估单个回答在回答问题时提供关键细节的能力的输入。
`question_answering_correctness_input`	可选：`QuestionAnsweringCorrectnessInput`。用于评估单个回答正确回答问题的能力的输入。
`pointwise_metric_input`	可选：`PointwiseMetricInput`。输入通用逐点评估。
`pairwise_metric_input`	可选：`PairwiseMetricInput`。输入通用成对评估。
`tool_call_valid_input`	可选：`ToolCallValidInput`。用于评估单个回答预测有效工具调用的能力的输入。
`tool_name_match_input`	可选：`ToolNameMatchInput`。用于评估单个回答使用正确工具名称预测工具调用的能力的输入。
`tool_parameter_key_match_input`	可选：`ToolParameterKeyMatchInput`。用于评估单个回答使用正确的参数名称预测工具调用的能力的输入。
`tool_parameter_kv_match_input`	可选：`ToolParameterKvMatchInput`。用于评估单个回答能否使用正确的参数名称和值预测工具调用的能力的输入
`comet_input`	可选：`CometInput`。要使用 COMET 进行评估的输入。
`metricx_input`	可选：`MetricxInput`。要使用 MetricX 进行评估的输入。

`ExactMatchInput`

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

参数
`metric_spec`	可选：`ExactMatchSpec`。指标规范，用于定义指标的行为。
`instances`	可选：`ExactMatchInstance[]`。评估输入，由 LLM 回答和参考组成。
`instances.prediction`	可选：`string`。 LLM 回答。
`instances.reference`	可选：`string`。黄金 LLM 回答以供参考。

`ExactMatchResults`

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

输出

输出
`exact_match_metric_values`	`ExactMatchMetricValue[]` 每个实例输入的评估结果。
`exact_match_metric_values.score`	`float` 以下项之一： `0`：实例不完全匹配 `1`：完全匹配

exact_match_metric_values

ExactMatchMetricValue[]

每个实例输入的评估结果。

exact_match_metric_values.score

float

以下项之一：

0：实例不完全匹配
1：完全匹配

`BleuInput`

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

参数
`metric_spec`	可选：`BleuSpec`。指标规范，用于定义指标的行为。
`metric_spec.use_effective_order`	可选：`bool`。是否考虑没有任何匹配项的 N 元语法顺序。
`instances`	可选：`BleuInstance[]`。评估输入，由 LLM 回答和参考组成。
`instances.prediction`	可选：`string`。 LLM 回答。
`instances.reference`	可选：`string`。黄金 LLM 回答以供参考。

`BleuResults`

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

输出

输出
`bleu_metric_values`	`BleuMetricValue[]` 每个实例输入的评估结果。
`bleu_metric_values.score`	`float`：`[0, 1]`，其中得分越高表示预测结果越接近参考。

bleu_metric_values

BleuMetricValue[]

每个实例输入的评估结果。

bleu_metric_values.score

float：[0, 1]，其中得分越高表示预测结果越接近参考。

`RougeInput`

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

参数
`metric_spec`	可选：`RougeSpec`。指标规范，用于定义指标的行为。
`metric_spec.rouge_type`	可选：`string`。可接受的值： `rougen[1-9]`：根据预测结果与参考之间N 元语法的重叠情况计算 `rouge` 得分。 `rougeL`：根据预测与参考之间的最长公共子序列 (LCS) 计算 `rouge` 得分。 `rougeLsum`：首先将预测和参考拆分为句子，然后计算每个元组的 LCS。最终的 `rougeLsum` 得分是这些单独的 LCS 得分的平均值。
`metric_spec.use_stemmer`	可选：`bool`。是否应使用 Porter 词干提取器来去除字词后缀，以提高匹配度。
`metric_spec.split_summaries`	可选：`bool`。是否为 rougeLsum 在句子之间添加换行符。
`instances`	可选：`RougeInstance[]`。评估输入，由 LLM 回答和参考组成。
`instances.prediction`	可选：`string`。 LLM 回答。
`instances.reference`	可选：`string`。黄金 LLM 回答以供参考。

`RougeResults`

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

输出

输出
`rouge_metric_values`	`RougeValue[]` 每个实例输入的评估结果。
`rouge_metric_values.score`	`float`：`[0, 1]`，其中得分越高表示预测结果越接近参考。

rouge_metric_values

RougeValue[]

每个实例输入的评估结果。

rouge_metric_values.score

float：[0, 1]，其中得分越高表示预测结果越接近参考。

`FluencyInput`

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

参数

参数
`metric_spec`	可选：`FluencySpec`。指标规范，用于定义指标的行为。
`instance`	可选：`FluencyInstance`。评估输入，由 LLM 回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。

metric_spec

可选：FluencySpec。

指标规范，用于定义指标的行为。

instance

可选：FluencyInstance。

评估输入，由 LLM 回答组成。

instance.prediction

可选：string。

LLM 回答。

`FluencyResult`

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：口齿不清 `2`：有点口齿不清 `3`：中性 `4`：还算流畅 `5`：流畅
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：口齿不清
2：有点口齿不清
3：中性
4：还算流畅
5：流畅

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`CoherenceInput`

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

参数

参数
`metric_spec`	可选：`CoherenceSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`CoherenceInstance`。评估输入，由 LLM 回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。

metric_spec

可选：CoherenceSpec。

指标规范，用于定义指标的行为。

instance

可选：CoherenceInstance。

评估输入，由 LLM 回答组成。

instance.prediction

可选：string。

LLM 回答。

`CoherenceResult`

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：不连贯 `2`：有点不连贯 `3`：中性 `4`：还算连贯 `5`：连贯
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：不连贯
2：有点不连贯
3：中性
4：还算连贯
5：连贯

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`SafetyInput`

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

参数

参数
`metric_spec`	可选：`SafetySpec`。指标规范，用于定义指标的行为。
`instance`	可选：`SafetyInstance`。评估输入，由 LLM 回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。

metric_spec

可选：SafetySpec。

指标规范，用于定义指标的行为。

instance

可选：SafetyInstance。

评估输入，由 LLM 回答组成。

instance.prediction

可选：string。

LLM 回答。

`SafetyResult`

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `0`：不安全 `1`：安全
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

0：不安全
1：安全

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`GroundednessInput`

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

参数	说明
`metric_spec`	可选：GroundednessSpec 指标规范，用于定义指标的行为。
`instance`	可选：GroundednessInstance 评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`GroundednessResult`

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `0`：没有以事实为依据 `1`：以事实为依据
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

0：没有以事实为依据
1：以事实为依据

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`FulfillmentInput`

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

参数
`metric_spec`	可选：`FulfillmentSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`FulfillmentInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。

`FulfillmentResult`

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：没有履行 `2`：履行不佳 `3`：履行尚可 `4`：履行良好 `5`：完全履行
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：没有履行
2：履行不佳
3：履行尚可
4：履行良好
5：完全履行

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`SummarizationQualityInput`

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

参数
`metric_spec`	可选：`SummarizationQualitySpec`。指标规范，用于定义指标的行为。
`instance`	可选：`SummarizationQualityInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`SummarizationQualityResult`

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：很差 `2`：差 `3`：尚可 `4`：良好 `5`：非常好
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：很差
2：差
3：尚可
4：良好
5：非常好

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`PairwiseSummarizationQualityInput`

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

参数
`metric_spec`	可选：`PairwiseSummarizationQualitySpec`。指标规范，用于定义指标的行为。
`instance`	可选：`PairwiseSummarizationQualityInstance`。评估输入，由推理输入和相应的回答组成。
`instance.baseline_prediction`	可选：`string`。基准模型 LLM 回答。
`instance.prediction`	可选：`string`。候选模型 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`PairwiseSummarizationQualityResult`

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`pairwise_choice`	`PairwiseChoice`：枚举，可能的值如下所示： `BASELINE`：基准预测结果更好 `CANDIDATE`：候选预测结果更好 `TIE`：基准预测与候选预测之间的关联。
`explanation`	`string`：成对选择分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

pairwise_choice

PairwiseChoice：枚举，可能的值如下所示：

BASELINE：基准预测结果更好
CANDIDATE：候选预测结果更好
TIE：基准预测与候选预测之间的关联。

explanation

string：成对选择分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`SummarizationHelpfulnessInput`

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

参数
`metric_spec`	可选：`SummarizationHelpfulnessSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`SummarizationHelpfulnessInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`SummarizationHelpfulnessResult`

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：没用 `2`：不太有用 `3`：中性 `4`：比较有帮助 `5`：有帮助
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：没用
2：不太有用
3：中性
4：比较有帮助
5：有帮助

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`SummarizationVerbosityInput`

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

参数
`metric_spec`	可选：`SummarizationVerbositySpec`。指标规范，用于定义指标的行为。
`instance`	可选：`SummarizationVerbosityInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`SummarizationVerbosityResult`

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`。以下项之一： `-2`：简洁 `-1`：比较简洁 `0`：最佳 `1`：比较详细 `2`：详细
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float。以下项之一：

-2：简洁
-1：比较简洁
0：最佳
1：比较详细
2：详细

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`QuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

参数
`metric_spec`	可选：`QuestionAnsweringQualitySpec`。指标规范，用于定义指标的行为。
`instance`	可选：`QuestionAnsweringQualityInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`QuestionAnsweringQualityResult`

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：很差 `2`：差 `3`：尚可 `4`：良好 `5`：非常好
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：很差
2：差
3：尚可
4：良好
5：非常好

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`PairwiseQuestionAnsweringQualityInput`

{
  "pairwise_question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

参数
`metric_spec`	可选：`QuestionAnsweringQualitySpec`。指标规范，用于定义指标的行为。
`instance`	可选：`QuestionAnsweringQualityInstance`。评估输入，由推理输入和相应的回答组成。
`instance.baseline_prediction`	可选：`string`。基准模型 LLM 回答。
`instance.prediction`	可选：`string`。候选模型 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`PairwiseQuestionAnsweringQualityResult`

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`pairwise_choice`	`PairwiseChoice`：枚举，可能的值如下所示： `BASELINE`：基准预测结果更好 `CANDIDATE`：候选预测结果更好 `TIE`：基准预测与候选预测之间的关联。
`explanation`	`string`：分配 `pairwise_choice` 的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

pairwise_choice

PairwiseChoice：枚举，可能的值如下所示：

BASELINE：基准预测结果更好
CANDIDATE：候选预测结果更好
TIE：基准预测与候选预测之间的关联。

explanation

string：分配 pairwise_choice 的理由。

confidence

float：[0, 1]结果的置信度得分。

`QuestionAnsweringRelevanceInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

参数
`metric_spec`	可选：`QuestionAnsweringRelevanceSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`QuestionAnsweringRelevanceInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`QuestionAnsweringRelevancyResult`

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：不相关 `2`：不太相关 `3`：中性 `4`：还算相关 `5`：相关
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：不相关
2：不太相关
3：中性
4：还算相关
5：相关

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`QuestionAnsweringHelpfulnessInput`

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

参数
`metric_spec`	可选：`QuestionAnsweringHelpfulnessSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`QuestionAnsweringHelpfulnessInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`QuestionAnsweringHelpfulnessResult`

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `1`：没用 `2`：不太有用 `3`：中性 `4`：比较有帮助 `5`：有帮助
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

1：没用
2：不太有用
3：中性
4：比较有帮助
5：有帮助

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`QuestionAnsweringCorrectnessInput`

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

参数
`metric_spec`	可选：`QuestionAnsweringCorrectnessSpec`。指标规范，用于定义指标的行为。
`metric_spec.use_reference`	可选：`bool`。参考是否用于评估。
`instance`	可选：`QuestionAnsweringCorrectnessInstance`。评估输入，由推理输入和相应的回答组成。
`instance.prediction`	可选：`string`。 LLM 回答。
`instance.reference`	可选：`string`。黄金 LLM 回答以供参考。
`instance.instruction`	可选：`string`。推理时使用的指令。
`instance.context`	可选：`string`。包含所有信息的推理时间文本，可在 LLM 回答中使用。

`QuestionAnsweringCorrectnessResult`

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

输出

输出
`score`	`float`：下列其中一种： `0`：错误 `1`：正确
`explanation`	`string`：得分分配的理由。
`confidence`	`float`：`[0, 1]`结果的置信度得分。

score

float：下列其中一种：

0：错误
1：正确

explanation

string：得分分配的理由。

confidence

float：[0, 1]结果的置信度得分。

`PointwiseMetricInput`

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

参数
`metric_spec`	必需：`PointwiseMetricSpec` 指标规范，用于定义指标的行为。
`metric_spec.metric_prompt_template`	必需：`string` 定义指标的提示模板。它由 instance.json_instance 中的键值对呈现
`instance`	必需：`PointwiseMetricInstance` 评估输入，由 json_instance 组成。
`instance.json_instance`	可选：`string`。以 JSON 格式表示的键值对。例如，{"key_1": "value_1", "key_2": "value_2"}。用于呈现 metric_spec.metric_prompt_template。

`PointwiseMetricResult`

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

输出
`score`	`float`：逐点指标评估结果的得分。
`explanation`	`string`：得分分配的理由。

`PairwiseMetricInput`

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

参数
`metric_spec`	必需：`PairwiseMetricSpec` 指标规范，用于定义指标的行为。
`metric_spec.metric_prompt_template`	必需：`string` 定义指标的提示模板。它由 instance.json_instance 中的键值对呈现
`instance`	必需：`PairwiseMetricInstance` 评估输入，由 json_instance 组成。
`instance.json_instance`	可选：`string`。以 JSON 格式表示的键值对。例如，{"key_1": "value_1", "key_2": "value_2"}。用于呈现 metric_spec.metric_prompt_template。

`PairwiseMetricResult`

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

输出
`score`	`float`：成对指标评估结果的得分。
`explanation`	`string`：得分分配的理由。

`ToolCallValidInput`

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

参数
`metric_spec`	可选：`ToolCallValidSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`ToolCallValidInstance`。评估输入，由 LLM 回答和参考组成。
`instance.prediction`	可选：`string`。候选模型 LLM 回答，这是一个包含 `content` 和 `tool_calls` 键的 JSON 序列化字符串。`content` 值是模型的文本输出。`tool_call` 值是工具调用列表的 JSON 序列化字符串。示例如下： { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	可选：`string`。与预测结果格式相同的黄金模型输出。

`ToolCallValidResults`

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

输出

输出
`tool_call_valid_metric_values`	重复的 `ToolCallValidMetricValue`：每个实例输入的评估结果。
`tool_call_valid_metric_values.score`	`float`：下列其中一种： `0`：工具调用无效 `1`：工具调用有效

tool_call_valid_metric_values

重复的 ToolCallValidMetricValue：每个实例输入的评估结果。

tool_call_valid_metric_values.score

float：下列其中一种：

0：工具调用无效
1：工具调用有效

`ToolNameMatchInput`

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

参数
`metric_spec`	可选：`ToolNameMatchSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`ToolNameMatchInstance`。评估输入，由 LLM 回答和参考组成。
`instance.prediction`	可选：`string`。候选模型 LLM 回答，这是一个包含 `content` 和 `tool_calls` 键的 JSON 序列化字符串。`content` 值是模型的文本输出。`tool_call` 值是工具调用列表的 JSON 序列化字符串。
`instance.reference`	可选：`string`。与预测结果格式相同的黄金模型输出。

`ToolNameMatchResults`

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

输出

输出
`tool_name_match_metric_values`	重复的 `ToolNameMatchMetricValue`：每个实例输入的评估结果。
`tool_name_match_metric_values.score`	`float`：下列其中一种： `0`：工具调用名称与参照项不匹配。 `1`：工具调用名称与参照项匹配。

tool_name_match_metric_values

重复的 ToolNameMatchMetricValue：每个实例输入的评估结果。

tool_name_match_metric_values.score

float：下列其中一种：

0：工具调用名称与参照项不匹配。
1：工具调用名称与参照项匹配。

`ToolParameterKeyMatchInput`

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

参数
`metric_spec`	可选：`ToolParameterKeyMatchSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`ToolParameterKeyMatchInstance`。评估输入，由 LLM 回答和参考组成。
`instance.prediction`	可选：`string`。候选模型 LLM 回答，这是一个包含 `content` 和 `tool_calls` 键的 JSON 序列化字符串。`content` 值是模型的文本输出。`tool_call` 值是工具调用列表的 JSON 序列化字符串。
`instance.reference`	可选：`string`。与预测结果格式相同的黄金模型输出。

`ToolParameterKeyMatchResults`

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

输出
`tool_parameter_key_match_metric_values`	重复的 `ToolParameterKeyMatchMetricValue`：每个实例输入的评估结果。
`tool_parameter_key_match_metric_values.score`	`float`：`[0, 1]`，得分越高，表示与参考参数的名称匹配的参数越多。

`ToolParameterKVMatchInput`

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

参数
`metric_spec`	可选：`ToolParameterKVMatchSpec`。指标规范，用于定义指标的行为。
`instance`	可选：`ToolParameterKVMatchInstance`。评估输入，由 LLM 回答和参考组成。
`instance.prediction`	可选：`string`。候选模型 LLM 回答，这是一个包含 `content` 和 `tool_calls` 键的 JSON 序列化字符串。`content` 值是模型的文本输出。`tool_call` 值是工具调用列表的 JSON 序列化字符串。
`instance.reference`	可选：`string`。与预测结果格式相同的黄金模型输出。

`ToolParameterKVMatchResults`

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

输出
`tool_parameter_kv_match_metric_values`	重复的 `ToolParameterKVMatchMetricValue`：每个实例输入的评估结果。
`tool_parameter_kv_match_metric_values.score`	`float`：`[0, 1]`，其中得分越高，表示与参考参数的名称和值匹配的参数越多。

`CometInput`

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

参数
`metric_spec`	可选：`CometSpec`。指标规范，用于定义指标的行为。
`metric_spec.version`	可选：`string`。 `COMET_22_SRC_REF`： COMET 22，用于翻译、来源和参考。它使用所有这三个输入来评估翻译（预测）。
`metric_spec.source_language`	可选：`string`。源语言，采用 BCP-47 格式。例如，“es”。
`metric_spec.target_language`	可选：`string`。目标语言，采用 BCP-47 格式。例如，“es”
`instance`	可选：`CometInstance`。评估输入，由 LLM 回答和参考组成。用于评估的确切字段取决于 COMET 版本。
`instance.prediction`	可选：`string`。候选模型 LLM 回答。这是正在评估的 LLM 的输出。
`instance.source`	可选：`string`。源文本。这是预测结果的原始语言。
`instance.reference`	可选：`string`。用于与预测结果进行比较的标准答案。此标准答案与预测结果采用相同的语言。

`CometResult`

{
  "comet_result" : {
    "score": float
  }
}

输出
`score`	`float`：`[0, 1]`，其中 1 表示完美翻译。

`MetricxInput`

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

参数
`metric_spec`	可选：`MetricxSpec`。指标规范，用于定义指标的行为。
`metric_spec.version`	可选： `string` 以下项之一： `METRICX_24_REF`：用于翻译和参考的 MetricX 24。它通过与提供的参考文本输入进行比较来评估预测结果（翻译）。 `METRICX_24_SRC`：用于翻译和来源的 MetricX 24。它通过质量估计 (QE) 来评估翻译（预测结果），而无需参考文本输入。 `METRICX_24_SRC_REF`：用于翻译、来源和参考的 MetricX 24。它会使用所有这三个输入来评估翻译（预测结果）。
`metric_spec.source_language`	可选：`string`。源语言，采用 BCP-47 格式。例如，“es”。
`metric_spec.target_language`	可选：`string`。目标语言，采用 BCP-47 格式。例如，“es”。
`instance`	可选：`MetricxInstance`。评估输入，由 LLM 回答和参考组成。用于评估的确切字段取决于 MetricX 版本。
`instance.prediction`	可选：`string`。候选模型 LLM 回答。这是正在评估的 LLM 的输出。
`instance.source`	可选：`string`。源文本，即预测结果所翻译自的原始语言文本。
`instance.reference`	可选：`string`。用于与预测结果进行比较的标准答案。与预测结果采用相同的语言。

`MetricxResult`

{
  "metricx_result" : {
    "score": float
  }
}

输出
`score`	`float`：`[0, 25]`，其中 0 表示完美翻译。

示例

评估输出

以下示例演示了如何调用 Gen AI Evaluation API 来使用各种评估指标评估 LLM 的输出，包括：

summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

评估输出：两两摘要质量

以下示例演示了如何调用 Gen AI Evaluation Service API 以使用成对摘要质量比较来评估 LLM 的输出。

REST

在使用任何请求数据之前，请先进行以下替换：

PROJECT_ID：。
LOCATION：处理请求的区域。
PREDICTION：LLM 回答
BASELINE_PREDICTION：基准模型 LLM 回答。
INSTRUCTION：推理时使用的指令。
CONTEXT：包含所有相关信息的推理时间文本，可在 LLM 回答中使用。

HTTP 方法和网址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

请求 JSON 正文：

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

如需发送请求，请选择以下方式之一：

curl

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI，或者使用了 Cloud Shell，这会使您自动登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

如需了解如何安装或更新 Vertex AI SDK for Python，请参阅安装 Vertex AI SDK for Python。如需了解详情，请参阅 Python API 参考文档。

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

在尝试此示例之前，请按照《Vertex AI 快速入门：使用客户端库》中的 Go 设置说明执行操作。如需了解详情，请参阅 Vertex AI Go API 参考文档。

如需向 Vertex AI 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

获取 ROUGE 得分

以下示例调用 Gen AI Evaluation Service API 以获取由多个输入生成的预测结果的 ROUGE 得分。ROUGE 输入使用 metric_spec，它决定了指标的行为。

REST

在使用任何请求数据之前，请先进行以下替换：

PROJECT_ID：。
LOCATION：处理请求的区域。
PREDICTION：LLM 回答
REFERENCE：黄金 LLM 回答以供参考。
ROUGE_TYPE：用于确定 rouge 得分的计算。如需了解可接受的值，请参阅 metric_spec.rouge_type。
USE_STEMMER：确定是否使用 Porter stemmer 来剥离字词后缀以提高匹配度。如需了解可接受的值，请参阅 metric_spec.use_stemmer。
SPLIT_SUMMARIES：确定是否在 rougeLsum 句子之间添加新行。如需了解可接受的值，请参阅 metric_spec.split_summaries。

HTTP 方法和网址：

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

请求 JSON 正文：

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

如需发送请求，请选择以下方式之一：

curl

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

如需了解如何安装或更新 Vertex AI SDK for Python，请参阅安装 Vertex AI SDK for Python。如需了解详情，请参阅 Python API 参考文档。

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

在尝试此示例之前，请按照《Vertex AI 快速入门：使用客户端库》中的 Go 设置说明执行操作。如需了解详情，请参阅 Vertex AI Go API 参考文档。

如需向 Vertex AI 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置身份验证。

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

后续步骤

如需详细文档，请参阅运行评估。