Gen AI Evaluation Service API

Gen AI Evaluation Service を使用すると、独自の基準に基づいて、大規模言語モデル（LLM）を複数の指標で評価できます。推論時の入力、LLM レスポンス、その他のパラメータを指定すると、Gen AI Evaluation Service は評価タスクに固有の指標を返します。

指標には、PointwiseMetric や PairwiseMetric などのモデルベースの指標と、rouge、bleu、ツールの関数呼び出しの指標など、メモリ内で計算される指標が含まれます。PointwiseMetric と PairwiseMetric は、独自の基準でカスタマイズできる汎用モデルベースの指標です。このサービスは、モデルから予測結果を直接入力として取得するため、Vertex AI でサポートされているすべてのモデルに対して推論とその後の評価の両方を実行できます。

モデルの評価の詳細については、Gen AI Evaluation Service の概要をご覧ください。

制限事項

評価サービスには次の制限があります。

評価サービスでは、最初の呼び出しで伝播遅延が発生することがあります。
ほとんどのモデルベースの指標は gemini-2.0-flash の割り当てを消費します。これは、Gen AI Evaluation Service が基盤となるジャッジモデルとして gemini-2.0-flash を利用して、これらのモデルベースの指標を計算するためです。
MetricX や COMET など一部のモデルベースの指標は、別の ML モデルを使用するため、gemini-2.0-flash の割り当ては消費しません。

構文の例

評価呼び出しを送信する構文。

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

パラメータリスト

パラメータ
`exact_match_input`	省略可: `ExactMatchInput` 予測が参照と完全に一致しているかどうかを評価するための入力。
`bleu_input`	省略可: `BleuInput` 予測と参照を比較して BLEU スコアを計算するための入力。
`rouge_input`	省略可: `RougeInput` 予測と参照を比較して `rouge` スコアを計算するための入力。`rouge_type` はさまざまな `rouge` スコアをサポートしています。
`fluency_input`	省略可: `FluencyInput` 単一レスポンスの回答の言語習熟度を評価するための入力。
`coherence_input`	省略可: `CoherenceInput` 単一のレスポンスが、一貫性があり、わかりやすい内容の返信を行うことができるかどうかを評価するための入力。
`safety_input`	省略可: `SafetyInput` 単一レスポンスの安全性レベルを評価するための入力。
`groundedness_input`	省略可: `GroundednessInput` 単一のレスポンスが、入力テキストにのみ含まれる情報を提供または参照できるかどうかを評価するための入力。
`fulfillment_input`	省略可: `FulfillmentInput` 単一のレスポンスが指示内容の要件を完全に満たすことができるかどうかを評価するための入力。
`summarization_quality_input`	省略可: `SummarizationQualityInput` 全般的に見て単一レスポンスがどの程度適切にテキストを要約できるかを評価するための入力。
`pairwise_summarization_quality_input`	省略可: `PairwiseSummarizationQualityInput` 2 つのレスポンスの全般的な要約の品質を比較するための入力。
`summarization_helpfulness_input`	省略可: `SummarizationHelpfulnessInput` 単一レスポンスが、元のテキストを置き換えるために必要な詳細情報を含む要約を提供できるかどうかを評価するための入力。
`summarization_verbosity_input`	省略可: `SummarizationVerbosityInput` 単一のレスポンスが簡潔な要約を提示できるかどうかを評価するための入力。
`question_answering_quality_input`	省略可: `QuestionAnsweringQualityInput` 全般的に見て、参照するテキスト本文が与えられたときに、単一のレスポンスがどの程度質問に回答できるかを評価するための入力。
`pairwise_question_answering_quality_input`	省略可: `PairwiseQuestionAnsweringQualityInput` 全般的に見て、参照するテキスト本文が与えられたときに、2 つのレスポンスがどの程度質問に回答できるかを比較するための入力。
`question_answering_relevance_input`	省略可: `QuestionAnsweringRelevanceInput` 質問に対して単一のレスポンスが関連する情報で応答できるかどうかを評価するための入力。
`question_answering_helpfulness_input`	省略可: `QuestionAnsweringHelpfulnessInput` 質問に答える際に単一のレスポンスが重要な詳細情報を提供できるかどうかを評価するための入力。
`question_answering_correctness_input`	省略可: `QuestionAnsweringCorrectnessInput` 単一のレスポンスが質問に正しく答えられるかどうかを評価するための入力。
`pointwise_metric_input`	省略可: `PointwiseMetricInput` 一般的なポイントワイズ評価の入力。
`pairwise_metric_input`	省略可: `PairwiseMetricInput` 一般的なペアワイズ評価の入力。
`tool_call_valid_input`	省略可: `ToolCallValidInput` 単一のレスポンスが有効なツール呼び出しを予測できるかどうかを評価するための入力。
`tool_name_match_input`	省略可: `ToolNameMatchInput` 単一のレスポンスがツール呼び出しを正しいツール名で予測できるかどうかを評価するための入力。
`tool_parameter_key_match_input`	省略可: `ToolParameterKeyMatchInput` 単一のレスポンスが、正しいパラメータ名でツール呼び出しを予測できるかどうかを評価するための入力。
`tool_parameter_kv_match_input`	省略可: `ToolParameterKvMatchInput` 単一のレスポンスが、正しいパラメータ名と値でツール呼び出しを予測できるかどうかを評価するための入力
`comet_input`	省略可: `CometInput` COMET を使用して評価する入力。
`metricx_input`	省略可: `MetricxInput` MetricX を使用して評価する入力。

`ExactMatchInput`

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

パラメータ
`metric_spec`	省略可: `ExactMatchSpec` 指標の動作を定義する指標の仕様。
`instances`	省略可: `ExactMatchInstance[]` 評価の入力。LLM のレスポンスと参照で構成されます。
`instances.prediction`	省略可: `string` LLM レスポンス。
`instances.reference`	省略可: `string` 参照用のゴールデン LLM レスポンス。

`ExactMatchResults`

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

出力

出力
`exact_match_metric_values`	`ExactMatchMetricValue[]` インスタンス入力ごとの評価結果。
`exact_match_metric_values.score`	`float` 次のいずれかになります。 `0`: インスタンスが完全一致ではなかった `1`: 完全一致

exact_match_metric_values

ExactMatchMetricValue[]

インスタンス入力ごとの評価結果。

exact_match_metric_values.score

float

次のいずれかになります。

0: インスタンスが完全一致ではなかった
1: 完全一致

`BleuInput`

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

パラメータ
`metric_spec`	省略可: `BleuSpec` 指標の動作を定義する指標の仕様。
`metric_spec.use_effective_order`	省略可: `bool` 一致のない n グラムの順序を考慮するかどうか。
`instances`	省略可: `BleuInstance[]` 評価の入力。LLM のレスポンスと参照で構成されます。
`instances.prediction`	省略可: `string` LLM レスポンス。
`instances.reference`	省略可: `string` 参照用のゴールデン LLM レスポンス。

`BleuResults`

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

出力

出力
`bleu_metric_values`	`BleuMetricValue[]` インスタンス入力ごとの評価結果。
`bleu_metric_values.score`	`float`: `[0, 1]`。スコアが高いほど、予測が参照に近いことを表します。

bleu_metric_values

BleuMetricValue[]

インスタンス入力ごとの評価結果。

bleu_metric_values.score

float: [0, 1]。スコアが高いほど、予測が参照に近いことを表します。

`RougeInput`

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

パラメータ
`metric_spec`	省略可: `RougeSpec` 指標の動作を定義する指標の仕様。
`metric_spec.rouge_type`	省略可: `string` 使用できる値: `rougen[1-9]`: 予測とリファレンス間の n グラムの重複に基づいて `rouge` スコアを計算します。 `rougeL`: 予測とリファレンス間の最長共通部分列（LCS）に基づいて `rouge` スコアを計算します。 `rougeLsum`: まず予測と参照を文に分割し、各タプルの LCS を計算します。最終的な `rougeLsum` スコアは、これらの個別の LCS スコアの平均です。
`metric_spec.use_stemmer`	省略可: `bool` 一致を改善するために Porter ステムを使用して単語の接尾辞を削除するかどうか。
`metric_spec.split_summaries`	省略可: `bool` rougeLsum の文の間に改行を追加するかどうか。
`instances`	省略可: `RougeInstance[]` 評価の入力。LLM のレスポンスと参照で構成されます。
`instances.prediction`	省略可: `string` LLM レスポンス。
`instances.reference`	省略可: `string` 参照用のゴールデン LLM レスポンス。

`RougeResults`

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

出力

出力
`rouge_metric_values`	`RougeValue[]` インスタンス入力ごとの評価結果。
`rouge_metric_values.score`	`float`: `[0, 1]`。スコアが高いほど、予測が参照に近いことを表します。

rouge_metric_values

RougeValue[]

インスタンス入力ごとの評価結果。

rouge_metric_values.score

float: [0, 1]。スコアが高いほど、予測が参照に近いことを表します。

`FluencyInput`

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

パラメータ

パラメータ
`metric_spec`	省略可: `FluencySpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `FluencyInstance` 評価の入力（LLM レスポンスで構成）。
`instance.prediction`	省略可: `string` LLM レスポンス。

metric_spec

省略可: FluencySpec

指標の動作を定義する指標の仕様。

instance

省略可: FluencyInstance

評価の入力（LLM レスポンスで構成）。

instance.prediction

省略可: string

LLM レスポンス。

`FluencyResult`

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: 不明瞭 `2`: やや不明瞭 `3`: どちらともいえない `4`: ある程度堪能 `5`: 堪能
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: 不明瞭
2: やや不明瞭
3: どちらともいえない
4: ある程度堪能
5: 堪能

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`CoherenceInput`

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

パラメータ

パラメータ
`metric_spec`	省略可: `CoherenceSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `CoherenceInstance` 評価の入力（LLM レスポンスで構成）。
`instance.prediction`	省略可: `string` LLM レスポンス。

metric_spec

省略可: CoherenceSpec

指標の動作を定義する指標の仕様。

instance

省略可: CoherenceInstance

評価の入力（LLM レスポンスで構成）。

instance.prediction

省略可: string

LLM レスポンス。

`CoherenceResult`

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: 不整合 `2`: やや不整合 `3`: どちらともいえない `4`: やや一貫性がある `5`: 一貫性がある
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: 不整合
2: やや不整合
3: どちらともいえない
4: やや一貫性がある
5: 一貫性がある

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`SafetyInput`

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

パラメータ

パラメータ
`metric_spec`	省略可: `SafetySpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `SafetyInstance` 評価の入力（LLM レスポンスで構成）。
`instance.prediction`	省略可: `string` LLM レスポンス。

metric_spec

省略可: SafetySpec

指標の動作を定義する指標の仕様。

instance

省略可: SafetyInstance

評価の入力（LLM レスポンスで構成）。

instance.prediction

省略可: string

LLM レスポンス。

`SafetyResult`

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `0`: 安全でない `1`: 安全
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

0: 安全でない
1: 安全

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`GroundednessInput`

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

パラメータ	説明
`metric_spec`	省略可: GroundednessSpec 指標の動作を定義する指標の仕様。
`instance`	省略可: GroundednessInstance 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`GroundednessResult`

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `0`: 根拠なし `1`: 根拠あり
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

0: 根拠なし
1: 根拠あり

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`FulfillmentInput`

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `FulfillmentSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `FulfillmentInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。

`FulfillmentResult`

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: フルフィルメントなし `2`: 不十分なフルフィルメント `3`: 一部のフルフィルメントあり `4`: 適切なフルフィルメント `5`: 詳細なフルフィルメント
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: フルフィルメントなし
2: 不十分なフルフィルメント
3: 一部のフルフィルメントあり
4: 適切なフルフィルメント
5: 詳細なフルフィルメント

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`SummarizationQualityInput`

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

パラメータ
`metric_spec`	省略可: `SummarizationQualitySpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `SummarizationQualityInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`SummarizationQualityResult`

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: 非常に不適格 `2`: 不適格 `3`: OK `4`: 良好 `5`: 非常に良好
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: 非常に不適格
2: 不適格
3: OK
4: 良好
5: 非常に良好

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`PairwiseSummarizationQualityInput`

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

パラメータ
`metric_spec`	省略可: `PairwiseSummarizationQualitySpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `PairwiseSummarizationQualityInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.baseline_prediction`	省略可: `string` ベースラインモデルの LLM レスポンス。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`PairwiseSummarizationQualityResult`

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`pairwise_choice`	`PairwiseChoice`: 次の値を持つ列挙型。 `BASELINE`: ベースラインの予測が優れています。 `CANDIDATE`: 候補の予測が優れています `TIE`: ベースラインと候補の予測が密接に関係しています。
`explanation`	`string`: pairwise_choice の割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

pairwise_choice

PairwiseChoice: 次の値を持つ列挙型。

BASELINE: ベースラインの予測が優れています。
CANDIDATE: 候補の予測が優れています
TIE: ベースラインと候補の予測が密接に関係しています。

explanation

string: pairwise_choice の割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`SummarizationHelpfulnessInput`

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

パラメータ
`metric_spec`	省略可: `SummarizationHelpfulnessSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `SummarizationHelpfulnessInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`SummarizationHelpfulnessResult`

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: 役に立たなかった `2`: あまり役に立たなかった `3`: どちらともいえない `4`: ある程度は役に立った `5`: 役に立った
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: 役に立たなかった
2: あまり役に立たなかった
3: どちらともいえない
4: ある程度は役に立った
5: 役に立った

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`SummarizationVerbosityInput`

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

パラメータ
`metric_spec`	省略可: `SummarizationVerbositySpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `SummarizationVerbosityInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`SummarizationVerbosityResult`

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `-2`: 簡潔 `-1`: やや簡潔 `0`: 最適 `1`: やや詳細 `2`: 詳細
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

-2: 簡潔
-1: やや簡潔
0: 最適
1: やや詳細
2: 詳細

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`QuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

パラメータ
`metric_spec`	省略可: `QuestionAnsweringQualitySpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `QuestionAnsweringQualityInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`QuestionAnsweringQualityResult`

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: 非常に不適格 `2`: 不適格 `3`: OK `4`: 良好 `5`: 非常に良好
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: 非常に不適格
2: 不適格
3: OK
4: 良好
5: 非常に良好

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`PairwiseQuestionAnsweringQualityInput`

{
  "pairwise_question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `QuestionAnsweringQualitySpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `QuestionAnsweringQualityInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.baseline_prediction`	省略可: `string` ベースラインモデルの LLM レスポンス。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`PairwiseQuestionAnsweringQualityResult`

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`pairwise_choice`	`PairwiseChoice`: 次の値を持つ列挙型。 `BASELINE`: ベースラインの予測が優れています。 `CANDIDATE`: 候補の予測が優れています `TIE`: ベースラインと候補の予測が密接に関係しています。
`explanation`	`string`: `pairwise_choice` の割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

pairwise_choice

PairwiseChoice: 次の値を持つ列挙型。

BASELINE: ベースラインの予測が優れています。
CANDIDATE: 候補の予測が優れています
TIE: ベースラインと候補の予測が密接に関係しています。

explanation

string: pairwise_choice の割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`QuestionAnsweringRelevanceInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `QuestionAnsweringRelevanceSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `QuestionAnsweringRelevanceInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`QuestionAnsweringRelevancyResult`

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: 関連性が低い `2`: やや関連性が低い `3`: どちらともいえない `4`: 一部関連性がある `5`: 関連性がある
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: 関連性が低い
2: やや関連性が低い
3: どちらともいえない
4: 一部関連性がある
5: 関連性がある

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`QuestionAnsweringHelpfulnessInput`

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `QuestionAnsweringHelpfulnessSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `QuestionAnsweringHelpfulnessInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`QuestionAnsweringHelpfulnessResult`

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `1`: 役に立たなかった `2`: あまり役に立たなかった `3`: どちらともいえない `4`: ある程度は役に立った `5`: 役に立った
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

1: 役に立たなかった
2: あまり役に立たなかった
3: どちらともいえない
4: ある程度は役に立った
5: 役に立った

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`QuestionAnsweringCorrectnessInput`

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `QuestionAnsweringCorrectnessSpec` 指標の動作を定義する指標の仕様。
`metric_spec.use_reference`	省略可: `bool` 評価で参照が使用されるかどうか。
`instance`	省略可: `QuestionAnsweringCorrectnessInstance` 評価入力。推論入力と対応するレスポンスで構成されます。
`instance.prediction`	省略可: `string` LLM レスポンス。
`instance.reference`	省略可: `string` 参照用のゴールデン LLM レスポンス。
`instance.instruction`	省略可: `string` 推論時に使用される命令。
`instance.context`	省略可: `string` 推論時のテキスト。LLM レスポンスで使用できるすべての情報が含まれています。

`QuestionAnsweringCorrectnessResult`

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

出力

出力
`score`	`float`: 次のいずれかです。 `0`: 不正解 `1`: 正解
`explanation`	`string`: スコア割り当ての根拠。
`confidence`	`float`: 結果の信頼スコア（`[0, 1]`）。

score

float: 次のいずれかです。

0: 不正解
1: 正解

explanation

string: スコア割り当ての根拠。

confidence

float: 結果の信頼スコア（[0, 1]）。

`PointwiseMetricInput`

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

パラメータ
`metric_spec`	必須: `PointwiseMetricSpec` 指標の動作を定義する指標の仕様。
`metric_spec.metric_prompt_template`	必須: `string` 指標を定義するプロンプトテンプレート。これは、instance.json_instance の Key-Value ペアによってレンダリングされます。
`instance`	必須: `PointwiseMetricInstance` 評価入力。json_instance で構成されます。
`instance.json_instance`	省略可: `string` Json 形式の Key-Value ペア。例: {"key_1": "value_1", "key_2": "value_2"}。metric_spec.metric_prompt_template のレンダリングに使用されます。

`PointwiseMetricResult`

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

出力
`score`	`float`: ポイントワイズ指標の評価結果のスコア。
`explanation`	`string`: スコア割り当ての根拠。

`PairwiseMetricInput`

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

パラメータ
`metric_spec`	必須: `PairwiseMetricSpec` 指標の動作を定義する指標の仕様。
`metric_spec.metric_prompt_template`	必須: `string` 指標を定義するプロンプトテンプレート。これは、instance.json_instance の Key-Value ペアによってレンダリングされます。
`instance`	必須: `PairwiseMetricInstance` 評価入力。json_instance で構成されます。
`instance.json_instance`	省略可: `string` JSON 形式の Key-Value ペア。例: {"key_1": "value_1", "key_2": "value_2"}。metric_spec.metric_prompt_template のレンダリングに使用されます。

`PairwiseMetricResult`

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

出力
`score`	`float`: ペアワイズ指標評価結果のスコア。
`explanation`	`string`: スコア割り当ての根拠。

`ToolCallValidInput`

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `ToolCallValidSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `ToolCallValidInstance` 評価の入力。LLM のレスポンスと参照で構成されます。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。これは、`content` キーと `tool_calls` キーを含む JSON シリアル化文字列です。`content` の値は、モデルからのテキスト出力です。`tool_call` 値は、ツール呼び出しのリストが JSON シリアル化された文字列です。次に例を示します。 { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	省略可: `string` 予測と同じ形式のゴールデンモデル出力。

`ToolCallValidResults`

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

出力

出力
`tool_call_valid_metric_values`	繰り返し `ToolCallValidMetricValue`: インスタンス入力ごとの評価結果。
`tool_call_valid_metric_values.score`	`float`: 次のいずれかです。 `0`: 無効なツールの呼び出し `1`: 有効なツールの呼び出し

tool_call_valid_metric_values

繰り返し ToolCallValidMetricValue: インスタンス入力ごとの評価結果。

tool_call_valid_metric_values.score

float: 次のいずれかです。

0: 無効なツールの呼び出し
1: 有効なツールの呼び出し

`ToolNameMatchInput`

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `ToolNameMatchSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `ToolNameMatchInstance` 評価の入力。LLM のレスポンスと参照で構成されます。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。これは、`content` キーと `tool_calls` キーを含む JSON シリアル化文字列です。`content` の値は、モデルからのテキスト出力です。`tool_call` 値は、ツール呼び出しのリストが JSON シリアル化された文字列です。
`instance.reference`	省略可: `string` 予測と同じ形式のゴールデンモデル出力。

`ToolNameMatchResults`

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

出力

出力
`tool_name_match_metric_values`	繰り返し `ToolNameMatchMetricValue`: インスタンス入力ごとの評価結果。
`tool_name_match_metric_values.score`	`float`: 次のいずれかです。 `0`: ツールの呼び出し名が参照と一致していません。 `1`: ツールの呼び出し名が参照と一致しています。

tool_name_match_metric_values

繰り返し ToolNameMatchMetricValue: インスタンス入力ごとの評価結果。

tool_name_match_metric_values.score

float: 次のいずれかです。

0: ツールの呼び出し名が参照と一致していません。
1: ツールの呼び出し名が参照と一致しています。

`ToolParameterKeyMatchInput`

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `ToolParameterKeyMatchSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `ToolParameterKeyMatchInstance` 評価の入力。LLM のレスポンスと参照で構成されます。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。これは、`content` キーと `tool_calls` キーを含む JSON シリアル化文字列です。`content` の値は、モデルからのテキスト出力です。`tool_call` 値は、ツール呼び出しのリストが JSON シリアル化された文字列です。
`instance.reference`	省略可: `string` 予測と同じ形式のゴールデンモデル出力。

`ToolParameterKeyMatchResults`

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

出力
`tool_parameter_key_match_metric_values`	繰り返し `ToolParameterKeyMatchMetricValue`: インスタンス入力ごとの評価結果。
`tool_parameter_key_match_metric_values.score`	`float`: `[0, 1]`。スコアが高いほど、参照パラメータの名前と一致するパラメータが多く存在することを表します。

`ToolParameterKVMatchInput`

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

パラメータ
`metric_spec`	省略可: `ToolParameterKVMatchSpec` 指標の動作を定義する指標の仕様。
`instance`	省略可: `ToolParameterKVMatchInstance` 評価の入力。LLM のレスポンスと参照で構成されます。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。これは、`content` キーと `tool_calls` キーを含む JSON シリアル化文字列です。`content` の値は、モデルからのテキスト出力です。`tool_call` 値は、ツール呼び出しのリストが JSON シリアル化された文字列です。
`instance.reference`	省略可: `string` 予測と同じ形式のゴールデンモデル出力。

`ToolParameterKVMatchResults`

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

出力
`tool_parameter_kv_match_metric_values`	繰り返し `ToolParameterKVMatchMetricValue`: インスタンス入力ごとの評価結果。
`tool_parameter_kv_match_metric_values.score`	`float`: `[0, 1]`。スコアが高いほど、参照パラメータの名前と値に一致するパラメータが多く存在することを表します。

`CometInput`

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

パラメータ
`metric_spec`	省略可: `CometSpec` 指標の動作を定義する指標の仕様。
`metric_spec.version`	省略可: `string` `COMET_22_SRC_REF`: 翻訳、原文、参照の COMET 22。3 つの入力すべてを使用して翻訳（予測）を評価します。
`metric_spec.source_language`	省略可: `string` 原文の言語（BCP-47 形式）。例: 「es」。
`metric_spec.target_language`	省略可: `string` 訳文の言語（BCP-47 形式）。例: 「es」
`instance`	省略可: `CometInstance` 評価の入力。LLM のレスポンスと参照で構成されます。評価に使用される正確なフィールドは、COMET のバージョンによって異なります。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。これは、評価対象の LLM の出力です。
`instance.source`	省略可: `string` 原文テキスト。これは、予測の翻訳元の言語です。
`instance.reference`	省略可: `string` 予測と比較するために使用されるグラウンドトゥルース。これは予測と同じ言語です。

`CometResult`

{
  "comet_result" : {
    "score": float
  }
}

出力
`score`	`float`: `[0, 1]`。1 は完全な翻訳を表します。

`MetricxInput`

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

パラメータ
`metric_spec`	省略可: `MetricxSpec` 指標の動作を定義する指標の仕様。
`metric_spec.version`	省略可: `string` 次のいずれかになります。 `METRICX_24_REF`: 翻訳と参照の MetricX 24。指定された参照テキスト入力と比較して、予測（翻訳）を評価します。 `METRICX_24_SRC`: 翻訳と原文の MetricX 24。参照テキストを入力せずに、品質評価（QE）によって翻訳（予測）を評価します。 `METRICX_24_SRC_REF`: 翻訳、原文、参照の MetricX 24。3 つの入力すべてを使用して翻訳（予測）を評価します。
`metric_spec.source_language`	省略可: `string` 原文の言語（BCP-47 形式）。例: 「es」。
`metric_spec.target_language`	省略可: `string` 訳文の言語（BCP-47 形式）。例: 「es」。
`instance`	省略可: `MetricxInstance` 評価の入力。LLM のレスポンスと参照で構成されます。評価に使用される正確なフィールドは、MetricX のバージョンによって異なります。
`instance.prediction`	省略可: `string` 候補モデルの LLM レスポンス。これは、評価対象の LLM の出力です。
`instance.source`	省略可: `string` 予測の翻訳元の言語による原文テキスト。
`instance.reference`	省略可: `string` 予測と比較するために使用されるグラウンドトゥルース。予測と同じ言語です。

`MetricxResult`

{
  "metricx_result" : {
    "score": float
  }
}

出力
`score`	`float`: `[0, 25]`。0 は完全な翻訳を表します。

例

出力を評価する

次の例は、Gen AI Evaluation API を呼び出し、次のようなさまざまな評価指標を使用して LLM の出力を評価する方法を示しています。

summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

出力を評価する: ペアワイズの要約の品質

次の例は、Gen AI Evaluation Service API を呼び出し、ペアワイズ要約品質の比較を使用して LLM の出力を評価する方法を示しています。

REST

リクエストのデータを使用する前に、次のように置き換えます。

PROJECT_ID:
LOCATION: リクエストを処理するリージョン。
PREDICTION: LLM レスポンス。
BASELINE_PREDICTION: ベースラインモデルの LLM レスポンス。
INSTRUCTION: 推論時に使用される命令。
CONTEXT: 推論時のテキスト。LLM レスポンスで使用できるすべての関連情報が含まれています。

HTTP メソッドと URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

リクエストの本文（JSON）:

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

リクエストを送信するには、次のいずれかのオプションを選択します。

curl

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ユーザーアカウントで gcloud CLI にログインしているか、Cloud Shell を使用して自動的に gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ご自分のユーザーアカウントで gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

Vertex AI SDK for Python のインストールまたは更新の方法については、Vertex AI SDK for Python をインストールするをご覧ください。詳細については、Python API リファレンスドキュメントをご覧ください。

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Go の設定手順を完了してください。詳細については、Vertex AI Go API のリファレンスドキュメントをご覧ください。

Vertex AI に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証を設定するをご覧ください。

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

ROUGE スコアを取得する

次の例では、Gen AI Evaluation Service API を呼び出して、複数の入力によって生成された予測の ROUGE スコアを取得します。ROUGE 入力は metric_spec を使用し、指標の動作を決定します。

REST

リクエストのデータを使用する前に、次のように置き換えます。

PROJECT_ID:
LOCATION: リクエストを処理するリージョン。
PREDICTION: LLM レスポンス。
REFERENCE: 参照用のゴールデン LLM レスポンス。
ROUGE_TYPE: ROUGE スコアの決定に使用される計算。使用できる値については、metric_spec.rouge_type をご覧ください。
USE_STEMMER: 一致を改善するために Porter ステムを使用して単語の接尾辞を削除するかどうかを決定します。有効な値については、metric_spec.use_stemmer をご覧ください。
SPLIT_SUMMARIES: rougeLsum 文の間に改行を追加するかどうかを決定します。有効な値については、metric_spec.split_summaries をご覧ください。

HTTP メソッドと URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

リクエストの本文（JSON）:

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

リクエストを送信するには、次のいずれかのオプションを選択します。

curl

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

次のステップ

詳細なドキュメントについては、評価を実行するをご覧ください。

Gen AI Evaluation Service API

構文の例

curl

Python

パラメータ リスト

ExactMatchInput

ExactMatchResults

BleuInput

BleuResults

RougeInput

RougeResults

FluencyInput

FluencyResult

CoherenceInput

CoherenceResult

SafetyInput

SafetyResult

GroundednessInput

GroundednessResult

FulfillmentInput

FulfillmentResult

SummarizationQualityInput

SummarizationQualityResult

PairwiseSummarizationQualityInput

PairwiseSummarizationQualityResult

SummarizationHelpfulnessInput

SummarizationHelpfulnessResult

SummarizationVerbosityInput

SummarizationVerbosityResult

QuestionAnsweringQualityInput

QuestionAnsweringQualityResult

PairwiseQuestionAnsweringQualityInput

PairwiseQuestionAnsweringQualityResult

QuestionAnsweringRelevanceInput

QuestionAnsweringRelevancyResult

QuestionAnsweringHelpfulnessInput

QuestionAnsweringHelpfulnessResult

QuestionAnsweringCorrectnessInput

QuestionAnsweringCorrectnessResult

PointwiseMetricInput

PointwiseMetricResult

PairwiseMetricInput

PairwiseMetricResult

ToolCallValidInput

ToolCallValidResults

ToolNameMatchInput

ToolNameMatchResults

ToolParameterKeyMatchInput

ToolParameterKeyMatchResults

ToolParameterKVMatchInput

ToolParameterKVMatchResults

CometInput

CometResult

MetricxInput

MetricxResult

例

出力を評価する

Python

Go

出力を評価する: ペアワイズの要約の品質

REST

curl

PowerShell

Python

Python

Go

Go

ROUGE スコアを取得する

REST

curl

PowerShell

Python

Python

Go

Go

次のステップ

パラメータリスト

`ExactMatchInput`

`ExactMatchResults`

`BleuInput`

`BleuResults`

`RougeInput`

`RougeResults`

`FluencyInput`

`FluencyResult`

`CoherenceInput`

`CoherenceResult`

`SafetyInput`

`SafetyResult`

`GroundednessInput`

`GroundednessResult`

`FulfillmentInput`

`FulfillmentResult`

`SummarizationQualityInput`

`SummarizationQualityResult`

`PairwiseSummarizationQualityInput`

`PairwiseSummarizationQualityResult`

`SummarizationHelpfulnessInput`

`SummarizationHelpfulnessResult`

`SummarizationVerbosityInput`

`SummarizationVerbosityResult`

`QuestionAnsweringQualityInput`

`QuestionAnsweringQualityResult`

`PairwiseQuestionAnsweringQualityInput`

`PairwiseQuestionAnsweringQualityResult`

`QuestionAnsweringRelevanceInput`

`QuestionAnsweringRelevancyResult`

`QuestionAnsweringHelpfulnessInput`

`QuestionAnsweringHelpfulnessResult`

`QuestionAnsweringCorrectnessInput`

`QuestionAnsweringCorrectnessResult`

`PointwiseMetricInput`

`PointwiseMetricResult`

`PairwiseMetricInput`

`PairwiseMetricResult`

`ToolCallValidInput`

`ToolCallValidResults`

`ToolNameMatchInput`

`ToolNameMatchResults`

`ToolParameterKeyMatchInput`

`ToolParameterKeyMatchResults`

`ToolParameterKVMatchInput`

`ToolParameterKVMatchResults`

`CometInput`

`CometResult`

`MetricxInput`

`MetricxResult`