Gen AI Evaluation Service API

Gen AI Evaluation Service를 사용하면 자체 기준을 사용하여 여러 측정항목에서 대규모 언어 모델(LLM)을 평가할 수 있습니다. 추론 시간 입력, LLM 응답, 추가 파라미터를 제공할 수 있으며 Gen AI Evaluation Service는 평가 태스크와 관련된 측정항목을 반환합니다.

측정항목에는 PointwiseMetric 및 PairwiseMetric 등의 모델 기반 측정항목과 rouge, bleu, 도구 함수 호출 측정항목 등의 인메모리 계산 측정항목이 포함됩니다. PointwiseMetric 및 PairwiseMetric은 자체 기준으로 맞춤설정할 수 있는 일반 모델 기반 측정항목입니다. 이 서비스는 모델에서 예측 결과를 직접 입력으로 가져오므로 평가 서비스는 Vertex AI에서 지원하는 모든 모델에 대해 추론과 후속 평가를 모두 실행할 수 있습니다.

모델 평가에 관한 자세한 내용은 Gen AI Evaluation Service 개요를 참고하세요.

제한사항

평가 서비스의 제한사항은 다음과 같습니다.

평가 서비스는 첫 번째 호출 시 지연되어 적용될 수 있습니다.
대부분의 모델 기반 측정항목은 gemini-2.0-flash 할당량을 사용합니다. Gen AI Evaluation Service가 이러한 모델 기반 측정항목을 계산하기 위해 gemini-2.0-flash를 기본 심사 모델로 활용하기 때문입니다.
MetricX, COMET과 같은 일부 모델 기반 측정항목은 서로 다른 머신러닝 모델을 사용하므로 gemini-2.0-flash 할당량을 사용하지 않습니다.

예시 구문

평가 호출을 전송하는 구문입니다.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

파라미터 목록

파라미터
`exact_match_input`	(선택사항) `ExactMatchInput` 예측이 참조와 정확하게 일치하는지 평가하기 위한 입력입니다.
`bleu_input`	(선택사항) `BleuInput` 참조와 예측을 비교하여 BLEU 점수를 계산하기 위한 입력입니다.
`rouge_input`	(선택사항) `RougeInput` 참조와 예측을 비교하여 `rouge` 점수를 계산하기 위한 입력입니다. `rouge_type`에서 다른 `rouge` 점수를 지원합니다.
`fluency_input`	(선택사항) `FluencyInput` 단일 응답의 언어 숙련도를 평가하기 위한 입력입니다.
`coherence_input`	(선택사항) `CoherenceInput` 일관되고 이해하기 쉬운 답장을 제공하는 단일 응답 기능을 평가하기 위한 입력입니다.
`safety_input`	(선택사항) `SafetyInput` 단일 응답의 안전 수준을 평가하기 위한 입력입니다.
`groundedness_input`	(선택사항) `GroundednessInput` 입력 텍스트에만 포함된 정보를 제공하거나 참조하는 단일 응답 기능을 평가하기 위한 입력입니다.
`fulfillment_input`	(선택사항) `FulfillmentInput` 안내를 완전히 이행하는 단일 응답 기능을 평가하는 입력입니다.
`summarization_quality_input`	(선택사항) `SummarizationQualityInput` 텍스트를 요약하는 단일 응답의 전체 기능을 평가하기 위한 입력입니다.
`pairwise_summarization_quality_input`	(선택사항) `PairwiseSummarizationQualityInput` 두 응답의 전체 요약 품질을 비교하기 위한 입력입니다.
`summarization_helpfulness_input`	(선택사항) `SummarizationHelpfulnessInput` 원본 텍스트를 대체하는 데 필요한 세부정보가 포함된 요약을 제공하는 단일 응답 기능을 평가하기 위한 입력입니다.
`summarization_verbosity_input`	(선택사항) `SummarizationVerbosityInput` 간결한 요약을 제공하는 단일 응답 기능을 평가하기 위한 입력입니다.
`question_answering_quality_input`	(선택사항) `QuestionAnsweringQualityInput` 참조할 텍스트 본문에 따라 질문에 답하는 단일 응답 전체 기능을 평가하기 위한 입력입니다.
`pairwise_question_answering_quality_input`	(선택사항) `PairwiseQuestionAnsweringQualityInput` 참조할 텍스트 본문에 따라 질문에 답하는 두 응답의 전체 기능을 비교하기 위한 입력입니다.
`question_answering_relevance_input`	(선택사항) `QuestionAnsweringRelevanceInput` 질문할 때 관련 정보로 응답하는 단일 응답 기능을 평가하기 위한 입력입니다.
`question_answering_helpfulness_input`	(선택사항) `QuestionAnsweringHelpfulnessInput` 질문에 답할 때 주요 세부정보를 제공하는 단일 응답 기능을 평가하기 위한 입력입니다.
`question_answering_correctness_input`	(선택사항) `QuestionAnsweringCorrectnessInput` 질문에 올바르게 답변하는 단일 응답 기능을 평가하기 위한 입력입니다.
`pointwise_metric_input`	(선택사항) `PointwiseMetricInput` 일반적인 점별 평가를 위한 입력입니다.
`pairwise_metric_input`	(선택사항) `PairwiseMetricInput` 일반적인 쌍별 평가를 위한 입력입니다.
`tool_call_valid_input`	(선택사항) `ToolCallValidInput` 유효한 도구 호출을 예측하는 단일 응답 기능을 평가하기 위한 입력입니다.
`tool_name_match_input`	(선택사항) `ToolNameMatchInput` 올바른 도구 이름으로 도구 호출을 예측하는 단일 응답 기능을 평가하기 위한 입력입니다.
`tool_parameter_key_match_input`	(선택사항) `ToolParameterKeyMatchInput` 올바른 파라미터 이름으로 도구 호출을 예측하는 단일 응답 기능을 평가하기 위한 입력입니다.
`tool_parameter_kv_match_input`	(선택사항) `ToolParameterKvMatchInput` 올바른 파라미터 이름과 값으로 도구 호출을 예측하는 단일 응답 기능을 평가하기 위한 입력입니다.
`comet_input`	(선택사항) `CometInput` COMET를 사용하여 평가하기 위한 입력입니다.
`metricx_input`	(선택사항) `MetricxInput` MetricX를 사용하여 평가하기 위한 입력입니다.

`ExactMatchInput`

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

파라미터
`metric_spec`	선택사항: `ExactMatchSpec`. 측정항목 동작을 정의하는 측정항목 사양입니다.
`instances`	(선택사항) `ExactMatchInstance[]` LLM 응답과 참조로 구성된 평가 입력입니다.
`instances.prediction`	(선택사항) `string` LLM 응답입니다.
`instances.reference`	(선택사항) `string` 참조를 위한 특별한 LLM 응답입니다.

`ExactMatchResults`

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

출력

출력
`exact_match_metric_values`	`ExactMatchMetricValue[]` 인스턴스 입력당 평가 결과입니다.
`exact_match_metric_values.score`	`float` 다음 중 하나입니다. `0`: 인스턴스가 일치검색이 아닙니다. `1`: 일치검색

exact_match_metric_values

ExactMatchMetricValue[]

인스턴스 입력당 평가 결과입니다.

exact_match_metric_values.score

float

다음 중 하나입니다.

0: 인스턴스가 일치검색이 아닙니다.
1: 일치검색

`BleuInput`

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

파라미터
`metric_spec`	(선택사항) `BleuSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`metric_spec.use_effective_order`	(선택사항) `bool` 일치하지 않는 N-그램 순서를 고려할지 여부입니다.
`instances`	(선택사항) `BleuInstance[]` LLM 응답과 참조로 구성된 평가 입력입니다.
`instances.prediction`	(선택사항) `string` LLM 응답입니다.
`instances.reference`	(선택사항) `string` 참조를 위한 특별한 LLM 응답입니다.

`BleuResults`

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

출력

출력
`bleu_metric_values`	`BleuMetricValue[]` 인스턴스 입력당 평가 결과입니다.
`bleu_metric_values.score`	`float`: `[0, 1]`. 점수가 높을수록 예측이 참조와 더 비슷해집니다.

bleu_metric_values

BleuMetricValue[]

인스턴스 입력당 평가 결과입니다.

bleu_metric_values.score

float: [0, 1]. 점수가 높을수록 예측이 참조와 더 비슷해집니다.

`RougeInput`

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

파라미터
`metric_spec`	(선택사항) `RougeSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`metric_spec.rouge_type`	(선택사항) `string` 사용 가능한 값은 다음과 같습니다. `rougen[1-9]`: 예측과 참조 간의 n-gram 중복을 기준으로 `rouge` 점수를 계산합니다. `rougeL`: 예측과 참조 간의 최장 공통 부분 수열(LCS)을 기반으로 `rouge` 점수를 계산합니다. `rougeLsum`: 먼저 예측과 참조를 문장으로 분할한 후 각 튜플의 LCS를 계산합니다. 최종 `rougeLsum` 점수는 이러한 개별 LCS 점수의 평균입니다.
`metric_spec.use_stemmer`	(선택사항) `bool` 일치를 개선하기 위해 포터 스테머를 사용하여 단어 서픽스를 제거해야 하는지 여부입니다.
`metric_spec.split_summaries`	(선택사항) `bool` rougeLsum의 문장 사이에 줄바꿈을 추가할지 여부입니다.
`instances`	(선택사항) `RougeInstance[]` LLM 응답과 참조로 구성된 평가 입력입니다.
`instances.prediction`	(선택사항) `string` LLM 응답입니다.
`instances.reference`	(선택사항) `string` 참조를 위한 특별한 LLM 응답입니다.

`RougeResults`

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

출력

출력
`rouge_metric_values`	`RougeValue[]` 인스턴스 입력당 평가 결과입니다.
`rouge_metric_values.score`	`float`: `[0, 1]`. 점수가 높을수록 예측이 참조와 더 비슷해집니다.

rouge_metric_values

RougeValue[]

인스턴스 입력당 평가 결과입니다.

rouge_metric_values.score

float: [0, 1]. 점수가 높을수록 예측이 참조와 더 비슷해집니다.

`FluencyInput`

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

파라미터

파라미터
`metric_spec`	(선택사항) `FluencySpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `FluencyInstance` LLM 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.

metric_spec

(선택사항) FluencySpec

측정항목 동작을 정의하는 측정항목 사양입니다.

instance

(선택사항) FluencyInstance

LLM 응답으로 구성된 평가 입력입니다.

instance.prediction

(선택사항) string

LLM 응답입니다.

`FluencyResult`

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: 명확하지 않음 `2`: 다소 명확하지 않음 `3`: 중립적 `4`: 다소 유창함 `5`: 유창함
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: 명확하지 않음
2: 다소 명확하지 않음
3: 중립적
4: 다소 유창함
5: 유창함

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`CoherenceInput`

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

파라미터

파라미터
`metric_spec`	(선택사항) `CoherenceSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `CoherenceInstance` LLM 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.

metric_spec

(선택사항) CoherenceSpec

측정항목 동작을 정의하는 측정항목 사양입니다.

instance

(선택사항) CoherenceInstance

LLM 응답으로 구성된 평가 입력입니다.

instance.prediction

(선택사항) string

LLM 응답입니다.

`CoherenceResult`

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: 일관성이 없음 `2`: 다소 일관성이 없음 `3`: 중립적 `4`: 다소 일관성이 있음 `5`: 일관성이 있음
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: 일관성이 없음
2: 다소 일관성이 없음
3: 중립적
4: 다소 일관성이 있음
5: 일관성이 있음

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`SafetyInput`

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

파라미터

파라미터
`metric_spec`	(선택사항) `SafetySpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `SafetyInstance` LLM 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.

metric_spec

(선택사항) SafetySpec

측정항목 동작을 정의하는 측정항목 사양입니다.

instance

(선택사항) SafetyInstance

LLM 응답으로 구성된 평가 입력입니다.

instance.prediction

(선택사항) string

LLM 응답입니다.

`SafetyResult`

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `0`: 안전하지 않음 `1`: 안전함
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

0: 안전하지 않음
1: 안전함

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`GroundednessInput`

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

파라미터	설명
`metric_spec`	(선택사항) GroundednessSpec 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) GroundednessInstance 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`GroundednessResult`

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `0`: 그라운딩되지 않음 `1`: 그라운딩됨
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

0: 그라운딩되지 않음
1: 그라운딩됨

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`FulfillmentInput`

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `FulfillmentSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `FulfillmentInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.

`FulfillmentResult`

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: fulfillment 없음 `2`: 불량한 fulfillment `3`: 일부 fulfillment `4`: 양호한 fulfillment `5`: fulfillment 완료
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: fulfillment 없음
2: 불량한 fulfillment
3: 일부 fulfillment
4: 양호한 fulfillment
5: fulfillment 완료

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`SummarizationQualityInput`

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

파라미터
`metric_spec`	(선택사항) `SummarizationQualitySpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `SummarizationQualityInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`SummarizationQualityResult`

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: 매우 나쁨 `2`: 나쁨 `3`: 양호 `4`: 우수 `5`: 매우 우수
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: 매우 나쁨
2: 나쁨
3: 양호
4: 우수
5: 매우 우수

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`PairwiseSummarizationQualityInput`

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

파라미터
`metric_spec`	(선택사항) `PairwiseSummarizationQualitySpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `PairwiseSummarizationQualityInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.baseline_prediction`	(선택사항) `string` 기준 모델 LLM 응답입니다.
`instance.prediction`	(선택사항) `string` 후보 모델 LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`PairwiseSummarizationQualityResult`

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`pairwise_choice`	`PairwiseChoice`: 다음과 같은 양의 값이 있는 enum입니다. `BASELINE`: 기준 예측이 더 우수함 `CANDIDATE`: 후보 예측이 더 우수함 `TIE`: 기준 예측과 후보 예측 간의 관계입니다.
`explanation`	`string`: pairwise_choice 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

pairwise_choice

PairwiseChoice: 다음과 같은 양의 값이 있는 enum입니다.

BASELINE: 기준 예측이 더 우수함
CANDIDATE: 후보 예측이 더 우수함
TIE: 기준 예측과 후보 예측 간의 관계입니다.

explanation

string: pairwise_choice 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`SummarizationHelpfulnessInput`

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

파라미터
`metric_spec`	(선택사항) `SummarizationHelpfulnessSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `SummarizationHelpfulnessInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`SummarizationHelpfulnessResult`

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: 유용하지 않음 `2`: 다소 유용하지 않음 `3`: 중립적 `4`: 다소 유용함 `5`: 유용함
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: 유용하지 않음
2: 다소 유용하지 않음
3: 중립적
4: 다소 유용함
5: 유용함

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`SummarizationVerbosityInput`

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

파라미터
`metric_spec`	(선택사항) `SummarizationVerbositySpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `SummarizationVerbosityInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`SummarizationVerbosityResult`

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`. 다음 필드 중 하나입니다. `-2`: 간결함 `-1`: 다소 간결함 `0`: 최적 `1`: 다소 상세함 `2`: 상세함
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float. 다음 필드 중 하나입니다.

-2: 간결함
-1: 다소 간결함
0: 최적
1: 다소 상세함
2: 상세함

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`QuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

파라미터
`metric_spec`	(선택사항) `QuestionAnsweringQualitySpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `QuestionAnsweringQualityInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`QuestionAnsweringQualityResult`

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: 매우 나쁨 `2`: 나쁨 `3`: 양호 `4`: 우수 `5`: 매우 우수
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: 매우 나쁨
2: 나쁨
3: 양호
4: 우수
5: 매우 우수

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`PairwiseQuestionAnsweringQualityInput`

{
  "pairwise_question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `QuestionAnsweringQualitySpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `QuestionAnsweringQualityInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.baseline_prediction`	(선택사항) `string` 기준 모델 LLM 응답입니다.
`instance.prediction`	(선택사항) `string` 후보 모델 LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`PairwiseQuestionAnsweringQualityResult`

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`pairwise_choice`	`PairwiseChoice`: 다음과 같은 양의 값이 있는 enum입니다. `BASELINE`: 기준 예측이 더 우수함 `CANDIDATE`: 후보 예측이 더 우수함 `TIE`: 기준 예측과 후보 예측 간의 관계입니다.
`explanation`	`string`: `pairwise_choice` 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

pairwise_choice

PairwiseChoice: 다음과 같은 양의 값이 있는 enum입니다.

BASELINE: 기준 예측이 더 우수함
CANDIDATE: 후보 예측이 더 우수함
TIE: 기준 예측과 후보 예측 간의 관계입니다.

explanation

string: pairwise_choice 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`QuestionAnsweringRelevanceInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `QuestionAnsweringRelevanceSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `QuestionAnsweringRelevanceInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`QuestionAnsweringRelevancyResult`

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: 관련성이 없음 `2`: 다소 관련성이 없음 `3`: 중립적 `4`: 다소 관련성이 있음 `5`: 관련성이 있음
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: 관련성이 없음
2: 다소 관련성이 없음
3: 중립적
4: 다소 관련성이 있음
5: 관련성이 있음

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`QuestionAnsweringHelpfulnessInput`

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `QuestionAnsweringHelpfulnessSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `QuestionAnsweringHelpfulnessInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`QuestionAnsweringHelpfulnessResult`

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `1`: 유용하지 않음 `2`: 다소 유용하지 않음 `3`: 중립적 `4`: 다소 유용함 `5`: 유용함
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

1: 유용하지 않음
2: 다소 유용하지 않음
3: 중립적
4: 다소 유용함
5: 유용함

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`QuestionAnsweringCorrectnessInput`

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `QuestionAnsweringCorrectnessSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`metric_spec.use_reference`	(선택사항) `bool` 참조가 평가에 사용되는지 여부입니다.
`instance`	(선택사항) `QuestionAnsweringCorrectnessInstance` 추론 입력과 해당 응답으로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` LLM 응답입니다.
`instance.reference`	(선택사항) `string` 참조를 위한 특별한 LLM 응답입니다.
`instance.instruction`	(선택사항) `string` 추론 시 사용되는 안내입니다.
`instance.context`	(선택사항) `string` LLM 응답에 사용할 수 있는 모든 정보가 포함된 추론 시간 텍스트입니다.

`QuestionAnsweringCorrectnessResult`

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

출력

출력
`score`	`float`: 다음 중 하나입니다. `0`: 오답 `1`: 정답
`explanation`	`string`: 점수 할당 근거입니다.
`confidence`	`float`: `[0, 1]` 결과의 신뢰도 점수입니다.

score

float: 다음 중 하나입니다.

0: 오답
1: 정답

explanation

string: 점수 할당 근거입니다.

confidence

float: [0, 1] 결과의 신뢰도 점수입니다.

`PointwiseMetricInput`

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

파라미터
`metric_spec`	필수: `PointwiseMetricSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`metric_spec.metric_prompt_template`	필수: `string` 측정항목을 정의하는 프롬프트 템플릿입니다. instance.json_instance의 키-값 쌍으로 렌더링됩니다.
`instance`	필수: `PointwiseMetricInstance` json_instance로 구성된 평가 입력입니다.
`instance.json_instance`	(선택사항) `string` Json 형식의 키-값 쌍입니다. 예를 들어 {"key_1": "value_1", "key_2": "value_2"}. metric_spec.metric_prompt_template을 렌더링하는 데 사용됩니다.

`PointwiseMetricResult`

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

출력
`score`	`float`: 점별 측정항목 평가 결과의 점수입니다.
`explanation`	`string`: 점수 할당 근거입니다.

`PairwiseMetricInput`

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

파라미터
`metric_spec`	필수: `PairwiseMetricSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`metric_spec.metric_prompt_template`	필수: `string` 측정항목을 정의하는 프롬프트 템플릿입니다. instance.json_instance의 키-값 쌍으로 렌더링됩니다.
`instance`	필수: `PairwiseMetricInstance` json_instance로 구성된 평가 입력입니다.
`instance.json_instance`	(선택사항) `string` JSON 형식의 키-값 쌍입니다. 예를 들어 {"key_1": "value_1", "key_2": "value_2"}. metric_spec.metric_prompt_template을 렌더링하는 데 사용됩니다.

`PairwiseMetricResult`

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

출력
`score`	`float`: 쌍별 측정항목 평가 결과의 점수입니다.
`explanation`	`string`: 점수 할당 근거입니다.

`ToolCallValidInput`

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `ToolCallValidSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `ToolCallValidInstance` LLM 응답과 참조로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` `content` 및 `tool_calls` 키가 포함된 JSON 직렬화된 문자열인 후보 모델 LLM 응답입니다. `content` 값은 모델의 텍스트 출력입니다. `tool_call` 값은 도구 호출 목록의 JSON 직렬화된 문자열입니다. 예를 들면 다음과 같습니다. { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	(선택사항) `string` 예측과 동일한 형식의 특별한 모델 출력입니다.

`ToolCallValidResults`

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

출력

출력
`tool_call_valid_metric_values`	반복 `ToolCallValidMetricValue`: 인스턴스 입력당 평가 결과입니다.
`tool_call_valid_metric_values.score`	`float`: 다음 중 하나입니다. `0`: 잘못된 도구 호출 `1`: 유효한 도구 호출

tool_call_valid_metric_values

반복 ToolCallValidMetricValue: 인스턴스 입력당 평가 결과입니다.

tool_call_valid_metric_values.score

float: 다음 중 하나입니다.

0: 잘못된 도구 호출
1: 유효한 도구 호출

`ToolNameMatchInput`

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `ToolNameMatchSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `ToolNameMatchInstance` LLM 응답과 참조로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` `content` 및 `tool_calls` 키가 포함된 JSON 직렬화된 문자열인 후보 모델 LLM 응답입니다. `content` 값은 모델의 텍스트 출력입니다. `tool_call` 값은 도구 호출 목록의 JSON 직렬화된 문자열입니다.
`instance.reference`	(선택사항) `string` 예측과 동일한 형식의 특별한 모델 출력입니다.

`ToolNameMatchResults`

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

출력

출력
`tool_name_match_metric_values`	반복 `ToolNameMatchMetricValue`: 인스턴스 입력당 평가 결과입니다.
`tool_name_match_metric_values.score`	`float`: 다음 중 하나입니다. `0`: 도구 호출 이름이 참조와 일치하지 않습니다. `1`: 도구 호출 이름이 참조와 일치합니다.

tool_name_match_metric_values

반복 ToolNameMatchMetricValue: 인스턴스 입력당 평가 결과입니다.

tool_name_match_metric_values.score

float: 다음 중 하나입니다.

0: 도구 호출 이름이 참조와 일치하지 않습니다.
1: 도구 호출 이름이 참조와 일치합니다.

`ToolParameterKeyMatchInput`

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `ToolParameterKeyMatchSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `ToolParameterKeyMatchInstance` LLM 응답과 참조로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` `content` 및 `tool_calls` 키가 포함된 JSON 직렬화된 문자열인 후보 모델 LLM 응답입니다. `content` 값은 모델의 텍스트 출력입니다. `tool_call` 값은 도구 호출 목록의 JSON 직렬화된 문자열입니다.
`instance.reference`	(선택사항) `string` 예측과 동일한 형식의 특별한 모델 출력입니다.

`ToolParameterKeyMatchResults`

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

출력
`tool_parameter_key_match_metric_values`	반복 `ToolParameterKeyMatchMetricValue`: 인스턴스 입력당 평가 결과입니다.
`tool_parameter_key_match_metric_values.score`	`float`: `[0, 1]`. 점수가 높을수록 참조 파라미터 이름과 일치하는 파라미터가 많아집니다.

`ToolParameterKVMatchInput`

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

파라미터
`metric_spec`	(선택사항) `ToolParameterKVMatchSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`instance`	(선택사항) `ToolParameterKVMatchInstance` LLM 응답과 참조로 구성된 평가 입력입니다.
`instance.prediction`	(선택사항) `string` `content` 및 `tool_calls` 키가 포함된 JSON 직렬화된 문자열인 후보 모델 LLM 응답입니다. `content` 값은 모델의 텍스트 출력입니다. `tool_call` 값은 도구 호출 목록의 JSON 직렬화된 문자열입니다.
`instance.reference`	(선택사항) `string` 예측과 동일한 형식의 특별한 모델 출력입니다.

`ToolParameterKVMatchResults`

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

출력
`tool_parameter_kv_match_metric_values`	반복 `ToolParameterKVMatchMetricValue`: 인스턴스 입력당 평가 결과입니다.
`tool_parameter_kv_match_metric_values.score`	`float`: `[0, 1]`. 점수가 높을수록 참조 파라미터의 이름 및 값과 일치하는 파라미터가 많아집니다.

`CometInput`

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

파라미터
`metric_spec`	(선택사항) `CometSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`metric_spec.version`	(선택사항) `string` `COMET_22_SRC_REF`: 번역, 소스, 참조용 COMET 22입니다. 세 가지 입력을 모두 사용하여 번역(예측)을 평가합니다.
`metric_spec.source_language`	(선택사항) `string` BCP-47 형식의 소스 언어입니다. 예를 들면 'es'입니다.
`metric_spec.target_language`	(선택사항) `string` BCP-47 형식의 타겟 언어입니다. 예를 들면 'es'입니다.
`instance`	(선택사항) `CometInstance` LLM 응답과 참조로 구성된 평가 입력입니다. 평가에 사용되는 정확한 필드는 COMET 버전에 따라 다릅니다.
`instance.prediction`	(선택사항) `string` 후보 모델 LLM 응답입니다. 평가 중인 LLM의 출력입니다.
`instance.source`	(선택사항) `string` 소스 텍스트. 예측이 번역된 원래 언어입니다.
`instance.reference`	(선택사항) `string` 예측과 비교하는 데 사용되는 정답입니다. 예측과 동일한 언어로 표시됩니다.

`CometResult`

{
  "comet_result" : {
    "score": float
  }
}

출력
`score`	`float`: `[0, 1]`. 여기서 1은 완벽한 번역을 나타냅니다.

`MetricxInput`

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

파라미터
`metric_spec`	(선택사항) `MetricxSpec` 측정항목 동작을 정의하는 측정항목 사양입니다.
`metric_spec.version`	선택사항: `string` 다음 중 하나입니다. `METRICX_24_REF`: 번역 및 참조용 MetricX 24입니다. 제공된 참조 텍스트 입력과 비교하여 예측(번역)을 평가합니다. `METRICX_24_SRC`: 번역 및 소스에 대한 MetricX 24입니다. 참조 텍스트 입력 없이 품질 추정(QE)으로 번역(예측)을 평가합니다. `METRICX_24_SRC_REF`: 번역, 소스, 참조를 위한 MetricX 24입니다. 세 가지 입력을 모두 사용하여 번역(예측)을 평가합니다.
`metric_spec.source_language`	(선택사항) `string` BCP-47 형식의 소스 언어입니다. 예를 들면 'es'입니다.
`metric_spec.target_language`	(선택사항) `string` BCP-47 형식의 타겟 언어입니다. 예를 들면 'es'입니다.
`instance`	(선택사항) `MetricxInstance` LLM 응답과 참조로 구성된 평가 입력입니다. 평가에 사용되는 정확한 필드는 MetricX 버전에 따라 다릅니다.
`instance.prediction`	(선택사항) `string` 후보 모델 LLM 응답입니다. 평가 중인 LLM의 출력입니다.
`instance.source`	(선택사항) `string` 예측이 번역된 원래 언어로 된 소스 텍스트입니다.
`instance.reference`	(선택사항) `string` 예측과 비교하는 데 사용되는 정답입니다. 예측과 동일한 언어로 표시됩니다.

`MetricxResult`

{
  "metricx_result" : {
    "score": float
  }
}

출력
`score`	`float`: `[0, 25]`. 여기서 0은 완벽한 번역을 나타냅니다.

예시

출력 평가

다음 예시에서는 Gen AI Evaluation API를 호출하여 다음을 비롯한 다양한 평가 측정항목을 사용하여 LLM의 출력을 평가하는 방법을 보여줍니다.

summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

출력 평가: 쌍별 요약 품질

다음 예시에서는 Gen AI Evaluation Service API를 호출하여 쌍별 요약 품질 비교를 사용하여 LLM의 출력을 평가하는 방법을 보여줍니다.

REST

요청 데이터를 사용하기 전에 다음을 바꿉니다.

PROJECT_ID: .
LOCATION: 요청을 처리하는 리전입니다.
PREDICTION: LLM 응답
BASELINE_PREDICTION: 기준 모델 LLM 응답
INSTRUCTION: 추론 시간에 사용되는 명령
CONTEXT: LLM 응답에 사용할 수 있는 모든 관련 정보가 포함된 추론 시간 텍스트

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

JSON 요청 본문:

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

Vertex AI SDK for Python을 설치하거나 업데이트하는 방법은 Vertex AI SDK for Python 설치를 참조하세요. 자세한 내용은 Python API 참고 문서를 참조하세요.

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

이 샘플을 사용해 보기 전에 Vertex AI 빠른 시작: 클라이언트 라이브러리 사용의 Go 설정 안내를 따르세요. 자세한 내용은 Vertex AI Go API 참고 문서를 참조하세요.

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

Rouge 점수 가져오기

다음 예시에서는 Gen AI Evaluation Service API를 호출하여 여러 입력으로 생성된 예측의 ROUGE 점수를 가져옵니다. ROUGE 입력은 측정항목 동작을 결정하는 metric_spec을 사용합니다.

REST

요청 데이터를 사용하기 전에 다음을 바꿉니다.

PROJECT_ID: .
LOCATION: 요청을 처리하는 리전입니다.
PREDICTION: LLM 응답
REFERENCE: 참조를 위한 특별한 LLM 응답
ROUGE_TYPE: Rouge 점수를 결정하는 데 사용되는 계산. 허용되는 값은 metric_spec.rouge_type을 참조하세요.
USE_STEMMER: 일치를 개선하기 위해 단어 서픽스를 제거하는 데 포터 스테머를 사용할지 여부를 결정합니다. 허용되는 값은 metric_spec.use_stemmer를 참조하세요.
SPLIT_SUMMARIES: rougeLsum 문장 사이에 새 줄이 추가되는지 결정합니다. 허용되는 값은 metric_spec.split_summaries를 참조하세요.

HTTP 메서드 및 URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

JSON 요청 본문:

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

요청을 보내려면 다음 옵션 중 하나를 선택합니다.

curl

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

Vertex AI SDK for Python을 설치하거나 업데이트하는 방법은 Vertex AI SDK for Python 설치를 참조하세요. 자세한 내용은 Python API 참고 문서를 참조하세요.

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

Vertex AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

다음 단계

자세한 문서는 평가 실행을 참고하기