Evaluates instances based on a given metric.
Endpoint
post
https://{service-endpoint}/v1beta1/{location}:evaluateInstances
Where {service-endpoint}
is one of the supported service endpoints.
Path parameters
location
string
Required. The resource name of the Location to evaluate the instances. Format: projects/{project}/locations/{location}
Request body
The request body contains data with the following structure:
metric_inputs
. Instances and specs for evaluation metric_inputs
can be only one of the following:Auto metric instances. Instances and metric spec for exact match metric.
Instances and metric spec for bleu metric.
Instances and metric spec for rouge metric.
LLM-based metric instance. General text generation metrics, applicable to other categories. Input for fluency metric.
Input for coherence metric.
Input for safety metric.
Input for groundedness metric.
Input for fulfillment metric.
Input for summarization quality metric.
Input for pairwise summarization quality metric.
Input for summarization helpfulness metric.
Input for summarization verbosity metric.
Input for question answering quality metric.
Input for pairwise question answering quality metric.
Input for question answering relevance metric.
Input for question answering helpfulness metric.
Input for question answering correctness metric.
Input for pointwise metric.
Input for pairwise metric.
Tool call metric instances. Input for tool call valid metric.
Input for tool name match metric.
Input for tool parameter key match metric.
Input for tool parameter key value match metric.
Example request
Python
Response body
Response message for EvaluationService.EvaluateInstances.
If successful, the response body contains data with the following structure:
evaluation_results
. Evaluation results will be served in the same order as presented in EvaluationRequest.instances. evaluation_results
can be only one of the following:Auto metric evaluation results. Results for exact match metric.
Results for bleu metric.
Results for rouge metric.
LLM-based metric evaluation result. General text generation metrics, applicable to other categories. result for fluency metric.
result for coherence metric.
result for safety metric.
result for groundedness metric.
result for fulfillment metric.
Summarization only metrics. result for summarization quality metric.
result for pairwise summarization quality metric.
result for summarization helpfulness metric.
result for summarization verbosity metric.
Question answering only metrics. result for question answering quality metric.
result for pairwise question answering quality metric.
result for question answering relevance metric.
result for question answering helpfulness metric.
result for question answering correctness metric.
Generic metrics. result for pointwise metric.
result for pairwise metric.
Tool call metrics. Results for tool call valid metric.
Results for tool name match metric.
Results for tool parameter key match metric.
Results for tool parameter key value match metric.
JSON representation |
---|
{ // Union field |
ExactMatchInput
Input for exact match metric.
Required. Spec for exact match metric.
Required. Repeated exact match instances.
JSON representation |
---|
{ "metricSpec": { object ( |
ExactMatchSpec
This type has no fields.
Spec for exact match metric - returns 1 if prediction and reference exactly matches, otherwise 0.
ExactMatchInstance
Spec for exact match instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Required. Ground truth used to compare against the prediction.
JSON representation |
---|
{ "prediction": string, "reference": string } |
BleuInput
Input for bleu metric.
Required. Spec for bleu score metric.
Required. Repeated bleu instances.
JSON representation |
---|
{ "metricSpec": { object ( |
BleuSpec
Spec for bleu score metric - calculates the precision of n-grams in the prediction as compared to reference - returns a score ranging between 0 to 1.
useEffectiveOrder
boolean
Optional. Whether to useEffectiveOrder to compute bleu score.
JSON representation |
---|
{ "useEffectiveOrder": boolean } |
BleuInstance
Spec for bleu instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Required. Ground truth used to compare against the prediction.
JSON representation |
---|
{ "prediction": string, "reference": string } |
RougeInput
Input for rouge metric.
Required. Spec for rouge score metric.
Required. Repeated rouge instances.
JSON representation |
---|
{ "metricSpec": { object ( |
RougeSpec
Spec for rouge score metric - calculates the recall of n-grams in prediction as compared to reference - returns a score ranging between 0 and 1.
rougeType
string
Optional. Supported rouge types are rougen[1-9], rougeL, and rougeLsum.
useStemmer
boolean
Optional. Whether to use stemmer to compute rouge score.
splitSummaries
boolean
Optional. Whether to split summaries while using rougeLsum.
JSON representation |
---|
{ "rougeType": string, "useStemmer": boolean, "splitSummaries": boolean } |
RougeInstance
Spec for rouge instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Required. Ground truth used to compare against the prediction.
JSON representation |
---|
{ "prediction": string, "reference": string } |
FluencyInput
Input for fluency metric.
Required. Spec for fluency score metric.
Required. Fluency instance.
JSON representation |
---|
{ "metricSpec": { object ( |
FluencySpec
Spec for fluency score metric.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "version": integer } |
FluencyInstance
Spec for fluency instance.
prediction
string
Required. Output of the evaluated model.
JSON representation |
---|
{ "prediction": string } |
CoherenceInput
Input for coherence metric.
Required. Spec for coherence score metric.
Required. Coherence instance.
JSON representation |
---|
{ "metricSpec": { object ( |
CoherenceSpec
Spec for coherence score metric.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "version": integer } |
CoherenceInstance
Spec for coherence instance.
prediction
string
Required. Output of the evaluated model.
JSON representation |
---|
{ "prediction": string } |
SafetyInput
Input for safety metric.
Required. Spec for safety metric.
Required. Safety instance.
JSON representation |
---|
{ "metricSpec": { object ( |
SafetySpec
Spec for safety metric.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "version": integer } |
SafetyInstance
Spec for safety instance.
prediction
string
Required. Output of the evaluated model.
JSON representation |
---|
{ "prediction": string } |
GroundednessInput
Input for groundedness metric.
Required. Spec for groundedness metric.
Required. Groundedness instance.
JSON representation |
---|
{ "metricSpec": { object ( |
GroundednessSpec
Spec for groundedness metric.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "version": integer } |
GroundednessInstance
Spec for groundedness instance.
prediction
string
Required. Output of the evaluated model.
context
string
Required. Background information provided in context used to compare against the prediction.
JSON representation |
---|
{ "prediction": string, "context": string } |
FulfillmentInput
Input for fulfillment metric.
Required. Spec for fulfillment score metric.
Required. Fulfillment instance.
JSON representation |
---|
{ "metricSpec": { object ( |
FulfillmentSpec
Spec for fulfillment metric.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "version": integer } |
FulfillmentInstance
Spec for fulfillment instance.
prediction
string
Required. Output of the evaluated model.
instruction
string
Required. Inference instruction prompt to compare prediction with.
JSON representation |
---|
{ "prediction": string, "instruction": string } |
SummarizationQualityInput
Input for summarization quality metric.
Required. Spec for summarization quality score metric.
Required. Summarization quality instance.
JSON representation |
---|
{ "metricSpec": { object ( |
SummarizationQualitySpec
Spec for summarization quality score metric.
useReference
boolean
Optional. Whether to use instance.reference to compute summarization quality.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
SummarizationQualityInstance
Spec for summarization quality instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Required. Text to be summarized.
instruction
string
Required. Summarization prompt for LLM.
JSON representation |
---|
{ "prediction": string, "reference": string, "context": string, "instruction": string } |
PairwiseSummarizationQualityInput
Input for pairwise summarization quality metric.
Required. Spec for pairwise summarization quality score metric.
Required. Pairwise summarization quality instance.
JSON representation |
---|
{ "metricSpec": { object ( |
PairwiseSummarizationQualitySpec
Spec for pairwise summarization quality score metric.
useReference
boolean
Optional. Whether to use instance.reference to compute pairwise summarization quality.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
PairwiseSummarizationQualityInstance
Spec for pairwise summarization quality instance.
prediction
string
Required. Output of the candidate model.
baselinePrediction
string
Required. Output of the baseline model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Required. Text to be summarized.
instruction
string
Required. Summarization prompt for LLM.
JSON representation |
---|
{ "prediction": string, "baselinePrediction": string, "reference": string, "context": string, "instruction": string } |
SummarizationHelpfulnessInput
Input for summarization helpfulness metric.
Required. Spec for summarization helpfulness score metric.
Required. Summarization helpfulness instance.
JSON representation |
---|
{ "metricSpec": { object ( |
SummarizationHelpfulnessSpec
Spec for summarization helpfulness score metric.
useReference
boolean
Optional. Whether to use instance.reference to compute summarization helpfulness.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
SummarizationHelpfulnessInstance
Spec for summarization helpfulness instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Required. Text to be summarized.
instruction
string
Optional. Summarization prompt for LLM.
JSON representation |
---|
{ "prediction": string, "reference": string, "context": string, "instruction": string } |
SummarizationVerbosityInput
Input for summarization verbosity metric.
Required. Spec for summarization verbosity score metric.
Required. Summarization verbosity instance.
JSON representation |
---|
{ "metricSpec": { object ( |
SummarizationVerbositySpec
Spec for summarization verbosity score metric.
useReference
boolean
Optional. Whether to use instance.reference to compute summarization verbosity.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
SummarizationVerbosityInstance
Spec for summarization verbosity instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Required. Text to be summarized.
instruction
string
Optional. Summarization prompt for LLM.
JSON representation |
---|
{ "prediction": string, "reference": string, "context": string, "instruction": string } |
QuestionAnsweringQualityInput
Input for question answering quality metric.
Required. Spec for question answering quality score metric.
Required. Question answering quality instance.
JSON representation |
---|
{ "metricSpec": { object ( |
QuestionAnsweringQualitySpec
Spec for question answering quality score metric.
useReference
boolean
Optional. Whether to use instance.reference to compute question answering quality.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
QuestionAnsweringQualityInstance
Spec for question answering quality instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Required. Text to answer the question.
instruction
string
Required. Question Answering prompt for LLM.
JSON representation |
---|
{ "prediction": string, "reference": string, "context": string, "instruction": string } |
PairwiseQuestionAnsweringQualityInput
Input for pairwise question answering quality metric.
Required. Spec for pairwise question answering quality score metric.
Required. Pairwise question answering quality instance.
JSON representation |
---|
{ "metricSpec": { object ( |
PairwiseQuestionAnsweringQualitySpec
Spec for pairwise question answering quality score metric.
useReference
boolean
Optional. Whether to use instance.reference to compute question answering quality.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
PairwiseQuestionAnsweringQualityInstance
Spec for pairwise question answering quality instance.
prediction
string
Required. Output of the candidate model.
baselinePrediction
string
Required. Output of the baseline model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Required. Text to answer the question.
instruction
string
Required. Question Answering prompt for LLM.
JSON representation |
---|
{ "prediction": string, "baselinePrediction": string, "reference": string, "context": string, "instruction": string } |
QuestionAnsweringRelevanceInput
Input for question answering relevance metric.
Required. Spec for question answering relevance score metric.
Required. Question answering relevance instance.
JSON representation |
---|
{ "metricSpec": { object ( |
QuestionAnsweringRelevanceSpec
Spec for question answering relevance metric.
useReference
boolean
Optional. Whether to use instance.reference to compute question answering relevance.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
QuestionAnsweringRelevanceInstance
Spec for question answering relevance instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Optional. Text provided as context to answer the question.
instruction
string
Required. The question asked and other instruction in the inference prompt.
JSON representation |
---|
{ "prediction": string, "reference": string, "context": string, "instruction": string } |
QuestionAnsweringHelpfulnessInput
Input for question answering helpfulness metric.
Required. Spec for question answering helpfulness score metric.
Required. Question answering helpfulness instance.
JSON representation |
---|
{ "metricSpec": { object ( |
QuestionAnsweringHelpfulnessSpec
Spec for question answering helpfulness metric.
useReference
boolean
Optional. Whether to use instance.reference to compute question answering helpfulness.
version
integer
Optional. Which version to use for evaluation.
JSON representation |
---|
{ "useReference": boolean, "version": integer } |
QuestionAnsweringHelpfulnessInstance
Spec for question answering helpfulness instance.
prediction
string
Required. Output of the evaluated model.
reference
string
Optional. Ground truth used to compare against the prediction.
context
string
Optional. Text provided as context to answer the question.
instruction
string
Required. The question asked and other instruction in the inference prompt.