Method: projects.locations.evaluateInstances

Evaluates instances based on a given metric.

Endpoint

post https://{service-endpoint}/v1beta1/{location}:evaluateInstances

Where {service-endpoint} is one of the supported service endpoints.

Path parameters

location string

Required. The resource name of the Location to evaluate the instances. Format: projects/{project}/locations/{location}

Request body

The request body contains data with the following structure:

Fields Union field metric_inputs. Instances and specs for evaluation metric_inputs can be only one of the following:
exactMatchInput object (ExactMatchInput)

Auto metric instances. Instances and metric spec for exact match metric.

bleuInput object (BleuInput)

Instances and metric spec for bleu metric.

rougeInput object (RougeInput)

Instances and metric spec for rouge metric.

fluencyInput object (FluencyInput)

LLM-based metric instance. General text generation metrics, applicable to other categories. Input for fluency metric.

coherenceInput object (CoherenceInput)

Input for coherence metric.

safetyInput object (SafetyInput)

Input for safety metric.

groundednessInput object (GroundednessInput)

Input for groundedness metric.

fulfillmentInput object (FulfillmentInput)

Input for fulfillment metric.

summarizationQualityInput object (SummarizationQualityInput)

Input for summarization quality metric.

pairwiseSummarizationQualityInput object (PairwiseSummarizationQualityInput)

Input for pairwise summarization quality metric.

summarizationHelpfulnessInput object (SummarizationHelpfulnessInput)

Input for summarization helpfulness metric.

summarizationVerbosityInput object (SummarizationVerbosityInput)

Input for summarization verbosity metric.

questionAnsweringQualityInput object (QuestionAnsweringQualityInput)

Input for question answering quality metric.

pairwiseQuestionAnsweringQualityInput object (PairwiseQuestionAnsweringQualityInput)

Input for pairwise question answering quality metric.

questionAnsweringRelevanceInput object (QuestionAnsweringRelevanceInput)

Input for question answering relevance metric.

questionAnsweringHelpfulnessInput object (QuestionAnsweringHelpfulnessInput)

Input for question answering helpfulness metric.

questionAnsweringCorrectnessInput object (QuestionAnsweringCorrectnessInput)

Input for question answering correctness metric.

pointwiseMetricInput object (PointwiseMetricInput)

Input for pointwise metric.

pairwiseMetricInput object (PairwiseMetricInput)

Input for pairwise metric.

toolCallValidInput object (ToolCallValidInput)

Tool call metric instances. Input for tool call valid metric.

toolNameMatchInput object (ToolNameMatchInput)

Input for tool name match metric.

toolParameterKeyMatchInput object (ToolParameterKeyMatchInput)

Input for tool parameter key match metric.

toolParameterKvMatchInput object (ToolParameterKVMatchInput)

Input for tool parameter key value match metric.

Example request

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update project_id and location
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)

Response body

Response message for EvaluationService.EvaluateInstances.

If successful, the response body contains data with the following structure:

Fields Union field evaluation_results. Evaluation results will be served in the same order as presented in EvaluationRequest.instances. evaluation_results can be only one of the following:
exactMatchResults object (ExactMatchResults)

Auto metric evaluation results. Results for exact match metric.

bleuResults object (BleuResults)

Results for bleu metric.

rougeResults object (RougeResults)

Results for rouge metric.

fluencyResult object (FluencyResult)

LLM-based metric evaluation result. General text generation metrics, applicable to other categories. result for fluency metric.

coherenceResult object (CoherenceResult)

result for coherence metric.

safetyResult object (SafetyResult)

result for safety metric.

groundednessResult object (GroundednessResult)

result for groundedness metric.

fulfillmentResult object (FulfillmentResult)

result for fulfillment metric.

summarizationQualityResult object (SummarizationQualityResult)

Summarization only metrics. result for summarization quality metric.

pairwiseSummarizationQualityResult object (PairwiseSummarizationQualityResult)

result for pairwise summarization quality metric.

summarizationHelpfulnessResult object (SummarizationHelpfulnessResult)

result for summarization helpfulness metric.

summarizationVerbosityResult object (SummarizationVerbosityResult)

result for summarization verbosity metric.

questionAnsweringQualityResult object (QuestionAnsweringQualityResult)

Question answering only metrics. result for question answering quality metric.

pairwiseQuestionAnsweringQualityResult object (PairwiseQuestionAnsweringQualityResult)

result for pairwise question answering quality metric.

questionAnsweringRelevanceResult object (QuestionAnsweringRelevanceResult)

result for question answering relevance metric.

questionAnsweringHelpfulnessResult object (QuestionAnsweringHelpfulnessResult)

result for question answering helpfulness metric.

questionAnsweringCorrectnessResult object (QuestionAnsweringCorrectnessResult)

result for question answering correctness metric.

pointwiseMetricResult object (PointwiseMetricResult)

Generic metrics. result for pointwise metric.

pairwiseMetricResult object (PairwiseMetricResult)

result for pairwise metric.

toolCallValidResults object (ToolCallValidResults)

Tool call metrics. Results for tool call valid metric.

toolNameMatchResults object (ToolNameMatchResults)

Results for tool name match metric.

toolParameterKeyMatchResults object (ToolParameterKeyMatchResults)

Results for tool parameter key match metric.

toolParameterKvMatchResults object (ToolParameterKVMatchResults)

Results for tool parameter key value match metric.

JSON representation
{

  // Union field evaluation_results can be only one of the following:
  "exactMatchResults": {
    object (ExactMatchResults)
  },
  "bleuResults": {
    object (BleuResults)
  },
  "rougeResults": {
    object (RougeResults)
  },
  "fluencyResult": {
    object (FluencyResult)
  },
  "coherenceResult": {
    object (CoherenceResult)
  },
  "safetyResult": {
    object (SafetyResult)
  },
  "groundednessResult": {
    object (GroundednessResult)
  },
  "fulfillmentResult": {
    object (FulfillmentResult)
  },
  "summarizationQualityResult": {
    object (SummarizationQualityResult)
  },
  "pairwiseSummarizationQualityResult": {
    object (PairwiseSummarizationQualityResult)
  },
  "summarizationHelpfulnessResult": {
    object (SummarizationHelpfulnessResult)
  },
  "summarizationVerbosityResult": {
    object (SummarizationVerbosityResult)
  },
  "questionAnsweringQualityResult": {
    object (QuestionAnsweringQualityResult)
  },
  "pairwiseQuestionAnsweringQualityResult": {
    object (PairwiseQuestionAnsweringQualityResult)
  },
  "questionAnsweringRelevanceResult": {
    object (QuestionAnsweringRelevanceResult)
  },
  "questionAnsweringHelpfulnessResult": {
    object (QuestionAnsweringHelpfulnessResult)
  },
  "questionAnsweringCorrectnessResult": {
    object (QuestionAnsweringCorrectnessResult)
  },
  "pointwiseMetricResult": {
    object (PointwiseMetricResult)
  },
  "pairwiseMetricResult": {
    object (PairwiseMetricResult)
  },
  "toolCallValidResults": {
    object (ToolCallValidResults)
  },
  "toolNameMatchResults": {
    object (ToolNameMatchResults)
  },
  "toolParameterKeyMatchResults": {
    object (ToolParameterKeyMatchResults)
  },
  "toolParameterKvMatchResults": {
    object (ToolParameterKVMatchResults)
  }
  // End of list of possible types for union field evaluation_results.
}

ExactMatchInput

Input for exact match metric.

Fields
metricSpec object (ExactMatchSpec)

Required. Spec for exact match metric.

instances[] object (ExactMatchInstance)

Required. Repeated exact match instances.

JSON representation
{
  "metricSpec": {
    object (ExactMatchSpec)
  },
  "instances": [
    {
      object (ExactMatchInstance)
    }
  ]
}

ExactMatchSpec

This type has no fields.

Spec for exact match metric - returns 1 if prediction and reference exactly matches, otherwise 0.

ExactMatchInstance

Spec for exact match instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Required. Ground truth used to compare against the prediction.

JSON representation
{
  "prediction": string,
  "reference": string
}

BleuInput

Input for bleu metric.

Fields
metricSpec object (BleuSpec)

Required. Spec for bleu score metric.

instances[] object (BleuInstance)

Required. Repeated bleu instances.

JSON representation
{
  "metricSpec": {
    object (BleuSpec)
  },
  "instances": [
    {
      object (BleuInstance)
    }
  ]
}

BleuSpec

Spec for bleu score metric - calculates the precision of n-grams in the prediction as compared to reference - returns a score ranging between 0 to 1.

Fields
useEffectiveOrder boolean

Optional. Whether to useEffectiveOrder to compute bleu score.

JSON representation
{
  "useEffectiveOrder": boolean
}

BleuInstance

Spec for bleu instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Required. Ground truth used to compare against the prediction.

JSON representation
{
  "prediction": string,
  "reference": string
}

RougeInput

Input for rouge metric.

Fields
metricSpec object (RougeSpec)

Required. Spec for rouge score metric.

instances[] object (RougeInstance)

Required. Repeated rouge instances.

JSON representation
{
  "metricSpec": {
    object (RougeSpec)
  },
  "instances": [
    {
      object (RougeInstance)
    }
  ]
}

RougeSpec

Spec for rouge score metric - calculates the recall of n-grams in prediction as compared to reference - returns a score ranging between 0 and 1.

Fields
rougeType string

Optional. Supported rouge types are rougen[1-9], rougeL, and rougeLsum.

useStemmer boolean

Optional. Whether to use stemmer to compute rouge score.

splitSummaries boolean

Optional. Whether to split summaries while using rougeLsum.

JSON representation
{
  "rougeType": string,
  "useStemmer": boolean,
  "splitSummaries": boolean
}

RougeInstance

Spec for rouge instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Required. Ground truth used to compare against the prediction.

JSON representation
{
  "prediction": string,
  "reference": string
}

FluencyInput

Input for fluency metric.

Fields
metricSpec object (FluencySpec)

Required. Spec for fluency score metric.

instance object (FluencyInstance)

Required. Fluency instance.

JSON representation
{
  "metricSpec": {
    object (FluencySpec)
  },
  "instance": {
    object (FluencyInstance)
  }
}

FluencySpec

Spec for fluency score metric.

Fields
version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "version": integer
}

FluencyInstance

Spec for fluency instance.

Fields
prediction string

Required. Output of the evaluated model.

JSON representation
{
  "prediction": string
}

CoherenceInput

Input for coherence metric.

Fields
metricSpec object (CoherenceSpec)

Required. Spec for coherence score metric.

instance object (CoherenceInstance)

Required. Coherence instance.

JSON representation
{
  "metricSpec": {
    object (CoherenceSpec)
  },
  "instance": {
    object (CoherenceInstance)
  }
}

CoherenceSpec

Spec for coherence score metric.

Fields
version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "version": integer
}

CoherenceInstance

Spec for coherence instance.

Fields
prediction string

Required. Output of the evaluated model.

JSON representation
{
  "prediction": string
}

SafetyInput

Input for safety metric.

Fields
metricSpec object (SafetySpec)

Required. Spec for safety metric.

instance object (SafetyInstance)

Required. Safety instance.

JSON representation
{
  "metricSpec": {
    object (SafetySpec)
  },
  "instance": {
    object (SafetyInstance)
  }
}

SafetySpec

Spec for safety metric.

Fields
version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "version": integer
}

SafetyInstance

Spec for safety instance.

Fields
prediction string

Required. Output of the evaluated model.

JSON representation
{
  "prediction": string
}

GroundednessInput

Input for groundedness metric.

Fields
metricSpec object (GroundednessSpec)

Required. Spec for groundedness metric.

instance object (GroundednessInstance)

Required. Groundedness instance.

JSON representation
{
  "metricSpec": {
    object (GroundednessSpec)
  },
  "instance": {
    object (GroundednessInstance)
  }
}

GroundednessSpec

Spec for groundedness metric.

Fields
version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "version": integer
}

GroundednessInstance

Spec for groundedness instance.

Fields
prediction string

Required. Output of the evaluated model.

context string

Required. Background information provided in context used to compare against the prediction.

JSON representation
{
  "prediction": string,
  "context": string
}

FulfillmentInput

Input for fulfillment metric.

Fields
metricSpec object (FulfillmentSpec)

Required. Spec for fulfillment score metric.

instance object (FulfillmentInstance)

Required. Fulfillment instance.

JSON representation
{
  "metricSpec": {
    object (FulfillmentSpec)
  },
  "instance": {
    object (FulfillmentInstance)
  }
}

FulfillmentSpec

Spec for fulfillment metric.

Fields
version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "version": integer
}

FulfillmentInstance

Spec for fulfillment instance.

Fields
prediction string

Required. Output of the evaluated model.

instruction string

Required. Inference instruction prompt to compare prediction with.

JSON representation
{
  "prediction": string,
  "instruction": string
}

SummarizationQualityInput

Input for summarization quality metric.

Fields
metricSpec object (SummarizationQualitySpec)

Required. Spec for summarization quality score metric.

instance object (SummarizationQualityInstance)

Required. Summarization quality instance.

JSON representation
{
  "metricSpec": {
    object (SummarizationQualitySpec)
  },
  "instance": {
    object (SummarizationQualityInstance)
  }
}

SummarizationQualitySpec

Spec for summarization quality score metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute summarization quality.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

SummarizationQualityInstance

Spec for summarization quality instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Required. Text to be summarized.

instruction string

Required. Summarization prompt for LLM.

JSON representation
{
  "prediction": string,
  "reference": string,
  "context": string,
  "instruction": string
}

PairwiseSummarizationQualityInput

Input for pairwise summarization quality metric.

Fields
metricSpec object (PairwiseSummarizationQualitySpec)

Required. Spec for pairwise summarization quality score metric.

Required. Pairwise summarization quality instance.

JSON representation
{
  "metricSpec": {
    object (PairwiseSummarizationQualitySpec)
  },
  "instance": {
    object (PairwiseSummarizationQualityInstance)
  }
}

PairwiseSummarizationQualitySpec

Spec for pairwise summarization quality score metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute pairwise summarization quality.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

PairwiseSummarizationQualityInstance

Spec for pairwise summarization quality instance.

Fields
prediction string

Required. Output of the candidate model.

baselinePrediction string

Required. Output of the baseline model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Required. Text to be summarized.

instruction string

Required. Summarization prompt for LLM.

JSON representation
{
  "prediction": string,
  "baselinePrediction": string,
  "reference": string,
  "context": string,
  "instruction": string
}

SummarizationHelpfulnessInput

Input for summarization helpfulness metric.

Fields
metricSpec object (SummarizationHelpfulnessSpec)

Required. Spec for summarization helpfulness score metric.

Required. Summarization helpfulness instance.

JSON representation
{
  "metricSpec": {
    object (SummarizationHelpfulnessSpec)
  },
  "instance": {
    object (SummarizationHelpfulnessInstance)
  }
}

SummarizationHelpfulnessSpec

Spec for summarization helpfulness score metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute summarization helpfulness.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

SummarizationHelpfulnessInstance

Spec for summarization helpfulness instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Required. Text to be summarized.

instruction string

Optional. Summarization prompt for LLM.

JSON representation
{
  "prediction": string,
  "reference": string,
  "context": string,
  "instruction": string
}

SummarizationVerbosityInput

Input for summarization verbosity metric.

Fields
metricSpec object (SummarizationVerbositySpec)

Required. Spec for summarization verbosity score metric.

instance object (SummarizationVerbosityInstance)

Required. Summarization verbosity instance.

JSON representation
{
  "metricSpec": {
    object (SummarizationVerbositySpec)
  },
  "instance": {
    object (SummarizationVerbosityInstance)
  }
}

SummarizationVerbositySpec

Spec for summarization verbosity score metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute summarization verbosity.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

SummarizationVerbosityInstance

Spec for summarization verbosity instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Required. Text to be summarized.

instruction string

Optional. Summarization prompt for LLM.

JSON representation
{
  "prediction": string,
  "reference": string,
  "context": string,
  "instruction": string
}

QuestionAnsweringQualityInput

Input for question answering quality metric.

Fields
metricSpec object (QuestionAnsweringQualitySpec)

Required. Spec for question answering quality score metric.

Required. Question answering quality instance.

JSON representation
{
  "metricSpec": {
    object (QuestionAnsweringQualitySpec)
  },
  "instance": {
    object (QuestionAnsweringQualityInstance)
  }
}

QuestionAnsweringQualitySpec

Spec for question answering quality score metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute question answering quality.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

QuestionAnsweringQualityInstance

Spec for question answering quality instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Required. Text to answer the question.

instruction string

Required. Question Answering prompt for LLM.

JSON representation
{
  "prediction": string,
  "reference": string,
  "context": string,
  "instruction": string
}

PairwiseQuestionAnsweringQualityInput

Input for pairwise question answering quality metric.

Fields

Required. Spec for pairwise question answering quality score metric.

Required. Pairwise question answering quality instance.

JSON representation
{
  "metricSpec": {
    object (PairwiseQuestionAnsweringQualitySpec)
  },
  "instance": {
    object (PairwiseQuestionAnsweringQualityInstance)
  }
}

PairwiseQuestionAnsweringQualitySpec

Spec for pairwise question answering quality score metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute question answering quality.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

PairwiseQuestionAnsweringQualityInstance

Spec for pairwise question answering quality instance.

Fields
prediction string

Required. Output of the candidate model.

baselinePrediction string

Required. Output of the baseline model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Required. Text to answer the question.

instruction string

Required. Question Answering prompt for LLM.

JSON representation
{
  "prediction": string,
  "baselinePrediction": string,
  "reference": string,
  "context": string,
  "instruction": string
}

QuestionAnsweringRelevanceInput

Input for question answering relevance metric.

Fields
metricSpec object (QuestionAnsweringRelevanceSpec)

Required. Spec for question answering relevance score metric.

Required. Question answering relevance instance.

JSON representation
{
  "metricSpec": {
    object (QuestionAnsweringRelevanceSpec)
  },
  "instance": {
    object (QuestionAnsweringRelevanceInstance)
  }
}

QuestionAnsweringRelevanceSpec

Spec for question answering relevance metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute question answering relevance.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

QuestionAnsweringRelevanceInstance

Spec for question answering relevance instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Optional. Text provided as context to answer the question.

instruction string

Required. The question asked and other instruction in the inference prompt.

JSON representation
{
  "prediction": string,
  "reference": string,
  "context": string,
  "instruction": string
}

QuestionAnsweringHelpfulnessInput

Input for question answering helpfulness metric.

Fields
metricSpec object (QuestionAnsweringHelpfulnessSpec)

Required. Spec for question answering helpfulness score metric.

Required. Question answering helpfulness instance.

JSON representation
{
  "metricSpec": {
    object (QuestionAnsweringHelpfulnessSpec)
  },
  "instance": {
    object (QuestionAnsweringHelpfulnessInstance)
  }
}

QuestionAnsweringHelpfulnessSpec

Spec for question answering helpfulness metric.

Fields
useReference boolean

Optional. Whether to use instance.reference to compute question answering helpfulness.

version integer

Optional. Which version to use for evaluation.

JSON representation
{
  "useReference": boolean,
  "version": integer
}

QuestionAnsweringHelpfulnessInstance

Spec for question answering helpfulness instance.

Fields
prediction string

Required. Output of the evaluated model.

reference string

Optional. Ground truth used to compare against the prediction.

context string

Optional. Text provided as context to answer the question.

instruction string

Required. The question asked and other instruction in the inference prompt.