Gen AI evaluation service API

This guide shows you how to use the Gen AI evaluation service API to evaluate your large language models (LLMs). This guide covers the following topics:

  • Metric types: Understand the different categories of evaluation metrics available.
  • Example syntax: See example curl and Python requests to the evaluation API.
  • Parameter details: Learn about the specific parameters for each evaluation metric.
  • Examples: View complete code samples for common evaluation tasks.

You can use the Gen AI evaluation service to evaluate your large language models (LLMs) across several metrics with your own criteria. You provide inference-time inputs, LLM responses, and additional parameters, and the Gen AI evaluation service returns metrics specific to the evaluation task.

Metrics include model-based metrics, such as PointwiseMetric and PairwiseMetric, and in-memory computed metrics, such as rouge, bleu, and tool function-call metrics. PointwiseMetric and PairwiseMetric are generic model-based metrics that you can customize with your own criteria. The service accepts prediction results directly from models, which lets you perform both inference and evaluation on any model supported by Vertex AI.

For more information about evaluating a model, see Gen AI evaluation service overview.

Limitations

The evaluation service has the following limitations:

  • The evaluation service might have a propagation delay on your first call.
  • Most model-based metrics consume gemini-2.0-flash quota because the Gen AI evaluation service uses gemini-2.0-flash as the underlying judge model to compute them.
  • Some model-based metrics, such as MetricX and COMET, use different machine learning models and don't consume gemini-2.0-flash quota.

Metric types

The Gen AI evaluation service API provides several categories of metrics to evaluate different aspects of your model's performance. The following table provides a high-level overview to help you choose the right metrics for your use case.

Metric Category Description Use Case
Lexical Metrics
(e.g., bleu, rouge, exact_match)
These metrics compute scores based on the textual overlap between the model's prediction and a reference (ground truth) text. They are fast and objective. Ideal for tasks with a clear "correct" answer, like translation or fact-based question answering, where similarity to a reference is a good proxy for quality.
Model-Based Pointwise Metrics
(e.g., fluency, safety, groundedness, summarization_quality)
These metrics use a judge model to assess the quality of a single model response based on specific criteria (like fluency or safety) without needing a reference answer. Best for evaluating subjective qualities of generative text where there isn't a single correct answer, such as the creativity, coherence, or safety of a response.
Model-Based Pairwise Metrics
(e.g., pairwise_summarization_quality)
These metrics use a judge model to compare two model responses (for example, from a baseline and a candidate model) and determine which one is better. Useful for A/B testing and directly comparing the performance of two different models or two versions of the same model on the same task.
Tool Use Metrics
(e.g., tool_call_valid, tool_name_match)
These metrics evaluate the model's ability to correctly use tools (function calls) by checking for valid syntax, correct tool names, and accurate parameters. Essential for evaluating models that are designed to interact with external APIs or systems through tool calling.
Custom Metrics
(pointwise_metric, pairwise_metric)
These provide a flexible framework to define your own evaluation criteria using a prompt template. The service then uses a judge model to evaluate responses based on your custom instructions. For specialized evaluation tasks where predefined metrics are insufficient and you need to assess performance against unique, domain-specific requirements.
Specialized Metrics
(comet, metricx)
Highly specialized metrics designed for specific tasks, primarily machine translation quality. Use for nuanced evaluation for machine translation tasks that goes beyond simple lexical matching.

Example syntax

The following examples show the syntax for sending an evaluation request.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

Parameter details

This section details the request and response objects for each evaluation metric.

Request body

The top-level request body contains one of the following metric input objects.

Parameters

exact_match_input

Optional: ExactMatchInput

Evaluates if the prediction exactly matches the reference.

bleu_input

Optional: BleuInput

Computes the BLEU score by comparing the prediction to the reference.

rouge_input

Optional: RougeInput

Computes ROUGE scores by comparing the prediction to the reference. The rouge_type parameter lets you specify different ROUGE types.

fluency_input

Optional: FluencyInput

Assesses the language fluency of a single response.

coherence_input

Optional: CoherenceInput

Assesses the coherence of a single response.

safety_input

Optional: SafetyInput

Assesses the safety level of a single response.

groundedness_input

Optional: GroundednessInput

Assesses if a response is grounded in the provided context.

fulfillment_input

Optional: FulfillmentInput

Assesses how well a response fulfills the given instructions.

summarization_quality_input

Optional: SummarizationQualityInput

Assesses the overall summarization quality of a response.

pairwise_summarization_quality_input

Optional: PairwiseSummarizationQualityInput

Compares the summarization quality of two responses.

summarization_helpfulness_input

Optional: SummarizationHelpfulnessInput

Assesses if a summary is helpful and contains necessary details from the original text.

summarization_verbosity_input

Optional: SummarizationVerbosityInput

Assesses the verbosity of a summary.

question_answering_quality_input

Optional: QuestionAnsweringQualityInput

Assesses the overall quality of an answer to a question, based on a provided context.

pairwise_question_answering_quality_input

Optional: PairwiseQuestionAnsweringQualityInput

Compares the quality of two answers to a question, based on a provided context.

question_answering_relevance_input

Optional: QuestionAnsweringRelevanceInput

Assesses the relevance of an answer to a question.

question_answering_helpfulness_input

Optional: QuestionAnsweringHelpfulnessInput

Assesses the helpfulness of an answer by checking for key details.

question_answering_correctness_input

Optional: QuestionAnsweringCorrectnessInput

Assesses the correctness of an answer to a question.

pointwise_metric_input

Optional: PointwiseMetricInput

Input for a custom pointwise evaluation.

pairwise_metric_input

Optional: PairwiseMetricInput

Input for a custom pairwise evaluation.

tool_call_valid_input

Optional: ToolCallValidInput

Assesses if the response predicts a valid tool call.

tool_name_match_input

Optional: ToolNameMatchInput

Assesses if the response predicts the correct tool name in a tool call.

tool_parameter_key_match_input

Optional: ToolParameterKeyMatchInput

Assesses if the response predicts the correct parameter names in a tool call.

tool_parameter_kv_match_input

Optional: ToolParameterKvMatchInput

Assesses if the response predicts the correct parameter names and values in a tool call.

comet_input

Optional: CometInput

Input to evaluate using COMET.

metricx_input

Optional: MetricxInput

Input to evaluate using MetricX.

Exact Match (exact_match_input)

Input (ExactMatchInput)

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

Parameters

metric_spec

Optional: ExactMatchSpec.

Specifies the metric's behavior.

instances

Optional: ExactMatchInstance[]

One or more evaluation instances, each containing an LLM response and a reference.

instances.prediction

Optional: string

The LLM response.

instances.reference

Optional: string

The ground truth or reference response.

Output (ExactMatchResults)

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Output

exact_match_metric_values

ExactMatchMetricValue[]

An array of evaluation results, one for each input instance.

exact_match_metric_values.score

float

One of the following:

  • 0: The instance was not an exact match.
  • 1: The instance was an exact match.

BLEU (bleu_input)

Input (BleuInput)

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

Parameters

metric_spec

Optional: BleuSpec

Specifies the metric's behavior.

metric_spec.use_effective_order

Optional: bool

Specifies whether to consider n-gram orders that have no match.

instances

Optional: BleuInstance[]

One or more evaluation instances, each containing an LLM response and a reference.

instances.prediction

Optional: string

The LLM response.

instances.reference

Optional: string

The ground truth or reference response.

Output (BleuResults)

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Output

bleu_metric_values

BleuMetricValue[]

An array of evaluation results, one for each input instance.

bleu_metric_values.score

float: A value in the range [0, 1]. Higher scores mean the prediction is more similar to the reference.

ROUGE (rouge_input)

Input (RougeInput)

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

Parameters

metric_spec

Optional: RougeSpec

Specifies the metric's behavior.

metric_spec.rouge_type

Optional: string

Supported values:

  • rougen[1-9]: Computes ROUGE scores based on the overlap of n-grams between the prediction and the reference.
  • rougeL: Computes ROUGE scores based on the Longest Common Subsequence (LCS) between the prediction and the reference.
  • rougeLsum: Splits the prediction and the reference into sentences and then computes the LCS for each tuple. The final rougeLsum score is the average of these individual LCS scores.

metric_spec.use_stemmer

Optional: bool

Specifies whether to use the Porter stemmer to strip word suffixes for better matching.

metric_spec.split_summaries

Optional: bool

Specifies whether to add newlines between sentences for rougeLsum.

instances

Optional: RougeInstance[]

One or more evaluation instances, each containing an LLM response and a reference.

instances.prediction

Optional: string

The LLM response.

instances.reference

Optional: string

The ground truth or reference response.

Output (RougeResults)

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Output

rouge_metric_values

RougeValue[]

An array of evaluation results, one for each input instance.

rouge_metric_values.score

float: A value in the range [0, 1]. Higher scores mean the prediction is more similar to the reference.

Fluency (fluency_input)

Input (FluencyInput)

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

Parameters

metric_spec

Optional: FluencySpec

Specifies the metric's behavior.

instance

Optional: FluencyInstance

The evaluation input, which consists of an LLM response.

instance.prediction

Optional: string

The LLM response.

Output (FluencyResult)

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: Inarticulate
  • 2: Somewhat Inarticulate
  • 3: Neutral
  • 4: Somewhat fluent
  • 5: Fluent

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Coherence (coherence_input)

Input (CoherenceInput)

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

Parameters

metric_spec

Optional: CoherenceSpec

Specifies the metric's behavior.

instance

Optional: CoherenceInstance

The evaluation input, which consists of an LLM response.

instance.prediction

Optional: string

The LLM response.

Output (CoherenceResult)

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: Incoherent
  • 2: Somewhat incoherent
  • 3: Neutral
  • 4: Somewhat coherent
  • 5: Coherent

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Safety (safety_input)

Input (SafetyInput)

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

Parameters

metric_spec

Optional: SafetySpec

Specifies the metric's behavior.

instance

Optional: SafetyInstance

The evaluation input, which consists of an LLM response.

instance.prediction

Optional: string

The LLM response.

Output (SafetyResult)

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 0: Unsafe
  • 1: Safe

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Groundedness (groundedness_input)

Input (GroundednessInput)

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

Parameter Description

metric_spec

Optional: GroundednessSpec

Specifies the metric's behavior.

instance

Optional: GroundednessInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (GroundednessResult)

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 0: Ungrounded
  • 1: Grounded

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Fulfillment (fulfillment_input)

Input (FulfillmentInput)

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

Parameters

metric_spec

Optional: FulfillmentSpec

Specifies the metric's behavior.

instance

Optional: FulfillmentInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

Output (FulfillmentResult)

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: No fulfillment
  • 2: Poor fulfillment
  • 3: Some fulfillment
  • 4: Good fulfillment
  • 5: Complete fulfillment

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Summarization Quality (summarization_quality_input)

Input (SummarizationQualityInput)

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameters

metric_spec

Optional: SummarizationQualitySpec

Specifies the metric's behavior.

instance

Optional: SummarizationQualityInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (SummarizationQualityResult)

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: Very bad
  • 2: Bad
  • 3: Ok
  • 4: Good
  • 5: Very good

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Pairwise Summarization Quality (pairwise_summarization_quality_input)

Input (PairwiseSummarizationQualityInput)

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameters

metric_spec

Optional: PairwiseSummarizationQualitySpec

Specifies the metric's behavior.

instance

Optional: PairwiseSummarizationQualityInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.baseline_prediction

Optional: string

The baseline model's LLM response.

instance.prediction

Optional: string

The candidate model's LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (PairwiseSummarizationQualityResult)

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

Output

pairwise_choice

PairwiseChoice: An enum with one of the following values:

  • BASELINE: The baseline prediction is better.
  • CANDIDATE: The candidate prediction is better.
  • TIE: The baseline and candidate predictions are of equal quality.

explanation

string: The reasoning for the assigned pairwise_choice.

confidence

float: A confidence score for the result in the range [0, 1].

Summarization Helpfulness (summarization_helpfulness_input)

Input (SummarizationHelpfulnessInput)

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameters

metric_spec

Optional: SummarizationHelpfulnessSpec

Specifies the metric's behavior.

instance

Optional: SummarizationHelpfulnessInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (SummarizationHelpfulnessResult)

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: Unhelpful
  • 2: Somewhat unhelpful
  • 3: Neutral
  • 4: Somewhat helpful
  • 5: Helpful

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Summarization Verbosity (summarization_verbosity_input)

Input (SummarizationVerbosityInput)

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameters

metric_spec

Optional: SummarizationVerbositySpec

Specifies the metric's behavior.

instance

Optional: SummarizationVerbosityInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (SummarizationVerbosityResult)

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float. One of the following:

  • -2: Terse
  • -1: Somewhat terse
  • 0: Optimal
  • 1: Somewhat verbose
  • 2: Verbose

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Question Answering Quality (question_answering_quality_input)

Input (QuestionAnsweringQualityInput)

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameters

metric_spec

Optional: QuestionAnsweringQualitySpec

Specifies the metric's behavior.

instance

Optional: QuestionAnsweringQualityInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (QuestionAnsweringQualityResult)

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: Very bad
  • 2: Bad
  • 3: Ok
  • 4: Good
  • 5: Very good

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Pairwise Question Answering Quality (pairwise_question_answering_quality_input)

Input (PairwiseQuestionAnsweringQualityInput)

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameters

metric_spec

Optional: QuestionAnsweringQualitySpec

Metric spec, defining the metric's behavior.

instance

Optional: QuestionAnsweringQualityInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.baseline_prediction

Optional: string

The baseline model's LLM response.

instance.prediction

Optional: string

The candidate model's LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (PairwiseQuestionAnsweringQualityResult)

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

Output

pairwise_choice

PairwiseChoice: An enum with one of the following values:

  • BASELINE: The baseline prediction is better.
  • CANDIDATE: The candidate prediction is better.
  • TIE: The baseline and candidate predictions are of equal quality.

explanation

string: The reasoning for the assigned pairwise_choice.

confidence

float: A confidence score for the result in the range [0, 1].

Question Answering Relevance (question_answering_relevance_input)

Input (QuestionAnsweringRelevanceInput)

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameters

metric_spec

Optional: QuestionAnsweringRelevanceSpec

Specifies the metric's behavior.

instance

Optional: QuestionAnsweringRelevanceInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (QuestionAnsweringRelevanceResult)

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: Irrelevant
  • 2: Somewhat irrelevant
  • 3: Neutral
  • 4: Somewhat relevant
  • 5: Relevant

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Question Answering Helpfulness (question_answering_helpfulness_input)

Input (QuestionAnsweringHelpfulnessInput)

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameters

metric_spec

Optional: QuestionAnsweringHelpfulnessSpec

Specifies the metric's behavior.

instance

Optional: QuestionAnsweringHelpfulnessInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (QuestionAnsweringHelpfulnessResult)

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 1: Unhelpful
  • 2: Somewhat unhelpful
  • 3: Neutral
  • 4: Somewhat helpful
  • 5: Helpful

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Question Answering Correctness (question_answering_correctness_input)

Input (QuestionAnsweringCorrectnessInput)

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameters

metric_spec

Optional: QuestionAnsweringCorrectnessSpec

Specifies the metric's behavior.

metric_spec.use_reference

Optional: bool

Specifies if the reference is used in the evaluation.

instance

Optional: QuestionAnsweringCorrectnessInstance

The evaluation input, which consists of inference inputs and the corresponding response.

instance.prediction

Optional: string

The LLM response.

instance.reference

Optional: string

The ground truth or reference response.

instance.instruction

Optional: string

The instruction provided at inference time.

instance.context

Optional: string

The context provided at inference time that the LLM response can use.

Output (QuestionAnsweringCorrectnessResult)

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Output

score

float: One of the following:

  • 0: Incorrect
  • 1: Correct

explanation

string: The reasoning for the assigned score.

confidence

float: A confidence score for the result in the range [0, 1].

Custom Pointwise (pointwise_metric_input)

Input (PointwiseMetricInput)

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

Parameters

metric_spec

Required: PointwiseMetricSpec

Specifies the metric's behavior.

metric_spec.metric_prompt_template

Required: string

A prompt template that defines the metric. The template is rendered using the key-value pairs in instance.json_instance.

instance

Required: PointwiseMetricInstance

The evaluation input, which consists of a json_instance.

instance.json_instance

Optional: string

A JSON string of key-value pairs (for example, {"key_1": "value_1", "key_2": "value_2"}) used to render the metric_spec.metric_prompt_template.

Output (PointwiseMetricResult)

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

Output

score

float: A score for the pointwise metric evaluation result.

explanation

string: The reasoning for the assigned score.

Custom Pairwise (pairwise_metric_input)

Input (PairwiseMetricInput)

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

Parameters

metric_spec

Required: PairwiseMetricSpec

Specifies the metric's behavior.

metric_spec.metric_prompt_template

Required: string

A prompt template that defines the metric. The template is rendered using the key-value pairs in instance.json_instance.

instance

Required: PairwiseMetricInstance

The evaluation input, which consists of a json_instance.

instance.json_instance

Optional: string

A JSON string of key-value pairs (for example, {"key_1": "value_1", "key_2": "value_2"}) used to render the metric_spec.metric_prompt_template.

Output (PairwiseMetricResult)

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

Output

score

float: A score for the pairwise metric evaluation result.

explanation

string: The reasoning for the assigned score.

Tool Call Valid (tool_call_valid_input)

Input (ToolCallValidInput)

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameters

metric_spec

Optional: ToolCallValidSpec

Specifies the metric's behavior.

instance

Optional: ToolCallValidInstance

The evaluation input, which consists of an LLM response and a reference.

instance.prediction

Optional: string

The candidate model's response. This must be a JSON serialized string containing content and tool_calls keys. The content value is the text output from the model. The tool_calls value is a JSON serialized string of a list of tool calls. For example:

{
  "content": "",
  "tool_calls": [
    {
      "name": "book_tickets",
      "arguments": {
        "movie": "Mission Impossible Dead Reckoning Part 1",
        "theater": "Regal Edwards 14",
        "location": "Mountain View CA",
        "showtime": "7:30",
        "date": "2024-03-30",
        "num_tix": "2"
      }
    }
  ]
}

instance.reference

Optional: string

The ground truth or reference response, in the same format as prediction.

Output (ToolCallValidResults)

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Output

tool_call_valid_metric_values

ToolCallValidMetricValue[]: An array of evaluation results, one for each input instance.

tool_call_valid_metric_values.score

float: One of the following:

  • 0: Invalid tool call
  • 1: Valid tool call

Tool Name Match (tool_name_match_input)

Input (ToolNameMatchInput)

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameters

metric_spec

Optional: ToolNameMatchSpec

Specifies the metric's behavior.

instance

Optional: ToolNameMatchInstance

The evaluation input, which consists of an LLM response and a reference.

instance.prediction

Optional: string

The candidate model's response. This must be a JSON serialized string containing content and tool_calls keys. The content value is the text output from the model. The tool_calls value is a JSON serialized string of a list of tool calls.

instance.reference

Optional: string

The ground truth or reference response, in the same format as prediction.

Output (ToolNameMatchResults)

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Output

tool_name_match_metric_values

ToolNameMatchMetricValue[]: An array of evaluation results, one for each input instance.

tool_name_match_metric_values.score

float: One of the following:

  • 0: The tool call name doesn't match the reference.
  • 1: The tool call name matches the reference.

Tool Parameter Key Match (tool_parameter_key_match_input)

Input (ToolParameterKeyMatchInput)

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameters

metric_spec

Optional: ToolParameterKeyMatchSpec

Specifies the metric's behavior.

instance

Optional: ToolParameterKeyMatchInstance

The evaluation input, which consists of an LLM response and a reference.

instance.prediction

Optional: string

The candidate model's response. This must be a JSON serialized string containing content and tool_calls keys. The content value is the text output from the model. The tool_calls value is a JSON serialized string of a list of tool calls.

instance.reference

Optional: string

The ground truth or reference response, in the same format as prediction.

Output (ToolParameterKeyMatchResults)

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Output

tool_parameter_key_match_metric_values

ToolParameterKeyMatchMetricValue[]: An array of evaluation results, one for each input instance.

tool_parameter_key_match_metric_values.score

float: A value in the range [0, 1]. Higher scores mean more parameters match the names of the reference parameters.

Tool Parameter KV Match (tool_parameter_kv_match_input)

Input (ToolParameterKVMatchInput)

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameters

metric_spec

Optional: ToolParameterKVMatchSpec

Specifies the metric's behavior.

instance

Optional: ToolParameterKVMatchInstance

The evaluation input, which consists of an LLM response and a reference.

instance.prediction

Optional: string

The candidate model's response. This must be a JSON serialized string containing content and tool_calls keys. The content value is the text output from the model. The tool_calls value is a JSON serialized string of a list of tool calls.

instance.reference

Optional: string

The ground truth or reference response, in the same format as prediction.

Output (ToolParameterKVMatchResults)

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Output

tool_parameter_kv_match_metric_values

ToolParameterKVMatchMetricValue[]: An array of evaluation results, one for each input instance.

tool_parameter_kv_match_metric_values.score

float: A value in the range [0, 1]. Higher scores mean more parameters match the names and values of the reference parameters.

COMET (comet_input)

Input (CometInput)

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

Parameters

metric_spec

Optional: CometSpec

Specifies the metric's behavior.

metric_spec.version

Optional: string

COMET_22_SRC_REF: COMET 22 for translation, source, and reference. It evaluates the translation (prediction) using all three inputs.

metric_spec.source_language

Optional: string

The source language in BCP-47 format. For example, "es".

metric_spec.target_language

Optional: string

The target language in BCP-47 format. For example, "es".

instance

Optional: CometInstance

The evaluation input. The exact fields used for evaluation depend on the COMET version.

instance.prediction

Optional: string

The candidate model's response, which is the translated text to be evaluated.

instance.source

Optional: string

The source text in the original language, before translation.

instance.reference

Optional: string

The ground truth or reference translation, in the same language as the prediction.

Output (CometResult)

{
  "comet_result" : {
    "score": float
  }
}

Output

score

float: A value in the range [0, 1], where 1 represents a perfect translation.

MetricX (metricx_input)

Input (MetricxInput)

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

Parameters

metric_spec

Optional: MetricxSpec

Specifies the metric's behavior.

metric_spec.version

Optional:

string

One of the following:

  • METRICX_24_REF: MetricX 24 for translation and reference. It evaluates the prediction (translation) by comparing it with the provided reference text input.
  • METRICX_24_SRC: MetricX 24 for translation and source. It evaluates the translation (prediction) by using Quality Estimation (QE), without a reference text input.
  • METRICX_24_SRC_REF: MetricX 24 for translation, source, and reference. It evaluates the translation (prediction) using all three inputs.

metric_spec.source_language

Optional: string

The source language in BCP-47 format. For example, "es".

metric_spec.target_language

Optional: string

The target language in BCP-47 format. For example, "es".

instance

Optional: MetricxInstance

The evaluation input. The exact fields used for evaluation depend on the MetricX version.

instance.prediction

Optional: string

The candidate model's response, which is the translated text to be evaluated.

instance.source

Optional: string

The source text in the original language that the prediction was translated from.

instance.reference

Optional: string

The ground truth used to compare against the prediction. It is in the same language as the prediction.

Output (MetricxResult)

{
  "metricx_result" : {
    "score": float
  }
}

Output

score

float: A value in the range [0, 25], where 0 represents a perfect translation.

Examples

Evaluate multiple metrics in one call

The following example shows how to call the Gen AI evaluation service API to evaluate the output of an LLM using a variety of evaluation metrics, including summarization_quality, groundedness, fulfillment, summarization_helpfulness, and summarization_verbosity.

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

Evaluate pairwise summarization quality

The following example shows how to call the Gen AI evaluation service API to evaluate the output of an LLM using a pairwise summarization quality comparison.

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: .
  • LOCATION: The region to process the request.
  • PREDICTION: LLM response.
  • BASELINE_PREDICTION: Baseline model LLM response.
  • INSTRUCTION: The instruction used at inference time.
  • CONTEXT: Inference-time text containing all relevant information, that can be used in the LLM response.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

Request JSON body:

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

Evaluate a ROUGE score

The following example shows how to call the Gen AI evaluation service API to get the ROUGE score for a prediction. The request uses metric_spec to configure the metric's behavior.

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: .
  • LOCATION: The region to process the request.
  • PREDICTION: LLM response.
  • REFERENCE: Golden LLM response for reference.
  • ROUGE_TYPE: The calculation used to determine the rouge score. See metric_spec.rouge_type for acceptable values.
  • USE_STEMMER: Determines whether the Porter stemmer is used to strip word suffixes to improve matching. For acceptable values, see metric_spec.use_stemmer.
  • SPLIT_SUMMARIES: Determines if new lines are added between rougeLsum sentences. For acceptable values, see metric_spec.split_summaries .

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

Request JSON body:

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

What's next