Rapid evaluation API

The rapid evaluation service enables users to evaluate their LLM models, both pointwise and pairwise, across several metrics. Users provide inference-time inputs, LLM responses, and additional parameters, and the service returns metrics specific to the evaluation task. Metrics include both model-based metrics (e.g. SummarizationQuality) and in-memory-computed metrics (e.g. Rouge, Bleu, and Tool/Function Call metrics). Since the service takes the prediction results directly from models as input, it may evaluate all models supported by Vertex.

Limitations

  • Model-based metrics consume text-bison quota. rapid evaluation service leverages text-bison as the underlying arbiter model to compute model-based metrics.
  • The service has a propagation delay. It may not be available for several minutes after the first call to the service.

Syntax

  • PROJECT_ID = PROJECT_ID
  • REGION = REGION
  • MODEL_ID = MODEL_ID

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

Parameter list

Full list of metrics available.

Parameters

exact_match_input

Optional: ExactMatchInput

Input to assess if the prediction matches the reference exactly.

bleu_input

Optional: BleuInput

Input to compute BLEU score by comparing the prediction against the reference.

rouge_input

Optional: RougeInput

Input to compute ROUGE scores by comparing the prediction against the reference. Different ROUGE scores are supported by rouge_type.

fluency_input

Optional: FluencyInput

Input to assess a single response's language mastery.

coherence_input

Optional: CoherenceInput

Input to assess a single response's ability to provide a coherent, easy-to-follow reply.

safety_input

Optional: SafetyInput

Input to assess a single response's level of safety.

groundedness_input

Optional: GroundednessInput

Input to assess a single response's ability to provide or reference information included only in the input text.

fulfillment_input

Optional: FulfillmentInput

Input to assess a single response's ability to completely fulfill instructions.

summarization_quality_input

Optional: SummarizationQualityInput

Input to assess a single response's overall ability to summarize text.

pairwise_summarization_quality_input

Optional: PairwiseSummarizationQualityInput

Input to compare two responses' overall summarization quality.

summarization_helpfulness_input

Optional: SummarizationHelpfulnessInput

Input to assess a single response's ability to provide a summarization, which contains the details necessary to substitute the original text.

summarization_verbosity_input

Optional: SummarizationVerbosityInput

Input to assess a single response's ability to provide a succinct summarization.

question_answering_quality_input

Optional: QuestionAnsweringQualityInput

Input to assess a single response's overall ability to answer questions, given a body of text to reference.

pairwise_question_answering_quality_input

Optional: PairwiseQuestionAnsweringQualityInput

Input to compare two responses' overall ability to answer questions, given a body of text to reference.

question_answering_relevance_input

Optional: QuestionAnsweringRelevanceInput

Input to assess a single response's ability to respond with relevant information when asked a question.

question_answering_helpfulness_input

Optional: QuestionAnsweringHelpfulnessInput

Input to assess a single response's ability to provide key details when answering a question.

question_answering_correctness_input

Optional: QuestionAnsweringCorrectnessInput

Input to assess a single response's ability to correctly answer a question.

tool_call_valid_input

Optional: ToolCallValidInput

Input to assess a single response's ability to predict a valid tool call.

tool_name_match_input

Optional: ToolNameMatchInput

Input to assess a single response's ability to predict a tool call with the right tool name.

tool_parameter_key_match_input

Optional: ToolParameterKeyMatchInput

Input to assess a single response's ability to predict a tool call with correct parameter names.

tool_parameter_kv_match_input

Optional: ToolParameterKvMatchInput

Input to assess a single response's ability to predict a tool call with correct parameter names and values

ExactMatchInput

{
  "exact_match_input: {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}
Parameters

metric_spec

Optional: ExactMatchSpec.

Metric spec, defining the metric's behavior.

instances

Optional: ExactMatchInstance[]

Evaluation input, consisting of LLM response and reference.

instances.prediction

Optional: string

LLM response.

instances.reference

Optional: string

Golden LLM response for reference.

ExactMatchResults

{
  "exact_match_results: {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}
Output

exact_match_metric_values

ExactMatchMetricValue[]

Evaluation results per instance input.

exact_match_metric_values.score

float

One of the following:

  • 0: Instance was not an exact match
  • 1: Exact match

BleuInput

{
  "bleu_input: {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}
Parameters

metric_spec

Optional: BleuSpec

Metric spec, defining the metric's behavior.

instances

Optional: BleuInstance[]

Evaluation input, consisting of LLM response and reference.

instances.prediction

Optional: string

LLM response.

instances.reference

Optional: string

Golden LLM response for reference.

BleuResults

{
  "bleu_results: {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}
Output

bleu_metric_values

BleuMetricValue[]

Evaluation results per instance input.

bleu_metric_values.score

float: [0, 1], where higher scores mean the prediction is more like the reference.

RougeInput

{
  "rouge_input: {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}
Parameters

metric_spec

Optional: RougeSpec

Metric spec, defining the metric's behavior.

metric_spec.rouge_type

Optional: string

Acceptable values:

  • rougen[1-9]: compute ROUGE scores based on the overlap of n-grams between the prediction and the reference.
  • rougeL: compute ROUGE scores based on the Longest Common Subsequence (LCS) between the prediction and the reference.
  • rougeLsum: first splits the prediction and the reference into sentences and then computes the LCS for each tuple. The final rougeLsum score is the average of these individual LCS scores.

metric_spec.use_stemmer

Optional: bool

Whether Porter stemmer should be used to strip word suffixes to improve matching.

metric_spec.split_summaries

Optional: bool

Whether to add newlines between sentences for rougeLsum.

instances

Optional: RougeInstance[]

Evaluation input, consisting of LLM response and reference.

instances.prediction

Optional: string

LLM response.

instances.reference

Optional: string

Golden LLM response for reference.

RougeResults

{
  "rouge_results: {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}
Output

rouge_metric_values

RougeValue[]

Evaluation results per instance input.

rouge_metric_values.score

float: [0, 1], where higher scores mean the prediction is more like the reference.

FluencyInput

{
  "fluency_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}
Parameters

metric_spec

Optional: FluencySpec

Metric spec, defining the metric's behavior.

instance

Optional: FluencyInstance

Evaluation input, consisting of LLM response.

instance.prediction

Optional: string

LLM response.

FluencyResult

{
  "fluency_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: Inarticulate
  • 2: Somewhat Inarticulate
  • 3: Neutral
  • 4: Somewhat fluent
  • 5: Fluent

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

CoherenceInput

{
  "coherence_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}
Parameters

metric_spec

Optional: CoherenceSpec

Metric spec, defining the metric's behavior.

instance

Optional: CoherenceInstance

Evaluation input, consisting of LLM response.

instance.prediction

Optional: string

LLM response.

CoherenceResult

{
  "coherence_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: Incoherent
  • 2: Somewhat incoherent
  • 3: Neutral
  • 4: Somewhat coherent
  • 5: Coherent

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

SafetyInput

{
  "safety_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}
Parameters

metric_spec

Optional: SafetySpec

Metric spec, defining the metric's behavior.

instance

Optional: SafetyInstance

Evaluation input, consisting of LLM response.

instance.prediction

Optional: string

LLM response.

SafetyResult

{
  "safety_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 0. Unsafe
  • 1. Safe

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

GroundednessInput

{
  "groundedness_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

Parameter

Description

metric_spec

Optional: GroundednessSpec

Metric spec, defining the metric's behavior.

instance

Optional: GroundednessInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

GroundednessResult

{
  "groundedness_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 0: Ungrounded
  • 1: Grounded

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

FulfillmentInput

{
  "fulfillment_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}
Parameters

metric_spec

Optional: FulfillmentSpec

Metric spec, defining the metric's behavior.

instance

Optional: FulfillmentInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

FulfillmentResult

{
  "fulfillment_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: No fulfillment
  • 2: Poor fulfillment
  • 3: Some fulfillment
  • 4: Good fulfillment
  • 5: Complete fulfillment

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

SummarizationQualityInput

{
  "summarization_quality_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}
Parameters

metric_spec

Optional: SummarizationQualitySpec

Metric spec, defining the metric's behavior.

instance

Optional: SummarizationQualityInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

SummarizationQualityResult

{
  "summarization_quality_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: Very bad
  • 2: Bad
  • 3: Ok
  • 4: Good
  • 5: Very good

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

PairwiseSummarizationQualityInput

{
  "pairwise_summarization_quality_input: {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}
Parameters

metric_spec

Optional: PairwiseSummarizationQualitySpec

Metric spec, defining the metric's behavior.

instance

Optional: PairwiseSummarizationQualityInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.baseline_prediction

Optional: string

Baseline model LLM response.

instance.prediction

Optional: string

Candidate model LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

PairwiseSummarizationQualityResult

{
  "pairwise_summarization_quality_result: {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}
Output

pairwise_choice

PairwiseChoice: Enum with possible values as follows:

  • BASELINE: Baseline prediction is better
  • CANDIDATE: Candidate prediction is better
  • TIE: Tie between Baseline and Candidate predictions.

explanation

string: Justification for pairwise_choice assignment.

confidence

float: [0, 1] Confidence score of our result.

SummarizationHelpfulnessInput

{
  "summarization_helpfulness_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}
Parameters

metric_spec

Optional: SummarizationHelpfulnessSpec

Metric spec, defining the metric's behavior.

instance

Optional: SummarizationHelpfulnessInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

SummarizationHelpfulnessResult

{
  "summarization_helpfulness_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: Unhelpful
  • 2: Somewhat unhelpful
  • 3: Neutral
  • 4: Somewhat helpful
  • 5: Helpful

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

SummarizationVerbosityInput

{
  "summarization_verbosity_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}
Parameters

metric_spec

Optional: SummarizationVerbositySpec

Metric spec, defining the metric's behavior.

instance

Optional: SummarizationVerbosityInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

SummarizationVerbosityResult

{
  "summarization_verbosity_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float. One of the following:

  • -2: Terse
  • -1: Somewhat terse
  • 0: Optimal
  • 1: Somewhat verbose
  • 2: Verbose

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

QuestionAnsweringQualityInput

{
  "question_answering_quality_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}
Parameters

metric_spec

Optional: QuestionAnsweringQualitySpec

Metric spec, defining the metric's behavior.

instance

Optional: QuestionAnsweringQualityInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

QuestionAnsweringQualityResult

{
  "question_answering_quality_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: Very bad
  • 2: Bad
  • 3: Ok
  • 4: Good
  • 5: Very good

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

PairwiseQuestionAnsweringQualityInput

{
  "question_answering_quality_input: {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}
Parameters

metric_spec

Optional: QuestionAnsweringQualitySpec

Metric spec, defining the metric's behavior.

instance

Optional: QuestionAnsweringQualityInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.baseline_prediction

Optional: string

Baseline model LLM response.

instance.prediction

Optional: string

Candidate model LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

PairwiseQuestionAnsweringQualityResult

{
  "pairwise_question_answering_quality_result: {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}
Output

pairwise_choice

PairwiseChoice: Enum with possible values as follows:

  • BASELINE: Baseline prediction is better
  • CANDIDATE: Candidate prediction is better
  • TIE: Tie between Baseline and Candidate predictions.

explanation

string: Justification for pairwise_choice assignment.

confidence

float: [0, 1] Confidence score of our result.

QuestionAnsweringRelevanceInput

{
  "question_answering_quality_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}
Parameters

metric_spec

Optional: QuestionAnsweringRelevanceSpec

Metric spec, defining the metric's behavior.

instance

Optional: QuestionAnsweringRelevanceInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

QuestionAnsweringRelevancyResult

{
  "question_answering_relevancy_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: Irrelevant
  • 2: Somewhat irrelevant
  • 3: Neutral
  • 4: Somewhat relevant
  • 5: Relevant

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

QuestionAnsweringHelpfulnessInput

{
  "question_answering_helpfulness_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}
Parameters

metric_spec

Optional: QuestionAnsweringHelpfulnessSpec

Metric spec, defining the metric's behavior.

instance

Optional: QuestionAnsweringHelpfulnessInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

QuestionAnsweringHelpfulnessResult

{
  "question_answering_helpfulness_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 1: Unhelpful
  • 2: Somewhat unhelpful
  • 3: Neutral
  • 4: Somewhat helpful
  • 5: Helpful

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

QuestionAnsweringCorrectnessInput

{
  "question_answering_correctness_input: {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}
Parameters

metric_spec

Optional: QuestionAnsweringCorrectnessSpec: Metric spec, defining the metric's behavior.

metric_spec.use_reference

Optional: bool

If reference is used or not in the evaluation.

instance

Optional: QuestionAnsweringCorrectnessInstance

Evaluation input, consisting of inference inputs and corresponding response.

instance.prediction

Optional: string

LLM response.

instance.reference

Optional: string

Golden LLM response for reference.

instance.instruction

Optional: string

Instruction used at inference time.

instance.context

Optional: string

Inference-time text containing all information, which can be used in the LLM response.

QuestionAnsweringCorrectnessResult

{
  "question_answering_correctness_result: {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}
Output

score

float: One of the following:

  • 0: Incorrect
  • 1: Correct

explanation

string: Justification for score assignment.

confidence

float: [0, 1] Confidence score of our result.

ToolCallValidInput

{
  "tool_call_valid_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}
Parameters

metric_spec

Optional: ToolCallValidSpec

Metric spec, defining the metric's behavior.

instance

Optional: ToolCallValidInstance

Evaluation input, consisting of LLM response and reference.

instance.prediction

Optional: string

Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls. An example is:

{
  "content": "",
  "tool_calls": [
    {
      "name": "book_tickets",
      "arguments": {
        "movie": "Mission Impossible Dead Reckoning Part 1",
        "theater": "Regal Edwards 14",
        "location": "Mountain View CA",
        "showtime": "7:30",
        "date": "2024-03-30",
        "num_tix": "2"
      }
    }
  ]
}

instance.reference

Optional: string

Golden model output in the same format as prediction.

ToolCallValidResults

{
  "tool_call_valid_results: {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}
Output

tool_call_valid_metric_values

repeated ToolCallValidMetricValue: Evaluation results per instance input.

tool_call_valid_metric_values.score

float: One of the following:

  • 0: Invalid tool call
  • 1: Valid tool call

ToolNameMatchInput

{
  "tool_name_match_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}
Parameters

metric_spec

Optional: ToolNameMatchSpec

Metric spec, defining the metric's behavior.

instance

Optional: ToolNameMatchInstance

Evaluation input, consisting of LLM response and reference.

instance.prediction

Optional: string

Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls.

instance.reference

Optional: string

Golden model output in the same format as prediction.

ToolNameMatchResults

{
  "tool_name_match_results: {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}
Output

tool_name_match_metric_values

repeated ToolNameMatchMetricValue: Evaluation results per instance input.

tool_name_match_metric_values.score

float: One of the following:

  • 0: Tool call name doesn't match the reference.
  • 1: Tool call name matches the reference.

ToolParameterKeyMatchInput

{
  "tool_parameter_key_match_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}
Parameters

metric_spec

Optional: ToolParameterKeyMatchSpec

Metric spec, defining the metric's behavior.

instance

Optional: ToolParameterKeyMatchInstance

Evaluation input, consisting of LLM response and reference.

instance.prediction

Optional: string

Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls.

instance.reference

Optional: string

Golden model output in the same format as prediction.

ToolParameterKeyMatchResults

{
  "tool_parameter_key_match_results: {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}
Output

tool_parameter_key_match_metric_values

repeated ToolParameterKeyMatchMetricValue: Evaluation results per instance input.

tool_parameter_key_match_metric_values.score

float: [0, 1], where higher scores mean more parameters match the reference parameters' names.

ToolParameterKVMatchInput

{
  "tool_parameter_kv_match_input: {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}
Parameters

metric_spec

Optional: ToolParameterKVMatchSpec

Metric spec, defining the metric's behavior.

instance

Optional: ToolParameterKVMatchInstance

Evaluation input, consisting of LLM response and reference.

instance.prediction

Optional: string

Candidate model LLM response, which is a JSON serialized string that contains content and tool_calls keys. The content value is the text output from the model. The tool_call value is a JSON serialized string of a list of tool calls.

instance.reference

Optional: string

Golden model output in the same format as prediction.

ToolParameterKVMatchResults

{
  "tool_parameter_kv_match_results: {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}
Output

tool_parameter_kv_match_metric_values

repeated ToolParameterKVMatchMetricValue: Evaluation results per instance input.

tool_parameter_kv_match_metric_values.score

float: [0, 1], where higher scores mean more parameters match the reference parameters' names and values.

Examples

  • PROJECT_ID = PROJECT_ID
  • REGION = REGION

Pairwise Summarization Quality

Here we demonstrate how to call the rapid evaluation API to evaluate the output of an LLM. In this case, we make a pairwise summarization quality comparison.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances \
  -d '{
    "pairwise_summarization_quality_input": {
      "metric_spec": {},
      "instance": {
        "prediction": "France is a country located in Western Europe.",
        "baseline_prediction": "France is a country.",
        "instruction": "Summarize the context.",
        "context": "France is a country located in Western Europe. It'\''s bordered by Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, and Andorra.  France'\''s coastline stretches along the English Channel, the North Sea, the Atlantic Ocean, and the Mediterranean Sea.  Known for its rich history, iconic landmarks like the Eiffel Tower, and delicious cuisine, France is a major cultural and economic power in Europe and throughout the world.",
      }
    }
  }'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "France is a country located in Western Europe.",
      "baseline_prediction": "France is a country.",
      "instruction": "Summarize the context.",
      "context": (
          "France is a country located in Western Europe. It's bordered by "
          "Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, "
          "and Andorra.  France's coastline stretches along the English "
          "Channel, the North Sea, the Atlantic Ocean, and the Mediterranean "
          "Sea.  Known for its rich history, iconic landmarks like the Eiffel "
          "Tower, and delicious cuisine, France is a major cultural and "
          "economic power in Europe and throughout the world."
      ),
    }
  }
}

uri = f'https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

ROUGE

Next, we will call the API to get the ROUGE score of a prediction, given a reference.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances \
  -d '{
    "rouge_input": {
      "instances": {
        "prediction": "A fast brown fox leaps over a lazy dog.",
        "reference": "The quick brown fox jumps over the lazy dog.",
      },
      "instances": {
        "prediction": "A quick brown fox jumps over the lazy canine.",
        "reference": "The quick brown fox jumps over the lazy dog.",
      },
      "instances": {
        "prediction": "The speedy brown fox jumps over the lazy dog.",
        "reference": "The quick brown fox jumps over the lazy dog.",
      },
      "metric_spec": {
        "rouge_type": "rougeLsum",
        "use_stemmer": true,
        "split_summaries": true
      }
    }
  }'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  "rouge_input": {
    "metric_spec": {
        "rouge_type": "rougeLsum",
        "use_stemmer": True,
        "split_summaries": True
    },
    "instances": [
        {
          "prediction": "A fast brown fox leaps over a lazy dog.",
          "reference": "The quick brown fox jumps over the lazy dog.",
        }, {
          "prediction": "A quick brown fox jumps over the lazy canine.",
          "reference": "The quick brown fox jumps over the lazy dog.",
        }, {
          "prediction": "The speedy brown fox jumps over the lazy dog.",
          "reference": "The quick brown fox jumps over the lazy dog.",
        }
    ]
  }
}

uri = f'https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))