This guide shows you how to use the Gen AI evaluation service API to evaluate your large language models (LLMs). This guide covers the following topics:
- Metric types: Understand the different categories of evaluation metrics available.
- Example syntax: See example
curl
and Python requests to the evaluation API. - Parameter details: Learn about the specific parameters for each evaluation metric.
- Examples: View complete code samples for common evaluation tasks.
You can use the Gen AI evaluation service to evaluate your large language models (LLMs) across several metrics with your own criteria. You provide inference-time inputs, LLM responses, and additional parameters, and the Gen AI evaluation service returns metrics specific to the evaluation task.
Metrics include model-based metrics, such as PointwiseMetric
and PairwiseMetric
, and in-memory computed metrics, such as rouge
, bleu
, and tool function-call metrics. PointwiseMetric
and PairwiseMetric
are generic model-based metrics that you can customize with your own criteria. The service accepts prediction results directly from models, which lets you perform both inference and evaluation on any model supported by Vertex AI.
For more information about evaluating a model, see Gen AI evaluation service overview.
Limitations
The evaluation service has the following limitations:
- The evaluation service might have a propagation delay on your first call.
- Most model-based metrics consume gemini-2.0-flash quota because the Gen AI evaluation service uses
gemini-2.0-flash
as the underlying judge model to compute them. - Some model-based metrics, such as MetricX and COMET, use different machine learning models and don't consume gemini-2.0-flash quota.
Metric types
The Gen AI evaluation service API provides several categories of metrics to evaluate different aspects of your model's performance. The following table provides a high-level overview to help you choose the right metrics for your use case.
Metric Category | Description | Use Case |
---|---|---|
Lexical Metrics (e.g., bleu , rouge , exact_match ) |
These metrics compute scores based on the textual overlap between the model's prediction and a reference (ground truth) text. They are fast and objective. | Ideal for tasks with a clear "correct" answer, like translation or fact-based question answering, where similarity to a reference is a good proxy for quality. |
Model-Based Pointwise Metrics (e.g., fluency , safety , groundedness , summarization_quality ) |
These metrics use a judge model to assess the quality of a single model response based on specific criteria (like fluency or safety) without needing a reference answer. | Best for evaluating subjective qualities of generative text where there isn't a single correct answer, such as the creativity, coherence, or safety of a response. |
Model-Based Pairwise Metrics (e.g., pairwise_summarization_quality ) |
These metrics use a judge model to compare two model responses (for example, from a baseline and a candidate model) and determine which one is better. | Useful for A/B testing and directly comparing the performance of two different models or two versions of the same model on the same task. |
Tool Use Metrics (e.g., tool_call_valid , tool_name_match ) |
These metrics evaluate the model's ability to correctly use tools (function calls) by checking for valid syntax, correct tool names, and accurate parameters. | Essential for evaluating models that are designed to interact with external APIs or systems through tool calling. |
Custom Metrics ( pointwise_metric , pairwise_metric ) |
These provide a flexible framework to define your own evaluation criteria using a prompt template. The service then uses a judge model to evaluate responses based on your custom instructions. | For specialized evaluation tasks where predefined metrics are insufficient and you need to assess performance against unique, domain-specific requirements. |
Specialized Metrics ( comet , metricx ) |
Highly specialized metrics designed for specific tasks, primarily machine translation quality. | Use for nuanced evaluation for machine translation tasks that goes beyond simple lexical matching. |
Example syntax
The following examples show the syntax for sending an evaluation request.
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \ -d '{ "pointwise_metric_input" : { "metric_spec" : { ... }, "instance": { ... }, } }'
Python
import json from google import auth from google.api_core import exceptions from google.auth.transport import requests as google_auth_requests creds, _ = auth.default( scopes=['https://www.googleapis.com/auth/cloud-platform']) data = { ... } uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances' result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data) print(json.dumps(result.json(), indent=2))
Parameter details
This section details the request and response objects for each evaluation metric.
Request body
The top-level request body contains one of the following metric input objects.
Parameters | |
---|---|
|
Optional: Evaluates if the prediction exactly matches the reference. |
|
Optional: Computes the BLEU score by comparing the prediction to the reference. |
|
Optional: Computes ROUGE scores by comparing the prediction to the reference. The |
|
Optional: Assesses the language fluency of a single response. |
|
Optional: Assesses the coherence of a single response. |
|
Optional: Assesses the safety level of a single response. |
|
Optional: Assesses if a response is grounded in the provided context. |
|
Optional: Assesses how well a response fulfills the given instructions. |
|
Optional: Assesses the overall summarization quality of a response. |
|
Optional: Compares the summarization quality of two responses. |
|
Optional: Assesses if a summary is helpful and contains necessary details from the original text. |
|
Optional: Assesses the verbosity of a summary. |
|
Optional: Assesses the overall quality of an answer to a question, based on a provided context. |
|
Optional: Compares the quality of two answers to a question, based on a provided context. |
|
Optional: Assesses the relevance of an answer to a question. |
|
Optional: Assesses the helpfulness of an answer by checking for key details. |
|
Optional: Assesses the correctness of an answer to a question. |
|
Optional: Input for a custom pointwise evaluation. |
|
Optional: Input for a custom pairwise evaluation. |
|
Optional: Assesses if the response predicts a valid tool call. |
|
Optional: Assesses if the response predicts the correct tool name in a tool call. |
|
Optional: Assesses if the response predicts the correct parameter names in a tool call. |
|
Optional: Assesses if the response predicts the correct parameter names and values in a tool call. |
|
Optional: Input to evaluate using COMET. |
|
Optional: Input to evaluate using MetricX. |
Exact Match (exact_match_input
)
Input (ExactMatchInput
)
{ "exact_match_input": { "metric_spec": {}, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: One or more evaluation instances, each containing an LLM response and a reference. |
|
Optional: The LLM response. |
|
Optional: The ground truth or reference response. |
Output (ExactMatchResults
)
{ "exact_match_results": { "exact_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
An array of evaluation results, one for each input instance. |
|
One of the following:
|
BLEU (bleu_input
)
Input (BleuInput
)
{ "bleu_input": { "metric_spec": { "use_effective_order": bool }, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: Specifies whether to consider n-gram orders that have no match. |
|
Optional: One or more evaluation instances, each containing an LLM response and a reference. |
|
Optional: The LLM response. |
|
Optional: The ground truth or reference response. |
Output (BleuResults
)
{ "bleu_results": { "bleu_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
An array of evaluation results, one for each input instance. |
|
|
ROUGE (rouge_input
)
Input (RougeInput
)
{ "rouge_input": { "metric_spec": { "rouge_type": string, "use_stemmer": bool, "split_summaries": bool }, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: Supported values:
|
|
Optional: Specifies whether to use the Porter stemmer to strip word suffixes for better matching. |
|
Optional: Specifies whether to add newlines between sentences for |
|
Optional: One or more evaluation instances, each containing an LLM response and a reference. |
|
Optional: The LLM response. |
|
Optional: The ground truth or reference response. |
Output (RougeResults
)
{ "rouge_results": { "rouge_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
An array of evaluation results, one for each input instance. |
|
|
Fluency (fluency_input
)
Input (FluencyInput
)
{ "fluency_input": { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of an LLM response. |
|
Optional: The LLM response. |
Output (FluencyResult
)
{ "fluency_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Coherence (coherence_input
)
Input (CoherenceInput
)
{ "coherence_input": { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of an LLM response. |
|
Optional: The LLM response. |
Output (CoherenceResult
)
{ "coherence_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Safety (safety_input
)
Input (SafetyInput
)
{ "safety_input": { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of an LLM response. |
|
Optional: The LLM response. |
Output (SafetyResult
)
{ "safety_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Groundedness (groundedness_input
)
Input (GroundednessInput
)
{ "groundedness_input": { "metric_spec": {}, "instance": { "prediction": string, "context": string } } }
Parameter | Description |
---|---|
|
Optional: GroundednessSpec Specifies the metric's behavior. |
|
Optional: GroundednessInstance The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (GroundednessResult
)
{ "groundedness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Fulfillment (fulfillment_input
)
Input (FulfillmentInput
)
{ "fulfillment_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The instruction provided at inference time. |
Output (FulfillmentResult
)
{ "fulfillment_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Summarization Quality (summarization_quality_input
)
Input (SummarizationQualityInput
)
{ "summarization_quality_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (SummarizationQualityResult
)
{ "summarization_quality_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Pairwise Summarization Quality (pairwise_summarization_quality_input
)
Input (PairwiseSummarizationQualityInput
)
{ "pairwise_summarization_quality_input": { "metric_spec": {}, "instance": { "baseline_prediction": string, "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The baseline model's LLM response. |
|
Optional: The candidate model's LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (PairwiseSummarizationQualityResult
)
{ "pairwise_summarization_quality_result": { "pairwise_choice": PairwiseChoice, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Summarization Helpfulness (summarization_helpfulness_input
)
Input (SummarizationHelpfulnessInput
)
{ "summarization_helpfulness_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (SummarizationHelpfulnessResult
)
{ "summarization_helpfulness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Summarization Verbosity (summarization_verbosity_input
)
Input (SummarizationVerbosityInput
)
{ "summarization_verbosity_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (SummarizationVerbosityResult
)
{ "summarization_verbosity_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Question Answering Quality (question_answering_quality_input
)
Input (QuestionAnsweringQualityInput
)
{ "question_answering_quality_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (QuestionAnsweringQualityResult
)
{ "question_answering_quality_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Pairwise Question Answering Quality (pairwise_question_answering_quality_input
)
Input (PairwiseQuestionAnsweringQualityInput
)
{ "question_answering_quality_input": { "metric_spec": {}, "instance": { "baseline_prediction": string, "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: The baseline model's LLM response. |
|
Optional: The candidate model's LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (PairwiseQuestionAnsweringQualityResult
)
{ "pairwise_question_answering_quality_result": { "pairwise_choice": PairwiseChoice, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Question Answering Relevance (question_answering_relevance_input
)
Input (QuestionAnsweringRelevanceInput
)
{ "question_answering_quality_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (QuestionAnsweringRelevanceResult
)
{ "question_answering_relevancy_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Question Answering Helpfulness (question_answering_helpfulness_input
)
Input (QuestionAnsweringHelpfulnessInput
)
{ "question_answering_helpfulness_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (QuestionAnsweringHelpfulnessResult
)
{ "question_answering_helpfulness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Question Answering Correctness (question_answering_correctness_input
)
Input (QuestionAnsweringCorrectnessInput
)
{ "question_answering_correctness_input": { "metric_spec": { "use_reference": bool }, "instance": { "prediction": string, "reference": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: Specifies if the reference is used in the evaluation. |
|
Optional: The evaluation input, which consists of inference inputs and the corresponding response. |
|
Optional: The LLM response. |
|
Optional: The ground truth or reference response. |
|
Optional: The instruction provided at inference time. |
|
Optional: The context provided at inference time that the LLM response can use. |
Output (QuestionAnsweringCorrectnessResult
)
{ "question_answering_correctness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
Custom Pointwise (pointwise_metric_input
)
Input (PointwiseMetricInput
)
{ "pointwise_metric_input": { "metric_spec": { "metric_prompt_template": string }, "instance": { "json_instance": string, } } }
Parameters | |
---|---|
|
Required: Specifies the metric's behavior. |
|
Required: A prompt template that defines the metric. The template is rendered using the key-value pairs in |
|
Required: The evaluation input, which consists of a |
|
Optional: A JSON string of key-value pairs (for example, |
Output (PointwiseMetricResult
)
{ "pointwise_metric_result": { "score": float, "explanation": string, } }
Output | |
---|---|
|
|
|
|
Custom Pairwise (pairwise_metric_input
)
Input (PairwiseMetricInput
)
{ "pairwise_metric_input": { "metric_spec": { "metric_prompt_template": string }, "instance": { "json_instance": string, } } }
Parameters | |
---|---|
|
Required: Specifies the metric's behavior. |
|
Required: A prompt template that defines the metric. The template is rendered using the key-value pairs in |
|
Required: The evaluation input, which consists of a |
|
Optional: A JSON string of key-value pairs (for example, |
Output (PairwiseMetricResult
)
{ "pairwise_metric_result": { "score": float, "explanation": string, } }
Output | |
---|---|
|
|
|
|
Tool Call Valid (tool_call_valid_input
)
Input (ToolCallValidInput
)
{ "tool_call_valid_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of an LLM response and a reference. |
|
Optional: The candidate model's response. This must be a JSON serialized string containing { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] } |
|
Optional: The ground truth or reference response, in the same format as |
Output (ToolCallValidResults
)
{ "tool_call_valid_results": { "tool_call_valid_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
|
|
|
Tool Name Match (tool_name_match_input
)
Input (ToolNameMatchInput
)
{ "tool_name_match_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of an LLM response and a reference. |
|
Optional: The candidate model's response. This must be a JSON serialized string containing |
|
Optional: The ground truth or reference response, in the same format as |
Output (ToolNameMatchResults
)
{ "tool_name_match_results": { "tool_name_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
|
|
|
Tool Parameter Key Match (tool_parameter_key_match_input
)
Input (ToolParameterKeyMatchInput
)
{ "tool_parameter_key_match_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of an LLM response and a reference. |
|
Optional: The candidate model's response. This must be a JSON serialized string containing |
|
Optional: The ground truth or reference response, in the same format as |
Output (ToolParameterKeyMatchResults
)
{ "tool_parameter_key_match_results": { "tool_parameter_key_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
|
|
|
Tool Parameter KV Match (tool_parameter_kv_match_input
)
Input (ToolParameterKVMatchInput
)
{ "tool_parameter_kv_match_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional: The evaluation input, which consists of an LLM response and a reference. |
|
Optional: The candidate model's response. This must be a JSON serialized string containing |
|
Optional: The ground truth or reference response, in the same format as |
Output (ToolParameterKVMatchResults
)
{ "tool_parameter_kv_match_results": { "tool_parameter_kv_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
|
|
|
COMET (comet_input
)
Input (CometInput
)
{ "comet_input" : { "metric_spec" : { "version": string }, "instance": { "prediction": string, "source": string, "reference": string, }, } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional:
|
|
Optional: The source language in BCP-47 format. For example, "es". |
|
Optional: The target language in BCP-47 format. For example, "es". |
|
Optional: The evaluation input. The exact fields used for evaluation depend on the COMET version. |
|
Optional: The candidate model's response, which is the translated text to be evaluated. |
|
Optional: The source text in the original language, before translation. |
|
Optional: The ground truth or reference translation, in the same language as the |
Output (CometResult
)
{ "comet_result" : { "score": float } }
Output | |
---|---|
|
|
MetricX (metricx_input
)
Input (MetricxInput
)
{ "metricx_input" : { "metric_spec" : { "version": string }, "instance": { "prediction": string, "source": string, "reference": string, }, } }
Parameters | |
---|---|
|
Optional: Specifies the metric's behavior. |
|
Optional:
One of the following:
|
|
Optional: The source language in BCP-47 format. For example, "es". |
|
Optional: The target language in BCP-47 format. For example, "es". |
|
Optional: The evaluation input. The exact fields used for evaluation depend on the MetricX version. |
|
Optional: The candidate model's response, which is the translated text to be evaluated. |
|
Optional: The source text in the original language that the prediction was translated from. |
|
Optional: The ground truth used to compare against the prediction. It is in the same language as the prediction. |
Output (MetricxResult
)
{ "metricx_result" : { "score": float } }
Output | |
---|---|
|
|
Examples
Evaluate multiple metrics in one call
The following example shows how to call the Gen AI evaluation service API to evaluate the output of an LLM using a variety of evaluation metrics, including summarization_quality
, groundedness
, fulfillment
, summarization_helpfulness
, and summarization_verbosity
.
Python
Go
Evaluate pairwise summarization quality
The following example shows how to call the Gen AI evaluation service API to evaluate the output of an LLM using a pairwise summarization quality comparison.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: .
- LOCATION: The region to process the request.
- PREDICTION: LLM response.
- BASELINE_PREDICTION: Baseline model LLM response.
- INSTRUCTION: The instruction used at inference time.
- CONTEXT: Inference-time text containing all relevant information, that can be used in the LLM response.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \
Request JSON body:
{ "pairwise_summarization_quality_input": { "metric_spec": {}, "instance": { "prediction": "PREDICTION", "baseline_prediction": "BASELINE_PREDICTION", "instruction": "INSTRUCTION", "context": "CONTEXT", } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content
Python
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Go
Go
Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Evaluate a ROUGE score
The following example shows how to call the Gen AI evaluation service API to get the ROUGE score for a prediction. The request uses metric_spec
to configure the metric's behavior.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: .
- LOCATION: The region to process the request.
- PREDICTION: LLM response.
- REFERENCE: Golden LLM response for reference.
- ROUGE_TYPE: The calculation used to determine the rouge score. See
metric_spec.rouge_type
for acceptable values. - USE_STEMMER: Determines whether the Porter stemmer is used to strip word suffixes to improve matching. For acceptable values, see
metric_spec.use_stemmer
. - SPLIT_SUMMARIES: Determines if new lines are added between
rougeLsum
sentences. For acceptable values, seemetric_spec.split_summaries
.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \
Request JSON body:
{ "rouge_input": { "instances": { "prediction": "PREDICTION", "reference": "REFERENCE.", }, "metric_spec": { "rouge_type": "ROUGE_TYPE", "use_stemmer": USE_STEMMER, "split_summaries": SPLIT_SUMMARIES, } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content
Python
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Go
Go
Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Go API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
What's next
- Learn how to Run an evaluation.