The Gen AI evaluation service lets you evaluate your large language models (LLMs), both pointwise and pairwise, across several metrics, with your own criteria. You can provide inference-time inputs, LLM responses and additional parameters, and the Gen AI evaluation service returns metrics specific to the evaluation task.
Metrics include model-based metrics, such as PointwiseMetric
and PairwiseMetric
, and in-memory
computed metrics, such as rouge
, bleu
, and tool function-call metrics.
PointwiseMetric
and PairwiseMetric
are generic model-based metrics that
you can customize with your own criteria.
Because the service takes the prediction results directly from models as input,
the evaluation service can perform both inference and subsequent evaluation on
all models supported by
Vertex AI.
For more information on evaluating a model, see Gen AI evaluation service overview.
Limitations
The following are limitations of the evaluation service:
- Model-based metrics consume
gemini-1.5-pro quota.
The Gen AI evaluation service leverages
gemini-1.5-pro
as the underlying judge model to compute model-based metrics. - The evaluation service may have a propagation delay in your first call.
Example syntax
Syntax to send an evaluation call.
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \ -d '{ "contents": [{ ... }], "tools": [{ "function_declarations": [ { ... } ] }] }'
Python
import json from google import auth from google.api_core import exceptions from google.auth.transport import requests as google_auth_requests creds, _ = auth.default( scopes=['https://www.googleapis.com/auth/cloud-platform']) data = { ... } uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances' result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data) print(json.dumps(result.json(), indent=2))
Parameter list
Parameters | |
---|---|
|
Optional: Input to assess if the prediction matches the reference exactly. |
|
Optional: Input to compute BLEU score by comparing the prediction against the reference. |
|
Optional: Input to compute |
|
Optional: Input to assess a single response's language mastery. |
|
Optional: Input to assess a single response's ability to provide a coherent, easy-to-follow reply. |
|
Optional: Input to assess a single response's level of safety. |
|
Optional: Input to assess a single response's ability to provide or reference information included only in the input text. |
|
Optional: Input to assess a single response's ability to completely fulfill instructions. |
|
Optional: Input to assess a single response's overall ability to summarize text. |
|
Optional: Input to compare two responses' overall summarization quality. |
|
Optional: Input to assess a single response's ability to provide a summarization, which contains the details necessary to substitute the original text. |
|
Optional: Input to assess a single response's ability to provide a succinct summarization. |
|
Optional: Input to assess a single response's overall ability to answer questions, given a body of text to reference. |
|
Optional: Input to compare two responses' overall ability to answer questions, given a body of text to reference. |
|
Optional: Input to assess a single response's ability to respond with relevant information when asked a question. |
|
Optional: Input to assess a single response's ability to provide key details when answering a question. |
|
Optional: Input to assess a single response's ability to correctly answer a question. |
|
Optional: Input for a generic pointwise evaluation. |
|
Optional: Input for a generic pairwise evaluation. |
|
Optional: Input to assess a single response's ability to predict a valid tool call. |
|
Optional: Input to assess a single response's ability to predict a tool call with the right tool name. |
|
Optional: Input to assess a single response's ability to predict a tool call with correct parameter names. |
|
Optional: Input to assess a single response's ability to predict a tool call with correct parameter names and values |
ExactMatchInput
{ "exact_match_input": { "metric_spec": {}, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
ExactMatchResults
{ "exact_match_results": { "exact_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
Evaluation results per instance input. |
|
One of the following:
|
BleuInput
{ "bleu_input": { "metric_spec": { "use_effective_order": bool }, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Whether to take into account n-gram orders without any match. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
BleuResults
{ "bleu_results": { "bleu_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
Evaluation results per instance input. |
|
|
RougeInput
{ "rouge_input": { "metric_spec": { "rouge_type": string, "use_stemmer": bool, "split_summaries": bool }, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Acceptable values:
|
|
Optional: Whether Porter stemmer should be used to strip word suffixes to improve matching. |
|
Optional: Whether to add newlines between sentences for rougeLsum. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
RougeResults
{ "rouge_results": { "rouge_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
Evaluation results per instance input. |
|
|
FluencyInput
{ "fluency_input": { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
FluencyResult
{ "fluency_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
CoherenceInput
{ "coherence_input": { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
CoherenceResult
{ "coherence_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SafetyInput
{ "safety_input": { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
SafetyResult
{ "safety_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
GroundednessInput
{ "groundedness_input": { "metric_spec": {}, "instance": { "prediction": string, "context": string } } }
Parameter |
Description |
|
Optional: GroundednessSpec Metric spec, defining the metric's behavior. |
|
Optional: GroundednessInstance Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
GroundednessResult
{ "groundedness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
FulfillmentInput
{ "fulfillment_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
FulfillmentResult
{ "fulfillment_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SummarizationQualityInput
{ "summarization_quality_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationQualityResult
{ "summarization_quality_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
PairwiseSummarizationQualityInput
{ "pairwise_summarization_quality_input": { "metric_spec": {}, "instance": { "baseline_prediction": string, "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: Baseline model LLM response. |
|
Optional: Candidate model LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
PairwiseSummarizationQualityResult
{ "pairwise_summarization_quality_result": { "pairwise_choice": PairwiseChoice, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SummarizationHelpfulnessInput
{ "summarization_helpfulness_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationHelpfulnessResult
{ "summarization_helpfulness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SummarizationVerbosityInput
{ "summarization_verbosity_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationVerbosityResult
{ "summarization_verbosity_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringQualityInput
{ "question_answering_quality_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringQualityResult
{ "question_answering_quality_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
PairwiseQuestionAnsweringQualityInput
{ "question_answering_quality_input": { "metric_spec": {}, "instance": { "baseline_prediction": string, "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: Baseline model LLM response. |
|
Optional: Candidate model LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
PairwiseQuestionAnsweringQualityResult
{ "pairwise_question_answering_quality_result": { "pairwise_choice": PairwiseChoice, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringRelevanceInput
{ "question_answering_quality_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringRelevancyResult
{ "question_answering_relevancy_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringHelpfulnessInput
{ "question_answering_helpfulness_input": { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringHelpfulnessResult
{ "question_answering_helpfulness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringCorrectnessInput
{ "question_answering_correctness_input": { "metric_spec": { "use_reference": bool }, "instance": { "prediction": string, "reference": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: If reference is used or not in the evaluation. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringCorrectnessResult
{ "question_answering_correctness_result": { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
PointwiseMetricInput
{ "pointwise_metric_input": { "metric_spec": { "metric_prompt_template": string }, "instance": { "json_instance": string, } } }
Parameters | |
---|---|
|
Required: Metric spec, defining the metric's behavior. |
|
Required: A prompt template defining the metric. It is rendered by the key-value pairs in instance.json_instance |
|
Required: Evaluation input, consisting of json_instance. |
|
Optional: The key-value pairs in Json format. For example, {"key_1": "value_1", "key_2": "value_2"}. It is used to render metric_spec.metric_prompt_template. |
PointwiseMetricResult
{ "pointwise_metric_result": { "score": float, "explanation": string, } }
Output | |
---|---|
|
|
|
|
PairwiseMetricInput
{ "pairwise_metric_input": { "metric_spec": { "metric_prompt_template": string }, "instance": { "json_instance": string, } } }
Parameters | |
---|---|
|
Required: Metric spec, defining the metric's behavior. |
|
Required: A prompt template defining the metric. It is rendered by the key-value pairs in instance.json_instance |
|
Required: Evaluation input, consisting of json_instance. |
|
Optional: The key-value pairs in JSON format. For example, {"key_1": "value_1", "key_2": "value_2"}. It is used to render metric_spec.metric_prompt_template. |
PairwiseMetricResult
{ "pairwise_metric_result": { "score": float, "explanation": string, } }
Output | |
---|---|
|
|
|
|
ToolCallValidInput
{ "tool_call_valid_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] } |
|
Optional: Golden model output in the same format as prediction. |
ToolCallValidResults
{ "tool_call_valid_results": { "tool_call_valid_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
ToolNameMatchInput
{ "tool_name_match_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolNameMatchResults
{ "tool_name_match_results": { "tool_name_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
ToolParameterKeyMatchInput
{ "tool_parameter_key_match_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolParameterKeyMatchResults
{ "tool_parameter_key_match_results": { "tool_parameter_key_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
ToolParameterKVMatchInput
{ "tool_parameter_kv_match_input": { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolParameterKVMatchResults
{ "tool_parameter_kv_match_results": { "tool_parameter_kv_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
Examples
Evaluate an output
The following example demonstrates how to call the Gen AI Evaluation API to evaluate the output of an LLM using a variety of evaluation metrics, including the following:
summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity
Python
Evaluate an output: pairwise summarization quality
The following example demonstrates how to call the Gen AI evaluation service API to evaluate the output of an LLM using a pairwise summarization quality comparison.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your project ID.
- LOCATION: The region to process the request.
- PREDICTION: LLM response.
- BASELINE_PREDICTION: Baseline model LLM response.
- INSTRUCTION: The instruction used at inference time.
- CONTEXT: Inference-time text containing all relevant information, that can be used in the LLM response.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \
Request JSON body:
{ "pairwise_summarization_quality_input": { "metric_spec": {}, "instance": { "prediction": "PREDICTION", "baseline_prediction": "BASELINE_PREDICTION", "instruction": "INSTRUCTION", "context": "CONTEXT", } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content
Python
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Get ROUGE score
The following example calls the Gen AI evaluation service API to get the ROUGE score
of a prediction, generated by a number of inputs. The ROUGE inputs use
metric_spec
, which determines the metric's behavior.
REST
Before using any of the request data, make the following replacements:
- PROJECT_ID: Your project ID.
- LOCATION: The region to process the request.
- PREDICTION: LLM response.
- REFERENCE: Golden LLM response for reference.
- ROUGE_TYPE: The calculation used to determine the rouge score. See
metric_spec.rouge_type
for acceptable values. - USE_STEMMER: Determines whether the Porter stemmer is used to strip word suffixes to improve matching. For acceptable values, see
metric_spec.use_stemmer
. - SPLIT_SUMMARIES: Determines if new lines are added between
rougeLsum
sentences. For acceptable values, seemetric_spec.split_summaries
.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \
Request JSON body:
{ "rouge_input": { "instances": { "prediction": "PREDICTION", "reference": "REFERENCE.", }, "metric_spec": { "rouge_type": "ROUGE_TYPE", "use_stemmer": USE_STEMMER, "split_summaries": SPLIT_SUMMARIES, } } }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content
Python
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
What's next
- For detailed documentation, see Run an evaluation.