The rapid evaluation service enables users to evaluate their LLM models, both pointwise and pairwise, across several metrics. Users provide inference-time inputs, LLM responses, and additional parameters, and the service returns metrics specific to the evaluation task. Metrics include both model-based metrics (e.g. SummarizationQuality) and in-memory-computed metrics (e.g. Rouge, Bleu, and Tool/Function Call metrics). Since the service takes the prediction results directly from models as input, it may evaluate all models supported by Vertex.
Limitations
- Model-based metrics consume text-bison quota. rapid evaluation service leverages text-bison as the underlying arbiter model to compute model-based metrics.
- The service has a propagation delay. It may not be available for several minutes after the first call to the service.
Syntax
- PROJECT_ID =
PROJECT_ID
- REGION =
REGION
- MODEL_ID =
MODEL_ID
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances
Python
import json from google import auth from google.api_core import exceptions from google.auth.transport import requests as google_auth_requests creds, _ = auth.default( scopes=['https://www.googleapis.com/auth/cloud-platform']) data = { ... } uri = f'https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances' result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data) print(json.dumps(result.json(), indent=2))
Parameter list
Full list of metrics available.
Parameters | |
---|---|
|
Optional: Input to assess if the prediction matches the reference exactly. |
|
Optional: Input to compute BLEU score by comparing the prediction against the reference. |
|
Optional: Input to compute |
|
Optional: Input to assess a single response's language mastery. |
|
Optional: Input to assess a single response's ability to provide a coherent, easy-to-follow reply. |
|
Optional: Input to assess a single response's level of safety. |
|
Optional: Input to assess a single response's ability to provide or reference information included only in the input text. |
|
Optional: Input to assess a single response's ability to completely fulfill instructions. |
|
Optional: Input to assess a single response's overall ability to summarize text. |
|
Optional: Input to compare two responses' overall summarization quality. |
|
Optional: Input to assess a single response's ability to provide a summarization, which contains the details necessary to substitute the original text. |
|
Optional: Input to assess a single response's ability to provide a succinct summarization. |
|
Optional: Input to assess a single response's overall ability to answer questions, given a body of text to reference. |
|
Optional: Input to compare two responses' overall ability to answer questions, given a body of text to reference. |
|
Optional: Input to assess a single response's ability to respond with relevant information when asked a question. |
|
Optional: Input to assess a single response's ability to provide key details when answering a question. |
|
Optional: Input to assess a single response's ability to correctly answer a question. |
|
Optional: Input to assess a single response's ability to predict a valid tool call. |
|
Optional: Input to assess a single response's ability to predict a tool call with the right tool name. |
|
Optional: Input to assess a single response's ability to predict a tool call with correct parameter names. |
|
Optional: Input to assess a single response's ability to predict a tool call with correct parameter names and values |
ExactMatchInput
{ "exact_match_input: { "metric_spec": {}, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
ExactMatchResults
{ "exact_match_results: { "exact_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
Evaluation results per instance input. |
|
One of the following:
|
BleuInput
{ "bleu_input: { "metric_spec": {}, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
BleuResults
{ "bleu_results: { "bleu_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
Evaluation results per instance input. |
|
|
RougeInput
{ "rouge_input: { "metric_spec": { "rouge_type": string, "use_stemmer": bool, "split_summaries": bool }, "instances": [ { "prediction": string, "reference": string } ] } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Acceptable values:
|
|
Optional: Whether Porter stemmer should be used to strip word suffixes to improve matching. |
|
Optional: Whether to add newlines between sentences for rougeLsum. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
RougeResults
{ "rouge_results: { "rouge_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
Evaluation results per instance input. |
|
|
FluencyInput
{ "fluency_input: { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
FluencyResult
{ "fluency_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
CoherenceInput
{ "coherence_input: { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
CoherenceResult
{ "coherence_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SafetyInput
{ "safety_input: { "metric_spec": {}, "instance": { "prediction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response. |
|
Optional: LLM response. |
SafetyResult
{ "safety_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
GroundednessInput
{ "groundedness_input: { "metric_spec": {}, "instance": { "prediction": string, "context": string } } }
Parameter |
Description |
|
Optional: GroundednessSpec Metric spec, defining the metric's behavior. |
|
Optional: GroundednessInstance Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
GroundednessResult
{ "groundedness_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
FulfillmentInput
{ "fulfillment_input: { "metric_spec": {}, "instance": { "prediction": string, "instruction": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
FulfillmentResult
{ "fulfillment_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SummarizationQualityInput
{ "summarization_quality_input: { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationQualityResult
{ "summarization_quality_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
PairwiseSummarizationQualityInput
{ "pairwise_summarization_quality_input: { "metric_spec": {}, "instance": { "baseline_prediction": string, "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: Baseline model LLM response. |
|
Optional: Candidate model LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
PairwiseSummarizationQualityResult
{ "pairwise_summarization_quality_result: { "pairwise_choice": PairwiseChoice, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SummarizationHelpfulnessInput
{ "summarization_helpfulness_input: { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationHelpfulnessResult
{ "summarization_helpfulness_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
SummarizationVerbosityInput
{ "summarization_verbosity_input: { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
SummarizationVerbosityResult
{ "summarization_verbosity_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringQualityInput
{ "question_answering_quality_input: { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string, } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringQualityResult
{ "question_answering_quality_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
PairwiseQuestionAnsweringQualityInput
{ "question_answering_quality_input: { "metric_spec": {}, "instance": { "baseline_prediction": string, "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: Baseline model LLM response. |
|
Optional: Candidate model LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
PairwiseQuestionAnsweringQualityResult
{ "pairwise_question_answering_quality_result: { "pairwise_choice": PairwiseChoice, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringRelevanceInput
{ "question_answering_quality_input: { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringRelevancyResult
{ "question_answering_relevancy_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringHelpfulnessInput
{ "question_answering_helpfulness_input: { "metric_spec": {}, "instance": { "prediction": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringHelpfulnessResult
{ "question_answering_helpfulness_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
QuestionAnsweringCorrectnessInput
{ "question_answering_correctness_input: { "metric_spec": { "use_reference": bool }, "instance": { "prediction": string, "reference": string, "instruction": string, "context": string } } }
Parameters | |
---|---|
|
Optional: |
|
Optional: If reference is used or not in the evaluation. |
|
Optional: Evaluation input, consisting of inference inputs and corresponding response. |
|
Optional: LLM response. |
|
Optional: Golden LLM response for reference. |
|
Optional: Instruction used at inference time. |
|
Optional: Inference-time text containing all information, which can be used in the LLM response. |
QuestionAnsweringCorrectnessResult
{ "question_answering_correctness_result: { "score": float, "explanation": string, "confidence": float } }
Output | |
---|---|
|
|
|
|
|
|
ToolCallValidInput
{ "tool_call_valid_input: { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] } |
|
Optional: Golden model output in the same format as prediction. |
ToolCallValidResults
{ "tool_call_valid_results: { "tool_call_valid_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
ToolNameMatchInput
{ "tool_name_match_input: { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolNameMatchResults
{ "tool_name_match_results: { "tool_name_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
ToolParameterKeyMatchInput
{ "tool_parameter_key_match_input: { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolParameterKeyMatchResults
{ "tool_parameter_key_match_results: { "tool_parameter_key_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
ToolParameterKVMatchInput
{ "tool_parameter_kv_match_input: { "metric_spec": {}, "instance": { "prediction": string, "reference": string } } }
Parameters | |
---|---|
|
Optional: Metric spec, defining the metric's behavior. |
|
Optional: Evaluation input, consisting of LLM response and reference. |
|
Optional: Candidate model LLM response, which is a JSON serialized string that contains |
|
Optional: Golden model output in the same format as prediction. |
ToolParameterKVMatchResults
{ "tool_parameter_kv_match_results: { "tool_parameter_kv_match_metric_values": [ { "score": float } ] } }
Output | |
---|---|
|
repeated |
|
|
Examples
- PROJECT_ID =
PROJECT_ID
- REGION =
REGION
Pairwise Summarization Quality
Here we demonstrate how to call the rapid evaluation API to evaluate the output of an LLM. In this case, we make a pairwise summarization quality comparison.
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances \ -d '{ "pairwise_summarization_quality_input": { "metric_spec": {}, "instance": { "prediction": "France is a country located in Western Europe.", "baseline_prediction": "France is a country.", "instruction": "Summarize the context.", "context": "France is a country located in Western Europe. It'\''s bordered by Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, and Andorra. France'\''s coastline stretches along the English Channel, the North Sea, the Atlantic Ocean, and the Mediterranean Sea. Known for its rich history, iconic landmarks like the Eiffel Tower, and delicious cuisine, France is a major cultural and economic power in Europe and throughout the world.", } } }'
Python
import json from google import auth from google.api_core import exceptions from google.auth.transport import requests as google_auth_requests creds, _ = auth.default( scopes=['https://www.googleapis.com/auth/cloud-platform']) data = { "pairwise_summarization_quality_input": { "metric_spec": {}, "instance": { "prediction": "France is a country located in Western Europe.", "baseline_prediction": "France is a country.", "instruction": "Summarize the context.", "context": ( "France is a country located in Western Europe. It's bordered by " "Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, " "and Andorra. France's coastline stretches along the English " "Channel, the North Sea, the Atlantic Ocean, and the Mediterranean " "Sea. Known for its rich history, iconic landmarks like the Eiffel " "Tower, and delicious cuisine, France is a major cultural and " "economic power in Europe and throughout the world." ), } } } uri = f'https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances' result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data) print(json.dumps(result.json(), indent=2))
ROUGE
Next, we will call the API to get the ROUGE score of a prediction, given a reference.
curl
curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances \ -d '{ "rouge_input": { "instances": { "prediction": "A fast brown fox leaps over a lazy dog.", "reference": "The quick brown fox jumps over the lazy dog.", }, "instances": { "prediction": "A quick brown fox jumps over the lazy canine.", "reference": "The quick brown fox jumps over the lazy dog.", }, "instances": { "prediction": "The speedy brown fox jumps over the lazy dog.", "reference": "The quick brown fox jumps over the lazy dog.", }, "metric_spec": { "rouge_type": "rougeLsum", "use_stemmer": true, "split_summaries": true } } }'
Python
import json from google import auth from google.api_core import exceptions from google.auth.transport import requests as google_auth_requests creds, _ = auth.default( scopes=['https://www.googleapis.com/auth/cloud-platform']) data = { "rouge_input": { "metric_spec": { "rouge_type": "rougeLsum", "use_stemmer": True, "split_summaries": True }, "instances": [ { "prediction": "A fast brown fox leaps over a lazy dog.", "reference": "The quick brown fox jumps over the lazy dog.", }, { "prediction": "A quick brown fox jumps over the lazy canine.", "reference": "The quick brown fox jumps over the lazy dog.", }, { "prediction": "The speedy brown fox jumps over the lazy dog.", "reference": "The quick brown fox jumps over the lazy dog.", } ] } } uri = f'https://${REGION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${REGION}:evaluateInstances' result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data) print(json.dumps(result.json(), indent=2))