After you build a Generative AI model, you can use it to power an agent, such as a chatbot. The Gen AI evaluation service service lets you measure your agent's ability to complete tasks and goals for your use case.
This guide shows you how to evaluate Generative AI agents using the Gen AI evaluation service service and covers the following topics:
- Evaluation methods: Learn about the two main approaches for agent evaluation: final response and trajectory.
- Supported agents: See the types of agents you can evaluate, including those built with Agent Engine, LangChain, or custom functions.
- Evaluation metrics: Understand the metrics available for evaluating an agent's final response and its trajectory.
- Preparing your dataset: See how to structure your data for both final response and trajectory evaluation.
- Running an evaluation: Execute an evaluation task using the Vertex AI SDK and customize metrics for your specific needs.
- Viewing and interpreting results: Learn how to understand the instance-level and aggregate scores for your agent's performance.
- Agent2Agent (A2A) protocol: Get an overview of the A2A open standard for multi-agent communication.
Evaluation methods
You can evaluate your agent using the following methods. With the Gen AI evaluation service, you can trigger an agent execution and get metrics for both evaluation methods in one Vertex AI SDK query.
Evaluation Method | Description | Use Case |
---|---|---|
Final response evaluation | Evaluates only the final output of an agent to determine if it achieved its goal. | When the end result is the primary concern, and the intermediate steps are not important. |
Trajectory evaluation | Evaluates the entire path (sequence of tool calls) the agent took to reach the final response. | When the process, reasoning path, and tool usage are critical for debugging, optimization, or ensuring compliance. |
Supported agents
The Gen AI evaluation service supports the following categories of agents:
Supported agents | Description |
---|---|
Agent built with Agent Engine's template | Agent Engine (LangChain on Vertex AI) is a Google Cloud platform where you can deploy and manage agents. |
LangChain agents built using Agent Engine's customizable template | LangChain is an open source platform. |
Custom agent function | A flexible function that takes in a prompt for the agent and returns a response and trajectory in a dictionary. |
Evaluation metrics
You can define metrics for final response or trajectory evaluation.
Final response metrics
Final response evaluation follows the same process as model response evaluation. For more information, see Define your evaluation metrics.
Trajectory metrics
The following metrics evaluate the model's ability to follow the expected trajectory.
Metric | What It Measures | When to Use |
---|---|---|
trajectory_exact_match |
Whether the predicted tool call sequence is identical to the reference sequence. | For strict, non-flexible workflows where the exact sequence and parameters must be followed. |
trajectory_in_order_match |
Whether all reference tool calls are present in the correct order, allowing for extra calls. | When the core sequence is important, but the agent can perform additional helpful steps. |
trajectory_any_order_match |
Whether all reference tool calls are present, regardless of order or extra calls. | When a set of tasks must be completed, but the execution order is not important. |
trajectory_precision |
The proportion of predicted tool calls that are relevant (i.e., also in the reference). | To penalize agents that make many irrelevant or unnecessary tool calls. |
trajectory_recall |
The proportion of required (reference) tool calls that the agent actually made. | To ensure the agent performs all necessary steps to complete the task. |
trajectory_single_tool_use |
Whether a specific, single tool was used at least once in the trajectory. | To verify if a critical tool (e.g., a final confirmation or safety check) was part of the process. |
All trajectory metrics, except trajectory_single_tool_use
, require a predicted_trajectory
and a reference_trajectory
as input parameters.
Exact match
The trajectory_exact_match
metric returns a score of 1 if the predicted trajectory is identical to the reference trajectory, with the same tool calls in the same order. Otherwise, it returns 0.
In-order match
The trajectory_in_order_match
metric returns a score of 1 if the predicted trajectory contains all the tool calls from the reference trajectory in the same order. Extra tool calls are permitted. Otherwise, it returns 0.
Any-order match
The trajectory_any_order_match
metric returns a score of 1 if the predicted trajectory contains all the tool calls from the reference trajectory, regardless of their order. Extra tool calls are permitted. Otherwise, it returns 0.
Precision
The trajectory_precision
metric measures how many of the tool calls in the predicted trajectory are relevant according to the reference trajectory. The score is a float in the range of [0,1].
Precision is calculated by dividing the number of actions in the predicted trajectory that also appear in the reference trajectory by the total number of actions in the predicted trajectory.
Recall
The trajectory_recall
metric measures how many of the essential tool calls from the reference trajectory are present in the predicted trajectory. The score is a float in the range of [0,1].
Recall is calculated by dividing the number of actions in the reference trajectory that also appear in the predicted trajectory by the total number of actions in the reference trajectory.
Single tool use
The trajectory_single_tool_use
metric checks if a specific tool, specified in the metric spec, is used in the predicted trajectory. It doesn't check the order of tool calls or how many times the tool is used. It returns 1 if the tool is present and 0 if it is absent.
Default performance metrics
The following performance metrics are added to the evaluation results by default. You don't need to specify them in EvalTask
.
latency
Time taken by the agent to return a response, calculated in seconds.
failure
A boolean that indicates if the agent invocation resulted in an error.
Output scores
Value | Description |
---|---|
1 | Error |
0 | Valid response returned |
Prepare the evaluation dataset
The data schema for final response evaluation is similar to that of model response evaluation.
For computation-based trajectory evaluation, your dataset needs to provide the following information:
predicted_trajectory
: The list of tool calls used by the agent to reach the final response.reference_trajectory
: The expected tool use for the agent to satisfy the query. This is not required for thetrajectory_single_tool_use
metric.
Evaluation dataset examples
The following examples show datasets for trajectory evaluation. reference_trajectory
is required for all metrics except trajectory_single_tool_use
.
reference_trajectory = [
# example 1
[
{
"tool_name": "set_device_info",
"tool_input": {
"device_id": "device_2",
"updates": {
"status": "OFF"
}
}
}
],
# example 2
[
{
"tool_name": "get_user_preferences",
"tool_input": {
"user_id": "user_y"
}
},
{
"tool_name": "set_temperature",
"tool_input": {
"location": "Living Room",
"temperature": 23
}
},
]
]
predicted_trajectory = [
# example 1
[
{
"tool_name": "set_device_info",
"tool_input": {
"device_id": "device_3",
"updates": {
"status": "OFF"
}
}
}
],
# example 2
[
{
"tool_name": "get_user_preferences",
"tool_input": {
"user_id": "user_z"
}
},
{
"tool_name": "set_temperature",
"tool_input": {
"location": "Living Room",
"temperature": 23
}
},
]
]
eval_dataset = pd.DataFrame({
"predicted_trajectory": predicted_trajectory,
"reference_trajectory": reference_trajectory,
})
Import your evaluation dataset
You can import your dataset in the following formats:
- JSONL or CSV file stored in Cloud Storage
- BigQuery table
- Pandas DataFrame
Gen AI evaluation service provides example public datasets to demonstrate how you can evaluate your agents. The following code shows how to import the public datasets from a Cloud Storage bucket:
# dataset name to be imported
dataset = "on-device" # Alternatives: "customer-support", "content-creation"
# copy the tools and dataset file
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/tools.py .
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/eval_dataset.json .
# load the dataset examples
import json
eval_dataset = json.loads(open('eval_dataset.json').read())
# run the tools file
%run -i tools.py
where dataset
is one of the following public datasets:
"on-device"
for an On-Device Home Assistant, which controls home devices. The agent helps with queries such as "Schedule the air conditioning in the bedroom so that it is on between 11pm and 8am, and off the rest of the time.""customer-support"
for a Customer Support Agent. The agent helps with queries such as "Can you cancel any pending orders and escalate any open support tickets?""content-creation"
for a Marketing Content Creation Agent. The agent helps with queries such as "Reschedule campaign X to be a one-time campaign on social media site Y with a 50% reduced budget, only on December 25, 2024."
Run an evaluation
For agent evaluation, you can mix response evaluation metrics and trajectory evaluation metrics in the same task.
single_tool_use_metric = TrajectorySingleToolUse(tool_name='tool_name')
eval_task = EvalTask(
dataset=EVAL_DATASET,
metrics=[
"rouge_l_sum",
"bleu",
custom_trajectory_eval_metric, # custom computation-based metric
"trajectory_exact_match",
"trajectory_precision",
single_tool_use_metric,
response_follows_trajectory_metric # llm-based metric
],
)
eval_result = eval_task.evaluate(
runnable=RUNNABLE,
)
Metric customization
You can customize a large language model-based metric for trajectory evaluation using a templated interface or from scratch. For more details, see model-based metrics. The following is a templated example:
response_follows_trajectory_prompt_template = PointwiseMetricPromptTemplate(
criteria={
"Follows trajectory": (
"Evaluate whether the agent's response logically follows from the "
"sequence of actions it took. Consider these sub-points:\n"
" - Does the response reflect the information gathered during the trajectory?\n"
" - Is the response consistent with the goals and constraints of the task?\n"
" - Are there any unexpected or illogical jumps in reasoning?\n"
"Provide specific examples from the trajectory and response to support your evaluation."
)
},
rating_rubric={
"1": "Follows trajectory",
"0": "Does not follow trajectory",
},
input_variables=["prompt", "predicted_trajectory"],
)
response_follows_trajectory_metric = PointwiseMetric(
metric="response_follows_trajectory",
metric_prompt_template=response_follows_trajectory_prompt_template,
)
You can also define a custom computation-based metric for trajectory evaluation or response evaluation.
def essential_tools_present(instance, required_tools = ["tool1", "tool2"]):
trajectory = instance["predicted_trajectory"]
tools_present = [tool_used['tool_name'] for tool_used in trajectory]
if len(required_tools) == 0:
return {"essential_tools_present": 1}
score = 0
for tool in required_tools:
if tool in tools_present:
score += 1
return {
"essential_tools_present": score/len(required_tools),
}
custom_trajectory_eval_metric = CustomMetric(name="essential_tools_present", metric_function=essential_tools_present)
View and interpret the results
The evaluation results are displayed in tables for final response metrics and trajectory metrics.
The evaluation results contain the following information:
Final response metrics
Instance-level results
Column | Description |
---|---|
response | Final response generated by the agent. |
latency_in_seconds | Time taken to generate the response. |
failure | Indicates whether a valid response was generated. |
score | A score calculated for the response specified in the metric spec. |
explanation | The explanation for the score specified in the metric spec. |
Aggregate results
Column | Description |
---|---|
mean | Average score for all instances. |
standard deviation | Standard deviation for all the scores. |
Trajectory metrics
Instance-level results
Column | Description |
---|---|
predicted_trajectory | Sequence of tool calls followed by agent to reach the final response. |
reference_trajectory | Sequence of expected tool calls. |
score | A score calculated for the predicted trajectory and reference trajectory specified in the metric spec. |
latency_in_seconds | Time taken to generate the response. |
failure | Indicates whether a valid response was generated. |
Aggregate results
Column | Description |
---|---|
mean | Average score for all instances. |
standard deviation | Standard deviation for all the scores. |
Agent2Agent (A2A) protocol
If you are building a multi-agent system, consider reviewing the A2A Protocol. A2A Protocol is an open standard that enables seamless communication and collaboration between AI agents, regardless of their underlying frameworks. It was donated by Google Cloud to the Linux Foundation in June 2025. To use the A2A SDKs, or try out the samples, see the GitHub repository.
What's next
Try the following agent evaluation notebooks:
- Evaluate a Langraph agent
- Evaluate a CrewAI agent
- Evaluate a Langchain agent with Agent Engine
- Evaluate a LangGraph agent with Agent Engine
- Evaluate a CrewAI agent with Agent Engine