Evaluate Gen AI agents

After you build a Generative AI model, you can use it to power an agent, such as a chatbot. The Gen AI evaluation service service lets you measure your agent's ability to complete tasks and goals for your use case.

This guide shows you how to evaluate Generative AI agents using the Gen AI evaluation service service and covers the following topics:

  • Evaluation methods: Learn about the two main approaches for agent evaluation: final response and trajectory.
  • Supported agents: See the types of agents you can evaluate, including those built with Agent Engine, LangChain, or custom functions.
  • Evaluation metrics: Understand the metrics available for evaluating an agent's final response and its trajectory.
  • Preparing your dataset: See how to structure your data for both final response and trajectory evaluation.
  • Running an evaluation: Execute an evaluation task using the Vertex AI SDK and customize metrics for your specific needs.
  • Viewing and interpreting results: Learn how to understand the instance-level and aggregate scores for your agent's performance.
  • Agent2Agent (A2A) protocol: Get an overview of the A2A open standard for multi-agent communication.

Evaluation methods

You can evaluate your agent using the following methods. With the Gen AI evaluation service, you can trigger an agent execution and get metrics for both evaluation methods in one Vertex AI SDK query.

Evaluation Method Description Use Case
Final response evaluation Evaluates only the final output of an agent to determine if it achieved its goal. When the end result is the primary concern, and the intermediate steps are not important.
Trajectory evaluation Evaluates the entire path (sequence of tool calls) the agent took to reach the final response. When the process, reasoning path, and tool usage are critical for debugging, optimization, or ensuring compliance.

Supported agents

The Gen AI evaluation service supports the following categories of agents:

Supported agents Description
Agent built with Agent Engine's template Agent Engine (LangChain on Vertex AI) is a Google Cloud platform where you can deploy and manage agents.
LangChain agents built using Agent Engine's customizable template LangChain is an open source platform.
Custom agent function A flexible function that takes in a prompt for the agent and returns a response and trajectory in a dictionary.

Evaluation metrics

You can define metrics for final response or trajectory evaluation.

Final response metrics

Final response evaluation follows the same process as model response evaluation. For more information, see Define your evaluation metrics.

Trajectory metrics

The following metrics evaluate the model's ability to follow the expected trajectory.

Metric What It Measures When to Use
trajectory_exact_match Whether the predicted tool call sequence is identical to the reference sequence. For strict, non-flexible workflows where the exact sequence and parameters must be followed.
trajectory_in_order_match Whether all reference tool calls are present in the correct order, allowing for extra calls. When the core sequence is important, but the agent can perform additional helpful steps.
trajectory_any_order_match Whether all reference tool calls are present, regardless of order or extra calls. When a set of tasks must be completed, but the execution order is not important.
trajectory_precision The proportion of predicted tool calls that are relevant (i.e., also in the reference). To penalize agents that make many irrelevant or unnecessary tool calls.
trajectory_recall The proportion of required (reference) tool calls that the agent actually made. To ensure the agent performs all necessary steps to complete the task.
trajectory_single_tool_use Whether a specific, single tool was used at least once in the trajectory. To verify if a critical tool (e.g., a final confirmation or safety check) was part of the process.

All trajectory metrics, except trajectory_single_tool_use, require a predicted_trajectory and a reference_trajectory as input parameters.

Exact match

The trajectory_exact_match metric returns a score of 1 if the predicted trajectory is identical to the reference trajectory, with the same tool calls in the same order. Otherwise, it returns 0.

In-order match

The trajectory_in_order_match metric returns a score of 1 if the predicted trajectory contains all the tool calls from the reference trajectory in the same order. Extra tool calls are permitted. Otherwise, it returns 0.

Any-order match

The trajectory_any_order_match metric returns a score of 1 if the predicted trajectory contains all the tool calls from the reference trajectory, regardless of their order. Extra tool calls are permitted. Otherwise, it returns 0.

Precision

The trajectory_precision metric measures how many of the tool calls in the predicted trajectory are relevant according to the reference trajectory. The score is a float in the range of [0,1].

Precision is calculated by dividing the number of actions in the predicted trajectory that also appear in the reference trajectory by the total number of actions in the predicted trajectory.

Recall

The trajectory_recall metric measures how many of the essential tool calls from the reference trajectory are present in the predicted trajectory. The score is a float in the range of [0,1].

Recall is calculated by dividing the number of actions in the reference trajectory that also appear in the predicted trajectory by the total number of actions in the reference trajectory.

Single tool use

The trajectory_single_tool_use metric checks if a specific tool, specified in the metric spec, is used in the predicted trajectory. It doesn't check the order of tool calls or how many times the tool is used. It returns 1 if the tool is present and 0 if it is absent.

Default performance metrics

The following performance metrics are added to the evaluation results by default. You don't need to specify them in EvalTask.

latency

Time taken by the agent to return a response, calculated in seconds.

failure

A boolean that indicates if the agent invocation resulted in an error.

Output scores

Value Description
1 Error
0 Valid response returned

Prepare the evaluation dataset

The data schema for final response evaluation is similar to that of model response evaluation.

For computation-based trajectory evaluation, your dataset needs to provide the following information:

  • predicted_trajectory: The list of tool calls used by the agent to reach the final response.
  • reference_trajectory: The expected tool use for the agent to satisfy the query. This is not required for the trajectory_single_tool_use metric.

Evaluation dataset examples

The following examples show datasets for trajectory evaluation. reference_trajectory is required for all metrics except trajectory_single_tool_use.

reference_trajectory = [
# example 1
[
  {
    "tool_name": "set_device_info",
    "tool_input": {
        "device_id": "device_2",
        "updates": {
            "status": "OFF"
        }
    }
  }
],
# example 2
[
    {
      "tool_name": "get_user_preferences",
      "tool_input": {
          "user_id": "user_y"
      }
  },
  {
      "tool_name": "set_temperature",
      "tool_input": {
          "location": "Living Room",
          "temperature": 23
      }
    },
  ]
]

predicted_trajectory = [
# example 1
[
  {
    "tool_name": "set_device_info",
    "tool_input": {
        "device_id": "device_3",
        "updates": {
            "status": "OFF"
        }
    }
  }
],
# example 2
[
    {
      "tool_name": "get_user_preferences",
      "tool_input": {
          "user_id": "user_z"
      }
    },
    {
      "tool_name": "set_temperature",
      "tool_input": {
          "location": "Living Room",
          "temperature": 23
      }
    },
  ]
]

eval_dataset = pd.DataFrame({
    "predicted_trajectory": predicted_trajectory,
    "reference_trajectory": reference_trajectory,
})

Import your evaluation dataset

You can import your dataset in the following formats:

  • JSONL or CSV file stored in Cloud Storage
  • BigQuery table
  • Pandas DataFrame

Gen AI evaluation service provides example public datasets to demonstrate how you can evaluate your agents. The following code shows how to import the public datasets from a Cloud Storage bucket:

# dataset name to be imported
dataset = "on-device" # Alternatives: "customer-support", "content-creation"

# copy the tools and dataset file
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/tools.py .
!gcloud storage cp gs://cloud-ai-demo-datasets/agent-eval-datasets/{dataset}/eval_dataset.json .

# load the dataset examples
import json

eval_dataset = json.loads(open('eval_dataset.json').read())

# run the tools file
%run -i tools.py

where dataset is one of the following public datasets:

  • "on-device" for an On-Device Home Assistant, which controls home devices. The agent helps with queries such as "Schedule the air conditioning in the bedroom so that it is on between 11pm and 8am, and off the rest of the time."
  • "customer-support" for a Customer Support Agent. The agent helps with queries such as "Can you cancel any pending orders and escalate any open support tickets?"
  • "content-creation" for a Marketing Content Creation Agent. The agent helps with queries such as "Reschedule campaign X to be a one-time campaign on social media site Y with a 50% reduced budget, only on December 25, 2024."

Run an evaluation

For agent evaluation, you can mix response evaluation metrics and trajectory evaluation metrics in the same task.

single_tool_use_metric = TrajectorySingleToolUse(tool_name='tool_name')

eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[
        "rouge_l_sum",
        "bleu",
        custom_trajectory_eval_metric, # custom computation-based metric
        "trajectory_exact_match",
        "trajectory_precision",
        single_tool_use_metric,
        response_follows_trajectory_metric # llm-based metric
    ],
)
eval_result = eval_task.evaluate(
    runnable=RUNNABLE,
)

Metric customization

You can customize a large language model-based metric for trajectory evaluation using a templated interface or from scratch. For more details, see model-based metrics. The following is a templated example:

response_follows_trajectory_prompt_template = PointwiseMetricPromptTemplate(
    criteria={
        "Follows trajectory": (
            "Evaluate whether the agent's response logically follows from the "
            "sequence of actions it took. Consider these sub-points:\n"
            "  - Does the response reflect the information gathered during the trajectory?\n"
            "  - Is the response consistent with the goals and constraints of the task?\n"
            "  - Are there any unexpected or illogical jumps in reasoning?\n"
            "Provide specific examples from the trajectory and response to support your evaluation."
        )
    },
    rating_rubric={
        "1": "Follows trajectory",
        "0": "Does not follow trajectory",
    },
    input_variables=["prompt", "predicted_trajectory"],
)

response_follows_trajectory_metric = PointwiseMetric(
    metric="response_follows_trajectory",
    metric_prompt_template=response_follows_trajectory_prompt_template,
)

You can also define a custom computation-based metric for trajectory evaluation or response evaluation.

def essential_tools_present(instance, required_tools = ["tool1", "tool2"]):
    trajectory = instance["predicted_trajectory"]
    tools_present = [tool_used['tool_name'] for tool_used in trajectory]
    if len(required_tools) == 0:
      return {"essential_tools_present": 1}
    score = 0
    for tool in required_tools:
      if tool in tools_present:
        score += 1
    return {
        "essential_tools_present": score/len(required_tools),
    }

custom_trajectory_eval_metric = CustomMetric(name="essential_tools_present", metric_function=essential_tools_present)

View and interpret the results

The evaluation results are displayed in tables for final response metrics and trajectory metrics.

Tables for agent evaluation metrics

The evaluation results contain the following information:

Final response metrics

Instance-level results

Column Description
response Final response generated by the agent.
latency_in_seconds Time taken to generate the response.
failure Indicates whether a valid response was generated.
score A score calculated for the response specified in the metric spec.
explanation The explanation for the score specified in the metric spec.

Aggregate results

Column Description
mean Average score for all instances.
standard deviation Standard deviation for all the scores.

Trajectory metrics

Instance-level results

Column Description
predicted_trajectory Sequence of tool calls followed by agent to reach the final response.
reference_trajectory Sequence of expected tool calls.
score A score calculated for the predicted trajectory and reference trajectory specified in the metric spec.
latency_in_seconds Time taken to generate the response.
failure Indicates whether a valid response was generated.

Aggregate results

Column Description
mean Average score for all instances.
standard deviation Standard deviation for all the scores.

Agent2Agent (A2A) protocol

If you are building a multi-agent system, consider reviewing the A2A Protocol. A2A Protocol is an open standard that enables seamless communication and collaboration between AI agents, regardless of their underlying frameworks. It was donated by Google Cloud to the Linux Foundation in June 2025. To use the A2A SDKs, or try out the samples, see the GitHub repository.

What's next

Try the following agent evaluation notebooks: