Prepare your evaluation dataset

For the Gen AI evaluation service, the evaluation dataset typically consists of the model response that you want to evaluate, the input data used to generate your response, and possibly the ground truth response.

Evaluation dataset schema

For typical model-based metrics use cases, your dataset needs to provide the following information:

Input type Input field contents
prompt User input for the Gen AI model or application. It's optional in some cases.
response Your LLM inference response to be evaluated.
baseline_model_response (required by pairwise metrics) The baseline LLM inference response that is used to compare your LLM response to in the pairwise evaluation

If you use the Gen AI Evaluation module of the Vertex AI SDK for Python, the Gen AI evaluation service can automatically generate the response and baseline_model_response with the model you specified.

For other evaluation use cases, you may need to provide more information:

Multi-turn or chat

Input type Input field contents
history The history of the conversation between user and model before the current turn.
prompt User input for the Gen AI model or application in the current turn.
response Your LLM inference response to be evaluated, which is based on the history and current turn prompt.
baseline_model_response (required by pairwise metrics) The baseline LLM inference response that is used to compare your LLM response to in the pairwise evaluation, which is based on the history and current turn prompt.

Computation-based metrics

Your dataset needs to provide a response from the large language model and a reference to compare to.

Input type Input field contents
response Your LLM inference response to be evaluated.
reference The ground truth to compare your LLM response to.

Translation metrics

Your dataset needs to provide a response from the model. Depending on your use case, you also need to provide a reference to compare to, an input in the source language, or a combination of both.

Input type Input field contents
source Source text which is in the original language that the prediction was translated from.
response Your LLM inference response to be evaluated.
reference The ground truth to compare your LLM response to. This is in the same language as the response.

Depending on your use cases, you may also break down the input user prompt into granular pieces, such as instruction and context, and assemble them for inference by providing a prompt template. You can also provide the reference or ground truth information if needed:

Input type Input field contents
instruction Part of the input user prompt. It refers to the inference instruction that is sent to your LLM. For example: "Please summarize the following text" is an instruction.
context User input for the Gen AI model or application in the current turn.
reference The ground truth to compare your LLM response to.

The required inputs for the evaluation dataset should be consistent with your metrics. For more details regarding customizing your metrics, see Define your evaluation metrics and Run evaluation. For more details regarding how to include reference data in your model-based metrics, see Adapt a metric prompt template to your input data.

Import your evaluation dataset

You can import your dataset in the following formats:

  • JSONL or CSV file stored in Cloud Storage

  • BigQuery table

  • Pandas DataFrame

Evaluation dataset examples

This section shows dataset examples using the Pandas Dataframe format. Note that only several data records are shown here as an example, and evaluation datasets usually have 100 or more data points. For best practices when preparing a dataset, see the Best practices section.

Pointwise model-based metrics

The following is a summarization case to demonstrate a sample dataset for pointwise model-based metrics:

prompts = [
    # Example 1
    (
        "Summarize the text in one sentence: As part of a comprehensive"
        " initiative to tackle urban congestion and foster sustainable urban"
        " living, a major city has revealed ambitious plans for an extensive"
        " overhaul of its public transportation system. The project aims not"
        " only to improve the efficiency and reliability of public transit but"
        " also to reduce the city's carbon footprint and promote eco-friendly"
        " commuting options. City officials anticipate that this strategic"
        " investment will enhance accessibility for residents and visitors"
        " alike, ushering in a new era of efficient, environmentally conscious"
        " urban transportation."
    ),
    # Example 2
    (
        "Summarize the text such that a five-year-old can understand: A team of"
        " archaeologists has unearthed ancient artifacts shedding light on a"
        " previously unknown civilization. The findings challenge existing"
        " historical narratives and provide valuable insights into human"
        " history."
    ),
]

responses = [
    # Example 1
    (
        "A major city is revamping its public transportation system to fight"
        " congestion, reduce emissions, and make getting around greener and"
        " easier."
    ),
    # Example 2
    (
        "Some people who dig for old things found some very special tools and"
        " objects that tell us about people who lived a long, long time ago!"
        " What they found is like a new puzzle piece that helps us understand"
        " how people used to live."
    ),
]

eval_dataset = pd.DataFrame({
    "prompt": prompts,
    "response": responses,
})

Pairwise model-based metrics

The following example shows an open-book question-answering case to demonstrate a sample dataset for pairwise model-based metrics.

prompts = [
    # Example 1
    (
        "Based on the context provided, what is the hardest material? Context:"
        " Some might think that steel is the hardest material, or even"
        " titanium. However, diamond is actually the hardest material."
    ),
    # Example 2
    (
        "Based on the context provided, who directed The Godfather? Context:"
        " Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The"
        " Godfather, and the latter directed it as well."
    ),
]

responses = [
    # Example 1
    "Diamond is the hardest material. It is harder than steel or titanium.",
    # Example 2
    "Francis Ford Coppola directed The Godfather.",
]

baseline_model_responses = [
    # Example 1
    "Steel is the hardest material.",
    # Example 2
    "John Smith.",
]

eval_dataset = pd.DataFrame(
  {
    "prompt":  prompts,
    "response":  responses,
    "baseline_model_response": baseline_model_responses,
  }
)

Computation-based metrics

For computation-based metrics, reference is often required.

eval_dataset = pd.DataFrame({
  "response": ["The Roman Senate was filled with exuberance due to Pompey's defeat in Asia."],
  "reference": ["The Roman Senate was filled with exuberance due to successes against Catiline."],
})

Tool-use (function calling) metrics

The following example shows input data for computation-based tool-use metrics:

json_responses = ["""{
    "content": "",
    "tool_calls":[{
      "name":"get_movie_info",
      "arguments": {"movie":"Mission Impossible", "time": "today 7:30PM"}
    }]
  }"""]

json_references = ["""{
    "content": "",
    "tool_calls":[{
      "name":"book_tickets",
      "arguments":{"movie":"Mission Impossible", "time": "today 7:30PM"}
      }]
  }"""]

eval_dataset = pd.DataFrame({
    "response": json_responses,
    "reference": json_references,
})

Translation use cases

The following example shows input data for translation metrics:

  source = [
      "Dem Feuer konnte Einhalt geboten werden",
      "Schulen und Kindergärten wurden eröffnet.",
  ]

  response = [
      "The fire could be stopped",
      "Schools and kindergartens were open",
  ]

  reference = [
      "They were able to control the fire.",
      "Schools and kindergartens opened",
  ]

  eval_dataset = pd.DataFrame({
      "source": source,
      "response": response,
      "reference": reference,
  })

Best practices

Follow these best practices when defining your evaluation dataset:

  • Provide examples that represent the types of inputs, which your models process in production.
  • Your dataset must include a minimum of one evaluation example. We recommend around 100 examples to ensure high-quality aggregated metrics and statistically significant results. This size helps establish a higher confidence level in the aggregated evaluation results, minimizing the influence of outliers and ensuring that the performance metrics reflect the model's true capabilities across diverse scenarios. The rate of aggregated metric quality improvements tends to decrease when more than 400 examples are provided.

What's next