Prepare your evaluation dataset

This guide shows you how to prepare an evaluation dataset for the Gen AI evaluation service, covering the following topics:

  • Evaluation dataset schema: Learn about the required data fields for different evaluation use cases, such as pointwise, pairwise, and computation-based metrics.
  • Import your evaluation dataset: Discover the supported formats for bringing your data into the evaluation service.
  • Evaluation dataset examples: View sample datasets in Pandas DataFrame format for various evaluation scenarios.
  • Best practices: Get recommendations for creating a high-quality and effective evaluation dataset.

For the Gen AI evaluation service, the evaluation dataset typically consists of the model response that you want to evaluate, the input data used to generate your response, and possibly the ground truth response.

Evaluation dataset schema

For typical model-based metrics use cases, your dataset needs to provide the following information:

Input type Input field contents
prompt User input for the Gen AI model or application. This field is optional in some cases.
response Your LLM inference response to be evaluated.
baseline_model_response (required by pairwise metrics) The baseline LLM inference response for pairwise evaluation comparisons.

If you use the Gen AI Evaluation module of the Vertex AI SDK for Python, the Gen AI evaluation service can automatically generate the response and baseline_model_response with the model you specified.

For other evaluation use cases, you might need to provide more information:

Multi-turn or chat

Input type Input field contents
history The history of the conversation between the user and model before the current turn.
prompt User input for the Gen AI model or application in the current turn.
response Your LLM inference response to be evaluated, which is based on the history and current turn prompt.
baseline_model_response (required by pairwise metrics) The baseline LLM inference response for pairwise evaluation, based on the history and current prompt.

Computation-based metrics

Your dataset needs to provide a response from the large language model and a reference to compare it to.

Input type Input field contents
response Your LLM inference response to be evaluated.
reference The ground truth to compare your LLM response to.

Translation metrics

Your dataset needs to provide a response from the model. Depending on your use case, you also need to provide a reference to compare to, an input in the source language, or a combination of both.

Input type Input field contents
source Source text in the original language from which the prediction was translated.
response Your LLM inference response to be evaluated.
reference The ground truth to compare your LLM response to. This is in the same language as the response.

Depending on your use case, you can also break down the input user prompt into granular pieces, such as instruction and context, and assemble them for inference by providing a prompt template. You can also provide the reference or ground truth information if needed:

Input type Input field contents
instruction The part of the user prompt that contains the inference instruction for your LLM. For example: "Please summarize the following text" is an instruction.
context User input for the Gen AI model or application in the current turn.
reference The ground truth to compare your LLM response to.

The required inputs for the evaluation dataset must be consistent with your metrics. To learn more about customizing your metrics and running an evaluation, see Define your evaluation metrics and Run evaluation. For details about including reference data in your model-based metrics, see Adapt a metric prompt template to your input data.

Import your evaluation dataset

You can import your dataset in several formats. The following table provides a comparison to help you choose the best option for your use case.

Format Description Use Case
JSONL or CSV file in Cloud Storage Line-delimited JSON or comma-separated values files stored in a cloud bucket. Best for large, static datasets that are already stored or can be easily exported to cloud storage.
BigQuery table A table within Google's serverless data warehouse. Ideal for very large datasets already residing in BigQuery, allowing for powerful SQL-based preprocessing and selection.
Pandas DataFrame An in-memory data structure from the Python pandas library. Convenient for interactive development in notebooks, smaller datasets, and when data is generated or manipulated programmatically in a Python environment.

Evaluation dataset examples

This section shows dataset examples using the Pandas DataFrame format. The following examples are brief, but evaluation datasets usually have 100 or more data points. For recommendations on preparing a dataset, see the Best practices section.

Pointwise model-based metrics

The following summarization example shows a sample dataset for pointwise model-based metrics:

prompts = [
    # Example 1
    (
        "Summarize the text in one sentence: As part of a comprehensive"
        " initiative to tackle urban congestion and foster sustainable urban"
        " living, a major city has revealed ambitious plans for an extensive"
        " overhaul of its public transportation system. The project aims not"
        " only to improve the efficiency and reliability of public transit but"
        " also to reduce the city's carbon footprint and promote eco-friendly"
        " commuting options. City officials anticipate that this strategic"
        " investment will enhance accessibility for residents and visitors"
        " alike, ushering in a new era of efficient, environmentally conscious"
        " urban transportation."
    ),
    # Example 2
    (
        "Summarize the text such that a five-year-old can understand: A team of"
        " archaeologists has unearthed ancient artifacts shedding light on a"
        " previously unknown civilization. The findings challenge existing"
        " historical narratives and provide valuable insights into human"
        " history."
    ),
]

responses = [
    # Example 1
    (
        "A major city is revamping its public transportation system to fight"
        " congestion, reduce emissions, and make getting around greener and"
        " easier."
    ),
    # Example 2
    (
        "Some people who dig for old things found some very special tools and"
        " objects that tell us about people who lived a long, long time ago!"
        " What they found is like a new puzzle piece that helps us understand"
        " how people used to live."
    ),
]

eval_dataset = pd.DataFrame({
    "prompt": prompts,
    "response": responses,
})

Pairwise model-based metrics

The following open-book question-answering example shows a sample dataset for pairwise model-based metrics.

prompts = [
    # Example 1
    (
        "Based on the context provided, what is the hardest material? Context:"
        " Some might think that steel is the hardest material, or even"
        " titanium. However, diamond is actually the hardest material."
    ),
    # Example 2
    (
        "Based on the context provided, who directed The Godfather? Context:"
        " Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The"
        " Godfather, and the latter directed it as well."
    ),
]

responses = [
    # Example 1
    "Diamond is the hardest material. It is harder than steel or titanium.",
    # Example 2
    "Francis Ford Coppola directed The Godfather.",
]

baseline_model_responses = [
    # Example 1
    "Steel is the hardest material.",
    # Example 2
    "John Smith.",
]

eval_dataset = pd.DataFrame(
  {
    "prompt":  prompts,
    "response":  responses,
    "baseline_model_response": baseline_model_responses,
  }
)

Computation-based metrics

For computation-based metrics, reference is often required.

eval_dataset = pd.DataFrame({
  "response": ["The Roman Senate was filled with exuberance due to Pompey's defeat in Asia."],
  "reference": ["The Roman Senate was filled with exuberance due to successes against Catiline."],
})

Tool-use (function calling) metrics

The following example shows input data for computation-based tool-use metrics:

json_responses = ["""{
    "content": "",
    "tool_calls":[{
      "name":"get_movie_info",
      "arguments": {"movie":"Mission Impossible", "time": "today 7:30PM"}
    }]
  }"""]

json_references = ["""{
    "content": "",
    "tool_calls":[{
      "name":"book_tickets",
      "arguments":{"movie":"Mission Impossible", "time": "today 7:30PM"}
      }]
  }"""]

eval_dataset = pd.DataFrame({
    "response": json_responses,
    "reference": json_references,
})

Translation use cases

The following example shows input data for translation metrics:

  source = [
      "Dem Feuer konnte Einhalt geboten werden",
      "Schulen und Kindergärten wurden eröffnet.",
  ]

  response = [
      "The fire could be stopped",
      "Schools and kindergartens were open",
  ]

  reference = [
      "They were able to control the fire.",
      "Schools and kindergartens opened",
  ]

  eval_dataset = pd.DataFrame({
      "source": source,
      "response": response,
      "reference": reference,
  })

Best practices

Follow these best practices when you define your evaluation dataset:

  • Provide representative examples: Your dataset should include examples that represent the types of inputs that your model processes in production.
  • Use a sufficient dataset size: Your dataset must include a minimum of one example. Google recommends using around 100 examples to get high-quality aggregated metrics and statistically significant results. A dataset of this size helps to:
    • Establish a higher confidence level in the aggregated evaluation results.
    • Minimize the influence of outliers.
    • Help performance metrics reflect the model's true capabilities across diverse scenarios.
  • Recognize diminishing returns: The rate of aggregated metric quality improvements tends to decrease when more than 400 examples are provided.

What's next