Generative AI evaluation service overview

The generative AI evaluation service lets you evaluate model performance across specific use cases.

Leaderboards can provide a good overview of a model's general performance, but leaderboards can't provide information about how well the model performs in your specific use case.

When developing AI-based solutions, it is essential that generative AI models evaluate on user data with fixed criteria according to business needs. Evaluations ensure that models can adapt to the domain-specific requirements of different users. By assessing models against user-defined benchmarks and objectives, you can apply prompt engineering and model tuning to better align with the operational context, cultural sensitivities, and strategic goals of the businesses you serve. These assessments are used to guide the development and improvement of models while ensuring the models are useful, safe, and effective for your users.

In this section, we offer use cases for generative AI evaluation throughout the development lifecycle. We also present key paradigms, an online evaluation experience through Try the online evaluation quickstart, and corresponding pipeline components that let you move your evaluation logic to production. You can switch between interactive evaluation and production-ready pipelines. Evaluation results are autologged in Vertex AI Experiments. We offer prebuilt components to evaluate tasks, such as summarization or question answering, and evaluate individual metrics, such as coherence or summarization verbosity. You can also define your metrics, and provide your criteria for the autorater.

Example use cases

Generative AI evaluation is a comprehensive process that benchmarks model performance and guides the strategic development and refinement of generative AI models and applications, helping to ensure that generative AI models are aligned with your business needs. You can use generative AI during different phases of your development:

  • Pre-production: You can make better decisions with respect to model selection and customization preferences, such as tuning.
  • During production: You can monitor your model's performance to help ensure that the models are effective.

Generative AI evaluation can be applied to a range of use case scenarios, including the following:

  • Select pretrained models: You can choose a pretrained model for a specific task or application by evaluating the model's performance on the associated benchmark tasks.
  • Configure model generation settings: You can optimize the configuration settings of model-generation parameters, such as the temperature, which can improve the performance of tasks.
  • Prompt engineering using a template: You can design more effective prompts that lead to higher-quality output, which enhances user interaction with the model.
  • Improve and safeguard fine-tuning: Fine-tuning processes improve model performance while avoiding biases or undesirable behaviors.

For more examples, see Evaluation examples.

Evaluation services

Vertex AI offers two solutions for performing evaluation on generative AI models: an online evaluation service rapid evaluation and evaluation templates built on Vertex AI Pipelines (AutoSxS and computation-based). Choose the service that best fits your use case:

Service Use cases
Online evaluation (rapid evaluation) A few instances to evaluate. Workflows requiring quick iterations.
Pipeline evaluation (AutoSxS & computation-based) Many instances to evaluate. Asynchronous workflows and MLOps.

Rapid evaluation

The rapid evaluation service produces low latency and synchronous evaluations on small batches of data. You can perform evaluations on demand, and integrate the online service with other Vertex AI services using the Vertex AI SDK for Python. Using the SDK makes the online service adaptable to a variety of use cases.

The online service is most suitable for use cases involving small batches of data or when you must iterate and experiment quickly.

Pipeline services: AutoSxS and computation-based

Evaluation pipeline services provide end-to-end solutions for evaluating generative AI models. These solutions use Vertex AI Pipelines to orchestrate a series of steps related to evaluation such as generating model responses, calling the online evaluation service, and computing metrics. These steps can also be called individually in custom pipelines.

Because Vertex AI Pipelines is serverless, there is a higher startup latency associated with using pipelines for evaluation. Therefore, this service is better suited for larger evaluation jobs, workflows where evaluations aren't immediately needed, and integration into MLOps pipelines.

We offer two separate evaluation pipelines, as follows:

Evaluation paradigms

Generative AI evaluations work because of two paradigms for evaluating models, which include:

  • Pointwise: You can evaluate one model.
  • Pairwise: You can compare two models against each other.

Pointwise

Pointwise evaluation assesses the performance of a single model. It helps you to understand how well the model performs on a specific task, such as summarization or a dimension, such as instruction following. The evaluation process includes the following steps:

  1. The predicted results are produced from the model based on the input prompt.
  2. Evaluation is performed based on the generated results.

Depending on the evaluation method, input and output pairs and the ground truth might be required. When ground truth is available, the model's outputs are evaluated based on how well the outputs align with the expected outcomes. For more information, see Perform computation-based evaluation. When used without ground truth, the evaluation relies on the model's response to input prompts. A separate autorater model is also used. For more information, see Perform pairwise model-based evaluation to produce metrics customized to the nature of the task. For example, you might use coherence and relevance in text generation or accuracy in summarization.

This paradigm allows for an understanding of a model's capabilities in generating content, providing insights into the model's strengths and areas for improvement in a standalone context, without requiring a direct comparison with another model.

Pairwise

Pairwise evaluation is performed by comparing the predictions from two models. You have a model A to be evaluated against a model B, the baseline-reference model. You must provide input prompts that represent the input domain that is used for the comparison of the models. Given the same input prompt, the side-by-side comparison specifies which model prediction is preferred based on your comparison criteria. The final evaluation results are captured by the win rate. This paradigm can also operate without the need of a reference to ground truth data.

Evaluation methods

There are two categories of metrics based on the evaluation method, which include:

Computation-based metrics

Computation-based metrics compare whether the LLM-generated results are consistent with a ground-truth dataset of input and output pairs. The commonly used metrics can be categorized into the following groups:

  • Lexicon-based metrics: Use math to calculate the string similarities between LLM-generated results and ground truth, such as Exact Match and ROUGE.
  • Count-based metrics: Aggregate the number of rows that hit or miss certain ground-truth labels, such as F1-score, Accuracy, and Tool Name Match.
  • Embedding-based metrics: Calculate the distance between the LLM-generated results and ground truth in the embedding space, reflecting their level of similarity.

In the generative AI evaluation service, you can use computation-based metrics through the Pipeline and rapid evaluation Python SDK. The computation-based evaluation might be performed in only pointwise use cases. However, you can directly compare the metrics scores of two models for a pairwise comparison.

Model-based metrics

An autorater model is used to generate model-based evaluation metrics. The autorater performs complex and nuanced evaluations, which were traditionally performed by expensive human evaluators. Autoraters try to augment human evaluation, and we calibrate the autorater's quality offline with human raters. Much like human evaluators, the autorater determines the quality of responses through a numeric score output and gives the reasoning behind its judgments along with a confidence level. For more information, see View evaluation results.

Model-based evaluation is available on demand and evaluates language models with comparable performance to that of human raters. Some additional benefits of model-based evaluation include the following:

  • Evaluates natural language models without human preference data.
  • Achieves better scale, increases availability, and reduces cost compared to evaluating language models with human raters.
  • Achieves rating transparency by capturing preference explanations and confidence scores.

This diagram illustrates how pairwise model-based evaluation works, which might be performed in pointwise and pairwise use cases. You can see how the autorater performs pairwise evaluation in the evaluation pipeline service, AutoSxS.

Overview of how AutoSxS works

What's next