Evaluation notebooks

We provide several examples of how you can use the rapid evaluation SDK to perform evaluations on your generative AI models.

Evaluate your models in real time

The Vertex AI rapid evaluation service lets you evaluate your generative AI models in real time. To learn how to use rapid evaluation, see Run a rapid evaluation.

For an end-to-end example, see the colab notebook for the Vertex AI SDK for Python with rapid evaluation.

Evaluate and optimize prompt template design

Use the rapid evaluation SDK to evaluate the effect of prompt engineering. Examine the statistics corresponding with each prompt template to understand how differences in prompting impacts evaluation results.

For an end-to-end example, see the notebook Evaluate and Optimize Prompt Template Design for Better Results.

Evaluate and select LLM models using benchmark metrics

Use the rapid evaluation SDK to score both Gemini Pro and Text Bison models on a benchmark dataset and a task.

For an end-to-end example, see the notebook Score and Select LLM Models.

Evaluate and select model-generation settings

Use the rapid evaluation SDK to adjust the temperature of Gemini Pro on a summarization task and to evaluate quality, fluency, safety, and verbosity.

For an end-to-end example, see the notebook Evaluate and Select Model Generation Settings.

Define your metrics

Use the rapid evaluation SDK to evaluate multiple prompt templates with your custom defined metrics.

For an end-to-end example, see the notebook Define Your Own Metrics.

Evaluate tool use

Use the rapid evaluation SDK to define an API function and a tool for the Gemini model. You can also use the SDK to evaluate tool use and function-calling quality for Gemini.

For an end-to-end example, see the notebook Evaluate Generative Model Tool Use and Function Calling.

Evaluate generated answers from RAG for question answering

Use the rapid evaluation SDK to evaluate a question-answering task from Retrieval-Augmented Generation (RAG) generated answers.

For an end-to-end example, see the notebook Evaluate Generated Answers from RAG for Question Answering.

What's next