Jump to Content
AI & Machine Learning

Evaluate your gen media models with multimodal evaluation on Vertex AI

May 13, 2025
https://storage.googleapis.com/gweb-cloudblog-publish/original_images/blog_KnxMxBa.gif
Irina Sigler

Product Manager, Cloud AI

Anant Nawalgaria

Sr. Staff ML Engineer, Google

Try Gemini 2.5

Our most intelligent model is now available on Vertex AI

Try now

The world of generative AI is moving fast, with models like Lyria, Imagen, and Veo now capable of producing stunningly realistic and imaginative images and videos from simple text prompts. However, evaluating these models is still a steep challenge. Traditional human evaluation, while the gold standard, can be slow and costly, hindering rapid development cycles.

To address this, we're thrilled to introduce Gecko, now available through Google Cloud’s Vertex AI Evaluation Service. Gecko is a rubric-based and interpretable autorater for evaluating generative AI models that empowers developers with a more nuanced, customizable, and transparent way to assess the performance of image and video generation models.

The challenge of evaluating generative models with auto-raters

Creating useful, performant auto-raters is challenging as the quality of generation dramatically improves. While specialised models can be efficient, they lack the interpretability developers need to understand model behavior and pinpoint areas for improvement. For instance, when evaluating how accurately a generated image depicts a prompt, a single score doesn't reveal why a model succeeded or failed.

Introducing Gecko: Interpretable, customizable, and performant evaluation

Gecko offers a fine-grained, interpretable, and customizable auto-rater. This Google DeepMind research paper shows that such an auto-rater can reliably evaluate image and video generation across a range of skills, reducing the dependency on costly human judgment. Notably, beyond its interpretability, Gecko exhibits strong performance and has already been instrumental in benchmarking the progress of leading models like Imagen.

Gecko makes evaluation interpretable with its  clear, step-by-step rubric-based approach. Let’s take an example and use Gecko to evaluate the generated media of a cup of coffee and a croissant on a table.

https://storage.googleapis.com/gweb-cloudblog-publish/images/image3_SCDlW95.max-1100x1100.png

Figure 1: Prompt and image pair we will use as our running example

Step 1: Semantic prompt decomposition.

Gecko leverages a Gemini model to first break down the input text prompt into key semantic elements that need to be verified in the generated media. This includes identifying entities, their attributes, and the relationships between them.

For the running example, the prompt is broken down into keywords: Steaming, cup of coffee, croissant, table.

Step 2: Question generation.

Based on the decomposed prompt, the Gemini model then generates a series of question-answer pairs. These questions are specifically designed to probe the generated image or video for the presence and accuracy of the identified elements and relationships. Optionally, Gemini can provide justifications for why a particular answer is correct, further enhancing transparency.

Let’s take a look at the running example and generate question-answer pairs for each keyword. For the keyword Steaming, the question-answer pair is ‘is the cup of coffee steaming? [‘yes’, ‘no’]’ with the ground-truth answer ‘yes’.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Figure_2_actsEXP.max-900x900.jpg

Figure 2: Visualisation of the outputs from the semantic prompt decomposition and question-answer generation steps.

Step 3: Scoring

Finally, the Gemini model scores the generated media against each question-answer pair. These individual scores are then aggregated to produce a final evaluation score.

For the running example, all questions were found to be correct, giving a perfect final score.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Figure_3_xSjcX8h.max-1000x1000.jpg

Figure 3: Visualisation of the outputs from the scoring step, giving scores for each question which are aggregated to give a final overall score.

Evaluate with Gecko on Vertex AI

Gecko is now available via the Gen AI Evaluation Service in Vertex AI, empowering you to evaluate image or video generative models. Here's how you can get started with Gecko evaluation for images and videos on Vertex AI:

First, you'll need to set up configurations for both rubric generation and rubric validation.

lang-py
Loading...

Next, prepare your dataset for evaluation. This involves creating a Pandas DataFrame with columns for your prompts and the corresponding generated images or videos.

lang-py
Loading...

Now, you can generate the rubrics based on your prompts using the configured rubric_based_gecko metric.

lang-py
Loading...

Finally, run the evaluation using the generated rubrics and your dataset. The evaluate method of EvalTask will use the rubric validator to score the generated content.

lang-py
Loading...

After the evaluation runs, you can compute and analyze the final scores to understand how well your generated content aligns with the detailed criteria derived from your prompts.

Loading...

Vertex AI Gen AI evaluation service offers summary and metrics tables, providing detailed insights into the evaluation performance. Beyond that, for Gecko you will find the category or concept which each of the questions is categorized as, as well as the score of the generated image or video performed against that category. For example “is the cat grey?” would be a question which falls under the question category: “color”

Access to these granular evaluation results enables you to create meaningful visualizations of the performance of the models across various criterion, including bar and radar charts like the one below:

https://storage.googleapis.com/gweb-cloudblog-publish/images/image4_1ADqbJd.max-1900x1900.png

Figure 4: Visualisation of the aggregate performance of the generated media for various categories/criterion

With Gecko on Vertex AI, you gain access to a robust framework for assessing model’s capabilities at finer detail. You can refer to the text- to-image and text-to-video evaluation Colabs to get a first hand experience today.

Posted in