Desde el 29 de abril del 2025, los modelos Gemini 1.5 Pro y Gemini 1.5 Flash no están disponibles en proyectos donde no se hayan utilizado previamente, incluidos los proyectos nuevos. Para obtener más información, consulta Versiones y ciclo de vida de los modelos.

Esta página se ha traducido con Cloud Translation API.

Plantillas de peticiones de métricas para la evaluación basada en modelos

En esta página se muestra una lista de plantillas que puedes usar para la evaluación basada en modelos con el servicio de evaluación de IA generativa. Para obtener más información sobre las métricas basadas en modelos, consulta el artículo Definir métricas propias.

Información general

En la evaluación basada en modelos, enviamos una petición al modelo de juez para que genere la puntuación de la métrica en función de los criterios, las rúbricas de puntuación y otras instrucciones especificadas.

En la siguiente tabla se ofrece un resumen de los ejemplos de plantillas de peticiones de métricas disponibles:

	Caso práctico de texto	Caso práctico de conversación multiturno	Otros casos prácticos clave
Puntual	Fluidez Coherencia Con los pies en la tierra Seguridad Seguir instrucciones Verbosidad Calidad del texto	Calidad de la conversación de varias interacciones Seguridad de varias interacciones	Calidad de la creación de resúmenes Calidad de las respuestas a preguntas
Por pares	Fluidez Coherencia Con los pies en la tierra Seguridad Seguir instrucciones Verbosidad Calidad del texto	Calidad de la conversación de varias interacciones Seguridad de varias interacciones	Calidad de la creación de resúmenes Calidad de las respuestas a preguntas

Estructurar una plantilla de petición de métricas

Una plantilla de petición de métrica debe incluir las siguientes secciones principales:

Instrucción
Evaluación
Entradas del usuario y respuesta generada por la IA.

Cada sección puede contener subsecciones.

Instrucción

Componente	Función	Tipo	Ejemplo
Instrucción	Incluye un perfil para el modelo de juez y una breve descripción de su tarea.	Valor predeterminado	You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user input and AI-generated responses. You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below. You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

Evaluación

Componente	Función	Tipo	Ejemplo
Definición de la métrica	Especifica el nombre y la definición de la métrica.	Entradas de usuario opcionales	`You will be assessing a metric called SummarizationQuality, which measures the overall ability to summarize text`
Criterios	Define los criterios (y, opcionalmente, los subcriterios) de la métrica.	Entradas de usuario obligatorias	`Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements. Groundedness: The response contains information included only in the context. The response does not reference any outside information.`
Rúbrica de calificación	Especifica la escala de puntuación de la métrica con explicaciones sobre el significado de cada puntuación.	Entradas de usuario obligatorias	`5: (Very good). The summary follows instructions, is grounded, is concise, and fluent. 4: (Good). The summary follows instructions, is grounded, concise, and fluent. 3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent. 2: (Bad). The summary is grounded, but does not follow the instructions. 1: (Very bad). The summary is not grounded.`
Ejemplos de few-shot	Ejemplos de la tarea.	Entradas de usuario opcionales. Nota: Los ejemplos de pocos disparos no solo pueden mejorar el rendimiento, sino también el formato de la respuesta del modelo de juez. Te recomendamos que empieces con entre 5 y 10 ejemplos de few-shot.	`RESPONSE: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs. EXPLANATION: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence. SCORE: 1`
Pasos de la evaluación	Instrucciones paso a paso sobre cómo llevar a cabo la tarea	Entradas de usuario opcionales Nota: Puede especificar las clasificaciones de los criterios en los pasos de evaluación.	`STEP 1: Assess the response in aspects of instruction following, groundedness, helpfulness, and verbosity according to the criteria. STEP 2: Score based on the rubrics.`

Entradas del usuario

Componente	Función	Tipo	Ejemplo
Variables de entrada	Las entradas que deben proporcionar los usuarios para completar la petición de la herramienta de calificación automática y obtener una respuesta.	Entradas de usuario obligatorias	`## User Inputs ### Prompt {prompt} ## AI-generated Response {response}`

Además, si las columnas de los datos de usuario y las variables de entrada no coinciden y no quieres cambiar el nombre de los datos, puedes proporcionar una asignación:

Componente	Función	Tipo	Ejemplo
Asignación de columnas de métricas	Una asignación de las variables de entrada de la petición del usuario a los datos del usuario.	Entradas de usuario opcionales Nota: `prompt`, `response` y `baseline_model_response` no admiten la asignación si `evaluate()` ejecuta la inferencia del modelo.	`metric_column_mapping = {"reference":"ground_truth"}`

Adaptar una plantilla de petición de métrica a tus datos de entrada

Para adaptar una plantilla a tus datos y criterios de evaluación específicos, sigue estos pasos:

Identifica los criterios que faltan: determina qué criterios no se abordan adecuadamente en la plantilla actual.
Añadir criterios: incluye los criterios que faltan en la petición, definiendo claramente lo que quieres que tenga en cuenta el modelo.
Ajustar los campos de entrada del usuario: si tiene columnas adicionales del conjunto de datos de evaluación que quiera usar para la evaluación, añádalas a los campos de entrada del usuario e indique al modelo de juez cómo usar este campo.
Actualizar la guía de evaluación: modifica la guía de evaluación para que refleje los nuevos criterios y su importancia relativa.

Por ejemplo, si quieres evaluar un modelo de resumen en función de lo bien que se ajuste el resumen de la respuesta a un resumen de referencia, puedes añadir un nuevo criterio llamado "Alineación con la referencia" y añadir los datos de referencia como parte de User Inputs:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.
Reference alignment: The response is consistent and aligned with the reference response.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, concise, fluent and aligned with reference summary.
4: (Good). The summary follows instructions, is grounded, concise, and fluent but not aligned with reference summary.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent and is not aligned with reference summary.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, fluency and reference alignment according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Reference
{reference}

### Prompt
{prompt}

## AI-generated Response
{response}

Proporcionar ejemplos de pocos disparos para mejorar la calidad

Los ejemplos de pocos disparos pueden mejorar significativamente la calidad y la coherencia de las respuestas de evaluación, ya que guían al modelo hacia los formatos y estilos de salida que elijas. Te recomendamos que empieces con entre 5 y 10 ejemplos de few-shot.

Para incorporar ejemplos de few-shot, sigue estos pasos:

Identifica ejemplos relevantes: selecciona ejemplos que sean similares al tipo de datos de entrada que vas a evaluar.
Incluye ejemplos en la petición: coloca los ejemplos directamente en la petición de evaluación, antes de la tarea o el contexto.
Ejemplos de formato: asegúrese de que los ejemplos sigan el formato y el estilo de salida elegidos.

Por ejemplo, puedes proporcionar ejemplos de pocos disparos para la métrica coherence y añadir la instrucción para usar los ejemplos de la siguiente manera:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps as shown in few shot examples. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
...

## Criteria
...

## Rating Rubric
...

## Few-shot Examples
Response: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs.
Explanation: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence.
Score: 1

Response: Learning a new language can be a rewarding experience for children, opening doors to different cultures and expanding their understanding of the world. There are many resources available to help children learn languages, from online courses and apps to language exchange programs and immersion schools.
Explanation: The response presents two related ideas: the benefits of learning a new language for children and the resources available to aid in that process. However, there is no clear transition or connection between these two distinct points. While both sentences are relevant to the topic of language acquisition in children, the relationship between them could be made more explicit.
Score: 3

Response: Although the internet has revolutionized communication and information sharing, it has also created echo chambers where individuals are only exposed to opinions and beliefs that align with their own. This polarization can lead to increased hostility and misunderstanding between different groups, making it difficult to find common ground on important issues. Consequently, fostering media literacy and critical thinking skills is essential for navigating the vast and often biased landscape of online information. By teaching individuals to evaluate sources, identify biases, and consider diverse perspectives, we can empower them to break free from echo chambers and engage in meaningful dialogue with those who hold differing views.
Explanation: The response exhibits a clear and logical flow of ideas. The transition words 'although' and 'consequently' effectively signal the relationship between the internet's advantages, its drawbacks (echo chambers), and the proposed solution (media literacy). The text maintains cohesion through consistent focus on the central theme of online polarization and its remedies.
Score: 5

## Evaluation Steps
...

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}