Run an evaluation

You can use the Gen AI Evaluation module of the Vertex AI SDK for Python to programmatically evaluate your generative language models and applications with the Gen AI evaluation service API. This page shows you how to run evaluations with the Vertex AI SDK.

Before you begin

Install the Vertex AI SDK

To install the Gen AI Evaluation module from the Vertex AI SDK for Python, run the following command:

!pip install -q google-cloud-aiplatform[evaluation]

For more information, see Install the Vertex AI SDK for Python.

Authenticate the Vertex AI SDK

After you install the Vertex AI SDK for Python, you need to authenticate. The following topics explain how to authenticate with the Vertex AI SDK if you're working locally and if you're working in Colaboratory:

  • If you're developing locally, set up Application Default Credentials (ADC) in your local environment:

    1. Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init
      
    2. Create local authentication credentials for your Google Account:

      gcloud auth application-default login
      

      A login screen is displayed. After you sign in, your credentials are stored in the local credential file used by ADC. For more information, see Set up ADC for a local development environment.

  • If you're working in Colaboratory, run the following command in a Colab cell to authenticate:

    from google.colab import auth
    auth.authenticate_user()
    

    This command opens a window where you can complete the authentication.

Understanding service accounts

The service account is used by the Gen AI evaluation service to get predictions from the Gemini API in Vertex AI for model-based evaluation metrics. This service account is automatically provisioned on the first request to the Gen AI evaluation service.

Name Description Email address Role
Vertex AI Rapid Eval Service Agent The service account used to get predictions for model based evaluation. service-PROJECT_NUMBER@gcp-sa-vertex-eval.iam.gserviceaccount.com roles/aiplatform.rapidevalServiceAgent

The permissions associated to the rapid evaluation service agent are:

Role Permissions
Vertex AI Rapid Eval Service Agent (roles/aiplatform.rapidevalServiceAgent) aiplatform.endpoints.predict

Run your evaluation

Use the EvalTask class to run evaluations for the following use cases:

EvalTask class

The EvalTask class helps you evaluate models and applications based on specific tasks. To make fair comparisons between generative models, you typically need to repeatedly evaluate various models and prompt templates against a fixed evaluation dataset using specific metrics. It's also important to evaluate multiple metrics simultaneously within a single evaluation run.

EvalTask also integrates with Vertex AI Experiments to help you track configurations and results for each evaluation run. Vertex AI Experiments aids in managing and interpreting evaluation results, empowering you to make informed decisions.

The following example demonstrates how to instantiate the EvalTask class and run an evaluation:

from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    PairwiseMetricPromptTemplate,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
    MetricPromptTemplateExamples
)

eval_task = EvalTask(
    dataset=DATASET,
    metrics=[METRIC_1, METRIC_2, METRIC_3],
    experiment=EXPERIMENT_NAME,
)

eval_result = eval_task.evaluate(
    model=MODEL,
    prompt_template=PROMPT_TEMPLATE,
    experiment_run=EXPERIMENT_RUN,
)

Run evaluation with model-based metrics

For model-based metrics, use the PointwiseMetric and PairwiseMetric classes to define metrics tailored to your specific criteria. Run evaluations using the following options:

Use model-based metric examples

You can directly use the built-in constant Metric Prompt Template Examples within Vertex AI SDK. Alternatively, modify and incorporate them in the free-form metric definition interface.

For the full list of Metric Prompt Template Examples covering most key use cases, see Metric prompt templates.

The following Vertex AI SDK example shows how to use MetricPromptTemplateExamples class to define your metrics:

# View all the available examples of model-based metrics
MetricPromptTemplateExamples.list_example_metric_names()

# Display the metric prompt template of a specific example metric
print(MetricPromptTemplateExamples.get_prompt_template('fluency'))

# Use the pre-defined model-based metrics directly
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[MetricPromptTemplateExamples.Pointwise.FLUENCY],
)

eval_result = eval_task.evaluate(
    model=MODEL,
)

Use a model-based metric templated interface

Customize your metrics by populating fields like Criteria and Rating Rubrics using the PointwiseMetricPromptTemplate and PairwiseMetricPromptTemplate classes within Vertex AI SDK. Certain fields, such as Instruction, are assigned a default value if you don't provide input.

Optionally, you can specify input_variables, which is a list of input fields used by the metric prompt template to generate model-based evaluation results. By default, the model's response column is included for pointwise metrics, and both the candidate model's response and baseline_model_response columns are included for pairwise metrics.

For additional information, refer to the "Structure a metric prompt template" section in Metric prompt templates.

# Define a pointwise metric with two custom criteria
custom_text_quality = PointwiseMetric(
    metric="custom_text_quality",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
          "fluency": "Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.",
          "entertaining": "Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.",
        },
        rating_rubric={
          "1": "The response performs well on both criteria.",
          "0": "The response is somewhat aligned with both criteria",
          "-1": "The response falls short on both criteria",
        },
        input_variables=["prompt"],
    ),
)

# Display the serialized metric prompt template
print(custom_text_quality.metric_prompt_template)

# Run evaluation using the custom_text_quality metric
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[custom_text_quality],
)
eval_result = eval_task.evaluate(
    model=MODEL,
)

Use the model-based metric free-form SDK interface

For more flexibility in customizing the metric prompt template, you can define a metric directly using the free-form interface, which accepts a direct string input.

# Define a pointwise multi-turn chat quality metric
pointwise_chat_quality_metric_prompt = """Evaluate the AI's contribution to a meaningful conversation, considering coherence, fluency, groundedness, and conciseness.
 Review the chat history for context. Rate the response on a 1-5 scale, with explanations for each criterion and its overall impact.

# Conversation History
{history}

# Current User Prompt
{prompt}

# AI-generated Response
{response}
"""

freeform_multi_turn_chat_quality_metric = PointwiseMetric(
    metric="multi_turn_chat_quality_metric",
    metric_prompt_template=pointwise_chat_quality_metric_prompt,
)

# Run evaluation using the freeform_multi_turn_chat_quality_metric metric
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[freeform_multi_turn_chat_quality_metric],
)
eval_result = eval_task.evaluate(
    model=MODEL,
)

Evaluate a translation model

To evaluate your translation model, you can specify BLEU, MetricX, or COMET as evaluation metrics when using the Vertex AI SDK.

#Prepare the dataset for evaluation.
sources = [
    "Dem Feuer konnte Einhalt geboten werden",
    "Schulen und Kindergärten wurden eröffnet.",
]

responses = [
    "The fire could be stopped",
    "Schools and kindergartens were open",
]

references = [
    "They were able to control the fire.",
    "Schools and kindergartens opened",
]

eval_dataset = pd.DataFrame({
    "source": sources,
    "response": responses,
    "reference": references,
})

# Set the metrics.

metrics = [
    "bleu",
    pointwise_metric.Comet(),
    pointwise_metric.MetricX(),
]

eval_task = evaluation.EvalTask(
    dataset=eval_dataset,
    metrics=metrics,
)
eval_result = eval_task.evaluate()

Run evaluation with computation-based metrics

You can use computation-based metrics standalone, or together with model-based metrics.

# Combine computation-based metrics "ROUGE" and "BLEU" with model-based metrics
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=["rouge_l_sum", "bleu", custom_text_quality],
)
eval_result = eval_task.evaluate(
    model=MODEL,
)

Additional metric customization

If you need to further customize your metrics, like choosing a different judge model for model-based metrics, or define a new computation-based metric, you can use the CustomMetric class in the Vertex AI SDK. For more details, see the following notebooks:

Run model-based evaluation with increased rate limits and quota

A single evaluation request for a model-based metric results in multiple underlying requests to the Gemini API in Vertex AI and consumes the judge model gemini-1.5-pro quota. Model requests per minute (RPM) quota is calculated on a per-project basis, which means that both the requests to the judge model gemini-1.5-pro and the requests to Gen AI evaluation service for model-based metrics count towards the project judge model RPM quota in a specific region for gemini-1.5-pro.

You should increase the judge model RPM quota and set a higher evaluation service rate limit evaluation_service_qps in the following use cases:

  • Increased data volume: If you're processing significantly more data using the model-based metrics, you'll likely hit the default RPM quota. Increasing the quota lets you handle the larger volume without performance degradation or interruptions.

  • Faster evaluation: If your application requires quicker turnaround time for evaluations, you might need a higher RPM quota. This is especially important for time-sensitive applications or those with real-time interactions where delays in evaluation can impact the user experience.

  • Complex evaluation tasks: A higher RPM quota ensures you have enough capacity to handle resource-intensive evaluations for complex tasks or large amounts of text.

  • High user concurrency: If you anticipate a large number of users simultaneously requesting model-based evaluations and model inference within your project, a higher model RPM limit is crucial to prevent bottlenecks and maintain responsiveness.

To increase the model quota and use Gen AI evaluation service SDK with increased rate limits, do the following:

  1. In the Google Cloud console, go to the IAM & Admin Quotas page.

    View Quotas in Console

  2. In the Filter field, specify the Dimension (model identifier) and the Metric (quota identifier for Gemini models): base_model:gemini-1.5-pro and Metric:aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model.

  3. For the quota that you want to increase, click the more actions menu button.

  4. In the drop-down menu, click Edit quota. The Quota changes panel opens.

  5. Under Edit quota, enter a new quota value.

  6. Click Submit request.

  7. A Quota Increase Request will be confirmed by email and typically takes two business days to process.

  8. Once your Quota Increase Request is approved through email, you can set the evaluation_service_qps parameter as follows:

from vertexai.evaluation import EvalTask

# GEMINI_RPM is the requests per minute (RPM) quota for gemini-1.5-pro in your region
# Evaluation Service QPS limit is equal to (gemini-1.5-pro RPM / 60 sec / default number of samples)
CUSTOM_EVAL_SERVICE_QPS_LIMIT = GEMINI_RPM / 60 / 4

eval_task = EvalTask(
    dataset=DATASET,
    metrics=[METRIC_1, METRIC_2, METRIC_3],
)

eval_result = eval_task.evaluate(
    evaluation_service_qps=CUSTOM_EVAL_SERVICE_QPS_LIMIT,
    # Specify a retry_timeout limit for a more responsive evaluation run
    # the default value is 600 (in seconds, or 10 minutes)
    retry_timeout=RETRY_TIMEOUT,
)

For more information about quotas and limits, see Gen AI evaluation service quotas, and Gen AI evaluation service API.

What's next