Run an evaluation

You can use the Gen AI Evaluation module of the Vertex AI SDK for Python to programmatically evaluate your generative language models and applications with the Gen AI Evaluation Service API. This page shows you how to run evaluations with the Vertex AI SDK.

Before you begin

Install the Vertex AI SDK

To install the Gen AI Evaluation module from the Vertex AI SDK for Python, run the following command:

!pip install -q google-cloud-aiplatform[evaluation]

For more information, see Install the Vertex AI SDK for Python.

Authenticate the Vertex AI SDK

After you install the Vertex AI SDK for Python, you need to authenticate. The following topics explain how to authenticate with the Vertex AI SDK if you're working locally and if you're working in Colaboratory:

  • If you're developing locally, set up Application Default Credentials (ADC) in your local environment:

    1. Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init
      
    2. Create local authentication credentials for your Google Account:

      gcloud auth application-default login
      

      A login screen is displayed. After you sign in, your credentials are stored in the local credential file used by ADC. For more information about working with ADC in a local environment, see Local development environment.

  • If you're working in Colaboratory, run the following command in a Colab cell to authenticate:

    from google.colab import auth
    auth.authenticate_user()
    

    This command opens a window where you can complete the authentication.

Understanding service accounts

The service account is used by the Gen AI Evaluation Service to get predictions from the online prediction service for model-based evaluation metrics. This service account is automatically provisioned on the first request to the Gen AI Evaluation Service.

Name Description Email address Role
Vertex AI Rapid Eval Service Agent The service account used to get predictions for model based evaluation. service-PROJECT_NUMBER@gcp-sa-vertex-eval.iam.gserviceaccount.com roles/aiplatform.rapidevalServiceAgent

The permissions associated to the rapid evaluation service agent are:

Role Permissions
Vertex AI Rapid Eval Service Agent (roles/aiplatform.rapidevalServiceAgent) aiplatform.endpoints.predict

Run your evaluation

Use the EvalTask class to run evaluations for the following use cases:

EvalTask class

The EvalTask class helps you evaluate models and applications based on specific tasks. To make fair comparisons between generative models, you typically need to repeatedly evaluate various models and prompt templates against a fixed evaluation dataset using specific metrics. It's also important to evaluate multiple metrics simultaneously within a single evaluation run.

EvalTask also integrates with Vertex AI Experiments to help you track configurations and results for each evaluation run. Vertex AI Experiments aids in managing and interpreting evaluation results, empowering you to make informed decisions.

The following example demonstrates how to instantiate the EvalTask class and run an evaluation:

from vertexai.evaluation import EvalTask

eval_task = EvalTask(
  dataset=DATASET,
  metrics=[METRIC1, METRIC2, METRIC3],
  experiment=EXPERIMENT_NAME,
)

eval_result = eval_task.evaluate(
  model=MODEL,
  prompt_template=PROMPT_TEMPLATE,
  experiment_run=EXPERIMENT_RUN, 
)

Run evaluation with model-based metrics

For model-based metrics, use the PointwiseMetric and PairwiseMetric classes to define metrics tailored to your specific criteria. Run evaluations using the following options:

Use model-based metric examples

You can directly use the built-in constant Metric Prompt Template Examples within Vertex AI SDK. Alternatively, modify and incorporate them in the free-form metric definition interface.

For the full list of Metric Prompt Template Examples covering most key use cases, see Metric prompt templates.

The following Vertex AI SDK example shows how to use MetricPromptTemplateExamples class to define your metrics:

# View all the available examples of model-based metrics
MetricPromptTemplateExamples.list_example_metric_names()

# Display the metric prompt template of a specific example metric 
print(MetricPromptTemplateExamples.get_prompt_template('fluency'))

# Use the pre-defined model-based metrics directly
eval_task = EvalTask(
  dataset=EVAL_DATASET, 
  metrics=[MetricPromptTemplateExamples.Pointwise.FLUENCY], 
)

eval_result = eval_task.evaluate( 
  model=MODEL,
)

Use a model-based metric templated interface

Customize your metrics by populating fields like Criteria and Rating Rubrics using the PointwiseMetricPromptTemplate and PairwiseMetricPromptTemplate classes within Vertex AI SDK. Certain fields, such as Instruction, are assigned a default value if you don't provide input.

Optionally, you can specify input_variables, which is a list of input fields used by the metric prompt template to generate model-based evaluation results. By default, the model's response column is included for pointwise metrics, and both the candidate model's response and baseline_model_response columns are included for pairwise metrics.

For additional information, refer to the "Structure a metric prompt template" section in Metric prompt templates.

# Define a pointwise metric with two custom criteria
pointwise_metric_prompt_template = PointwiseMetricPromptTemplate(
  criteria={
    "fluency": "Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.", 
    "entertaining": "Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.",
  },
  rating_rubric={
    "1": "The response performs well on both criteria.",
    "0": "The response is somewhat aligned with both criteria",
    "-1": "The response falls short on both criteria",
  },
  input_variables=["prompt"],
)

custom_text_quality = PointwiseMetric(
   metric="custom_text_quality",
   metric_prompt_template=pointwise_metric_prompt_template,
)

# Display the serialized metric prompt template  
print(custom_text_quality.metric_prompt_template)

# Run evaluation using the defined metric
eval_task = EvalTask(
  dataset=EVAL_DATASET, 
  metrics=[custom_text_quality],
)
eval_result = eval_task.evaluate( 
  model=MODEL, 
)

Use the model-based metric free-form SDK interface

For more flexibility in customizing the metric prompt template, you can define a metric directly using the free-form interface, which accepts a direct string input.

# Define a pointwise multi-turn chat quality metric 
pointwise_chat_quality_metric_prompt = """Evaluate the AI's contribution to a meaningful conversation, considering coherence, fluency, groundedness, and conciseness.
 Review the chat history for context. Rate the response on a 1-5 scale, with explanations for each criterion and its overall impact.

# Conversation History
{history}

# Current User Prompt
{prompt}

# AI-generated Response
{response}
"""

freeform_multi_turn_chat_quality_metric = PointwiseMetric(
  metric="multi_turn_chat_quality_metric",
  metric_prompt_template=pointwise_chat_quality_metric_prompt,
)

# Run evaluation using the defined metric
eval_task = EvalTask(
  dataset=EVAL_DATASET, 
  metrics=[freeform_multi_turn_chat_quality_metric], 
)
eval_result = eval_task.evaluate( 
  model=MODEL,
)

Run evaluation with computation-based metrics

You can use computation-based metrics standalone, or together with model-based metrics.

# Combine computation-based metrics "ROUGE" and "BLEU" with model-based metrics
eval_task = EvalTask(
  dataset=EVAL_DATASET, 
  metrics=["rouge_l_sum", "bleu", custom_text_quality],
)
eval_result = eval_task.evaluate( 
  model=MODEL,
)

Additional metric customization

If you need to further customize your metrics, like choosing a different judge model for model-based metrics, or define a new computation-based metric, you can use the CustomMetric class in the Vertex AI SDK. For more details, see the following notebook:

What's next