You can use the Gen AI Evaluation module of the Vertex AI SDK for Python to programmatically evaluate your generative language models and applications with the Gen AI evaluation service API. This page shows you how to run evaluations with the Vertex AI SDK.
Before you begin
Install the Vertex AI SDK
To install the Gen AI Evaluation module from the Vertex AI SDK for Python, run the following command:
!pip install -q google-cloud-aiplatform[evaluation]
For more information, see Install the Vertex AI SDK for Python.
Authenticate the Vertex AI SDK
After you install the Vertex AI SDK for Python, you need to authenticate. The following topics explain how to authenticate with the Vertex AI SDK if you're working locally and if you're working in Colaboratory:
If you're developing locally, set up Application Default Credentials (ADC) in your local environment:
Install the Google Cloud CLI, then initialize it by running the following command:
gcloud init
Create local authentication credentials for your Google Account:
gcloud auth application-default login
A login screen is displayed. After you sign in, your credentials are stored in the local credential file used by ADC. For more information about working with ADC in a local environment, see Local development environment.
If you're working in Colaboratory, run the following command in a Colab cell to authenticate:
from google.colab import auth auth.authenticate_user()
This command opens a window where you can complete the authentication.
Understanding service accounts
The service account is used by the Gen AI evaluation service to get predictions from the Gemini API in Vertex AI for model-based evaluation metrics. This service account is automatically provisioned on the first request to the Gen AI evaluation service.
Name | Description | Email address | Role |
---|---|---|---|
Vertex AI Rapid Eval Service Agent | The service account used to get predictions for model based evaluation. | service-PROJECT_NUMBER@gcp-sa-vertex-eval.iam.gserviceaccount.com |
roles/aiplatform.rapidevalServiceAgent |
The permissions associated to the rapid evaluation service agent are:
Role | Permissions |
---|---|
Vertex AI Rapid Eval Service Agent (roles/aiplatform.rapidevalServiceAgent) | aiplatform.endpoints.predict |
Run your evaluation
Use the EvalTask
class to run evaluations for the following use cases:
EvalTask
class
The EvalTask
class helps you evaluate models and applications based on specific tasks. To make fair comparisons between generative models, you typically need to repeatedly evaluate various models and prompt templates against a fixed evaluation dataset using specific metrics. It's also important to evaluate multiple metrics simultaneously within a single evaluation run.
EvalTask
also integrates with Vertex AI Experiments to help you track configurations and results for each evaluation run. Vertex AI Experiments aids in managing and interpreting evaluation results, empowering you to make informed decisions.
The following example demonstrates how to instantiate the EvalTask
class and run an evaluation:
from vertexai.evaluation import (
EvalTask,
PairwiseMetric,
PairwiseMetricPromptTemplate,
PointwiseMetric,
PointwiseMetricPromptTemplate,
MetricPromptTemplateExamples,
)
eval_task = EvalTask(
dataset=DATASET,
metrics=[METRIC_1, METRIC_2, METRIC_3],
experiment=EXPERIMENT_NAME,
)
eval_result = eval_task.evaluate(
model=MODEL,
prompt_template=PROMPT_TEMPLATE,
experiment_run=EXPERIMENT_RUN,
)
Run evaluation with model-based metrics
For model-based metrics, use the PointwiseMetric
and PairwiseMetric
classes to define metrics tailored to your specific criteria. Run evaluations using the following options:
Use model-based metric examples
You can directly use the built-in constant Metric Prompt Template Examples
within Vertex AI SDK. Alternatively, modify and incorporate them in the free-form metric definition interface.
For the full list of Metric Prompt Template Examples covering most key use cases, see Metric prompt templates.
The following Vertex AI SDK example shows how to use MetricPromptTemplateExamples
class to define your metrics:
# View all the available examples of model-based metrics
MetricPromptTemplateExamples.list_example_metric_names()
# Display the metric prompt template of a specific example metric
print(MetricPromptTemplateExamples.get_prompt_template('fluency'))
# Use the pre-defined model-based metrics directly
eval_task = EvalTask(
dataset=EVAL_DATASET,
metrics=[MetricPromptTemplateExamples.Pointwise.FLUENCY],
)
eval_result = eval_task.evaluate(
model=MODEL,
)
Use a model-based metric templated interface
Customize your metrics by populating fields like Criteria
and Rating Rubrics
using the PointwiseMetricPromptTemplate
and PairwiseMetricPromptTemplate
classes within Vertex AI SDK. Certain fields, such as Instruction
, are assigned a default value if you don't provide input.
Optionally, you can specify input_variables
, which is a list of input fields used by the metric prompt template to generate model-based evaluation results. By default, the model's response
column is included for pointwise metrics, and both the candidate model's response
and baseline_model_response
columns are included for pairwise metrics.
For additional information, refer to the "Structure a metric prompt template" section in Metric prompt templates.
# Define a pointwise metric with two custom criteria
custom_text_quality = PointwiseMetric(
metric="custom_text_quality",
metric_prompt_template=PointwiseMetricPromptTemplate(
criteria={
"fluency": "Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.",
"entertaining": "Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.",
},
rating_rubric={
"1": "The response performs well on both criteria.",
"0": "The response is somewhat aligned with both criteria",
"-1": "The response falls short on both criteria",
},
input_variables=["prompt"],
),
)
# Display the serialized metric prompt template
print(custom_text_quality.metric_prompt_template)
# Run evaluation using the custom_text_quality metric
eval_task = EvalTask(
dataset=EVAL_DATASET,
metrics=[custom_text_quality],
)
eval_result = eval_task.evaluate(
model=MODEL,
)
Use the model-based metric free-form SDK interface
For more flexibility in customizing the metric prompt template, you can define a metric directly using the free-form interface, which accepts a direct string input.
# Define a pointwise multi-turn chat quality metric
pointwise_chat_quality_metric_prompt = """Evaluate the AI's contribution to a meaningful conversation, considering coherence, fluency, groundedness, and conciseness.
Review the chat history for context. Rate the response on a 1-5 scale, with explanations for each criterion and its overall impact.
# Conversation History
{history}
# Current User Prompt
{prompt}
# AI-generated Response
{response}
"""
freeform_multi_turn_chat_quality_metric = PointwiseMetric(
metric="multi_turn_chat_quality_metric",
metric_prompt_template=pointwise_chat_quality_metric_prompt,
)
# Run evaluation using the freeform_multi_turn_chat_quality_metric metric
eval_task = EvalTask(
dataset=EVAL_DATASET,
metrics=[freeform_multi_turn_chat_quality_metric],
)
eval_result = eval_task.evaluate(
model=MODEL,
)
Evaluate a translation model
To evaluate your translation model, you can specify BLEU, MetricX, or COMET as evaluation metrics when using the Vertex AI SDK.
#Prepare the dataset for evaluation.
sources = [
"Dem Feuer konnte Einhalt geboten werden",
"Schulen und Kindergärten wurden eröffnet.",
]
responses = [
"The fire could be stopped",
"Schools and kindergartens were open",
]
references = [
"They were able to control the fire.",
"Schools and kindergartens opened",
]
eval_dataset = pd.DataFrame({
"source": sources,
"response": responses,
"reference": references,
})
# Set the metrics.
metrics = [
"bleu",
pointwise_metric.Comet(),
pointwise_metric.MetricX(),
]
eval_task = evaluation.EvalTask(
dataset=eval_dataset,
metrics=metrics,
)
eval_result = eval_task.evaluate()
Run evaluation with computation-based metrics
You can use computation-based metrics standalone, or together with model-based metrics.
# Combine computation-based metrics "ROUGE" and "BLEU" with model-based metrics
eval_task = EvalTask(
dataset=EVAL_DATASET,
metrics=["rouge_l_sum", "bleu", custom_text_quality],
)
eval_result = eval_task.evaluate(
model=MODEL,
)
Additional metric customization
If you need to further customize your metrics, like choosing a different judge model for model-based metrics, or define a new computation-based metric, you can use the CustomMetric
class in the Vertex AI SDK. For more details, see the following notebooks:
Run model-based evaluation with increased rate limits and quota
A single evaluation request for a model-based metric results in multiple underlying requests to the Gemini API in Vertex AI and consumes the judge model gemini-1.5-pro
quota. Model requests per minute (RPM) quota is calculated on a per-project basis, which means that both the requests to the judge model gemini-1.5-pro
and the requests to Gen AI evaluation service for model-based metrics count towards the project judge model RPM quota in a specific region for gemini-1.5-pro
.
You should increase the judge model RPM quota and set a higher evaluation service rate limit evaluation_service_qps
in the following use cases:
Increased data volume: If you're processing significantly more data using the model-based metrics, you'll likely hit the default RPM quota. Increasing the quota lets you handle the larger volume without performance degradation or interruptions.
Faster evaluation: If your application requires quicker turnaround time for evaluations, you might need a higher RPM quota. This is especially important for time-sensitive applications or those with real-time interactions where delays in evaluation can impact the user experience.
Complex evaluation tasks: A higher RPM quota ensures you have enough capacity to handle resource-intensive evaluations for complex tasks or large amounts of text.
High user concurrency: If you anticipate a large number of users simultaneously requesting model-based evaluations and model inference within your project, a higher model RPM limit is crucial to prevent bottlenecks and maintain responsiveness.
To increase the model quota and use Gen AI evaluation service SDK with increased rate limits, do the following:
In the Google Cloud console, go to the IAM & Admin Quotas page.
In the Filter field, specify the Dimension (model identifier) and the Metric (quota identifier for Gemini models):
base_model:gemini-1.5-pro
andMetric:aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model
.For the quota that you want to increase, click the more actions menu
button.In the drop-down menu, click Edit quota. The Quota changes panel opens.
Under Edit quota, enter a new quota value.
Click Submit request.
A Quota Increase Request will be confirmed by email and typically takes two business days to process.
Once your Quota Increase Request is approved through email, you can set the
evaluation_service_qps
parameter as follows:
from vertexai.evaluation import EvalTask
# GEMINI_RPM is the requests per minute (RPM) quota for gemini-1.5-pro in your region
# Evaluation Service QPS limit is equal to (gemini-1.5-pro RPM / 60 sec / default number of samples)
CUSTOM_EVAL_SERVICE_QPS_LIMIT = GEMINI_RPM / 60 / 4
eval_task = EvalTask(
dataset=DATASET,
metrics=[METRIC_1, METRIC_2, METRIC_3],
)
eval_result = eval_task.evaluate(
evaluation_service_qps=CUSTOM_EVAL_SERVICE_QPS_LIMIT,
# Specify a retry_timeout limit for a more responsive evaluation run
# the default value is 600 (in seconds, or 10 minutes)
retry_timeout=RETRY_TIMEOUT,
)
For more information about quotas and limits, see Gen AI evaluation service quotas, and Gen AI evaluation service API.
What's next
Find a model-based metrics template.
Try an evaluation example notebook.