Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Run an evaluation

You can use the Gen AI Evaluation module of the Vertex AI SDK for Python to programmatically evaluate your generative language models and applications with the Gen AI evaluation service API. This page shows you how to run evaluations with the Vertex AI SDK. Note that evaluations at scale are only available using the REST API.

Before you begin

Install the Vertex AI SDK

To install the Gen AI Evaluation module from the Vertex AI SDK for Python, run the following command:

!pip install -q google-cloud-aiplatform[evaluation]

For more information, see Install the Vertex AI SDK for Python.

Authenticate the Vertex AI SDK

After you install the Vertex AI SDK for Python, you need to authenticate. The following topics explain how to authenticate with the Vertex AI SDK if you're working locally and if you're working in Colaboratory:

If you're developing locally, set up Application Default Credentials (ADC) in your local environment:
1. Install the Google Cloud CLI, then initialize it by running the following command:
```
gcloud init
```
2. Create local authentication credentials for your Google Account:
```
gcloud auth application-default login
```
  A login screen is displayed. After you sign in, your credentials are stored in the local credential file used by ADC. For more information, see Set up ADC for a local development environment.
If you're working in Colaboratory, run the following command in a Colab cell to authenticate:
```
from google.colab import auth
auth.authenticate_user()
```
This command opens a window where you can complete the authentication.

Understanding service accounts

The service account is used by the Gen AI evaluation service to get predictions from the Gemini API in Vertex AI for model-based evaluation metrics. This service account is automatically provisioned on the first request to the Gen AI evaluation service.

Name	Description	Email address	Role
Vertex AI Rapid Eval Service Agent	The service account used to get predictions for model based evaluation.	`service-PROJECT_NUMBER@gcp-sa-vertex-eval.iam.gserviceaccount.com`	`roles/aiplatform.rapidevalServiceAgent`

The permissions associated to the rapid evaluation service agent are:

Role	Permissions
Vertex AI Rapid Eval Service Agent (roles/aiplatform.rapidevalServiceAgent)	`aiplatform.endpoints.predict`

Run your evaluation

Use the EvalTask class to run evaluations for the following use cases:

Model-based metrics
Computation-based metrics
Run evaluations at scale (preview)
Additional metric customization
Increase rate limits and quota

`EvalTask` class

The EvalTask class helps you evaluate models and applications based on specific tasks. To make fair comparisons between generative models, you typically need to repeatedly evaluate various models and prompt templates against a fixed evaluation dataset using specific metrics. It's also important to evaluate multiple metrics simultaneously within a single evaluation run.

EvalTask also integrates with Vertex AI Experiments to help you track configurations and results for each evaluation run. Vertex AI Experiments aids in managing and interpreting evaluation results, empowering you to make informed decisions.

The following example demonstrates how to instantiate the EvalTask class and run an evaluation:

from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    PairwiseMetricPromptTemplate,
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
    MetricPromptTemplateExamples，
)

eval_task = EvalTask(
    dataset=DATASET,
    metrics=[METRIC_1, METRIC_2, METRIC_3],
    experiment=EXPERIMENT_NAME,
)

eval_result = eval_task.evaluate(
    model=MODEL,
    prompt_template=PROMPT_TEMPLATE,
    experiment_run=EXPERIMENT_RUN,
)

Run evaluation with model-based metrics

For model-based metrics, use the PointwiseMetric and PairwiseMetric classes to define metrics tailored to your specific criteria. Run evaluations using the following options:

Use existing examples
Use a templated interface
Define metrics from scratch

Use model-based metric examples

You can directly use the built-in constant Metric Prompt Template Examples within Vertex AI SDK. Alternatively, modify and incorporate them in the free-form metric definition interface.

For the full list of Metric Prompt Template Examples covering most key use cases, see Metric prompt templates.

Console

When you're running evaluations in a Colab Enterprise notebook, you can access metric prompt templates from directly within the Google Cloud console.

Click the link for your preferred Gen AI evaluation service notebook.
The notebook opens in GitHub. Click Open in Colab Enterprise. If a dialog asks you to enable APIs, click Enable.
Click the Gen AI Evaluation icon in the sidebar. A Pre-built metric templates panel opens.
Select Pointwise or Pairwise metrics.
Click the metric you want to use, such as Fluency. The code sample for the metric appears.
Click Copy to copy the code sample. Optionally, click Customize to change pre-set fields for the metric.
Paste the code sample into your notebook.

Vertex AI SDK

The following Vertex AI SDK example shows how to use MetricPromptTemplateExamples class to define your metrics:

# View all the available examples of model-based metrics
MetricPromptTemplateExamples.list_example_metric_names()

# Display the metric prompt template of a specific example metric
print(MetricPromptTemplateExamples.get_prompt_template('fluency'))

# Use the pre-defined model-based metrics directly
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[MetricPromptTemplateExamples.Pointwise.FLUENCY],
)

eval_result = eval_task.evaluate(
    model=MODEL,
)

Use a model-based metric templated interface

Customize your metrics by populating fields like Criteria and Rating Rubrics using the PointwiseMetricPromptTemplate and PairwiseMetricPromptTemplate classes within Vertex AI SDK. Certain fields, such as Instruction, are assigned a default value if you don't provide input.

Optionally, you can specify input_variables, which is a list of input fields used by the metric prompt template to generate model-based evaluation results. By default, the model's response column is included for pointwise metrics, and both the candidate model's response and baseline_model_response columns are included for pairwise metrics.

For additional information, refer to the "Structure a metric prompt template" section in Metric prompt templates.

# Define a pointwise metric with two custom criteria
custom_text_quality = PointwiseMetric(
    metric="custom_text_quality",
    metric_prompt_template=PointwiseMetricPromptTemplate(
        criteria={
          "fluency": "Sentences flow smoothly and are easy to read, avoiding awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.",
          "entertaining": "Short, amusing text that incorporates emojis, exclamations and questions to convey quick and spontaneous communication and diversion.",
        },
        rating_rubric={
          "1": "The response performs well on both criteria.",
          "0": "The response is somewhat aligned with both criteria",
          "-1": "The response falls short on both criteria",
        },
        input_variables=["prompt"],
    ),
)

# Display the serialized metric prompt template
print(custom_text_quality.metric_prompt_template)

# Run evaluation using the custom_text_quality metric
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[custom_text_quality],
)
eval_result = eval_task.evaluate(
    model=MODEL,
)

Use the model-based metric free-form SDK interface

For more flexibility in customizing the metric prompt template, you can define a metric directly using the free-form interface, which accepts a direct string input.

# Define a pointwise multi-turn chat quality metric
pointwise_chat_quality_metric_prompt = """Evaluate the AI's contribution to a meaningful conversation, considering coherence, fluency, groundedness, and conciseness.
 Review the chat history for context. Rate the response on a 1-5 scale, with explanations for each criterion and its overall impact.

# Conversation History
{history}

# Current User Prompt
{prompt}

# AI-generated Response
{response}
"""

freeform_multi_turn_chat_quality_metric = PointwiseMetric(
    metric="multi_turn_chat_quality_metric",
    metric_prompt_template=pointwise_chat_quality_metric_prompt,
)

# Run evaluation using the freeform_multi_turn_chat_quality_metric metric
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=[freeform_multi_turn_chat_quality_metric],
)
eval_result = eval_task.evaluate(
    model=MODEL,
)

Evaluate a translation model

To evaluate your translation model, you can specify BLEU, MetricX, or COMET as evaluation metrics when using the Vertex AI SDK.

#Prepare the dataset for evaluation.
sources = [
    "Dem Feuer konnte Einhalt geboten werden",
    "Schulen und Kindergärten wurden eröffnet.",
]

responses = [
    "The fire could be stopped",
    "Schools and kindergartens were open",
]

references = [
    "They were able to control the fire.",
    "Schools and kindergartens opened",
]

eval_dataset = pd.DataFrame({
    "source": sources,
    "response": responses,
    "reference": references,
})

# Set the metrics.

metrics = [
    "bleu",
    pointwise_metric.Comet(),
    pointwise_metric.MetricX(),
]

eval_task = evaluation.EvalTask(
    dataset=eval_dataset,
    metrics=metrics,
)
eval_result = eval_task.evaluate()

Run evaluation with computation-based metrics

You can use computation-based metrics standalone, or together with model-based metrics.

# Combine computation-based metrics "ROUGE" and "BLEU" with model-based metrics
eval_task = EvalTask(
    dataset=EVAL_DATASET,
    metrics=["rouge_l_sum", "bleu", custom_text_quality],
)
eval_result = eval_task.evaluate(
    model=MODEL,
)

Run evaluations at scale

If you have large evaluation datasets or periodically run evaluations in a production environment, you can use the EvaluateDataset API in the Gen AI evaluation service to run evaluations at scale.

Before using any of the request data, make the following replacements:

PROJECT_NUMBER: Your project number.
DATASET_URI: The Cloud Storage path to a JSONL file that contains evaluation instances. Each line in the file should represent a single instance, with keys corresponding to user-defined input fields in the metric_prompt_template (for model-based metrics) or required input parameters (for computation-based metrics). You can only specify one JSONL file. The following example is a line for a pointwise evaluation instance:
```
{"response": "The Roman Senate was filled with exuberance due to Pompey's defeat in Asia."}
```
METRIC_SPEC: One or more metric specs you are using for evaluation. You can use the following metric specs when running evaluations at scale: "pointwise_metric_spec", "pairwise_metric_spec", "exact_match_spec", "bleu_spec", and "rouge_spec".
METRIC_SPEC_FIELD_NAME: The required fields for your chosen metric spec. For example, "metric_prompt_template"
METRIC_SPEC_FIELD_CONTENT: The field content for your chosen metric spec. For example, you can use the following field content for a pointwise evaluation: "Evaluate the fluency of this sentence: {response}. Give score from 0 to 1. 0 - not fluent at all. 1 - very fluent."
OUTPUT_BUCKET: The name of the Cloud Storage bucket where you want to store evaluation results.

HTTP method and URL:

POST https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/evaluateDataset

Request JSON body:

{
  "dataset": {
    "gcs_source": {
      "uris": "DATASET_URI"
    }
  },
  "metrics": [
    {
      METRIC_SPEC: {
        METRIC_SPEC_FIELD_NAME: METRIC_SPEC_FIELD_CONTENT
      }
    }
  ],
  "output_config": {
    "gcs_destination": {
      "output_uri_prefix": "OUTPUT_BUCKET"
    }
  }
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/evaluateDataset"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/evaluateDataset" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{
  "name": "projects/PROJECT_NUMBER/locations/us-central1/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.EvaluateDatasetOperationMetadata",
    "genericMetadata": {
      "createTime": CREATE_TIME,
      "updateTime": UPDATE_TIME
    }
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.EvaluateDatasetResponse",
    "outputInfo": {
      "gcsOutputDirectory": "gs://OUTPUT_BUCKET/evaluation_GENERATION_TIME"
    }
  }
}

You can use the OPERATION_ID you receive in the response to request the status of the evaluation:

curl -X GET \
  -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
  -H "Content-Type: application/json; charset=utf-8" \
  "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/PROJECT_NUMBER/locations/us-central1/operations/OPERATION_ID"

Additional metric customization

If you need to further customize your metrics, like choosing a different judge model for model-based metrics, or define a new computation-based metric, you can use the CustomMetric class in the Vertex AI SDK. For more details, see the following notebooks:

Run model-based evaluation with increased rate limits and quota

A single evaluation request for a model-based metric results in multiple underlying requests to the Gemini API in Vertex AI and consumes quota for the judge model. You should set a higher evaluation service rate limit in the following use cases:

Increased data volume: If you're processing significantly more data using the model-based metrics, you might hit the default requests per minute (RPM) quota. Increasing the quota lets you handle the larger volume without performance degradation or interruptions.
Faster evaluation: If your application requires quicker turnaround time for evaluations, you might need a higher RPM quota. This is especially important for time-sensitive applications or those with real-time interactions where delays in evaluation can impact the user experience.
Complex evaluation tasks: A higher RPM quota ensures you have enough capacity to handle resource-intensive evaluations for complex tasks or large amounts of text.
High user concurrency: If you anticipate a large number of users simultaneously requesting model-based evaluations and model inference within your project, a higher model RPM limit is crucial to prevent bottlenecks and maintain responsiveness.

If you're using the default judge model of gemini-2.0-flash or newer models, we recommend that you use Provisioned Throughput to manage your quota.

For models older than gemini-2.0-flash, use the following instructions to increase the judge model RPM quota:

In the Google Cloud console, go to the IAM & Admin Quotas page.

View Quotas in Console
In the Filter field, specify the Dimension (model identifier) and the Metric (quota identifier for Gemini models): base_model:gemini-2.0-flash and Metric:aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model.
For the quota that you want to increase, click the more actions menu button.
In the drop-down menu, click Edit quota. The Quota changes panel opens.
Under Edit quota, enter a new quota value.
Click Submit request.
A Quota Increase Request (QIR) is confirmed by email and typically takes two business days to process.

To run an evaluation using a new quota, set the evaluation_service_qps parameter as follows:

from vertexai.evaluation import EvalTask

# GEMINI_RPM is the requests per minute (RPM) quota for gemini-2.0-flash-001 in your region
# Evaluation Service QPS limit is equal to (gemini-2.0-flash-001 RPM / 60 sec / default number of samples)
CUSTOM_EVAL_SERVICE_QPS_LIMIT = GEMINI_RPM / 60 / 4

eval_task = EvalTask(
    dataset=DATASET,
    metrics=[METRIC_1, METRIC_2, METRIC_3],
)

eval_result = eval_task.evaluate(
    evaluation_service_qps=CUSTOM_EVAL_SERVICE_QPS_LIMIT,
    # Specify a retry_timeout limit for a more responsive evaluation run
    # the default value is 600 (in seconds, or 10 minutes)
    retry_timeout=RETRY_TIMEOUT,
)

For more information about quotas and limits, see Gen AI evaluation service quotas, and Gen AI evaluation service API.