Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Run a computation-based evaluation pipeline

This guide shows you how to run a computation-based evaluation pipeline to evaluate the performance of foundation models and your tuned generative AI models on Vertex AI. The pipeline evaluates your model using a set of metrics against an evaluation dataset that you provide.

This page covers the following topics:

How computation-based model evaluation works: Learn about the evaluation process, which uses prompts and ground truths to compute metrics.
Prepare and upload the evaluation dataset: Understand the required JSONL format and how to upload your data.
Perform model evaluation: Run an evaluation job using the Google Cloud console, the REST API, or the Vertex AI SDK for Python.
View evaluation results: Find and interpret the evaluation metrics in Cloud Storage or the Google Cloud console.

The following diagram summarizes the overall workflow for running a computation-based evaluation:

For the latest computation-based evaluation features, see Define your metrics.

How computation-based model evaluation works

To evaluate a model's performance, you provide an evaluation dataset that contains prompt and ground truth pairs. For each pair, the prompt is the input that you want to evaluate, and the ground truth is the ideal response for that prompt.

During the evaluation, the process passes the prompt from each pair to the model to generate an output. The process then uses the model's generated output and the corresponding ground truth to compute the evaluation metrics.

The type of metrics used for evaluation depends on the task that you are evaluating. The following table shows the supported tasks and the metrics used to evaluate each task:

Task	Metric
Classification	Micro-F1, Macro-F1, Per class F1
Summarization	ROUGE-L
Question answering	Exact Match
Text generation	BLEU, ROUGE-L

Supported models

You can evaluate the following models:

text-bison: Base and tuned versions.
Gemini: All tasks except classification.

Prepare and upload the evaluation dataset

The evaluation dataset includes prompt and ground truth pairs that align with the task that you want to evaluate. Your dataset must include a minimum of one prompt and ground truth pair, and at least 10 pairs for meaningful metrics. The more examples you provide, the more meaningful the results.

Dataset format

Your evaluation dataset must be in the JSON Lines (JSONL) format, where each line is a JSON object. Each object must contain an input_text field with the prompt you want to evaluate and an output_text field with the ideal response for that prompt.

The maximum token length for input_text is 8,192, and the maximum token length for output_text is 1,024.

Upload the dataset to Cloud Storage

You can create a new Cloud Storage bucket or use an existing one to store your dataset file. The bucket must be in the same region as the model.

After your bucket is ready, upload your dataset file to the bucket.

Choose an evaluation method

You can run a computation-based evaluation job using the Google Cloud console, the REST API, or the Vertex AI SDK for Python. The following table can help you choose the best option for your use case.

Method	Description	Use Case
Google Cloud console	A graphical user interface (GUI) that provides a guided, step-by-step workflow for creating and monitoring evaluation jobs.	New users who are learning the evaluation workflow. Quick, one-off evaluations. Visualizing the setup and results without writing code.
REST API	A programmatic interface for creating evaluation jobs by sending JSON requests to an endpoint.	Integrating model evaluation into existing applications or multi-language workflows. Building custom tooling or automation that is not Python-based.
Vertex AI SDK for Python	A high-level Python library that simplifies interactions with the Vertex AI API.	Data scientists and ML engineers who work primarily in Python environments (for example, in Jupyter notebooks). Automating MLOps pipelines and complex evaluation workflows.

Perform model evaluation

Use one of the following methods to perform a model evaluation job.

Permissions required for this task

To perform this task, you must grant Identity and Access Management (IAM) roles to each of the following service accounts:

Service account	Default principal	Description	Roles
Vertex AI Service Agent	`service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com`	The Vertex AI Service Agent is automatically provisioned for your project and granted a predefined role. However, if an org policy modifies the default permissions of the Vertex AI Service Agent, you must manually grant the role to the service agent.	Vertex AI Service Agent (`roles/aiplatform.serviceAgent`)
Vertex AI Pipelines Service Account	`PROJECT_NUMBER-compute@developer.gserviceaccount.com`	The service account that runs the pipeline. The default service account used is the Compute Engine default service account. Optionally, you can use a custom service account instead of the default service account.	Vertex AI User (`roles/aiplatform.user`) Storage Object User (`roles/storage.objectUser`)

Depending on your input and output data sources, you may also need to grant the Vertex AI Pipelines Service Account additional roles:

Data source	Role	Where to grant the role
Standard BigQuery table	BigQuery Data Editor	Project that runs the pipeline
Standard BigQuery table	BigQuery Data Viewer	Project that the table belongs to
BigQuery view of a standard BigQuery table	BigQuery Data Editor	Project that runs the pipeline
	BigQuery Data Viewer	Project that the view belongs to
	BigQuery Data Viewer	Project that the table belongs to
BigQuery external table that has a source Cloud Storage file	BigQuery Data Editor	Project that runs the pipeline
	BigQuery Data Viewer	Project that the external table belongs to
	Storage Object Viewer	Project that the source file belongs to
BigQuery view of a BigQuery external table that has a source Cloud Storage file	BigQuery Data Editor	Project that runs the pipeline
	BigQuery Data Viewer	Project that the view belongs to
	BigQuery Data Viewer	Project that the external table belongs to
	Storage Object Viewer	Project that the source file belongs to
Cloud Storage file	BigQuery Data Viewer	Project that runs the pipeline

REST

To create a model evaluation job, send a POST request using the pipelineJobs method.

Before using any of the request data, make the following replacements:

PROJECT_ID: The Google Cloud project that runs the pipeline components.
PIPELINEJOB_DISPLAYNAME: A display name for the pipelineJob.
LOCATION: The region to run the pipeline components. Currently, only us-central1 is supported.
DATASET_URI: The Cloud Storage URI of your reference dataset. You can specify one or multiple URIs. This parameter supports wildcards. To learn more about this parameter, see InputConfig.
OUTPUT_DIR: The Cloud Storage URI to store evaluation output.
MODEL_NAME: Specify a publisher model or a tuned model resource as follows:
- Publisher model: publishers/google/models/MODEL@MODEL_VERSION
  Example: publishers/google/models/text-bison@002
- Tuned model: projects/PROJECT_NUMBER/locations/LOCATION/models/ENDPOINT_ID
  Example: projects/123456789012/locations/us-central1/models/1234567890123456789
The evaluation job doesn't impact any existing deployments of the model or their resources.
EVALUATION_TASK: The task that you want to evaluate the model on. The evaluation job computes a set of metrics relevant to that specific task. Acceptable values include the following:
- summarization
- question-answering
- text-generation
- classification
INSTANCES_FORMAT: The format of your dataset. Currently, only jsonl is supported. To learn more about this parameter, see InputConfig.
PREDICTIONS_FORMAT: The format of the evaluation output. Currently, only jsonl is supported. To learn more about this parameter, see InputConfig.
MACHINE_TYPE: (Optional) The machine type for running the evaluation job. The default value is e2-highmem-16. For a list of supported machine types, see Machine types.
SERVICE_ACCOUNT: (Optional) The service account to use for running the evaluation job. To learn how to create a custom service account, see Configure a service account with granular permissions. If unspecified, the Vertex AI Custom Code Service Agent is used.
NETWORK: (Optional) The fully qualified name of the Compute Engine network to peer the evaluatiuon job to. The format of the network name is projects/PROJECT_NUMBER/global/networks/NETWORK_NAME. If you specify this field, you need to have a VPC Network Peering for Vertex AI. If left unspecified, the evaluation job is not peered with any network.
KEY_NAME: (Optional) The name of the customer-managed encryption key (CMEK). If configured, resources created by the evaluation job is encrypted using the provided encryption key. The format of the key name is projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING/cryptoKeys/KEY. The key needs to be in the same region as the evaluation job.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs

Request JSON body:

{
  "displayName": "PIPELINEJOB_DISPLAYNAME",
  "runtimeConfig": {
    "gcsOutputDirectory": "gs://OUTPUT_DIR",
    "parameterValues": {
      "project": "PROJECT_ID",
      "location": "LOCATION",
      "batch_predict_gcs_source_uris": ["gs://DATASET_URI"],
      "batch_predict_gcs_destination_output_uri": "gs://OUTPUT_DIR",
      "model_name": "MODEL_NAME",
      "evaluation_task": "EVALUATION_TASK",
      "batch_predict_instances_format": "INSTANCES_FORMAT",
      "batch_predict_predictions_format: "PREDICTIONS_FORMAT",
      "machine_type": "MACHINE_TYPE",
      "service_account": "SERVICE_ACCOUNT",
      "network": "NETWORK",
      "encryption_spec_key_name": "KEY_NAME"
    }
  },
  "templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following. Note that pipelineSpec has been truncated to save space.

Response

......
.....
 "state": "PIPELINE_STATE_PENDING",
  "labels": {
    "vertex-ai-pipelines-run-billing-id": "1234567890123456789"
  },
  "runtimeConfig": {
    "gcsOutputDirectory": "gs://my-evaluation-bucket/output",
    "parameterValues": {
      "project": "my-project",
      "location": "us-central1",
      "batch_predict_gcs_source_uris": [
        "gs://my-evaluation-bucket/reference-datasets/eval_data.jsonl"
      ],
      "batch_predict_gcs_destination_output_uri": "gs://my-evaluation-bucket/output",
      "model_name": "publishers/google/models/text-bison@002"
    }
  },
  "serviceAccount": "123456789012-compute@developer.gserviceaccount.com",
  "templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1",
  "templateMetadata": {
    "version": "sha256:d4c0d665533f6b360eb474111aa5e00f000fb8eac298d367e831f3520b21cb1a"
  }
}

Example curl command

PROJECT_ID=myproject
REGION=us-central1
MODEL_NAME=publishers/google/models/text-bison@002
TEST_DATASET_URI=gs://my-gcs-bucket-uri/dataset.jsonl
OUTPUT_DIR=gs://my-gcs-bucket-uri/output

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
"https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/pipelineJobs" -d \
$'{
  "displayName": "evaluation-llm-text-generation-pipeline",
  "runtimeConfig": {
    "gcsOutputDirectory": "'${OUTPUT_DIR}'",
    "parameterValues": {
      "project": "'${PROJECT_ID}'",
      "location": "'${REGION}'",
      "batch_predict_gcs_source_uris": ["'${TEST_DATASET_URI}'"],
      "batch_predict_gcs_destination_output_uri": "'${OUTPUT_DIR}'",
      "model_name": "'${MODEL_NAME}'",
    }
  },
  "templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"
}'

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

import os

from google.auth import default

import vertexai
from vertexai.preview.language_models import (
    EvaluationTextClassificationSpec,
    TextGenerationModel,
)

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT")


def evaluate_model() -> object:
    """Evaluate the performance of a generative AI model."""

    # Set credentials for the pipeline components used in the evaluation task
    credentials, _ = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])

    vertexai.init(project=PROJECT_ID, location="us-central1", credentials=credentials)

    # Create a reference to a generative AI model
    model = TextGenerationModel.from_pretrained("text-bison@002")

    # Define the evaluation specification for a text classification task
    task_spec = EvaluationTextClassificationSpec(
        ground_truth_data=[
            "gs://cloud-samples-data/ai-platform/generative_ai/llm_classification_bp_input_prompts_with_ground_truth.jsonl"
        ],
        class_names=["nature", "news", "sports", "health", "startups"],
        target_column_name="ground_truth",
    )

    # Evaluate the model
    eval_metrics = model.evaluate(task_spec=task_spec)
    print(eval_metrics)
    # Example response:
    # ...
    # PipelineJob run completed.
    # Resource name: projects/123456789/locations/us-central1/pipelineJobs/evaluation-llm-classification-...
    # EvaluationClassificationMetric(label_name=None, auPrc=0.53833705, auRoc=0.8...

    return eval_metrics

Console

To create a model evaluation job using the Google Cloud console, follow these steps:

In the Google Cloud console, go to the Vertex AI Model Registry page.
Go to Vertex AI Model Registry
Click the name of the model that you want to evaluate.
In the Evaluate tab, click Create evaluation and configure the following settings:
- Objective: Select the task that you want to evaluate.
- Target column or field: (Classification only) Enter the target column for prediction. Example: ground_truth.
- Source path: Enter or select the URI of your evaluation dataset.
- Output format: Enter the format of the evaluation output. Currently, only jsonl is supported.
- Cloud Storage path: Enter or select the URI to store evaluation output.
- Class names: (Classification only) Enter the list of possible class names.
- Number of compute nodes: Enter the number of compute nodes to run the evaluation job.
- Machine type: Select a machine type to use for running the evaluation job.
Click Start evaluation.

View evaluation results

You can find the evaluation results in the Cloud Storage output directory that you specified when creating the evaluation job. The file is named evaluation_metrics.json.

For tuned models, you can also view evaluation results in the Google Cloud console:

In the Vertex AI section of the Google Cloud console, go to the Vertex AI Model Registry page.

Go to Vertex AI Model Registry
Click the name of the model to view its evaluation metrics.
In the Evaluate tab, click the name of the evaluation run that you want to view.

What's next

Learn about generative AI evaluation.
Learn about online evaluation with Gen AI Evaluation Service.
Learn how to tune a foundation model.

Run a computation-based evaluation pipeline Stay organized with collections Save and categorize content based on your preferences.

How computation-based model evaluation works

Supported models

Prepare and upload the evaluation dataset

Dataset format

Upload the dataset to Cloud Storage

Choose an evaluation method

Perform model evaluation

Permissions required for this task

REST

curl

PowerShell

Response

Example curl command

Python

Console

View evaluation results

What's next

Run a computation-based evaluation pipeline