This guide shows you how to run a computation-based evaluation pipeline to evaluate the performance of foundation models and your tuned generative AI models on Vertex AI. The pipeline evaluates your model using a set of metrics against an evaluation dataset that you provide. This page covers the following topics: The following diagram summarizes the overall workflow for running a computation-based evaluation: For the latest computation-based evaluation features, see Define your metrics. To evaluate a model's performance, you provide an evaluation dataset that contains prompt and ground truth pairs. For each pair, the prompt is the input that you want to evaluate, and the ground truth is the ideal response for that prompt. During the evaluation, the process passes the prompt from each pair to the model to generate an output. The process then uses the model's generated output and the corresponding ground truth to compute the evaluation metrics. The type of metrics used for evaluation depends on the task that you are evaluating. The following table shows the supported tasks and the metrics used to evaluate each task: You can evaluate the following models: The evaluation dataset includes prompt and ground truth pairs that align with the task that you want to evaluate. Your dataset must include a minimum of one prompt and ground truth pair, and at least 10 pairs for meaningful metrics. The more examples you provide, the more meaningful the results. Your evaluation dataset must be in the JSON Lines (JSONL) format, where each line is a JSON object. Each object must contain an The maximum token length for You can create a new Cloud Storage bucket or use an existing one to store your dataset file. The bucket must be in the same region as the model. After your bucket is ready,
upload
your dataset file to the bucket. You can run a computation-based evaluation job using the Google Cloud console, the REST API, or the Vertex AI SDK for Python. The following table can help you choose the best option for your use case. Use one of the following methods to perform a model evaluation job.
To create a model evaluation job, send a
Before using any of the request data,
make the following replacements:
Example: Example: The evaluation job doesn't impact any existing deployments of the model or their resources.
HTTP method and URL:
Request JSON body:
To send your request, choose one of these options:
Save the request body in a file named
Save the request body in a file named You should receive a JSON response similar to the following. Note that To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python.
For more information, see the
Python API reference documentation.
To create a model evaluation job using the Google Cloud console, follow
these steps:
How computation-based model evaluation works
Task
Metric
Classification
Micro-F1, Macro-F1, Per class F1
Summarization
ROUGE-L
Question answering
Exact Match
Text generation
BLEU, ROUGE-L
Supported models
text-bison
: Base and tuned versions.Prepare and upload the evaluation dataset
Dataset format
input_text
field with the prompt you want to evaluate and an output_text
field with the ideal response for that prompt.input_text
is 8,192, and the maximum token length for output_text
is 1,024.Upload the dataset to Cloud Storage
Choose an evaluation method
Method
Description
Use Case
Google Cloud console
A graphical user interface (GUI) that provides a guided, step-by-step workflow for creating and monitoring evaluation jobs.
REST API
A programmatic interface for creating evaluation jobs by sending JSON requests to an endpoint.
Vertex AI SDK for Python
A high-level Python library that simplifies interactions with the Vertex AI API.
Perform model evaluation
REST
POST
request using the
pipelineJobs method.
us-central1
is supported.
publishers/google/models/MODEL@MODEL_VERSION
publishers/google/models/text-bison@002
projects/PROJECT_NUMBER/locations/LOCATION/models/ENDPOINT_ID
projects/123456789012/locations/us-central1/models/1234567890123456789
summarization
question-answering
text-generation
classification
jsonl
is supported. To learn more about this parameter, see
InputConfig.jsonl
is supported. To learn more about this
parameter, see
InputConfig.e2-highmem-16
. For a list of
supported machine types, see
Machine types.projects/PROJECT_NUMBER/global/networks/NETWORK_NAME
. If you
specify this field, you need to have a VPC Network Peering for
Vertex AI. If left unspecified, the evaluation job is not peered with any network.projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING/cryptoKeys/KEY
.
The key needs to be in the same region as the evaluation job.POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs
{
"displayName": "PIPELINEJOB_DISPLAYNAME",
"runtimeConfig": {
"gcsOutputDirectory": "gs://OUTPUT_DIR",
"parameterValues": {
"project": "PROJECT_ID",
"location": "LOCATION",
"batch_predict_gcs_source_uris": ["gs://DATASET_URI"],
"batch_predict_gcs_destination_output_uri": "gs://OUTPUT_DIR",
"model_name": "MODEL_NAME",
"evaluation_task": "EVALUATION_TASK",
"batch_predict_instances_format": "INSTANCES_FORMAT",
"batch_predict_predictions_format: "PREDICTIONS_FORMAT",
"machine_type": "MACHINE_TYPE",
"service_account": "SERVICE_ACCOUNT",
"network": "NETWORK",
"encryption_spec_key_name": "KEY_NAME"
}
},
"templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"
}
curl
request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs"PowerShell
request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs" | Select-Object -Expand ContentpipelineSpec
has been truncated to save space.
Example curl command
PROJECT_ID=myproject
REGION=us-central1
MODEL_NAME=publishers/google/models/text-bison@002
TEST_DATASET_URI=gs://my-gcs-bucket-uri/dataset.jsonl
OUTPUT_DIR=gs://my-gcs-bucket-uri/output
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
"https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/pipelineJobs" -d \
$'{
"displayName": "evaluation-llm-text-generation-pipeline",
"runtimeConfig": {
"gcsOutputDirectory": "'${OUTPUT_DIR}'",
"parameterValues": {
"project": "'${PROJECT_ID}'",
"location": "'${REGION}'",
"batch_predict_gcs_source_uris": ["'${TEST_DATASET_URI}'"],
"batch_predict_gcs_destination_output_uri": "'${OUTPUT_DIR}'",
"model_name": "'${MODEL_NAME}'",
}
},
"templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"
}'
Python
Console
ground_truth
.jsonl
is supported.
View evaluation results
You can find the evaluation results in the Cloud Storage output directory that you specified when creating the evaluation job. The file is named evaluation_metrics.json
.
For tuned models, you can also view evaluation results in the Google Cloud console:
In the Vertex AI section of the Google Cloud console, go to the Vertex AI Model Registry page.
Click the name of the model to view its evaluation metrics.
In the Evaluate tab, click the name of the evaluation run that you want to view.
What's next
- Learn about generative AI evaluation.
- Learn about online evaluation with Gen AI Evaluation Service.
- Learn how to tune a foundation model.