You can evaluate the performance of foundation models and your tuned generative
AI models on Vertex AI. The models are evaluated using a set of metrics
against an evaluation dataset that you provide. This page explains how
computation-based model evaluation through the evaluation pipeline service
works, how to create and format the evaluation dataset, and how to perform the
evaluation using the Google Cloud console, Vertex AI API, or the
Vertex AI SDK for Python. To evaluate the performance of a model, you first create an evaluation dataset
that contains prompt and ground truth pairs. For each pair, the prompt is the
input that you want to evaluate, and the ground truth is the ideal response for
that prompt. During evaluation, the prompt in each pair of the evaluation
dataset is passed to the model to produce an output. The output generated by the
model and the ground truth from the evaluation dataset are used to compute the
evaluation metrics. The type of metrics used for evaluation depends on the task that you are
evaluating. The following table shows the supported tasks and the metrics used
to evaluate each task: Model evaluation is supported for the following models: Gemini: All tasks except classification. The evaluation dataset that's used for model evaluation includes prompt and
ground truth pairs that align with the task that you want to evaluate. Your
dataset must include a minimum of 1 prompt and ground truth pair and at least
10 pairs for meaningful metrics. The more examples you
give, the more meaningful the results. Your evaluation dataset must be in JSON Lines (JSONL)
format where each line contains a single prompt and ground truth pair specified
in the The maximum token length for You can either
create a new Cloud Storage bucket
or use an existing one to store your dataset file. The bucket must be in the
same region as the model. After your bucket is ready,
upload
your dataset file to the bucket. You can evaluate models by using the REST API or the Google Cloud console.
To create a model evaluation job, send a
Before using any of the request data,
make the following replacements:
Example: Example: The evaluation job doesn't impact any existing deployments of the model or their resources.
HTTP method and URL:
Request JSON body:
To send your request, choose one of these options:
Save the request body in a file named
Save the request body in a file named You should receive a JSON response similar to the following. Note that To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python.
For more information, see the
Python API reference documentation.
To create a model evaluation job by using the Google Cloud console, perform
the following steps:How computation-based model evaluation works
Task
Metric
Classification
Micro-F1, Macro-F1, Per class F1
Summarization
ROUGE-L
Question answering
Exact Match
Text generation
BLEU, ROUGE-L
Supported models
text-bison
: Base and tuned versions.Prepare evaluation dataset
Dataset format
input_text
and output_text
fields, respectively. The input_text
field contains the prompt that you want to evaluate, and the output_text
field
contains the ideal response for the prompt.input_text
is 8,192, and the maximum token length
for output_text
is 1,024.Upload evaluation dataset to Cloud Storage
Perform model evaluation
REST
POST
request by using the
pipelineJobs method.
us-central1
is supported.
publishers/google/models/MODEL@MODEL_VERSION
publishers/google/models/text-bison@002
projects/PROJECT_NUMBER/locations/LOCATION/models/ENDPOINT_ID
projects/123456789012/locations/us-central1/models/1234567890123456789
summarization
question-answering
text-generation
classification
jsonl
is supported. To learn more about this parameter, see
InputConfig.jsonl
is supported. To learn more about this
parameter, see
InputConfig.e2-highmem-16
. For a list of
supported machine types, see
Machine types.projects/PROJECT_NUMBER/global/networks/NETWORK_NAME
. If you
specify this field, you need to have a VPC Network Peering for
Vertex AI. If left unspecified, the evaluation job is not peered with any network.projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING/cryptoKeys/KEY
.
The key needs to be in the same region as the evaluation job.POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs
{
"displayName": "PIPELINEJOB_DISPLAYNAME",
"runtimeConfig": {
"gcsOutputDirectory": "gs://OUTPUT_DIR",
"parameterValues": {
"project": "PROJECT_ID",
"location": "LOCATION",
"batch_predict_gcs_source_uris": ["gs://DATASET_URI"],
"batch_predict_gcs_destination_output_uri": "gs://OUTPUT_DIR",
"model_name": "MODEL_NAME",
"evaluation_task": "EVALUATION_TASK",
"batch_predict_instances_format": "INSTANCES_FORMAT",
"batch_predict_predictions_format: "PREDICTIONS_FORMAT",
"machine_type": "MACHINE_TYPE",
"service_account": "SERVICE_ACCOUNT",
"network": "NETWORK",
"encryption_spec_key_name": "KEY_NAME"
}
},
"templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"
}
curl
request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs"PowerShell
request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/pipelineJobs" | Select-Object -Expand ContentpipelineSpec
has been truncated to save space.
Example curl command
PROJECT_ID=myproject
REGION=us-central1
MODEL_NAME=publishers/google/models/text-bison@002
TEST_DATASET_URI=gs://my-gcs-bucket-uri/dataset.jsonl
OUTPUT_DIR=gs://my-gcs-bucket-uri/output
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
"https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/pipelineJobs" -d \
$'{
"displayName": "evaluation-llm-text-generation-pipeline",
"runtimeConfig": {
"gcsOutputDirectory": "'${OUTPUT_DIR}'",
"parameterValues": {
"project": "'${PROJECT_ID}'",
"location": "'${REGION}'",
"batch_predict_gcs_source_uris": ["'${TEST_DATASET_URI}'"],
"batch_predict_gcs_destination_output_uri": "'${OUTPUT_DIR}'",
"model_name": "'${MODEL_NAME}'",
}
},
"templateUri": "https://us-kfp.pkg.dev/vertex-evaluation/pipeline-templates/evaluation-llm-text-generation-pipeline/1.0.1"
}'
Python
Console
ground_truth
.jsonl
is supported.
View evaluation results
You can find the evaluation results in the Cloud Storage output directory
that you specified when creating the evaluation job. The file is named
evaluation_metrics.json
.
For tuned models, you can also view evaluation results in the Google Cloud console:
In the Vertex AI section of the Google Cloud console, go to the Vertex AI Model Registry page.
Click the name of the model to view its evaluation metrics.
In the Evaluate tab, click the name of the evaluation run that you want to view.
What's next
- Learn about generative AI evaluation.
- Learn about online evaluation with Gen AI Evaluation Service.
- Learn how to tune a foundation model.