Resource: EvaluationRun
EvaluationRun is a resource that represents a single evaluation run, which includes a set of prompts, model responses, evaluation configuration and the resulting metrics.
name
string
Identifier. The resource name of the EvaluationRun. This is a unique identifier. Format: projects/{project}/locations/{location}/evaluationRuns/{evaluationRun}
displayName
string
Required. The display name of the Evaluation Run.
Optional. metadata about the evaluation run, can be used by the caller to store additional tracking information about the evaluation run.
labels
map (key: string, value: string)
Optional. Labels for the evaluation run.
Required. The data source for the evaluation run.
Optional. The candidate to inference config map for the evaluation run. The candidate can be up to 128 characters long and can consist of any UTF-8 characters.
Required. The configuration used for the evaluation.
Output only. The state of the evaluation run.
Output only. Only populated when the evaluation run's state is FAILED or CANCELLED.
Output only. The results of the evaluation run. Only populated when the evaluation run's state is SUCCEEDED.
Output only. time when the evaluation run was created.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z"
, "2014-10-02T15:01:23.045123456Z"
or "2014-10-02T15:01:23+05:30"
.
Output only. time when the evaluation run was completed.
Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z"
, "2014-10-02T15:01:23.045123456Z"
or "2014-10-02T15:01:23+05:30"
.
evaluationSetSnapshot
string
Output only. The specific evaluation set of the evaluation run. For runs with an evaluation set input, this will be that same set. For runs with BigQuery input, it's the sampled BigQuery dataset.
JSON representation |
---|
{ "name": string, "displayName": string, "metadata": value, "labels": { string: string, ... }, "dataSource": { object ( |
DataSource
The data source for the evaluation run.
source
Union type
source
can be only one of the following:evaluationSet
string
The EvaluationSet resource name. Format: projects/{project}/locations/{location}/evaluationSets/{evaluationSet}
Evaluation data in bigquery.
JSON representation |
---|
{
// source
"evaluationSet": string,
"bigqueryRequestSet": {
object ( |
BigQueryRequestSet
The request set for the evaluation run.
uri
string
Required. The URI of a BigQuery table. e.g. bq://projectId.bqDatasetId.bqTableId
promptColumn
string
Optional. The name of the column that contains the requests to evaluate. This will be in evaluationItem.EvalPrompt format.
rubricsColumn
string
Optional. The name of the column that contains the rubrics. This is in evaluation_rubric.RubricGroup format.
candidateResponseColumns
map (key: string, value: string)
Optional. Map of candidate name to candidate response column name. The column will be in evaluationItem.CandidateResponse format.
Optional. The sampling config for the bigquery resource.
JSON representation |
---|
{
"uri": string,
"promptColumn": string,
"rubricsColumn": string,
"candidateResponseColumns": {
string: string,
...
},
"samplingConfig": {
object ( |
SamplingConfig
The sampling config.
samplingCount
integer
Optional. The total number of logged data to import. If available data is less than the sampling count, all data will be imported. Default is 100.
Optional. The sampling method to use.
Optional. How long to wait before sampling data from the BigQuery table. If not specified, defaults to 0.
A duration in seconds with up to nine fractional digits, ending with 's
'. Example: "3.5s"
.
JSON representation |
---|
{
"samplingCount": integer,
"samplingMethod": enum ( |
SamplingMethod
The sampling method to use.
Enums | |
---|---|
SAMPLING_METHOD_UNSPECIFIED |
Unspecified sampling method. |
RANDOM |
Random sampling. |
InferenceConfig
An inference config used for model inference during the evaluation run.
model
string
Required. The fully qualified name of the publisher model or endpoint to use.
Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*
Endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}
model_config
Union type
model_config
can be only one of the following:Optional. Generation config.
JSON representation |
---|
{
"model": string,
// model_config
"generationConfig": {
object ( |
EvaluationConfig
The Evalution configuration used for the evaluation run.
Required. The metrics to be calculated in the evaluation run.
Optional. The rubric configs for the evaluation run. They are used to generate rubrics which can be used by rubric-based metrics. Multiple rubric configs can be specified for rubric generation but only one rubric config can be used for a rubric-based metric. If more than one rubric config is provided, the evaluation metric must specify a rubric group key. Note that if a generation spec is specified on both a rubric config and an evaluation metric, the rubrics generated for the metric will be used for evaluation.
Optional. The output config for the evaluation run.
Optional. The autorater config for the evaluation run.
The prompt template used for inference. The values for variables in the prompt template are defined in EvaluationItem.EvaluationPrompt.PromptTemplateData.values.
JSON representation |
---|
{ "metrics": [ { object ( |
EvaluationRunMetric
The metric used for evaluation runs.
metric
string
Required. The name of the metric.
metric_spec
Union type
metric_spec
can be only one of the following:Spec for rubric based metric.
Spec for a pre-defined metric.
Spec for an LLM based metric.
JSON representation |
---|
{ "metric": string, // metric_spec "rubricBasedMetricSpec": { object ( |
RubricBasedMetricSpec
Specification for a metric that is based on rubrics.
metricPromptTemplate
string
Optional. Template for the prompt used by the judge model to evaluate against rubrics.
rubrics_source
Union type
rubrics_source
can be only one of the following:Use rubrics provided directly in the spec.
rubricGroupKey
string
Use a pre-defined group of rubrics associated with the input content. This refers to a key in the rubricGroups
map of RubricEnhancedContents
.
Dynamically generate rubrics for evaluation using this specification.
Optional. Optional configuration for the judge LLM (Autorater). The definition of AutoraterConfig needs to be provided.
JSON representation |
---|
{ "metricPromptTemplate": string, // rubrics_source "inlineRubrics": { object ( |
RepeatedRubrics
RubricGenerationSpec
Specification for how rubrics should be generated.
promptTemplate
string
Optional. Template for the prompt used to generate rubrics. The details should be updated based on the most-recent recipe requirements.
Optional. The type of rubric content to be generated.
rubricTypeOntology[]
string
Optional. An optional, pre-defined list of allowed types for generated rubrics. If this field is provided, it implies include_rubric_type
should be true, and the generated rubric types should be chosen from this ontology.
Optional. Configuration for the model used in rubric generation. Configs including sampling count and base model can be specified here. Flipping is not supported for rubric generation.
JSON representation |
---|
{ "promptTemplate": string, "rubricContentType": enum ( |
AutoraterConfig
The autorater config used for the evaluation run.
autoraterModel
string
Optional. The fully qualified name of the publisher model or tuned autorater endpoint to use.
Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*
Tuned model endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}
Optional. Configuration options for model generation and outputs.
sampleCount
integer
Optional. Number of samples for each instance in the dataset. If not specified, the default is 4. Minimum value is 1, maximum value is 32.
JSON representation |
---|
{
"autoraterModel": string,
"generationConfig": {
object ( |
RubricContentType
Specifies the type of rubric content to generate.
Enums | |
---|---|
RUBRIC_CONTENT_TYPE_UNSPECIFIED |
The content type to generate is not specified. |
PROPERTY |
Generate rubrics based on properties. |
NL_QUESTION_ANSWER |
Generate rubrics in an NL question answer format. |
PYTHON_CODE_ASSERTION |
Generate rubrics in a unit test format. |
PredefinedMetricSpec
Specification for a pre-defined metric.
metricSpecName
string
Required. The name of a pre-defined metric, such as "instruction_following_v1" or "text_quality_v1".
Optional. The parameters needed to run the pre-defined metric.
JSON representation |
---|
{ "metricSpecName": string, "parameters": { object } } |
LLMBasedMetricSpec
Specification for an LLM based metric.
rubrics_source
Union type
rubrics_source
can be only one of the following:rubricGroupKey
string
Use a pre-defined group of rubrics associated with the input. Refers to a key in the rubricGroups map of EvaluationInstance.
Dynamically generate rubrics using this specification.
Dynamically generate rubrics using a predefined spec.
metricPromptTemplate
string
Required. Template for the prompt sent to the judge model.
systemInstruction
string
Optional. System instructions for the judge model.
Optional. Optional configuration for the judge LLM (Autorater).
Optional. Optional additional configuration for the metric.
JSON representation |
---|
{ // rubrics_source "rubricGroupKey": string, "rubricGenerationSpec": { object ( |
EvaluationRubricConfig
Configuration for a rubric group to be generated/saved for evaluation.
rubricGroupKey
string
Required. The key used to save the generated rubrics. If a generation spec is provided, this key will be used for the name of the generated rubric group. Otherwise, this key will be used to look up the existing rubric group on the evaluation item. Note that if a rubric group key is specified on both a rubric config and an evaluation metric, the key from the metric will be used to select the rubrics for evaluation.
generation_config
Union type
generation_config
can be only one of the following:Dynamically generate rubrics using this specification.
Dynamically generate rubrics using a predefined spec.
JSON representation |
---|
{ "rubricGroupKey": string, // generation_config "rubricGenerationSpec": { object ( |
OutputConfig
The output config for the evaluation run.
BigQuery destination for evaluation output.
Cloud Storage destination for evaluation output.
JSON representation |
---|
{ "bigqueryDestination": { object ( |
BigQueryDestination
The BigQuery location for the output content.
outputUri
string
Required. BigQuery URI to a project or table, up to 2000 characters long.
When only the project is specified, the Dataset and Table is created. When the full table reference is specified, the Dataset must exist and table must not exist.
Accepted forms:
- BigQuery path. For example:
bq://projectId
orbq://projectId.bqDatasetId
orbq://projectId.bqDatasetId.bqTableId
.
JSON representation |
---|
{ "outputUri": string } |
PromptTemplate
Prompt template used for inference.
source
Union type
source
can be only one of the following:promptTemplate
string
Inline prompt template. Template variables should be in the format "{var_name}". Example: "Translate the following from {source_lang} to {target_lang}: {text}"
gcsUri
string
Prompt template stored in Cloud Storage. Format: "gs://my-bucket/file-name.txt".
JSON representation |
---|
{ // source "promptTemplate": string, "gcsUri": string // Union type } |
State
The state of the evaluation run.
Enums | |
---|---|
STATE_UNSPECIFIED |
Unspecified state. |
PENDING |
The evaluation run is pending. |
RUNNING |
The evaluation run is running. |
SUCCEEDED |
The evaluation run has succeeded. |
FAILED |
The evaluation run has failed. |
CANCELLED |
The evaluation run has been cancelled. |
INFERENCE |
The evaluation run is performing inference. |
GENERATING_RUBRICS |
The evaluation run is performing rubric generation. |
EvaluationResults
The results of the evaluation run.
Optional. The summary metrics for the evaluation run.
evaluationSet
string
The evaluation set where item level results are stored.
JSON representation |
---|
{
"summaryMetrics": {
object ( |
SummaryMetrics
The summary metrics for the evaluation run.
Optional. Map of metric name to metric value.
totalItems
integer
Optional. The total number of items that were evaluated.
failedItems
integer
Optional. The number of items that failed to be evaluated.
JSON representation |
---|
{ "metrics": { string: value, ... }, "totalItems": integer, "failedItems": integer } |
Methods |
|
---|---|
|
Cancels an Evaluation Run. |
|
Creates an Evaluation Run. |
|
Deletes an Evaluation Run. |
|
Gets an Evaluation Run. |
|
Lists Evaluation Runs. |