REST Resource: projects.locations.evaluationRuns

Resource: EvaluationRun

EvaluationRun is a resource that represents a single evaluation run, which includes a set of prompts, model responses, evaluation configuration and the resulting metrics.

Fields

name string

Identifier. The resource name of the EvaluationRun. This is a unique identifier. Format: projects/{project}/locations/{location}/evaluationRuns/{evaluationRun}

displayName string

Required. The display name of the Evaluation Run.

metadata value (Value format)

Optional. metadata about the evaluation run, can be used by the caller to store additional tracking information about the evaluation run.

labels map (key: string, value: string)

Optional. Labels for the evaluation run.

dataSource object (DataSource)

Required. The data source for the evaluation run.

inferenceConfigs map (key: string, value: object (InferenceConfig))

Optional. The candidate to inference config map for the evaluation run. The candidate can be up to 128 characters long and can consist of any UTF-8 characters.

evaluationConfig object (EvaluationConfig)

Required. The configuration used for the evaluation.

state enum (State)

Output only. The state of the evaluation run.

error object (Status)

Output only. Only populated when the evaluation run's state is FAILED or CANCELLED.

evaluationResults object (EvaluationResults)

Output only. The results of the evaluation run. Only populated when the evaluation run's state is SUCCEEDED.

createTime string (Timestamp format)

Output only. time when the evaluation run was created.

Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z", "2014-10-02T15:01:23.045123456Z" or "2014-10-02T15:01:23+05:30".

completionTime string (Timestamp format)

Output only. time when the evaluation run was completed.

Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: "2014-10-02T15:01:23Z", "2014-10-02T15:01:23.045123456Z" or "2014-10-02T15:01:23+05:30".

evaluationSetSnapshot string

Output only. The specific evaluation set of the evaluation run. For runs with an evaluation set input, this will be that same set. For runs with BigQuery input, it's the sampled BigQuery dataset.

JSON representation

JSON representation
{ "name": string, "displayName": string, "metadata": value, "labels": { string: string, ... }, "dataSource": { object (`DataSource`) }, "inferenceConfigs": { string: { object (`InferenceConfig`) }, ... }, "evaluationConfig": { object (`EvaluationConfig`) }, "state": enum (`State`), "error": { object (`Status`) }, "evaluationResults": { object (`EvaluationResults`) }, "createTime": string, "completionTime": string, "evaluationSetSnapshot": string }

{
  "name": string,
  "displayName": string,
  "metadata": value,
  "labels": {
    string: string,
    ...
  },
  "dataSource": {
    object (DataSource)
  },
  "inferenceConfigs": {
    string: {
      object (InferenceConfig)
    },
    ...
  },
  "evaluationConfig": {
    object (EvaluationConfig)
  },
  "state": enum (State),
  "error": {
    object (Status)
  },
  "evaluationResults": {
    object (EvaluationResults)
  },
  "createTime": string,
  "completionTime": string,
  "evaluationSetSnapshot": string
}

DataSource

The data source for the evaluation run.

Fields

source Union type

One of multiple supported sources. source can be only one of the following:

evaluationSet string

The EvaluationSet resource name. Format: projects/{project}/locations/{location}/evaluationSets/{evaluationSet}

bigqueryRequestSet object (BigQueryRequestSet)

Evaluation data in bigquery.

JSON representation
{ // source "evaluationSet": string, "bigqueryRequestSet": { object (`BigQueryRequestSet`) } // Union type }

BigQueryRequestSet

The request set for the evaluation run.

Fields

uri string

Required. The URI of a BigQuery table. e.g. bq://projectId.bqDatasetId.bqTableId

promptColumn string

Optional. The name of the column that contains the requests to evaluate. This will be in evaluationItem.EvalPrompt format.

rubricsColumn string

Optional. The name of the column that contains the rubrics. This is in evaluation_rubric.RubricGroup format.

candidateResponseColumns map (key: string, value: string)

Optional. Map of candidate name to candidate response column name. The column will be in evaluationItem.CandidateResponse format.

samplingConfig object (SamplingConfig)

Optional. The sampling config for the bigquery resource.

JSON representation
{ "uri": string, "promptColumn": string, "rubricsColumn": string, "candidateResponseColumns": { string: string, ... }, "samplingConfig": { object (`SamplingConfig`) } }

SamplingConfig

The sampling config.

Fields

samplingCount integer

Optional. The total number of logged data to import. If available data is less than the sampling count, all data will be imported. Default is 100.

samplingMethod enum (SamplingMethod)

Optional. The sampling method to use.

samplingDuration string (Duration format)

Optional. How long to wait before sampling data from the BigQuery table. If not specified, defaults to 0.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

JSON representation
{ "samplingCount": integer, "samplingMethod": enum (`SamplingMethod`), "samplingDuration": string }

Enums
`SAMPLING_METHOD_UNSPECIFIED`	Unspecified sampling method.
`RANDOM`	Random sampling.

InferenceConfig

An inference config used for model inference during the evaluation run.

Fields

model string

Optional. The fully qualified name of the publisher model or endpoint to use.

Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*

Endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}

agentConfig object (AgentConfig)

Optional. Agent config used to generate responses.

model_config Union type

Configuration for the LLM. model_config can be only one of the following:

generationConfig object (GenerationConfig)

Optional. Generation config.

JSON representation
{ "model": string, "agentConfig": { object (`AgentConfig`) }, // model_config "generationConfig": { object (`GenerationConfig`) } // Union type }

AgentConfig

Configuration that describes an agent.

Fields

developerInstruction object (Content)

Optional. The developer instruction for the agent.

tools[] object (Tool)

Optional. The tools available to the agent.

JSON representation
{ "developerInstruction": { object (`Content`) }, "tools": [ { object (`Tool`) } ] }

EvaluationConfig

The Evalution configuration used for the evaluation run.

Fields

metrics[] object (EvaluationRunMetric)

Required. The metrics to be calculated in the evaluation run.

rubricConfigs[] object (EvaluationRubricConfig)

Optional. The rubric configs for the evaluation run. They are used to generate rubrics which can be used by rubric-based metrics. Multiple rubric configs can be specified for rubric generation but only one rubric config can be used for a rubric-based metric. If more than one rubric config is provided, the evaluation metric must specify a rubric group key. Note that if a generation spec is specified on both a rubric config and an evaluation metric, the rubrics generated for the metric will be used for evaluation.

outputConfig object (OutputConfig)

Optional. The output config for the evaluation run.

autoraterConfig object (AutoraterConfig)

Optional. The autorater config for the evaluation run.

promptTemplate object (PromptTemplate)

The prompt template used for inference. The values for variables in the prompt template are defined in EvaluationItem.EvaluationPrompt.PromptTemplateData.values.

JSON representation

JSON representation
{ "metrics": [ { object (`EvaluationRunMetric`) } ], "rubricConfigs": [ { object (`EvaluationRubricConfig`) } ], "outputConfig": { object (`OutputConfig`) }, "autoraterConfig": { object (`AutoraterConfig`) }, "promptTemplate": { object (`PromptTemplate`) } }

{
  "metrics": [
    {
      object (EvaluationRunMetric)
    }
  ],
  "rubricConfigs": [
    {
      object (EvaluationRubricConfig)
    }
  ],
  "outputConfig": {
    object (OutputConfig)
  },
  "autoraterConfig": {
    object (AutoraterConfig)
  },
  "promptTemplate": {
    object (PromptTemplate)
  }
}

EvaluationRunMetric

The metric used for evaluation runs.

Fields

metric string

Required. The name of the metric.

metricConfig object (Metric)

The metric config.

metric_spec Union type

The metric spec used for evaluation. metric_spec can be only one of the following:

rubricBasedMetricSpec object (RubricBasedMetricSpec)

Spec for rubric based metric.

predefinedMetricSpec object (PredefinedMetricSpec)

Spec for a pre-defined metric.

llmBasedMetricSpec object (LLMBasedMetricSpec)

Spec for an LLM based metric.

JSON representation

JSON representation
{ "metric": string, "metricConfig": { object (`Metric`) }, // metric_spec "rubricBasedMetricSpec": { object (`RubricBasedMetricSpec`) }, "predefinedMetricSpec": { object (`PredefinedMetricSpec`) }, "llmBasedMetricSpec": { object (`LLMBasedMetricSpec`) } // Union type }

{
  "metric": string,
  "metricConfig": {
    object (Metric)
  },

  // metric_spec
  "rubricBasedMetricSpec": {
    object (RubricBasedMetricSpec)
  },
  "predefinedMetricSpec": {
    object (PredefinedMetricSpec)
  },
  "llmBasedMetricSpec": {
    object (LLMBasedMetricSpec)
  }
  // Union type
}

RubricBasedMetricSpec

Specification for a metric that is based on rubrics.

Fields

metricPromptTemplate string

Optional. Template for the prompt used by the judge model to evaluate against rubrics.

rubrics_source Union type

Source of the rubrics to be used for evaluation. rubrics_source can be only one of the following:

inlineRubrics object (RepeatedRubrics)

Use rubrics provided directly in the spec.

rubricGroupKey string

Use a pre-defined group of rubrics associated with the input content. This refers to a key in the rubricGroups map of RubricEnhancedContents.

rubricGenerationSpec object (RubricGenerationSpec)

Dynamically generate rubrics for evaluation using this specification.

judgeAutoraterConfig object (AutoraterConfig)

Optional. Optional configuration for the judge LLM (Autorater). The definition of AutoraterConfig needs to be provided.

JSON representation

JSON representation
{ "metricPromptTemplate": string, // rubrics_source "inlineRubrics": { object (`RepeatedRubrics`) }, "rubricGroupKey": string, "rubricGenerationSpec": { object (`RubricGenerationSpec`) } // Union type "judgeAutoraterConfig": { object (`AutoraterConfig`) } }

{
  "metricPromptTemplate": string,

  // rubrics_source
  "inlineRubrics": {
    object (RepeatedRubrics)
  },
  "rubricGroupKey": string,
  "rubricGenerationSpec": {
    object (RubricGenerationSpec)
  }
  // Union type
  "judgeAutoraterConfig": {
    object (AutoraterConfig)
  }
}

RepeatedRubrics

Defines a list of rubrics, used when providing rubrics inline.

Fields

rubrics[] object (Rubric)

The list of rubrics.

JSON representation
{ "rubrics": [ { object (`Rubric`) } ] }

RubricGenerationSpec

Specification for how rubrics should be generated.

Fields

promptTemplate string

Optional. Template for the prompt used to generate rubrics. The details should be updated based on the most-recent recipe requirements.

rubricContentType enum (RubricContentType)

Optional. The type of rubric content to be generated.

rubricTypeOntology[] string

Optional. An optional, pre-defined list of allowed types for generated rubrics. If this field is provided, it implies include_rubric_type should be true, and the generated rubric types should be chosen from this ontology.

modelConfig object (AutoraterConfig)

Optional. Configuration for the model used in rubric generation. Configs including sampling count and base model can be specified here. Flipping is not supported for rubric generation.

JSON representation
{ "promptTemplate": string, "rubricContentType": enum (`RubricContentType`), "rubricTypeOntology": [ string ], "modelConfig": { object (`AutoraterConfig`) } }

AutoraterConfig

The autorater config used for the evaluation run.

Fields

autoraterModel string

Optional. The fully qualified name of the publisher model or tuned autorater endpoint to use.

Publisher model format: projects/{project}/locations/{location}/publishers/*/models/*

Tuned model endpoint format: projects/{project}/locations/{location}/endpoints/{endpoint}

generationConfig object (GenerationConfig)

Optional. Configuration options for model generation and outputs.

sampleCount integer

Optional. Number of samples for each instance in the dataset. If not specified, the default is 4. Minimum value is 1, maximum value is 32.

JSON representation
{ "autoraterModel": string, "generationConfig": { object (`GenerationConfig`) }, "sampleCount": integer }

Enums
`RUBRIC_CONTENT_TYPE_UNSPECIFIED`	The content type to generate is not specified.
`PROPERTY`	Generate rubrics based on properties.
`NL_QUESTION_ANSWER`	Generate rubrics in an NL question answer format.
`PYTHON_CODE_ASSERTION`	Generate rubrics in a unit test format.

PredefinedMetricSpec

Specification for a pre-defined metric.

Fields

metricSpecName string

Required. The name of a pre-defined metric, such as "instruction_following_v1" or "text_quality_v1".

parameters object (Struct format)

Optional. The parameters needed to run the pre-defined metric.

JSON representation
{ "metricSpecName": string, "parameters": { object } }

LLMBasedMetricSpec

Specification for an LLM based metric.

Fields

rubrics_source Union type

Source of the rubrics to be used for evaluation. rubrics_source can be only one of the following:

rubricGroupKey string

Use a pre-defined group of rubrics associated with the input. Refers to a key in the rubricGroups map of EvaluationInstance.

rubricGenerationSpec object (RubricGenerationSpec)

Dynamically generate rubrics using this specification.

predefinedRubricGenerationSpec object (PredefinedMetricSpec)

Dynamically generate rubrics using a predefined spec.

metricPromptTemplate string

Required. Template for the prompt sent to the judge model.

systemInstruction string

Optional. System instructions for the judge model.

judgeAutoraterConfig object (AutoraterConfig)

Optional. Optional configuration for the judge LLM (Autorater).

additionalConfig object (Struct format)

Optional. Optional additional configuration for the metric.

JSON representation

JSON representation
{ // rubrics_source "rubricGroupKey": string, "rubricGenerationSpec": { object (`RubricGenerationSpec`) }, "predefinedRubricGenerationSpec": { object (`PredefinedMetricSpec`) } // Union type "metricPromptTemplate": string, "systemInstruction": string, "judgeAutoraterConfig": { object (`AutoraterConfig`) }, "additionalConfig": { object } }

{

  // rubrics_source
  "rubricGroupKey": string,
  "rubricGenerationSpec": {
    object (RubricGenerationSpec)
  },
  "predefinedRubricGenerationSpec": {
    object (PredefinedMetricSpec)
  }
  // Union type
  "metricPromptTemplate": string,
  "systemInstruction": string,
  "judgeAutoraterConfig": {
    object (AutoraterConfig)
  },
  "additionalConfig": {
    object
  }
}

EvaluationRubricConfig

Configuration for a rubric group to be generated/saved for evaluation.

Fields

rubricGroupKey string

Required. The key used to save the generated rubrics. If a generation spec is provided, this key will be used for the name of the generated rubric group. Otherwise, this key will be used to look up the existing rubric group on the evaluation item. Note that if a rubric group key is specified on both a rubric config and an evaluation metric, the key from the metric will be used to select the rubrics for evaluation.

generation_config Union type

The configuration for generating rubrics. generation_config can be only one of the following:

rubricGenerationSpec object (RubricGenerationSpec)

Dynamically generate rubrics using this specification.

predefinedRubricGenerationSpec object (PredefinedMetricSpec)

Dynamically generate rubrics using a predefined spec.

JSON representation
{ "rubricGroupKey": string, // generation_config "rubricGenerationSpec": { object (`RubricGenerationSpec`) }, "predefinedRubricGenerationSpec": { object (`PredefinedMetricSpec`) } // Union type }

OutputConfig

The output config for the evaluation run.

Fields

bigqueryDestination object (BigQueryDestination)

BigQuery destination for evaluation output.

gcsDestination object (GcsDestination)

Cloud Storage destination for evaluation output.

JSON representation
{ "bigqueryDestination": { object (`BigQueryDestination`) }, "gcsDestination": { object (`GcsDestination`) } }

BigQueryDestination

The BigQuery location for the output content.

Fields

outputUri string

Required. BigQuery URI to a project or table, up to 2000 characters long.

When only the project is specified, the Dataset and Table is created. When the full table reference is specified, the Dataset must exist and table must not exist.

Accepted forms:

BigQuery path. For example: bq://projectId or bq://projectId.bqDatasetId or bq://projectId.bqDatasetId.bqTableId.

JSON representation
{ "outputUri": string }

PromptTemplate

Prompt template used for inference.

Fields

source Union type

The source of the prompt template. source can be only one of the following:

promptTemplate string

Inline prompt template. Template variables should be in the format "{var_name}". Example: "Translate the following from {source_lang} to {target_lang}: {text}"

gcsUri string

Prompt template stored in Cloud Storage. Format: "gs://my-bucket/file-name.txt".

JSON representation
{ // source "promptTemplate": string, "gcsUri": string // Union type }

Enums
`STATE_UNSPECIFIED`	Unspecified state.
`PENDING`	The evaluation run is pending.
`RUNNING`	The evaluation run is running.
`SUCCEEDED`	The evaluation run has succeeded.
`FAILED`	The evaluation run has failed.
`CANCELLED`	The evaluation run has been cancelled.
`INFERENCE`	The evaluation run is performing inference.
`GENERATING_RUBRICS`	The evaluation run is performing rubric generation.

EvaluationResults

The results of the evaluation run.

Fields

summaryMetrics object (SummaryMetrics)

Optional. The summary metrics for the evaluation run.

evaluationSet string

The evaluation set where item level results are stored.

JSON representation
{ "summaryMetrics": { object (`SummaryMetrics`) }, "evaluationSet": string }

SummaryMetrics

The summary metrics for the evaluation run.

Fields

metrics map (key: string, value: value (Value format))

Optional. Map of metric name to metric value.

totalItems integer

Optional. The total number of items that were evaluated.

failedItems integer

Optional. The number of items that failed to be evaluated.

JSON representation
{ "metrics": { string: value, ... }, "totalItems": integer, "failedItems": integer }

Methods
`cancel`	Cancels an Evaluation Run.
`create`	Creates an Evaluation Run.
`delete`	Deletes an Evaluation Run.
`get`	Gets an Evaluation Run.
`list`	Lists Evaluation Runs.

Resource: EvaluationRun

DataSource

BigQueryRequestSet

SamplingConfig

SamplingMethod

InferenceConfig

AgentConfig

EvaluationConfig

EvaluationRunMetric

RubricBasedMetricSpec

RepeatedRubrics

RubricGenerationSpec

AutoraterConfig

RubricContentType

PredefinedMetricSpec

LLMBasedMetricSpec

EvaluationRubricConfig

OutputConfig

BigQueryDestination

PromptTemplate

State

EvaluationResults

SummaryMetrics

Methods

`cancel`

`create`

`delete`

`get`

`list`