Package evaluation (1.72.0)

API documentation for evaluation package.

Classes

CustomMetric

The custom evaluation metric.

A fully-customized CustomMetric that can be used to evaluate a single model by defining a metric function for a computation-based metric. The CustomMetric is computed on the client-side using the user-defined metric function in SDK only, not by the Vertex Gen AI Evaluation Service.

Attributes: name: The name of the metric. metric_function: The user-defined evaluation function to compute a metric score. Must use the dataset row dictionary as the metric function input and return per-instance metric result as a dictionary output. The metric score must mapped to the name of the CustomMetric as key.

EvalResult

Evaluation result.

EvalTask

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

Default dataset column names:
    * prompt_column_name: "prompt"
    * reference_column_name: "reference"
    * response_column_name: "response"
    * baseline_model_response_column_name: "baseline_model_response"

Requirement for different use cases:
  * Bring-your-own-response (BYOR): You already have the data that you
      want to evaluate stored in the dataset. Response column name can be
      customized by providing `response_column_name` parameter, or in the
      `metric_column_mapping`. For BYOR pairwise evaluation, the baseline
      model response column name can be customized by providing
      `baseline_model_response_column_name` parameter, or
      in the `metric_column_mapping`. If the `response` column or
      `baseline_model_response` column is present while the
      corresponding model is specified, an error will be raised.

  * Perform model inference without a prompt template: You have a dataset
      containing the input prompts to the model and want to perform model
      inference before evaluation. A column named `prompt` is required
      in the evaluation dataset and is used directly as input to the model.

  * Perform model inference with a prompt template: You have a dataset
      containing the input variables to the prompt template and want to
      assemble the prompts for model inference. Evaluation dataset
      must contain column names corresponding to the variable names in
      the prompt template. For example, if prompt template is
      "Instruction: {instruction}, context: {context}", the dataset must
      contain `instruction` and `context` columns.

Metrics Details:

The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage Examples:

1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.

  ```
  eval_dataset = pd.DataFrame({
          "prompt"  : [...],
          "reference": [...],
          "response" : [...],
          "baseline_model_response": [...],
  })
  eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
            "bleu",
            "rouge_l_sum",
            MetricPromptTemplateExamples.Pointwise.FLUENCY,
            MetricPromptTemplateExamples.Pairwise.SAFETY
    ],
    experiment="my-experiment",
  )
  eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
  ```

2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance.  The input column name to the
model is `prompt` and must be present in the dataset.

  ```
  eval_dataset = pd.DataFrame({
        "reference": [...],
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
      experiment="my-experiment",
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      experiment_run_name="gemini-eval-run"
  )
  ```

3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
  ```
  eval_dataset = pd.DataFrame({
      "context"    : [...],
      "instruction": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      prompt_template="{instruction}. Article: {context}. Summary:",
  )
  ```

4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.

  ```
  from openai import OpenAI
  client = OpenAI()
  def custom_model_fn(input: str) -> str:
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "user", "content": input}
      ]
    )
    return response.choices[0].message.content

  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
        "reference": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
      experiment="my-experiment",
  ).evaluate(
      model=custom_model_fn,
      experiment_run_name="gpt-eval-run"
  )
  ```

5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.

  ```
  baseline_model = GenerativeModel("gemini-1.0-pro")
  candidate_model = GenerativeModel("gemini-1.5-pro")

  pairwise_groundedness = PairwiseMetric(
      metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
          "pairwise_groundedness"
      ),
      baseline_model=baseline_model,
  )
  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[pairwise_groundedness],
      experiment="my-pairwise-experiment",
  ).evaluate(
      model=candidate_model,
      experiment_run_name="gemini-pairwise-eval-run",
  )
  ```

MetricPromptTemplateExamples

Examples of metric prompt templates for model-based evaluation.

PairwiseMetric

A Model-based Pairwise Metric.

A model-based evaluation metric that compares two generative models' responses side-by-side, and allows users to A/B test their generative models to determine which model is performing better.

For more details on when to use pairwise metrics, see Evaluation methods and metrics.

Result Details:

* In `EvalResult.summary_metrics`, win rates for both the baseline and
candidate model are computed. The win rate is computed as proportion of
wins of one model's responses to total attempts as a decimal value
between 0 and 1.

* In `EvalResult.metrics_table`, a pairwise metric produces two
evaluation results per dataset row:
    * `pairwise_choice`: The choice shows whether the candidate model or
      the baseline model performs better, or if they are equally good.
    * `explanation`: The rationale behind each verdict using
      chain-of-thought reasoning. The explanation helps users scrutinize
      the judgment and builds appropriate trust in the decisions.

See [documentation
page](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#understand-results)
for more details on understanding the metric results.

Usage Examples:

```
baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")

pairwise_groundedness = PairwiseMetric(
    metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
        "pairwise_groundedness"
    ),
    baseline_model=baseline_model,
)
eval_dataset = pd.DataFrame({
      "prompt"  : [...],
})
pairwise_task = EvalTask(
    dataset=eval_dataset,
    metrics=[pairwise_groundedness],
    experiment="my-pairwise-experiment",
)
pairwise_result = pairwise_task.evaluate(
    model=candidate_model,
    experiment_run_name="gemini-pairwise-eval-run",
)
```

PairwiseMetricPromptTemplate

Pairwise metric prompt template for pairwise model-based metrics.

PointwiseMetric

A Model-based Pointwise Metric.

A model-based evaluation metric that evaluate a single generative model's response.

For more details on when to use model-based pointwise metrics, see Evaluation methods and metrics.

Usage Examples:

```
candidate_model = GenerativeModel("gemini-1.5-pro")
eval_dataset = pd.DataFrame({
    "prompt"  : [...],
})
fluency_metric = PointwiseMetric(
    metric="fluency",
    metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template('fluency'),
)
pointwise_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        fluency_metric,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
    ],
)
pointwise_result = pointwise_eval_task.evaluate(
    model=candidate_model,
)
```

PointwiseMetricPromptTemplate

Pointwise metric prompt template for pointwise model-based metrics.

PromptTemplate

A prompt template for creating prompts with variables.

The PromptTemplate class allows users to define a template string with variables represented in curly braces {variable}. The variable names cannot contain spaces. These variables can be replaced with specific values using the assemble method, providing flexibility in generating dynamic prompts.

Usage:

```
template_str = "Hello, {name}! Today is {day}. How are you?"
prompt_template = PromptTemplate(template_str)
completed_prompt = prompt_template.assemble(name="John", day="Monday")
print(completed_prompt)
```

Rouge

The ROUGE Metric.

Calculates the recall of n-grams in prediction as compared to reference and returns a score ranging between 0 and 1. Supported rouge types are rougen[1-9], rougeL, and rougeLsum.