Class EvalTask (1.55.0)

EvalTask(
    *,
    dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
    metrics: typing.List[
        typing.Union[
            typing.Literal[
                "exact_match",
                "bleu",
                "rouge_1",
                "rouge_2",
                "rouge_l",
                "rouge_l_sum",
                "coherence",
                "fluency",
                "safety",
                "groundedness",
                "fulfillment",
                "summarization_quality",
                "summarization_helpfulness",
                "summarization_verbosity",
                "question_answering_quality",
                "question_answering_relevance",
                "question_answering_helpfulness",
                "question_answering_correctness",
                "text_generation_similarity",
                "text_generation_quality",
                "text_generation_instruction_following",
                "text_generation_safety",
                "text_generation_factuality",
                "summarization_pointwise_reference_free",
                "qa_pointwise_reference_free",
                "qa_pointwise_reference_based",
                "tool_call_quality",
            ],
            vertexai.preview.evaluation.metrics._base.CustomMetric,
            vertexai.preview.evaluation.metrics._base.PairwiseMetric,
        ]
    ],
    experiment: typing.Optional[str] = None,
    content_column_name: str = "content",
    reference_column_name: str = "reference",
    response_column_name: str = "response"
)

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

Default dataset column names:
    * content_column_name: "content"
    * reference_column_name: "reference"
    * response_column_name: "response"
Requirement for different use cases:
  * Bring your own prediction: A `response` column is required. Response
      column name can be customized by providing `response_column_name`
      parameter.
  * Without prompt template: A column representing the input prompt to the
      model is required. If `content_column_name` is not specified, the
      eval dataset requires `content` column by default. The response
      column is not used if present and new responses from the model are
      generated with the content column and used for evaluation.
  * With prompt template: Dataset must contain column names corresponding to
      the placeholder names in the prompt template. For example, if prompt
      template is "Instruction: {instruction}, context: {context}", the
      dataset must contain `instruction` and `context` column.

Metrics Details:

The supported metrics, metric bundle descriptions, grading rubrics, and
the required input fields can be found on the Vertex AI public
documentation page [Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).

Usage:

1. To perform bring-your-own-prediction(BYOP) evaluation, provide the model
responses in the response column in the dataset. The response column name
is "response" by default, or specify `response_column_name` parameter to
customize.

  ```
  eval_dataset = pd.DataFrame({
          "reference": [...],
          "response" : [...],
  })
  eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=["bleu", "rouge_l_sum", "coherence", "fluency"],
    experiment="my-experiment",
  )
  eval_result = eval_task.evaluate(
        experiment_run_name="eval-experiment-run"
  )
  ```

2. To perform evaluation with built-in Gemini model inference, specify the
`model` parameter with a GenerativeModel instance.  The default query
column name to the model is `content`.

  ```
  eval_dataset = pd.DataFrame({
        "reference": [...],
        "content"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["exact_match", "bleu", "rouge_1", "rouge_2",
      "rouge_l_sum"],
      experiment="my-experiment",
  ).evaluate(
      model=GenerativeModel("gemini-pro"),
      experiment_run_name="gemini-pro-eval-run"
  )
  ```

3. If a `prompt_template` is specified, the `content` column is not required.
Prompts can be assembled from the evaluation dataset, and all placeholder
names must be present in the dataset columns.
  ```
  eval_dataset = pd.DataFrame({
      "context"    : [...],
      "instruction": [...],
      "reference"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["summarization_quality"],
  ).evaluate(
      model=model,
      prompt_template="{instruction}. Article: {context}. Summary:",
  )
  ```

4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom prediction function. The `content` column in the
dataset is used to generate predictions with the custom model function for
evaluation.

  ```
  def custom_model_fn(input: str) -> str:
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "user", "content": input}
      ]
    )
    return response.choices[0].message.content

  eval_dataset = pd.DataFrame({
        "content"  : [...],
        "reference": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["text_generation_similarity","text_generation_quality"],
      experiment="my-experiment",
  ).evaluate(
      model=custom_model_fn,
      experiment_run_name="gpt-eval-run"
  )
  ```

Methods

EvalTask

EvalTask(
    *,
    dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
    metrics: typing.List[
        typing.Union[
            typing.Literal[
                "exact_match",
                "bleu",
                "rouge_1",
                "rouge_2",
                "rouge_l",
                "rouge_l_sum",
                "coherence",
                "fluency",
                "safety",
                "groundedness",
                "fulfillment",
                "summarization_quality",
                "summarization_helpfulness",
                "summarization_verbosity",
                "question_answering_quality",
                "question_answering_relevance",
                "question_answering_helpfulness",
                "question_answering_correctness",
                "text_generation_similarity",
                "text_generation_quality",
                "text_generation_instruction_following",
                "text_generation_safety",
                "text_generation_factuality",
                "summarization_pointwise_reference_free",
                "qa_pointwise_reference_free",
                "qa_pointwise_reference_based",
                "tool_call_quality",
            ],
            vertexai.preview.evaluation.metrics._base.CustomMetric,
            vertexai.preview.evaluation.metrics._base.PairwiseMetric,
        ]
    ],
    experiment: typing.Optional[str] = None,
    content_column_name: str = "content",
    reference_column_name: str = "reference",
    response_column_name: str = "response"
)

Initializes an EvalTask.

display_runs

display_runs()

Displays experiment runs associated with this EvalTask.

evaluate

evaluate(
    *,
    model: typing.Optional[
        typing.Union[
            vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
        ]
    ] = None,
    prompt_template: typing.Optional[str] = None,
    experiment_run_name: typing.Optional[str] = None,
    response_column_name: typing.Optional[str] = None
) -> vertexai.preview.evaluation._base.EvalResult

Runs an evaluation for the EvalTask.