Class EvalTask (1.51.0)

EvalTask(
    *,
    dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
    metrics: typing.List[
        typing.Union[
            typing.Literal[
                "exact_match",
                "bleu",
                "rouge_1",
                "rouge_2",
                "rouge_l",
                "rouge_l_sum",
                "coherence",
                "fluency",
                "safety",
                "groundedness",
                "fulfillment",
                "summarization_quality",
                "summarization_helpfulness",
                "summarization_verbosity",
                "question_answering_quality",
                "question_answering_relevance",
                "question_answering_helpfulness",
                "question_answering_correctness",
                "text_generation_similarity",
                "text_generation_quality",
                "text_generation_instruction_following",
                "text_generation_safety",
                "text_generation_factuality",
                "summarization_pointwise_reference_free",
                "qa_pointwise_reference_free",
                "qa_pointwise_reference_based",
                "tool_call_quality",
            ],
            vertexai.preview.evaluation.metrics._base.CustomMetric,
            vertexai.preview.evaluation.metrics._base.PairwiseMetric,
        ]
    ],
    experiment: typing.Optional[str] = None,
    content_column_name: str = "content",
    reference_column_name: str = "reference",
    response_column_name: str = "response"
)

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset details: Default dataset column names:

  • content_column_name: "content"
  • reference_column_name: "reference"
  • response_column_name: "response" Requirement for different use cases:
    • Bring your own prediction: A response column is required. Response column name can be customized by providing response_column_name parameter.
    • Without prompt template: A column representing the input prompt to the model is required. If content_column_name is not specified, the eval dataset requires content column by default. The response column is not used if present and new responses from the model are generated with the content column and used for evaluation.
    • With prompt template: Dataset must contain column names corresponding to the placeholder names in the prompt template. For example, if prompt template is "Instruction: {instruction}, context: {context}", the dataset must contain instruction and context column.

Metrics Details: The supported metrics, metric bundle descriptions, grading rubrics, and the required input fields can be found on the Vertex AI public documentation page Evaluation methods and metrics.

Usage:

  1. To perform bring your own prediction evaluation, provide the model responses in the response column in the dataset. The response column name is "response" by default, or specify response_column_name parameter to customize.

    eval_dataset = pd.DataFrame({
           "reference": [...],
           "response" : [...],
    })
    eval_task = EvalTask(
     dataset=eval_dataset,
     metrics=["bleu", "rouge_l_sum", "coherence", "fluency"],
     experiment="my-experiment",
    )
    eval_result = eval_task.evaluate(
         experiment_run_name="eval-experiment-run"
    )
    
  2. To perform evaluation with built-in Gemini model inference, specify the model parameter with a GenerativeModel instance. The default query column name to the model is content.

    eval_dataset = pd.DataFrame({
         "reference": [...],
         "content"  : [...],
    })
    result = EvalTask(
       dataset=eval_dataset,
       metrics=["exact_match", "bleu", "rouge_1", "rouge_2",
       "rouge_l_sum"],
       experiment="my-experiment",
    ).evaluate(
       model=GenerativeModel("gemini-pro"),
       experiment_run_name="gemini-pro-eval-run"
    )
    
  3. If a prompt_template is specified, the content column is not required. Prompts can be assembled from the evaluation dataset, and all placeholder names must be present in the dataset columns.

    eval_dataset = pd.DataFrame({
       "context"    : [...],
       "instruction": [...],
       "reference"  : [...],
    })
    result = EvalTask(
       dataset=eval_dataset,
       metrics=["summarization_quality"],
    ).evaluate(
       model=model,
       prompt_template="{instruction}. Article: {context}. Summary:",
    )
    
  4. To perform evaluation with custom model inference, specify the model parameter with a custom prediction function. The content column in the dataset is used to generate predictions with the custom model function for evaluation.

    def custom_model_fn(input: str) -> str:
     response = client.chat.completions.create(
       model="gpt-3.5-turbo",
       messages=[
         {"role": "user", "content": input}
       ]
     )
     return response.choices[0].message.content
    
    eval_dataset = pd.DataFrame({
         "content"  : [...],
         "reference": [...],
    })
    result = EvalTask(
       dataset=eval_dataset,
       metrics=["text_generation_similarity","text_generation_quality"],
       experiment="my-experiment",
    ).evaluate(
       model=custom_model_fn,
       experiment_run_name="gpt-eval-run"
    )
    

Methods

EvalTask

EvalTask(
    *,
    dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
    metrics: typing.List[
        typing.Union[
            typing.Literal[
                "exact_match",
                "bleu",
                "rouge_1",
                "rouge_2",
                "rouge_l",
                "rouge_l_sum",
                "coherence",
                "fluency",
                "safety",
                "groundedness",
                "fulfillment",
                "summarization_quality",
                "summarization_helpfulness",
                "summarization_verbosity",
                "question_answering_quality",
                "question_answering_relevance",
                "question_answering_helpfulness",
                "question_answering_correctness",
                "text_generation_similarity",
                "text_generation_quality",
                "text_generation_instruction_following",
                "text_generation_safety",
                "text_generation_factuality",
                "summarization_pointwise_reference_free",
                "qa_pointwise_reference_free",
                "qa_pointwise_reference_based",
                "tool_call_quality",
            ],
            vertexai.preview.evaluation.metrics._base.CustomMetric,
            vertexai.preview.evaluation.metrics._base.PairwiseMetric,
        ]
    ],
    experiment: typing.Optional[str] = None,
    content_column_name: str = "content",
    reference_column_name: str = "reference",
    response_column_name: str = "response"
)

Initializes an EvalTask.

display_runs

display_runs()

Displays experiment runs associated with this EvalTask.

evaluate

evaluate(
    *,
    model: typing.Optional[
        typing.Union[
            vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
        ]
    ] = None,
    prompt_template: typing.Optional[str] = None,
    experiment_run_name: typing.Optional[str] = None,
    response_column_name: str = "response"
) -> vertexai.preview.evaluation._base.EvalResult

Runs an evaluation for the EvalTask.