EvalTask(
*,
dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
metrics: typing.List[
typing.Union[
typing.Literal[
"exact_match",
"bleu",
"rouge_1",
"rouge_2",
"rouge_l",
"rouge_l_sum",
"tool_call_valid",
"tool_name_match",
"tool_parameter_key_match",
"tool_parameter_kv_match",
],
vertexai.evaluation.CustomMetric,
vertexai.evaluation.metrics._base._AutomaticMetric,
vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
]
],
experiment: typing.Optional[str] = None,
metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
output_uri_prefix: typing.Optional[str] = ""
)
A class representing an EvalTask.
An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.
Dataset Details:
Default dataset column names:
* prompt_column_name: "prompt"
* reference_column_name: "reference"
* response_column_name: "response"
* baseline_model_response_column_name: "baseline_model_response"
Requirement for different use cases:
* Bring-your-own-response: A `response` column is required. Response
column name can be customized by providing `response_column_name`
parameter. If a pairwise metric is used and a baseline model is
not provided, a `baseline_model_response` column is required.
Baseline model response column name can be customized by providing
`baseline_model_response_column_name` parameter. If the `response`
column or `baseline_model_response` column is present while the
corresponding model is specified, an error will be raised.
* Perform model inference without a prompt template: A `prompt` column
in the evaluation dataset representing the input prompt to the
model is required and is used directly as input to the model.
* Perform model inference with a prompt template: Evaluation dataset
must contain column names corresponding to the variable names in
the prompt template. For example, if prompt template is
"Instruction: {instruction}, context: {context}", the dataset must
contain `instruction` and `context` columns.
Metrics Details:
The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).
Usage Examples:
1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.
```
eval_dataset = pd.DataFrame({
"prompt" : [...],
"reference": [...],
"response" : [...],
"baseline_model_response": [...],
})
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[
"bleu",
"rouge_l_sum",
MetricPromptTemplateExamples.Pointwise.FLUENCY,
MetricPromptTemplateExamples.Pairwise.SAFETY
],
experiment="my-experiment",
)
eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
```
2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance. The input column name to the
model is `prompt` and must be present in the dataset.
```
eval_dataset = pd.DataFrame({
"reference": [...],
"prompt" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
experiment="my-experiment",
).evaluate(
model=GenerativeModel("gemini-1.5-pro"),
experiment_run_name="gemini-eval-run"
)
```
3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
```
eval_dataset = pd.DataFrame({
"context" : [...],
"instruction": [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
).evaluate(
model=GenerativeModel("gemini-1.5-pro"),
prompt_template="{instruction}. Article: {context}. Summary:",
)
```
4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.
```
from openai import OpenAI
client = OpenAI()
def custom_model_fn(input: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": input}
]
)
return response.choices[0].message.content
eval_dataset = pd.DataFrame({
"prompt" : [...],
"reference": [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
experiment="my-experiment",
).evaluate(
model=custom_model_fn,
experiment_run_name="gpt-eval-run"
)
```
5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.
```
baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")
pairwise_groundedness = PairwiseMetric(
metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
"pairwise_groundedness"
),
baseline_model=baseline_model,
)
eval_dataset = pd.DataFrame({
"prompt" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=[pairwise_groundedness],
experiment="my-pairwise-experiment",
).evaluate(
model=candidate_model,
experiment_run_name="gemini-pairwise-eval-run",
)
```
Properties
dataset
Returns evaluation dataset.
experiment
Returns experiment name.
metrics
Returns metrics.
Methods
EvalTask
EvalTask(
*,
dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
metrics: typing.List[
typing.Union[
typing.Literal[
"exact_match",
"bleu",
"rouge_1",
"rouge_2",
"rouge_l",
"rouge_l_sum",
"tool_call_valid",
"tool_name_match",
"tool_parameter_key_match",
"tool_parameter_kv_match",
],
vertexai.evaluation.CustomMetric,
vertexai.evaluation.metrics._base._AutomaticMetric,
vertexai.evaluation.metrics.pointwise_metric.PointwiseMetric,
vertexai.evaluation.metrics.pairwise_metric.PairwiseMetric,
]
],
experiment: typing.Optional[str] = None,
metric_column_mapping: typing.Optional[typing.Dict[str, str]] = None,
output_uri_prefix: typing.Optional[str] = ""
)
Initializes an EvalTask.
display_runs
display_runs()
Displays experiment runs associated with this EvalTask.
evaluate
evaluate(
*,
model: typing.Optional[
typing.Union[
vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
]
] = None,
prompt_template: typing.Optional[str] = None,
experiment_run_name: typing.Optional[str] = None,
response_column_name: typing.Optional[str] = None,
baseline_model_response_column_name: typing.Optional[str] = None,
evaluation_service_qps: typing.Optional[float] = None,
retry_timeout: float = 600.0,
output_file_name: typing.Optional[str] = None
) -> vertexai.evaluation.EvalResult
Runs an evaluation for the EvalTask.