- 1.77.0 (latest)
- 1.76.0
- 1.75.0
- 1.74.0
- 1.73.0
- 1.72.0
- 1.71.1
- 1.70.0
- 1.69.0
- 1.68.0
- 1.67.1
- 1.66.0
- 1.65.0
- 1.63.0
- 1.62.0
- 1.60.0
- 1.59.0
- 1.58.0
- 1.57.0
- 1.56.0
- 1.55.0
- 1.54.1
- 1.53.0
- 1.52.0
- 1.51.0
- 1.50.0
- 1.49.0
- 1.48.0
- 1.47.0
- 1.46.0
- 1.45.0
- 1.44.0
- 1.43.0
- 1.39.0
- 1.38.1
- 1.37.0
- 1.36.4
- 1.35.0
- 1.34.0
- 1.33.1
- 1.32.0
- 1.31.1
- 1.30.1
- 1.29.0
- 1.28.1
- 1.27.1
- 1.26.1
- 1.25.0
- 1.24.1
- 1.23.0
- 1.22.1
- 1.21.0
- 1.20.0
- 1.19.1
- 1.18.3
- 1.17.1
- 1.16.1
- 1.15.1
- 1.14.0
- 1.13.1
- 1.12.1
- 1.11.0
- 1.10.0
- 1.9.0
- 1.8.1
- 1.7.1
- 1.6.2
- 1.5.0
- 1.4.3
- 1.3.0
- 1.2.0
- 1.1.1
- 1.0.1
- 0.9.0
- 0.8.0
- 0.7.1
- 0.6.0
- 0.5.1
- 0.4.0
- 0.3.1
EvalTask(
*,
dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
metrics: typing.List[
typing.Union[
typing.Literal[
"exact_match",
"bleu",
"rouge_1",
"rouge_2",
"rouge_l",
"rouge_l_sum",
"coherence",
"fluency",
"safety",
"groundedness",
"fulfillment",
"summarization_quality",
"summarization_helpfulness",
"summarization_verbosity",
"question_answering_quality",
"question_answering_relevance",
"question_answering_helpfulness",
"question_answering_correctness",
"text_generation_similarity",
"text_generation_quality",
"text_generation_instruction_following",
"text_generation_safety",
"text_generation_factuality",
"summarization_pointwise_reference_free",
"qa_pointwise_reference_free",
"qa_pointwise_reference_based",
"tool_call_quality",
],
vertexai.preview.evaluation.metrics._base.CustomMetric,
vertexai.preview.evaluation.metrics._base.PairwiseMetric,
]
],
experiment: typing.Optional[str] = None,
content_column_name: str = "content",
reference_column_name: str = "reference",
response_column_name: str = "response"
)
A class representing an EvalTask.
An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.
Dataset Details:
Default dataset column names:
* content_column_name: "content"
* reference_column_name: "reference"
* response_column_name: "response"
Requirement for different use cases:
* Bring your own prediction: A `response` column is required. Response
column name can be customized by providing `response_column_name`
parameter.
* Without prompt template: A column representing the input prompt to the
model is required. If `content_column_name` is not specified, the
eval dataset requires `content` column by default. The response
column is not used if present and new responses from the model are
generated with the content column and used for evaluation.
* With prompt template: Dataset must contain column names corresponding to
the placeholder names in the prompt template. For example, if prompt
template is "Instruction: {instruction}, context: {context}", the
dataset must contain `instruction` and `context` column.
Metrics Details:
The supported metrics, metric bundle descriptions, grading rubrics, and
the required input fields can be found on the Vertex AI public
documentation page [Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).
Usage:
1. To perform bring-your-own-prediction(BYOP) evaluation, provide the model
responses in the response column in the dataset. The response column name
is "response" by default, or specify `response_column_name` parameter to
customize.
```
eval_dataset = pd.DataFrame({
"reference": [...],
"response" : [...],
})
eval_task = EvalTask(
dataset=eval_dataset,
metrics=["bleu", "rouge_l_sum", "coherence", "fluency"],
experiment="my-experiment",
)
eval_result = eval_task.evaluate(
experiment_run_name="eval-experiment-run"
)
```
2. To perform evaluation with built-in Gemini model inference, specify the
`model` parameter with a GenerativeModel instance. The default query
column name to the model is `content`.
```
eval_dataset = pd.DataFrame({
"reference": [...],
"content" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["exact_match", "bleu", "rouge_1", "rouge_2",
"rouge_l_sum"],
experiment="my-experiment",
).evaluate(
model=GenerativeModel("gemini-pro"),
experiment_run_name="gemini-pro-eval-run"
)
```
3. If a `prompt_template` is specified, the `content` column is not required.
Prompts can be assembled from the evaluation dataset, and all placeholder
names must be present in the dataset columns.
```
eval_dataset = pd.DataFrame({
"context" : [...],
"instruction": [...],
"reference" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["summarization_quality"],
).evaluate(
model=model,
prompt_template="{instruction}. Article: {context}. Summary:",
)
```
4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom prediction function. The `content` column in the
dataset is used to generate predictions with the custom model function for
evaluation.
```
def custom_model_fn(input: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": input}
]
)
return response.choices[0].message.content
eval_dataset = pd.DataFrame({
"content" : [...],
"reference": [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["text_generation_similarity","text_generation_quality"],
experiment="my-experiment",
).evaluate(
model=custom_model_fn,
experiment_run_name="gpt-eval-run"
)
```
Methods
EvalTask
EvalTask(
*,
dataset: typing.Union[pd.DataFrame, str, typing.Dict[str, typing.Any]],
metrics: typing.List[
typing.Union[
typing.Literal[
"exact_match",
"bleu",
"rouge_1",
"rouge_2",
"rouge_l",
"rouge_l_sum",
"coherence",
"fluency",
"safety",
"groundedness",
"fulfillment",
"summarization_quality",
"summarization_helpfulness",
"summarization_verbosity",
"question_answering_quality",
"question_answering_relevance",
"question_answering_helpfulness",
"question_answering_correctness",
"text_generation_similarity",
"text_generation_quality",
"text_generation_instruction_following",
"text_generation_safety",
"text_generation_factuality",
"summarization_pointwise_reference_free",
"qa_pointwise_reference_free",
"qa_pointwise_reference_based",
"tool_call_quality",
],
vertexai.preview.evaluation.metrics._base.CustomMetric,
vertexai.preview.evaluation.metrics._base.PairwiseMetric,
]
],
experiment: typing.Optional[str] = None,
content_column_name: str = "content",
reference_column_name: str = "reference",
response_column_name: str = "response"
)
Initializes an EvalTask.
display_runs
display_runs()
Displays experiment runs associated with this EvalTask.
evaluate
evaluate(
*,
model: typing.Optional[
typing.Union[
vertexai.generative_models.GenerativeModel, typing.Callable[[str], str]
]
] = None,
prompt_template: typing.Optional[str] = None,
experiment_run_name: typing.Optional[str] = None,
response_column_name: typing.Optional[str] = None,
retry_timeout: float = 600.0
) -> vertexai.preview.evaluation._base.EvalResult
Runs an evaluation for the EvalTask.