Diese Seite wurde von der Cloud Translation API übersetzt.

API für den Gen AI Evaluation Service

Mit dem Gen AI Evaluation Service können Sie Ihre Large Language Models (LLMs) anhand Ihrer eigenen Kriterien anhand verschiedener Messwerte bewerten. Sie können Inferenzzeiteingaben, LLM-Antworten und zusätzliche Parameter angeben. Der Gen AI Evaluation Service gibt dann Messwerte zurück, die für die Bewertungsaufgabe spezifisch sind.

Die Messwerte umfassen sowohl modellbasierte Messwerte wie PointwiseMetric und PairwiseMetric als auch In-Memory-berechnete Messwerte wie rouge, bleu und Messwerte für Tool-Funktionsaufrufe. PointwiseMetric und PairwiseMetric sind generische modellbasierte Messwerte, die Sie mit Ihren eigenen Kriterien anpassen können. Da der Dienst die Vorhersageergebnisse direkt aus Modellen als Eingabe verwendet, kann der Bewertungsdienst sowohl Inferenz als auch die anschließende Bewertung für alle von Vertex AI unterstützten Modelle durchführen.

Weitere Informationen zum Bewerten eines Modells finden Sie unter Übersicht über den Gen AI Evaluation Service.

Beschränkungen

Für den Auswertungsdienst gelten die folgenden Einschränkungen:

Beim ersten Aufruf des Bewertungsdienstes kann es zu einer Verzögerung kommen.
Für die meisten modellbasierten Messwerte wird Gemini 2.0 Flash-Kontingent verwendet, da der Gen AI Evaluation Service gemini-2.0-flash als zugrunde liegendes Judge-Modell verwendet, um diese modellbasierten Messwerte zu berechnen.
Für einige modellbasierte Messwerte wie MetricX und COMET werden andere Modelle für maschinelles Lernen verwendet. Daher wird kein gemini-2.0-flash-Kontingent verbraucht.

Beispielsyntax

Syntax zum Senden eines Bewertungsaufrufs.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

Parameterliste

Parameter
`exact_match_input`	Optional: `ExactMatchInput` Eingabe, um zu prüfen, ob die Vorhersage genau mit der Referenz übereinstimmt.
`bleu_input`	Optional: `BleuInput` Eingabe zum Berechnen des BLEU-Scores durch Vergleich der Vorhersage mit der Referenz.
`rouge_input`	Optional: `RougeInput` Eingabe zum Berechnen der `rouge`-Scores durch Vergleich der Vorhersage mit der Referenz. `rouge_type` unterstützt unterschiedliche `rouge`-Scores.
`fluency_input`	Optional: `FluencyInput` Eingabe zur Bewertung der Sprachbeherrschung einer einzelnen Antwort.
`coherence_input`	Optional: `CoherenceInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, eine kohärente, leicht verständliche Antwort zu geben.
`safety_input`	Optional: `SafetyInput` Eingabe zur Bewertung des Sicherheitslevels einer einzelnen Antwort.
`groundedness_input`	Optional: `GroundednessInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, Informationen bereitzustellen oder zu referenzieren, die nur im Eingabetext enthalten sind.
`fulfillment_input`	Optional: `FulfillmentInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, Anweisungen vollständig auszuführen.
`summarization_quality_input`	Optional: `SummarizationQualityInput` Eingabe, um allgemein die Fähigkeit einer einzelnen Antwort zu bewerten, Text zusammenzufassen.
`pairwise_summarization_quality_input`	Optional: `PairwiseSummarizationQualityInput` Eingabe zum Vergleich der allgemeinen Qualität von Zusammenfassungen zweier Antworten.
`summarization_helpfulness_input`	Optional: `SummarizationHelpfulnessInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, eine Zusammenfassung zu erstellen, die die erforderlichen Details enthält, um den Originaltext zu ersetzen.
`summarization_verbosity_input`	Optional: `SummarizationVerbosityInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, eine prägnante Zusammenfassung zu erstellen.
`question_answering_quality_input`	Optional: `QuestionAnsweringQualityInput` Eingabe zur Bewertung der allgemeinen Fähigkeit einer einzelnen Antwort zur Beantwortung von Fragen anhand eines als Referenz dienenden Textkörpers.
`pairwise_question_answering_quality_input`	Optional: `PairwiseQuestionAnsweringQualityInput` Eingabe für den Vergleich der allgemeinen Fähigkeit zweier Antworten zur Beantwortung von Fragen anhand eines als Referenz dienenden Textkörpers.
`question_answering_relevance_input`	Optional: `QuestionAnsweringRelevanceInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, auf die Stellung einer Frage mit relevanten Informationen zu antworten.
`question_answering_helpfulness_input`	Optional: `QuestionAnsweringHelpfulnessInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, bei der Beantwortung einer Frage wichtige Details anzugeben.
`question_answering_correctness_input`	Optional: `QuestionAnsweringCorrectnessInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, eine Frage richtig zu beantworten.
`pointwise_metric_input`	Optional: `PointwiseMetricInput` Eingabe für eine allgemeine punktweise Bewertung.
`pairwise_metric_input`	Optional: `PairwiseMetricInput` Eingabe für eine allgemeine paarweise Bewertung.
`tool_call_valid_input`	Optional: `ToolCallValidInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, einen gültigen Tool-Aufruf vorherzusagen.
`tool_name_match_input`	Optional: `ToolNameMatchInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, einen Tool-Aufruf mit dem richtigen Tool-Namen vorherzusagen.
`tool_parameter_key_match_input`	Optional: `ToolParameterKeyMatchInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, einen Tool-Aufruf mit korrekten Parameternamen vorherzusagen.
`tool_parameter_kv_match_input`	Optional: `ToolParameterKvMatchInput` Eingabe, um die Fähigkeit einer einzelnen Antwort zu bewerten, einen Tool-Aufruf mit korrekten Parameternamen und ‑werten vorherzusagen
`comet_input`	Optional: `CometInput` Eingabe, die mit COMET bewertet werden soll.
`metricx_input`	Optional: `MetricxInput` Eingabe, die mit MetricX ausgewertet werden soll.

`ExactMatchInput`

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

Parameter
`metric_spec`	Optional: `ExactMatchSpec`. Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instances`	Optional: `ExactMatchInstance[]` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz.
`instances.prediction`	Optional: `string` LLM-Antwort.
`instances.reference`	Optional: `string` Goldene LLM-Antwort als Referenz.

`ExactMatchResults`

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Ausgabe

Ausgabe
`exact_match_metric_values`	`ExactMatchMetricValue[]` Bewertungsergebnisse pro Instanzeingabe.
`exact_match_metric_values.score`	`float` Eines der folgenden Betriebssysteme: `0`: Die Instanz stimmte nicht genau überein. `1`: Genaue Übereinstimmung

exact_match_metric_values

ExactMatchMetricValue[]

Bewertungsergebnisse pro Instanzeingabe.

exact_match_metric_values.score

float

Eines der folgenden Betriebssysteme:

0: Die Instanz stimmte nicht genau überein.
1: Genaue Übereinstimmung

`BleuInput`

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

Parameter
`metric_spec`	Optional: `BleuSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`metric_spec.use_effective_order`	Optional: `bool` Gibt an, ob n-Gramm-Ordnungen ohne Übereinstimmung berücksichtigt werden sollen.
`instances`	Optional: `BleuInstance[]` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz.
`instances.prediction`	Optional: `string` LLM-Antwort.
`instances.reference`	Optional: `string` Goldene LLM-Antwort als Referenz.

`BleuResults`

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Ausgabe

Ausgabe
`bleu_metric_values`	`BleuMetricValue[]` Bewertungsergebnisse pro Instanzeingabe.
`bleu_metric_values.score`	`float`: `[0, 1]`, wobei höhere Scores bedeuten, dass die Vorhersage eher der Referenz entspricht.

bleu_metric_values

BleuMetricValue[]

Bewertungsergebnisse pro Instanzeingabe.

bleu_metric_values.score

float: [0, 1], wobei höhere Scores bedeuten, dass die Vorhersage eher der Referenz entspricht.

`RougeInput`

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

Parameter
`metric_spec`	Optional: `RougeSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`metric_spec.rouge_type`	Optional: `string` Zulässige Werte: `rougen[1-9]`: `rouge`-Scores anhand der Überlappung von N-Grammen zwischen der Vorhersage und der Referenz berechnen. `rougeL`: `rouge`-Scores anhand der längsten gemeinsamen Untersequenz (Longest Common Subsequence, LCS) zwischen der Vorhersage und der Referenz berechnen. `rougeLsum`: teilt zuerst die Vorhersage und die Referenz in Sätze auf und berechnet dann die LCS für jedes Tupel. Der endgültige `rougeLsum`-Score ist der Durchschnitt dieser einzelnen LCS-Scores.
`metric_spec.use_stemmer`	Optional: `bool` Gibt an, ob der Porter-Stemmer zum Entfernen von Wortsuffixen verwendet werden soll, um die Übereinstimmung zu verbessern.
`metric_spec.split_summaries`	Optional: `bool` Gibt an, ob Zeilenumbrüche zwischen Sätzen für rougeLsum hinzugefügt werden sollen.
`instances`	Optional: `RougeInstance[]` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz.
`instances.prediction`	Optional: `string` LLM-Antwort.
`instances.reference`	Optional: `string` Goldene LLM-Antwort als Referenz.

`RougeResults`

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Ausgabe

Ausgabe
`rouge_metric_values`	`RougeValue[]` Bewertungsergebnisse pro Instanzeingabe.
`rouge_metric_values.score`	`float`: `[0, 1]`, wobei höhere Scores bedeuten, dass die Vorhersage eher der Referenz entspricht.

rouge_metric_values

RougeValue[]

Bewertungsergebnisse pro Instanzeingabe.

rouge_metric_values.score

float: [0, 1], wobei höhere Scores bedeuten, dass die Vorhersage eher der Referenz entspricht.

`FluencyInput`

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

Parameter

Parameter
`metric_spec`	Optional: `FluencySpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `FluencyInstance` Bewertungseingabe, bestehend aus der LLM-Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.

metric_spec

Optional: FluencySpec

Messwertspezifikation, die das Verhalten des Messwerts definiert.

instance

Optional: FluencyInstance

Bewertungseingabe, bestehend aus der LLM-Antwort.

instance.prediction

Optional: string

LLM-Antwort.

`FluencyResult`

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Undeutlich `2`: Eher undeutlich `3`: Neutral `4`: Eher fließend `5`: Fließend
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Undeutlich
2: Eher undeutlich
3: Neutral
4: Eher fließend
5: Fließend

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`CoherenceInput`

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

Parameter

Parameter
`metric_spec`	Optional: `CoherenceSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `CoherenceInstance` Bewertungseingabe, bestehend aus der LLM-Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.

metric_spec

Optional: CoherenceSpec

Messwertspezifikation, die das Verhalten des Messwerts definiert.

instance

Optional: CoherenceInstance

Bewertungseingabe, bestehend aus der LLM-Antwort.

instance.prediction

Optional: string

LLM-Antwort.

`CoherenceResult`

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Inkohärent `2`: Eher unzusammenhängend `3`: Neutral `4`: Eher kohärent `5`: Kohärent
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Inkohärent
2: Eher unzusammenhängend
3: Neutral
4: Eher kohärent
5: Kohärent

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`SafetyInput`

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

Parameter

Parameter
`metric_spec`	Optional: `SafetySpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `SafetyInstance` Bewertungseingabe, bestehend aus der LLM-Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.

metric_spec

Optional: SafetySpec

Messwertspezifikation, die das Verhalten des Messwerts definiert.

instance

Optional: SafetyInstance

Bewertungseingabe, bestehend aus der LLM-Antwort.

instance.prediction

Optional: string

LLM-Antwort.

`SafetyResult`

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `0`: Unsicher `1`: Sicher
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

0: Unsicher
1: Sicher

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`GroundednessInput`

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

Parameter	Beschreibung
`metric_spec`	Optional: GroundednessSpec Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: GroundednessInstance Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`GroundednessResult`

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `0`: Unfundiert `1`: Fundiert
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

0: Unfundiert
1: Fundiert

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`FulfillmentInput`

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

Parameter
`metric_spec`	Optional: `FulfillmentSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `FulfillmentInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.

`FulfillmentResult`

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Keine Erfüllung `2`: Schlechte Erfüllung `3`: Teilweise Erfüllung `4`: Gute Erfüllung `5`: Vollständige Erfüllung
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Keine Erfüllung
2: Schlechte Erfüllung
3: Teilweise Erfüllung
4: Gute Erfüllung
5: Vollständige Erfüllung

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`SummarizationQualityInput`

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameter
`metric_spec`	Optional: `SummarizationQualitySpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `SummarizationQualityInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`SummarizationQualityResult`

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Sehr schlecht `2`: Schlecht `3`: Ok `4`: Gut `5`: Sehr gut
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Sehr schlecht
2: Schlecht
3: Ok
4: Gut
5: Sehr gut

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`PairwiseSummarizationQualityInput`

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameter
`metric_spec`	Optional: `PairwiseSummarizationQualitySpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `PairwiseSummarizationQualityInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.baseline_prediction`	Optional: `string` LLM-Antwort des Basismodells.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`PairwiseSummarizationQualityResult`

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`pairwise_choice`	`PairwiseChoice`: Enum mit folgenden möglichen Werten: `BASELINE`: Baseline-Vorhersage ist besser `CANDIDATE`: Kandidatenvorhersage ist besser `TIE`: Gleichstand zwischen Baseline- und Kandidatenvorhersagen.
`explanation`	`string`: Begründung für pairwise_choice-Zuweisung.
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

pairwise_choice

PairwiseChoice: Enum mit folgenden möglichen Werten:

BASELINE: Baseline-Vorhersage ist besser
CANDIDATE: Kandidatenvorhersage ist besser
TIE: Gleichstand zwischen Baseline- und Kandidatenvorhersagen.

explanation

string: Begründung für pairwise_choice-Zuweisung.

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`SummarizationHelpfulnessInput`

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameter
`metric_spec`	Optional: `SummarizationHelpfulnessSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `SummarizationHelpfulnessInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`SummarizationHelpfulnessResult`

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Nicht hilfreich `2`: Eher weniger hilfreich `3`: Neutral `4`: Einigermaßen hilfreich `5`: Hilfreich
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Nicht hilfreich
2: Eher weniger hilfreich
3: Neutral
4: Einigermaßen hilfreich
5: Hilfreich

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`SummarizationVerbosityInput`

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameter
`metric_spec`	Optional: `SummarizationVerbositySpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `SummarizationVerbosityInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`SummarizationVerbosityResult`

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`. Einer der folgenden Werte: `-2`: Knapp `-1`: Eher knapp `0`: Optimal `1`: Eher ausführlich `2`: Ausführlich
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float. Einer der folgenden Werte:

-2: Knapp
-1: Eher knapp
0: Optimal
1: Eher ausführlich
2: Ausführlich

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`QuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

Parameter
`metric_spec`	Optional: `QuestionAnsweringQualitySpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `QuestionAnsweringQualityInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`QuestionAnsweringQualityResult`

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Sehr schlecht `2`: Schlecht `3`: Ok `4`: Gut `5`: Sehr gut
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Sehr schlecht
2: Schlecht
3: Ok
4: Gut
5: Sehr gut

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`PairwiseQuestionAnsweringQualityInput`

{
  "pairwise_question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameter
`metric_spec`	Optional: `QuestionAnsweringQualitySpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `QuestionAnsweringQualityInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.baseline_prediction`	Optional: `string` LLM-Antwort des Basismodells.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`PairwiseQuestionAnsweringQualityResult`

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`pairwise_choice`	`PairwiseChoice`: Enum mit folgenden möglichen Werten: `BASELINE`: Baseline-Vorhersage ist besser `CANDIDATE`: Kandidatenvorhersage ist besser `TIE`: Gleichstand zwischen Baseline- und Kandidatenvorhersagen.
`explanation`	`string`: Begründung für die Zuweisung von `pairwise_choice`.
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

pairwise_choice

PairwiseChoice: Enum mit folgenden möglichen Werten:

BASELINE: Baseline-Vorhersage ist besser
CANDIDATE: Kandidatenvorhersage ist besser
TIE: Gleichstand zwischen Baseline- und Kandidatenvorhersagen.

explanation

string: Begründung für die Zuweisung von pairwise_choice.

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`QuestionAnsweringRelevanceInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameter
`metric_spec`	Optional: `QuestionAnsweringRelevanceSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `QuestionAnsweringRelevanceInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`QuestionAnsweringRelevancyResult`

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Irrelevant `2`: Eher irrelevant `3`: Neutral `4`: Eher relevant `5`: Relevant
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Irrelevant
2: Eher irrelevant
3: Neutral
4: Eher relevant
5: Relevant

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`QuestionAnsweringHelpfulnessInput`

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameter
`metric_spec`	Optional: `QuestionAnsweringHelpfulnessSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `QuestionAnsweringHelpfulnessInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`QuestionAnsweringHelpfulnessResult`

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `1`: Nicht hilfreich `2`: Eher weniger hilfreich `3`: Neutral `4`: Einigermaßen hilfreich `5`: Hilfreich
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

1: Nicht hilfreich
2: Eher weniger hilfreich
3: Neutral
4: Einigermaßen hilfreich
5: Hilfreich

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`QuestionAnsweringCorrectnessInput`

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

Parameter
`metric_spec`	Optional: `QuestionAnsweringCorrectnessSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`metric_spec.use_reference`	Optional: `bool` Gibt an, ob die Referenz in der Auswertung verwendet wird.
`instance`	Optional: `QuestionAnsweringCorrectnessInstance` Bewertungseingabe, bestehend aus Inferenz-Eingaben und der entsprechenden Antwort.
`instance.prediction`	Optional: `string` LLM-Antwort.
`instance.reference`	Optional: `string` Goldene LLM-Antwort als Referenz.
`instance.instruction`	Optional: `string` Anweisung, die zur Inferenzzeit verwendet wird.
`instance.context`	Optional: `string` Inferenzzeittext, der alle Informationen enthält, die in der LLM-Antwort verwendet werden können.

`QuestionAnsweringCorrectnessResult`

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

Ausgabe

Ausgabe
`score`	`float`: Beispiele: `0`: Falsch `1`: Richtig
`explanation`	`string`: Begründung für die Score-Zuweisung
`confidence`	`float`: `[0, 1]` Konfidenzwert unseres Ergebnisses.

score

float: Beispiele:

0: Falsch
1: Richtig

explanation

string: Begründung für die Score-Zuweisung

confidence

float: [0, 1] Konfidenzwert unseres Ergebnisses.

`PointwiseMetricInput`

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

Parameter
`metric_spec`	Erforderlich: `PointwiseMetricSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`metric_spec.metric_prompt_template`	Erforderlich: `string` Eine Prompt-Vorlage, die den Messwert definiert. Sie wird durch die Schlüssel/Wert-Paare in instance.json_instance gerendert.
`instance`	Erforderlich: `PointwiseMetricInstance` Bewertungseingabe, bestehend aus json_instance.
`instance.json_instance`	Optional: `string` Die Schlüssel/Wert-Paare im JSON-Format. Beispiel: {"key_1": "value_1", "key_2": "value_2"}. Damit wird metric_spec.metric_prompt_template gerendert.

`PointwiseMetricResult`

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

Ausgabe
`score`	`float`: Ein Wert für das punktweise Ergebnis der Messwertbewertung.
`explanation`	`string`: Begründung für die Score-Zuweisung

`PairwiseMetricInput`

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

Parameter
`metric_spec`	Erforderlich: `PairwiseMetricSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`metric_spec.metric_prompt_template`	Erforderlich: `string` Eine Prompt-Vorlage, die den Messwert definiert. Sie wird durch die Schlüssel/Wert-Paare in instance.json_instance gerendert.
`instance`	Erforderlich: `PairwiseMetricInstance` Bewertungseingabe, bestehend aus json_instance.
`instance.json_instance`	Optional: `string` Die Schlüssel/Wert-Paare im JSON-Format. Beispiel: {"key_1": "value_1", "key_2": "value_2"}. Damit wird metric_spec.metric_prompt_template gerendert.

`PairwiseMetricResult`

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

Ausgabe
`score`	`float`: Ein Score für das paarweise Ergebnis der Messwertbewertung.
`explanation`	`string`: Begründung für die Score-Zuweisung

`ToolCallValidInput`

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameter
`metric_spec`	Optional: `ToolCallValidSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `ToolCallValidInstance` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells, die ein JSON-serialisierter String mit den Schlüsseln `content` und `tool_calls` ist. Der Wert `content` ist die Textausgabe des Modells. Der Wert `tool_call` ist ein JSON-serialisierter String einer Liste von Tool-Aufrufen. Ein Beispiel: { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	Optional: `string` Goldene Modellausgabe im selben Format wie die Vorhersage.

`ToolCallValidResults`

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Ausgabe

Ausgabe
`tool_call_valid_metric_values`	wiederholt `ToolCallValidMetricValue`: Bewertungsergebnisse pro Instanzeingabe.
`tool_call_valid_metric_values.score`	`float`: Beispiele: `0`: Ungültiger Tool-Aufruf `1`: Gültiger Tool-Aufruf

tool_call_valid_metric_values

wiederholt ToolCallValidMetricValue: Bewertungsergebnisse pro Instanzeingabe.

tool_call_valid_metric_values.score

float: Beispiele:

0: Ungültiger Tool-Aufruf
1: Gültiger Tool-Aufruf

`ToolNameMatchInput`

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameter
`metric_spec`	Optional: `ToolNameMatchSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `ToolNameMatchInstance` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells, die ein JSON-serialisierter String mit den Schlüsseln `content` und `tool_calls` ist. Der Wert `content` ist die Textausgabe des Modells. Der Wert `tool_call` ist ein JSON-serialisierter String einer Liste von Tool-Aufrufen.
`instance.reference`	Optional: `string` Goldene Modellausgabe im selben Format wie die Vorhersage.

`ToolNameMatchResults`

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Ausgabe

Ausgabe
`tool_name_match_metric_values`	wiederholt `ToolNameMatchMetricValue`: Bewertungsergebnisse pro Instanzeingabe.
`tool_name_match_metric_values.score`	`float`: Beispiele: `0`: Name des Toolaufrufs entspricht nicht der Referenz. `1`: Der Name des Toolaufrufs entspricht der Referenz.

tool_name_match_metric_values

wiederholt ToolNameMatchMetricValue: Bewertungsergebnisse pro Instanzeingabe.

tool_name_match_metric_values.score

float: Beispiele:

0: Name des Toolaufrufs entspricht nicht der Referenz.
1: Der Name des Toolaufrufs entspricht der Referenz.

`ToolParameterKeyMatchInput`

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameter
`metric_spec`	Optional: `ToolParameterKeyMatchSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `ToolParameterKeyMatchInstance` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells, die ein JSON-serialisierter String mit den Schlüsseln `content` und `tool_calls` ist. Der Wert `content` ist die Textausgabe des Modells. Der Wert `tool_call` ist ein JSON-serialisierter String einer Liste von Tool-Aufrufen.
`instance.reference`	Optional: `string` Goldene Modellausgabe im selben Format wie die Vorhersage.

`ToolParameterKeyMatchResults`

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Ausgabe
`tool_parameter_key_match_metric_values`	wiederholt `ToolParameterKeyMatchMetricValue`: Bewertungsergebnisse pro Instanzeingabe.
`tool_parameter_key_match_metric_values.score`	`float`: `[0, 1]`, wobei höhere Scores bedeuten, dass mehr Parameter den Namen der Referenzparameter entsprechen.

`ToolParameterKVMatchInput`

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

Parameter
`metric_spec`	Optional: `ToolParameterKVMatchSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`instance`	Optional: `ToolParameterKVMatchInstance` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells, die ein JSON-serialisierter String mit den Schlüsseln `content` und `tool_calls` ist. Der Wert `content` ist die Textausgabe des Modells. Der Wert `tool_call` ist ein JSON-serialisierter String einer Liste von Tool-Aufrufen.
`instance.reference`	Optional: `string` Goldene Modellausgabe im selben Format wie die Vorhersage.

`ToolParameterKVMatchResults`

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

Ausgabe
`tool_parameter_kv_match_metric_values`	wiederholt `ToolParameterKVMatchMetricValue`: Bewertungsergebnisse pro Instanzeingabe.
`tool_parameter_kv_match_metric_values.score`	`float`: `[0, 1]`, wobei höhere Scores bedeuten, dass mehr Parameter den Namen und Werten der Referenzparameter entsprechen.

`CometInput`

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

Parameter
`metric_spec`	Optional: `CometSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`metric_spec.version`	Optional: `string` `COMET_22_SRC_REF`: COMET 22 für Übersetzung, Quelle und Referenz. Dabei wird die Übersetzung (Vorhersage) anhand aller drei Eingaben bewertet.
`metric_spec.source_language`	Optional: `string` Quellsprache im BCP-47-Format. Beispiel: „de“.
`metric_spec.target_language`	Optional: `string` Zielsprache im BCP-47-Format. Beispiel: „es“
`instance`	Optional: `CometInstance` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz. Die genauen Felder, die für die Bewertung verwendet werden, hängen von der COMET-Version ab.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells. Dies ist die Ausgabe des LLM, die bewertet wird.
`instance.source`	Optional: `string` Ausgangstext. Die Antwort wird in der Originalsprache angezeigt, aus der die Vorhersage übersetzt wurde.
`instance.reference`	Optional: `string` Die Grundwahrheit, mit der die Vorhersage verglichen wird. Die Sprache ist dieselbe wie die der Vorhersage.

`CometResult`

{
  "comet_result" : {
    "score": float
  }
}

Ausgabe
`score`	`float`: `[0, 1]`, wobei 1 für eine perfekte Übersetzung steht.

`MetricxInput`

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

Parameter
`metric_spec`	Optional: `MetricxSpec` Messwertspezifikation, die das Verhalten des Messwerts definiert.
`metric_spec.version`	Optional: `string` Eines der folgenden Betriebssysteme: `METRICX_24_REF`: MetricX 24 für Übersetzung und Referenz. Die Vorhersage (Übersetzung) wird anhand des bereitgestellten Referenztexteingabe bewertet. `METRICX_24_SRC`: MetricX 24 für Übersetzung und Quelle. Die Übersetzung (Vorhersage) wird durch Quality Estimation (QE) bewertet, ohne dass ein Referenztext eingegeben wird. `METRICX_24_SRC_REF`: MetricX 24 für Übersetzung, Quelle und Referenz. Dabei wird die Übersetzung (Vorhersage) anhand aller drei Eingaben bewertet.
`metric_spec.source_language`	Optional: `string` Quellsprache im BCP-47-Format. Beispiel: „de“.
`metric_spec.target_language`	Optional: `string` Zielsprache im BCP-47-Format. Beispiel: „de“.
`instance`	Optional: `MetricxInstance` Bewertungseingabe, bestehend aus LLM-Antwort und Referenz. Die genauen Felder, die für die Bewertung verwendet werden, hängen von der MetricX-Version ab.
`instance.prediction`	Optional: `string` LLM-Antwort des Kandidatenmodells. Dies ist die Ausgabe des LLM, die bewertet wird.
`instance.source`	Optional: `string` Der Quelltext in der Originalsprache, aus der die Vorhersage übersetzt wurde.
`instance.reference`	Optional: `string` Die Grundwahrheit, mit der die Vorhersage verglichen wird. Sie ist in derselben Sprache wie die Vorhersage.

`MetricxResult`

{
  "metricx_result" : {
    "score": float
  }
}

Ausgabe
`score`	`float`: `[0, 25]`, wobei 0 für eine perfekte Übersetzung steht.

Beispiele

Ausgabe bewerten

Im folgenden Beispiel wird gezeigt, wie Sie die Gen AI Evaluation API aufrufen, um die Ausgabe eines LLM anhand verschiedener Bewertungsmetriken zu bewerten, darunter:

summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

Ausgabe bewerten: Qualität der paarweisen Zusammenfassung

Im folgenden Beispiel wird gezeigt, wie Sie die Gen AI Evaluation Service API aufrufen, um die Ausgabe eines LLM anhand eines paarweisen Vergleichs der Zusammenfassungsqualität zu bewerten.

REST

Ersetzen Sie diese Werte in den folgenden Anfragedaten:

PROJECT_ID: .
LOCATION: Die Region, in der die Anfrage verarbeitet werden soll.
PREDICTION: LLM-Antwort.
BASELINE_PREDICTION: LLM-Antwort des Basismodells.
INSTRUCTION: Die Anweisung, die zur Inferenzzeit verwendet wird.
CONTEXT: Inferenzzeittext, der alle relevanten Informationen enthält, die in der LLM-Antwort verwendet werden können.

HTTP-Methode und URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

JSON-Text der Anfrage:

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

Wenn Sie die Anfrage senden möchten, wählen Sie eine der folgenden Optionen aus:

curl

Hinweis: Der folgende Befehl setzt voraus, dass Sie sich mit Ihrem Nutzerkonto bei der gcloud CLI angemeldet haben. Dazu haben Sie gcloud init oder gcloud auth login ausgeführt oder die Cloud Shell genutzt, die Sie automatisch bei der gcloud CLI anmeldet. Um herauszufinden, welches Konto gerade aktiv ist, führen Sie gcloud auth list aus.

Speichern Sie den Anfragetext in einer Datei mit dem Namen request.json und führen Sie den folgenden Befehl aus:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

Hinweis: Der folgende Befehl setzt voraus, dass Sie sich mit Ihrem Nutzerkonto bei der gcloud CLI angemeldet haben. Dazu führen Sie gcloud init oder gcloud auth login aus. Um herauszufinden, welches Konto gerade aktiv ist, führen Sie gcloud auth list aus.

Speichern Sie den Anfragetext in einer Datei mit dem Namen request.json und führen Sie den folgenden Befehl aus:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

Informationen zur Installation des Vertex AI SDK for Python finden Sie unter Vertex AI SDK for Python installieren. Weitere Informationen finden Sie in der Python-API-Referenzdokumentation.

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

Bevor Sie dieses Beispiel anwenden, folgen Sie den Go Schritten zur Einrichtung in der Vertex AI-Kurzanleitung zur Verwendung von Clientbibliotheken. Weitere Informationen finden Sie in der Referenzdokumentation zur Vertex AI Go API.

Richten Sie zur Authentifizierung bei Vertex AI Standardanmeldedaten für Anwendungen ein. Weitere Informationen finden Sie unter Authentifizierung für eine lokale Entwicklungsumgebung einrichten.

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

ROUGE-Wert abrufen

Im folgenden Beispiel wird die Gen AI Evaluation Service API aufgerufen, um den ROUGE-Score einer Vorhersage abzurufen, die anhand einer Reihe von Eingaben generiert wurde. Für die ROUGE-Eingaben wird metric_spec verwendet, wodurch das Verhalten der Metrik bestimmt wird.

REST

Ersetzen Sie diese Werte in den folgenden Anfragedaten:

PROJECT_ID: .
LOCATION: Die Region, in der die Anfrage verarbeitet werden soll.
PREDICTION: LLM-Antwort.
REFERENCE: Goldene LLM-Antwort als Referenz.
ROUGE_TYPE: Die Berechnung, die zum Ermitteln des ROUGE-Scores verwendet wird. Zulässige Werte finden Sie unter metric_spec.rouge_type.
USE_STEMMER: Gibt an, ob der Porter-Stemmer zum Entfernen von Wortsuffixen verwendet werden soll, um die Übereinstimmung zu verbessern. Zulässige Werte finden Sie unter metric_spec.use_stemmer.
SPLIT_SUMMARIES: Legt fest, ob zwischen rougeLsum-Sätzen neue Zeilen eingefügt werden. Zulässige Werte finden Sie unter metric_spec.split_summaries .

HTTP-Methode und URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

JSON-Text der Anfrage:

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

Wenn Sie die Anfrage senden möchten, wählen Sie eine der folgenden Optionen aus:

curl

Speichern Sie den Anfragetext in einer Datei mit dem Namen request.json und führen Sie den folgenden Befehl aus:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

Speichern Sie den Anfragetext in einer Datei mit dem Namen request.json und führen Sie den folgenden Befehl aus:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

Informationen zur Installation des Vertex AI SDK for Python finden Sie unter Vertex AI SDK for Python installieren. Weitere Informationen finden Sie in der Python-API-Referenzdokumentation.

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

Richten Sie zur Authentifizierung bei Vertex AI Standardanmeldedaten für Anwendungen ein. Weitere Informationen finden Sie unter Authentifizierung für eine lokale Entwicklungsumgebung einrichten.

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

Nächste Schritte

Eine ausführliche Dokumentation finden Sie unter Bewertung ausführen.