Evaluate translation models

The Gen AI evaluation service offers the following translation task evaluation metrics:

MetricX
COMET
BLEU

MetricX and COMET are pointwise model-based metrics that have been trained for translation tasks. You can evaluate the quality and accuracy of translation model results for your content, whether they are outputs of NMT, TranslationLLM, or Gemini models.

You can also use Gemini as a judge model to evaluate your model for fluency, coherence, verbosity and text quality in combination with MetricX, COMET or BLEU.

MetricX is an error-based metric developed by Google that predicts a floating point score between 0 and 25 representing the quality of a translation. MetricX is available both as a referenced-based and reference-free (QE) method. When you use this metric, a lower score is a better score, because it means there are fewer errors.
COMET employs a reference-based regression approach that provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
BLEU (Bilingual Evaluation Understudy) is a computation-based metric. The BLEU score indicates how similar the candidate text is to the reference text. A BLEU score value that is closer to one indicates that a translation is closer to the reference text.

Note that BLEU scores are not recommended for comparing across different corpora and languages. For example, an English to German BLEU score of 50 is not comparable to a Japanese to English BLEU score of 50. Many translation experts have shifted to model-based metric approaches, which have higher correlation with human ratings and are more granular in identifying error scenarios.

To learn how to run evaluations for translation models, see Evaluate a translation model.