Evaluate models

After training a model, AutoML Translation uses your TEST set to evaluate the quality and accuracy of the new model. AutoML Translation expresses the model quality by using its BLEU (Bilingual Evaluation Understudy) score, which indicates how similar the candidate text is to the reference text. A BLEU score value that is closer to one indicates that a translation is closer to the reference text.

Use this data to evaluate your model's readiness. To improve the quality of your model, consider adding more (and more diverse) training segment pairs. After you adjust your dataset, train a new model by using the improved dataset.

Note that BLEU scores are not recommended for comparing across different corpora and languages. For example, an English to German BLEU score of 50 is not comparable to a Japanese to English BLEU score of 50. Many translation experts have shifted to model-based metric approaches, which have higher correlation with human ratings and are more granular in identifying error scenarios.

AutoML Translation only supports BLEU scores. To evaluate your translation model using model-based metrics, see the Gen AI evaluation service in Vertex AI.

Get the model evaluation

Go to the AutoML Translation console.

Go to the Translation page
From the navigation menu, click Models to view a list of your models.
Click the model to evaluate.
Click the Train tab to see the model's evaluation metrics such as its BLEU score.

Test model predictions

By using the Google Cloud console, you compare translation results for your custom model against the default NMT model.

Go to the AutoML Translation console.

Go to the Translation page
From the navigation menu, click Models to view a list of your models.
Click the model to test.
Click the Predict tab.
Add input text in the source language text box.
Click Translate.

AutoML Translation shows the translation results for the custom model and NMT model.

Evaluate and compare models by using a new test set

From the Google Cloud console, you can reevaluate existing models by using a new set of test data. In a single evaluation, you can include up to 5 different models and then compare their results.

Upload your test data to Cloud Storage as a tab-separated values (TSV) or as a Translation Memory eXchange (TMX) file.

AutoML Translation evaluates your models against the test set and then produces evaluation scores. You can optionally save the results for each model as a TSV file in a Cloud Storage bucket, where each row has the following format:

Source segment tab Model candidate translation tab Reference translation

Go to the AutoML Translation console.

Go to the Translation page
From the navigation menu, click Models to view a list of your models.
Click the model to evaluate.
Click the Evaluate tab.
In the Evaluate tab, click New Evaluation.
Select the models that you want to evaluate and compare, and then click Next.

The current model must be selected, and Google NMT is selected by default, which you can deselect.
Specify a name for the Test set name to help you distinguish it from other evaluations, and then select your new test set from Cloud Storage.
Click Next.
To export predictions, specify a Cloud Storage destination folder.
Click Start evaluation.

AutoML Translation presents evaluation scores in a table format in the console after the evaluation is done. You can run only one evaluation at a time. If you specified a folder to store prediction results, AutoML Translation writes TSV files to that location that are named with the associated model ID, appended with the test set name.

Understanding the BLEU Score

BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations. A value of 0 means that the machine-translated output has no overlap with the reference translation (which indicates a lower quality) while a value of 1 means there is perfect overlap with the reference translations (which indicates a higher quality).

AutoML Translation expresses BLEU scores as a percentage rather than a decimal between 0 and 1.

Interpretation

As a rough guideline, the following interpretation of BLEU scores (expressed as percentages rather than decimals) might be helpful.

BLEU Score	Interpretation
< 10	Almost useless
10 - 19	Hard to get the gist
20 - 29	The gist is clear, but has significant grammatical errors
30 - 40	Understandable to good translations
40 - 50	High quality translations
50 - 60	Very high quality, adequate, and fluent translations
> 60	Quality often better than human

The following color gradient can be used as a general scale interpretation of the BLEU score:

General interpretability of scale

The mathematical details

Mathematically, the BLEU score is defined as:

$$ \text{BLEU} = \underbrace{\vphantom{\prod_i^4}\min\Big(1, \exp\big(1-\frac{\text{reference-length}} {\text{output-length}}\big)\Big)}_{\text{brevity penalty}} \underbrace{\Big(\prod_{i=1}^{4} precision_i\Big)^{1/4}}_{\text{n-gram overlap}} $$

with

\[ precision_i = \dfrac{\sum_{\text{snt}\in\text{Cand-Corpus}}\sum_{i\in\text{snt}}\min(m^i_{cand}, m^i_{ref})} {w_t^i = \sum_{\text{snt'}\in\text{Cand-Corpus}}\sum_{i'\in\text{snt'}} m^{i'}_{cand}} \]

where

$m_{cand}^i\hphantom{xi}$ is the count of i-gram in candidate matching the reference translation
$m_{ref}^i\hphantom{xxx}$ is the count of i-gram in the reference translation
$w_t^i\hphantom{m_{max}}$ is the total number of i-grams in candidate translation

The formula consists of two parts: the brevity penalty and the n-gram overlap.

Brevity Penalty
The brevity penalty penalizes generated translations that are too short compared to the closest reference length with an exponential decay. The brevity penalty compensates for the fact that the BLEU score has no recall term.
N-Gram Overlap
The n-gram overlap counts how many unigrams, bigrams, trigrams, and four-grams (i=1,...,4) match their n-gram counterpart in the reference translations. This term acts as a precision metric. Unigrams account for adequacy while longer n-grams account for fluency of the translation. To avoid overcounting, the n-gram counts are clipped to the maximal n-gram count occurring in the reference ($m_{ref}^n$).

Examples

Calculating $precision_1$

Consider this reference sentence and candidate translation:

Reference: the cat is on the mat
Candidate: the the the cat mat

The first step is to count the occurrences of each unigram in the reference and the candidate. Note that the BLEU metric is case-sensitive.

Unigram	$m_{cand}^i\hphantom{xi}$	$m_{ref}^i\hphantom{xxx}$	$\min(m^i_{cand}, m^i_{ref})$
`the`	3	2	2
`cat`	1	1	1
`is`	0	1	0
`on`	0	1	0
`mat`	1	1	1

The total number of unigrams in the candidate ($w_t^1$) is 5, so $precision_1$ = (2 + 1 + 1)/5 = 0.8.

Calculating the BLEU score

Reference: The NASA Opportunity rover is battling a massive dust storm on Mars .
Candidate 1: The Opportunity rover is combating a big sandstorm on Mars .
Candidate 2: A NASA rover is fighting a massive storm on Mars .

The above example consists of a single reference and two candidate translations. The sentences are tokenized prior to computing the BLEU score as depicted above; for example, the final period is counted as a separate token.

To compute the BLEU score for each translation, we compute the following statistics.

N-Gram Precisions
The following table contains the n-gram precisions for both candidates.
Brevity-Penalty
The brevity-penalty is the same for candidate 1 and candidate 2 since both sentences consist of 11 tokens.
BLEU-Score
Note that at least one matching 4-gram is required to get a BLEU score > 0. Since candidate translation 1 has no matching 4-gram, it has a BLEU score of 0.

Metric	Candidate 1	Candidate 2
$precision_1$ (1gram)	8/11	9/11
$precision_2$ (2gram)	4/10	5/10
$precision_3$ (3gram)	2/9	2/9
$precision_4$ (4gram)	0/8	1/8
Brevity-Penalty	0.83	0.83
BLEU-Score	0.0	0.27

Properties

BLEU is a Corpus-based Metric
The BLEU metric performs badly when used to evaluate individual sentences. For example, both example sentences get very low BLEU scores even though they capture most of the meaning. Because n-gram statistics for individual sentences are less meaningful, BLEU is by design a corpus-based metric; that is, statistics are accumulated over an entire corpus when computing the score. Note that the BLEU metric defined above cannot be factorized for individual sentences.
No distinction between content and function words
The BLEU metric does not distinguish between content and function words, that is, a dropped function word like "a" gets the same penalty as if the name "NASA" were erroneously replaced with "ESA".
Not good at capturing meaning and grammaticality of a sentence
The drop of a single word like "not" can change the polarity of a sentence. Also, taking only n-grams into account with n≤4 ignores long-range dependencies and thus BLEU often imposes only a small penalty for ungrammatical sentences.
Normalization and Tokenization
Prior to computing the BLEU score, both the reference and candidate translations are normalized and tokenized. The choice of normalization and tokenization steps significantly affect the final BLEU score.