Prepare training data

Cloud Translation trains custom models by using matching pairs of segments in the source and target languages. It treats each segment pair as an independent training item, without assuming any correlation between separate pairs.

The segment pairs that are used to train your custom model must be in the tab-separated values (.tsv) or Translation Memory eXchange (.tmx) format. For more information, see Prepare example translations.

Segment pairs are always de-duplicated across all imported pairs. A segment pair is a duplicate of another when their source segment matches another source segment. Cloud Translation doesn't allow you to import files with the same content.

Data split

AutoML Translation uses the segment pairs that you provide to for different purposes while creating your custom model:

  • Train - Segment pairs to train the model. Allocate most of your data for this purpose.
  • Validation - Segment pairs to validate the results that the model returns during training.
  • Test - Segment pairs to generate the final evaluation metrics of your model. Indicates how the model might perform in production.

You can control which segment pairs AutoML Translation uses for each purpose by uploading separate files for the training, validation, and testing sets. If you don't explicitly specify which files to use for these three purposes, AutoML Translation automatically divides your segment pairs into three sets. AutoML Translation uses approximately 80% of your data for training, 10% for validation, and 10% for testing. AutoML Translation randomly assigns your segment pairs into the three sets. You can have a maximum of 10,000 segment pairs each for the validation and testing sets. After 10,000 pairs, segment pairs are pushed to the training set.

If you do multiple data imports into the same dataset, you can manually specify the data split for one import and use the automatic split for another. Data is always re-balanced with respect to your manual division after each import and file deletion.

Data requirements

Your training data must conform to the following requirements:

  • If you let AutoML Translation automatically split your data, you must submit at least 1,000 segment pairs to train a custom model.
  • If you manually split your data, you must provide at least three segment pairs for the TRAIN set, and you must have at least 100 segment pairs each for the VALIDATION and TEST sets.
  • You must provide at least three segment pairs for the TRAIN set, and you must have at least 100 segment pairs each for the VALIDATION and TEST sets.
  • You cannot provide more than 10,000 segment pairs each for the VALIDATION and TEST sets.
  • Your dataset cannot exceed the maximum of 15 million segment pairs.

Data recommendations

The following recommendations can help you increase the quality of your model:

  • Use at least 5,000 segment pairs for TRAIN, 500 segment pairs for VALIDATION, and 500 segment pairs for TEST. That said, use more data if possible. Having more data for the TRAIN set helps the model learn patterns, and having more data for the VALIDATION and TEST sets help verify that the model can be generalized to a wider variety of scenarios in your domain.
  • Keep segments to roughly 200 words or less. AutoML Translation might drop segment pairs larger than that. For more information, see Import issues.
  • Fix common source data issues, as described in the "Clean up messy data" part in the data preparation section of the overview.

What's next