Preparing training data

AutoML Translation trains custom models using matching pairs of sentences in the source and target languages. It treats each sentence pair as an independent training item, without assuming any correlation between separate pairs.

The sentence pairs used to train your custom model must be in tab-separated values (.tsv) or Translation Memory eXchange (.tmx) format. You can batch multiple .tsv and .tmx files into a comma-separated values (.csv) file. You can import individual .tsv or .tmx files by using the Google Cloud console. If you use the AutoML API, you can only use .csv files.

Sentence pairs are always de-duplicated across all imported sentence pairs. A sentence pair is a duplicate of another when their source sentence matches another source sentence. In addition, AutoML Translation doesn't allow you to import files with the same content.

For a list of supported language pairs, see Language support for custom models.

Data split

AutoML Translation uses the sentence pairs that you provide to train, validate, and test your custom model.

  • TRAIN - Use the sentence pairs to train the model.
  • VALIDATION - Use the sentence pairs to validate the results that the model returns during training.
  • TEST - Use the sentence pairs to verify the model's results after the model has been trained.

You can control which sentence pairs AutoML Translation uses for each purpose by uploading separate files for the training, validation, and testing sets. If you don't explicitly specify which files to use for these three purposes, AutoML Translation automatically divides your sentence pairs into three sets. AutoML Translation uses approximately 80% of your data for training, 10% for validation, and 10% for testing. AutoML Translation randomly splits your data into the three sets. You can have a maximum of 10,000 sentence pairs each for the validation and testing sets. After 10,000 pairs, sentence pairs are pushed to the training set.

If you do multiple data imports into the same dataset, you can manually specify the data split for one import and use the automatic split for another. Data is always re-balanced with respect to your manual division after each import and file deletion.

Data requirements

Your training data must conform to the following requirements:

  • If you let AutoML Translation automatically split your data, you must submit at least 1,000 sentence pairs to train a custom model.
  • If you manually split your data, you must provide at least three sentence pairs for the TRAIN set, and you must have at least 100 sentence pairs each for the VALIDATION and TEST sets.
  • You cannot provide more than 10,000 sentence pairs each for the VALIDATION set or TEST set.
  • Your dataset cannot exceed the maximum of 15 million sentence pairs.

Data recommendations

The following recommendations can help you increase the quality of your training dataset:

  • Use at least 5,000 sentence pairs for TRAIN, 500 sentence pairs for VALIDATION, and 500 sentence pairs for TEST. That said, use more data if possible. Having more data for the TRAIN set helps the model learn patterns, and having more data for the VALIDATION and TEST sets help verify that the model can be generalized to a wider variety of scenarios in your domain.
  • Keep sentences to roughly 200 words or less. AutoML Translation might drop sentence pairs larger than that. For more information, see Import issues.
  • Fix common data issues. For more information, see the "Clean up messy data" section in the data preparation beginner's guide.

Tab-separated values (.tsv)

AutoML Translation supports tab-separated files, where each row has this format:

  • Source sentence tab Translated sentence

For example:

It's a beautiful day.\tEs ist ein schöner Tag.
Tomorrow it will rain.\tMorgen wird es regnen.

All text in a .tsv file must be plain text. If the text includes HTML tags or other markup, AutoML Translation treats the markup as plain text.

The tab-separated source data does not include language codes to identify the source and target languages. You identify the source and target language codes when you describe the model to be trained. AutoML Translation interprets the first segment as the source language, the second segment as the target. In the example above, the source would be English, and the target would be German. 

Translation Memory eXchange (.tmx)

Translation Memory eXchange (TMX) is a standard XML format for providing source and target translation sentences. AutoML Translation supports input files in a format based on TMX version 1.4. This example illustrates the required structure:

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
  <header segtype="sentence" o-tmf="UTF-8"
  adminlang="en" srclang="en" datatype="PlainText"/>
  <body>
    <tu>
      <tuv xml:lang="en">
        <seg>It's a beautiful day.</seg>
      </tuv>
      <tuv xml:lang="de">
        <seg>Es ist ein schöner Tag.</seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="en">
        <seg>Tomorrow it will rain.</seg>
      </tuv>
      <tuv xml:lang="de">
        <seg>Morgen wird es regnen.</seg>
      </tuv>
    </tu>
  </body>
</tmx>

The <header> element of a well-formed .tmx file must identify the source language using the srclang attribute, and every <tuv> element must identify the language of the contained text using the xml:lang attribute.

All <tu> elements must contain a pair of <tuv> elements with the same source and target languages. If a <tu> element contains more than two <tuv> elements, AutoML Translation processes only the first <tuv> matching the source language and the first matching the target language and ignores the rest. If a <tu> element does not have a matching pair of <tuv> elements, AutoML Translation skips over the invalid <tu> element.

AutoML Translation strips the markup tags from around a <seg> element before processing it. If a <tuv> element contains more than one <seg> element, AutoML Translation concatenates their text into a single element with a space between them.

If the file contains XML tags other than those shown above, AutoML Translation ignores them.

If the file does not conform to proper XML and TMX format – for example, if it is missing an end tag or a <tmx> element – AutoML Translation aborts processing it. AutoML Translation also aborts processing if it skips more than 1024 invalid <tu> elements.

Comma-separated values (.csv)

To upload sentence pairs using the AutoML API, you create a comma-separated values (.csv) file that identifies the .tsv and .tmx files to use, and which can also indicate which pairs to use for training, validation, and testing. The .csv file can have any filename, must be UTF-8 encoded, and must end with a .csv extension. The file has one row for each .tsv or .tmx file you are uploading, with two columns in each row:

  • Which set to assign the sentence pairs in this file to. This field is optional and can be one of these values:

    • TRAIN
    • VALIDATION
    • TEST
    • UNASSIGNED

      If a dataset is specified as UNASSIGNED, then AutoML Translation automatically splits it to ensure that there is enough training, validation, and testing content.

  • The full path to a .tsv or .tmx document containing sentence pairs.

For example, you might have the following in your .csv file:

TRAIN,gs://my-project-vcm/csv/en-fr-train.tsv
VALIDATION,gs://my-project-vcm/csv/en-fr-validation.tsv
TEST,gs://my-project-vcm/csv/en-fr-test.tsv