Custom translations overview

The default Google Neural Machine Translation (NMT) model covers a wide range of languages and works well for general-purpose text. However, in cases where you're translating domain-specific or style-sensitive text, custom translations can help you get more relevant translations.

Custom translations require you to provide your own example translations. Then, Cloud Translation can generate results that closely follow the style, tone, and vocabulary of your examples.

Cloud Translation provides two solutions for requesting custom translations: AutoML Translation for training custom models or adaptive translation to leverage Google's large-language models (LLMs). Each feature has its own data requirements, set of supported languages, and pricing.

AutoML Translation

With AutoML Translation, you import your data to train custom models that you own and maintain. After building a custom model, you can then request translations that use your model instead of the default NMT model. Compared to adaptive translation, custom models work well for domain-specific text where getting the correct terminology is your highest priority. You are also required to provide larger datasets for model training.

You are charged on the model training time and the number of input characters that you send for translations.

Adaptive translation

Adaptive translations use LLMs combined with small datasets to provide high-quality translations, often on par with AutoML Translation custom models. You don't train or maintain any models. Compared to custom models, adaptive translation works well for getting responses that are similar in style, tone, and voice with your input.

For adaptive translation, you are charged on the number of input and output characters.

Prepare example translations

Prepare example translations as segment pairs, which consists of one sentence in a source language and a corresponding sentence that's translated in the target language. Save these segment pairs in a tab-separated values (TSV) file or Translation Memory eXchange (TMX) file.

Choose examples that represent the linguistic domain of the content that you plan to translation. For additional guidance, see the Data preparation section in the AutoML Translation beginner's guide.

TSV

For tab-separated files, each row has the following format:

  • Source segment tab Translated segment

Don't include a header row with language codes to identify the source and target languages. You specify these languages when you create a dataset. The following example includes segment pairs for English to German translations:

It's a beautiful day.\tEs ist ein schöner Tag.
Tomorrow it will rain.\tMorgen wird es regnen.

All content in a TSV file must be plain text. If the text includes HTML tags or other markup, Cloud Translation treats the markup as plain text.

TMX

TMX is a standard XML format for providing source and target translation segments. Cloud Translation supports input files in a format based on TMX version 1.4. The following example illustrates the required structure:

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
  <header segtype="sentence" o-tmf="UTF-8"
  adminlang="en" srclang="en" datatype="PlainText"/>
  <body>
    <tu>
      <tuv xml:lang="en">
        <seg>It's a beautiful day.</seg>
      </tuv>
      <tuv xml:lang="de">
        <seg>Es ist ein schöner Tag.</seg>
      </tuv>
    </tu>
    <tu>
      <tuv xml:lang="en">
        <seg>Tomorrow it will rain.</seg>
      </tuv>
      <tuv xml:lang="de">
        <seg>Morgen wird es regnen.</seg>
      </tuv>
    </tu>
  </body>
</tmx>

The <header> element of a well-formed TMX file must identify the source language by using the srclang attribute, and every <tuv> element must identify the language of the contained text using the xml:lang attribute.

All <tu> elements must contain a pair of <tuv> elements with the same source and target languages. If a <tu> element contains more than two <tuv> elements, Cloud Translation processes only the first <tuv> matching the source language and the first matching the target language and ignores the rest. If a <tu> element does not have a matching pair of <tuv> elements, Cloud Translation skips over the invalid <tu> element.

Cloud Translation strips the markup tags from around a <seg> element before processing it. If a <tuv> element contains more than one <seg> element, Cloud Translation concatenates their text into a single element with a space between them.

If the file contains XML tags other than those shown earlier, Cloud Translation ignores them.

If the file does not conform to proper XML and TMX format – for example, if it is missing an end tag or a <tmx> element – Cloud Translation aborts processing it. Cloud Translation also aborts processing if it skips more than 1024 invalid <tu> elements.

The minimum required and maximum allowed number segment pairs for each feature is different. For more information, see the AutoML Translation data preparation or adaptive translation data requirements.

What's next