Media Translation basics

This document is a guide to the basics of using Media Translation. This conceptual guide covers the types of requests you can make to Media Translation, how to construct those requests, and how to handle their responses. We recommend that all users of Media Translation read this guide and one of the associated tutorials before diving into the API itself.

Speech translation requests

Media Translation has only one method so far to perform speech translation:

  • Streaming Translation (gRPC only) performs translation on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time translation purposes, such as capturing live audio from a microphone. Streaming translation provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking. Streaming translation requests are limited to audio data of 5 minutes or less in duration.

Requests contain either configuration parameters or audio data. The following sections describe these type of speech translation requests, the responses they generate, and how to handle those responses in more detail.

Streaming speech translation requests

A streaming Media Translation API request can be either a speech translation configuration, or audio data. A sample configuration request is shown below:

    "audio_config": {
        "audio_encoding": "linear16",
        "sample_rate_hertz": 16000,
        "source_language_code": "en-US",
        "target_language_code": "zh",
        "model" : "google-provided-model/video",
    "single_utterance" : False

A sample audio data request is shown below:

    "audio_content " : "\366c\256\375jQ\r\312\205j\271\243%/u\216z\330\354\221\360\253KJ\005\"

The first StreamingTranslateSpeechRequest must contain a configuration of type StreamingTranslateSpeechConfig without any accompanying audio. Subsequent StreamingTranslateSpeechRequests sent over the same stream will then consist of consecutive frames of raw audio bytes.

A StreamingTranslateSpeechConfig contains the following fields:

  • audio_config - (required) contains configuration information for the audio, of type TranslateSpeechConfig.
  • single_utterance - (optional, defaults to false) indicates whether this request should automatically end after speech is no longer detected. If set, Media Translation will detect pauses, silence, or non-speech audio to determine when to end translation. If not set, the stream will continue to listen and process audio until either the stream is closed directly, or the stream's limit length has been exceeded. Setting single_utterance to true is useful for processing voice commands.

A TranslateSpeechConfig contains the following sub-fields:

  • audio_encoding - (required) specifies the encoding scheme of the supplied audio (of type AudioEncoding). If you have a choice in codec, prefer a lossless encoding such as FLAC or LINEAR16 for best performance. (For more information, see Audio Encodings.)
  • sample_rate_hertz - (required) specifies the sample rate (in Hertz) of the supplied audio. (For more information on sample rates, see Sample Rates below.)
  • source_language_code - (required) contains the language + region/locale to use for speech recognition of the supplied audio. The language code must be a BCP-47 identifier. Note that language codes typically consist of primary language tags and secondary region subtags to indicate dialects (for example, 'en' for English and 'US' for the United States in the above example.) (For a list of supported languages, see Supported Languages.)
  • 'target_language_code' - (required) contains the language to use for text translation of the supplied audio. The language code must be a BCP-47 identifier. Note that language codes typically only consist of primary language tags, since translation text will not' consider dialects. However, "zh-CN" and "zh-TW" will be different translation texts. (For a list of supported languages, see Supported Languages.)

Audio is supplied to Media Translation through the audio_content field of type StreamingTranslateSpeechRequest. The audio_content field contains the audio to evaluate, embedded within the request. See Embedding Audio Content below for more information.

Streaming speech translation response

Streaming speech translation results are returned within a series of responses of type StreamingTranslateSpeechResponse. Such a response consists of the following fields:

  • speech_event_type contains events of type SpeechEventType. The value of these events will indicate when a single utterance has been determined to have been completed. The speech events serve as markers within your stream's response. When receive END_OF_SINGLE_UTTERANCE, user need to stop sending requests, while waiting to receive remaining translation responses.
  • results contains the list of results, which may be either interim or final results, of type StreamingTranslateSpeechResult. The results list contains following the sub-fields:
    • translation contains translation text.
    • isFinal indicates whether the results obtained within this list entry are interim or are final.

More information on these request and response parameters appears below.

Sample rates

You specify the sample rate of your audio in the sample_rate_hertz field of the configuration request, and it must match the sample rate of the associated audio content or stream. Sample rates between 8000 Hz and 48000 Hz are supported within Media Translation.

If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this may impair speech recognition accuracy, as consequence, translation quality will be reduced. And higher sample rate have no appreciable effect on speech recognition quality, and it may increase latency.

However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, do not resample your audio to 16000 Hz. Most legacy telephony audio, for example, use sample rates of 8000 Hz, which may give less accurate results. If you must use such audio, provide the audio to the Media Translation API at its native sample rate.


Media Translation's recognition/translation engine supports a variety of language/dialect pairs. You specify the language (and national or regional dialect) of your audio within the request configuration's source_language_code and target_language_code field, using a BCP-47 identifier.

A full list of supported languages for each feature is available on the Language Support page.

Selecting models

Media Translation can use one of several machine learning models to translation your audio file. Google has trained these models for specific audio types and sources.

When you send an audio translation request to Media Translation, you can improve the results that you receive by specifying the model. This allows the Media Translation API to process your audio files using a machine learning model trained to recognize speech audio from that particular type of source.

To specify a model for speech translation, include the model field in the TranslateSpeechConfig object for your request, specifying the model that you want to use.

Media Translation can use the following types of machine learning models for translating your audio files.

Type Name String Description
Video google-provided-model/video

Use this model for transcribing audio in video clips or that includes multiple speakers. For best results, provide audio recorded at 16,000Hz or greater sampling rate.

Note: This is a premium model that costs more than the standard rate.

Phone call google-provided-model/(enhanced-)phone-call

Use this model for transcribing audio from a phone call. Typically, phone audio is recorded at 8,000Hz sampling rate.

Note: The enhanced phone model is a premium model that costs more than the standard rate.

media translation: Default google-provided-model/default

Use this model if your audio does not fit one of the previously described models. For example, you can use this for long-form audio recordings that feature a single speaker only. Ideally, the audio is high-fidelity, recorded at 16,000Hz or greater sampling rate.

Embedding audio content

Embedded audio is included in the streaming speech translation request when passing an audio_content field within the streaming request. For embedded audio provided as content within a gRPC request, that audio must be compatible for Proto3 serialization, and provided as binary data.

When constructing a request using a Google Cloud client library, you generally will write out this binary data directly within the audio_content field.

To see code samples, see Translating streaming audio.