Best practices

This document contains recommendations on how to provide speech data to the Media Translation API. These guidelines are designed for greater efficiency and accuracy as well as reasonable response times from the service. Use of the Media Translation API works best when data sent to the service is within the parameters described in this document.

For optimal results... If possible, avoid...
Capture audio with a sampling rate of 16,000 Hz or higher. Otherwise, set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Lower sampling rates may reduce recognition accuracy. In consequence, the translation accuracy will be reduced as well. However, avoid re-sampling. For example, in telephony the native rate is commonly 8000 Hz, which is the rate that should be sent to the service.
Use a lossless codec to record and transmit audio. FLAC or LINEAR16 is recommended. Using mu-law or other lossy codecs during recording or transmission may reduce recognition accuracy. If your audio is already in an encoding not supported by the API, transcode it to lossless FLAC or LINEAR16. If your application must use a lossy codec to conserve bandwidth, we recommend the AMR_WB or OGG_OPUS codecs, in that preferred order.
Use LINEAR16 codec for good streaming response latency. Other types of codec may also add the streaming response latency since they need extra step of decoding. For same codec, higher sample rate may have higher latency.
Position the microphone as close to the speaker as possible, particularly when background noise is present. The recognizer service is designed to ignore background voices and noise without additional noise-canceling. However, excessive background noise and echoes may reduce accuracy, especially if a lossy codec is also used.
Use an enhanced model for better results with noisy background audio. Non-enhanced model may not perform well for noisy/echo audio.
Specify source_language_code using language code "language-region", specify target_language_code using language code without region(except zh-CN and zh-TW). If "region" is not specified in source_language_code, we will choose default region, which may not match the actually speech region, and reduce the accuracy. target_language_code does not need region since the translation is text, but zh-CN and zh-TW will be different in text.

Single utterance

For short queries or commands, use StreamingTranslateSpeechConfig with single_utterance set to true. This optimizes the recognition for short utterances and also minimizes latency. And the service will stop translation automatically when there is long silence or pause. When using 'single_utterance' mode, the service will return a END_OF_SINGLE_UTTERANCE as speech_event_type in response. Client is supposed to stop sending requests when gets END_OF_SINGLE_UTTERANCE response, and continue receiving the remaining responses.

Frame size

Streaming recognition recognizes live audio as it is captured from a microphone or other audio source. The audio stream is split into frames and sent in consecutive StreamingTranslateSpeechRequest messages. Any frame size is acceptable. Larger frames are more efficient, but add latency. A 100-millisecond frame size is recommended as a good tradeoff between latency and efficiency.

Audio pre-processing

It's best to provide audio that is as clean as possible by using a good quality and well-positioned microphone. However, applying noise-reduction signal processing to the audio before sending it to the service typically reduces recognition accuracy. The recognition service is designed to handle noisy audio.

For best results:

  • Position the microphone as close as possible to the person that is speaking, particularly when background noise is present.
  • Avoid audio clipping.
  • Do not use automatic gain control (AGC).
  • All noise reduction processing should be disabled.
  • Listen to some sample audio. It should sound clear, without distortion or unexpected noise.

Request configuration

Make sure that you accurately describe the audio data sent with your request to the Media Translation API. Ensuring that the TranslateSpeechConfig for your request describes the correct sample_rate_hertz, audio_encoding, and source_language_code, and target_language_code will result in the most accurate transcription and billing for your request.