Package google.cloud.mediatranslation.v1beta1

Index

SpeechTranslationService

Provides translation from/to media types.

StreamingTranslateSpeech

rpc StreamingTranslateSpeech(StreamingTranslateSpeechRequest) returns (StreamingTranslateSpeechResponse)

Performs bidirectional streaming speech translation: receive results while sending audio. This method is only available via the gRPC API (not REST).

Authorization Scopes

Requires the following OAuth scope:

  • https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

StreamingTranslateSpeechConfig

Config used for streaming translation.

Fields
audio_config

TranslateSpeechConfig

Required. The common config for all the following audio contents.

single_utterance

bool

Optional. If false or omitted, the system performs continuous translation (continuing to wait for and process audio even if the user pauses speaking) until the client closes the input stream (gRPC API) or until the maximum time limit has been reached. May return multiple StreamingTranslateSpeechResults with the is_final flag set to true.

If true, the speech translator will detect a single spoken utterance. When it detects that the user has paused or stopped speaking, it will return an END_OF_SINGLE_UTTERANCE event and cease translation. When the client receives 'END_OF_SINGLE_UTTERANCE' event, the client should stop sending the requests. However, clients should keep receiving remaining responses until the stream is terminated. To construct the complete sentence in a streaming way, one should override (if 'is_final' of previous response is false), or append (if 'is_final' of previous response is true).

stability

string

Optional. Stability control for the media translation text. Note that stability and speed would be trade off. The value should be "LOW", "MEDIUM", "HIGH", default empty string will be treated as "LOW". (1) "LOW": In low mode, translation service will start to do translation right after getting recognition response. The speed will be faster. (2) "MEDIUM": In medium mode, translation service will check if the recognition response is stable enough or not, and only translate recognition response which is not likely to be changed later. (3) "HIGH": In high mode, translation service will wait for more stable recognition responses, and then start to do translation. Also, the following recognition responses cannot modify previous recognition responses. Thus it may impact quality in some situation. "HIGH" stability will generate "final" responses more frequently.

StreamingTranslateSpeechRequest

The top-level message sent by the client for the StreamingTranslateSpeech method. Multiple StreamingTranslateSpeechRequest messages are sent. The first message must contain a streaming_config message and must not contain audio_content data. All subsequent messages must contain audio_content data and must not contain a streaming_config message.

Fields
Union field streaming_request. The streaming request, which is either a streaming config or content. streaming_request can be only one of the following:
streaming_config

StreamingTranslateSpeechConfig

Provides information to the recognizer that specifies how to process the request. The first StreamingTranslateSpeechRequest message must contain a streaming_config message.

audio_content

bytes

The audio data to be translated. Sequential chunks of audio data are sent in sequential StreamingTranslateSpeechRequest messages. The first StreamingTranslateSpeechRequest message must not contain audio_content data and all subsequent StreamingTranslateSpeechRequest messages must contain audio_content data. The audio bytes must be encoded as specified in StreamingTranslateSpeechConfig. Note: as with all bytes fields, protobuffers use a pure binary representation (not base64).

StreamingTranslateSpeechResponse

A streaming speech translation response corresponding to a portion of the audio currently processed.

Fields
error

Status

Output only. If set, returns a google.rpc.Status message that specifies the error for the operation.

result

StreamingTranslateSpeechResult

Output only. The translation result that is currently being processed (is_final could be true or false).

speech_event_type

SpeechEventType

Output only. Indicates the type of speech event.

SpeechEventType

Indicates the type of speech event.

Enums
SPEECH_EVENT_TYPE_UNSPECIFIED No speech event specified.
END_OF_SINGLE_UTTERANCE This event indicates that the server has detected the end of the user's speech utterance and expects no additional speech. Therefore, the server will not process additional audio (although it may subsequently return additional results). When the client receives 'END_OF_SINGLE_UTTERANCE' event, the client should stop sending the requests. However, clients should keep receiving remaining responses until the stream is terminated. To construct the complete sentence in a streaming way, one should override (if 'is_final' of previous response is false), or append (if 'is_final' of previous response is true). This event is only sent if single_utterance was set to true, and is not used otherwise.

StreamingTranslateSpeechResult

A streaming speech translation result corresponding to a portion of the audio that is currently being processed.

Fields
recognition_result

string

Output only. The debug only recognition result in original language. This field is debug only and will be set to empty string if not available. This is implementation detail and will not be backward compatible.

text_translation_result

TextTranslationResult

Text translation result.

TextTranslationResult

Text translation result.

Fields
translation

string

Output only. The translated sentence.

is_final

bool

Output only. If false, this StreamingTranslateSpeechResult represents an interim result that may change. If true, this is the final time the translation service will return this particular StreamingTranslateSpeechResult, the streaming translator will not return any further hypotheses for this portion of the transcript and corresponding audio.

TranslateSpeechConfig

Provides information to the speech translation that specifies how to process the request.

Fields
audio_encoding

string

Required. Encoding of audio data. Supported formats:

  • linear16

Uncompressed 16-bit signed little-endian samples (Linear PCM).

  • flac

flac (Free Lossless Audio Codec) is the recommended encoding because it is lossless--therefore recognition is not compromised--and requires only about half the bandwidth of linear16.

  • mulaw

8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law.

  • amr

Adaptive Multi-Rate Narrowband codec. sample_rate_hertz must be 8000.

  • amr-wb

Adaptive Multi-Rate Wideband codec. sample_rate_hertz must be 16000.

  • ogg-opus

Opus encoded audio frames in Ogg container. sample_rate_hertz must be one of 8000, 12000, 16000, 24000, or 48000.

  • mp3

MP3 audio. Support all standard MP3 bitrates (which range from 32-320 kbps). When using this encoding, sample_rate_hertz has to match the sample rate of the file being used.

source_language_code

string

Required. Source language code (BCP-47) of the input audio.

target_language_code

string

Required. Target language code (BCP-47) of the output.

sample_rate_hertz

int32

Optional. Sample rate in Hertz of the audio data. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that's not possible, use the native sample rate of the audio source (instead of re-sampling).

model

string

Optional. google-provided-model/video and google-provided-model/enhanced-phone-call are premium models. google-provided-model/phone-call is not premium model.