RecognitionConfig(mapping=None, *, ignore_unknown_fields=False, **kwargs)
Provides information to the recognizer that specifies how to process the request.
Encoding of audio data sent in all
Sample rate in Hertz of the audio data sent in all
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16, OGG_OPUS and FLAC are
This needs to be set to
Required. The language of the supplied audio as a
A list of up to 3 additional
Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of
If set to
Speech adaptation configuration improves the accuracy of speech recognition. For more information, see the `speech adaptation
Use transcription normalization to automatically replace parts of the transcript with phrases of your choosing. For StreamingRecognize, this normalization only applies to stable partial transcripts (stability > 0.8) and final transcripts.
Array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see `speech adaptation
If 'true', adds punctuation to recognition result hypotheses. This feature is only available in select languages. Setting this for requests in other languages has no effect at all. The default 'false' value does not add punctuation to result hypotheses.
The spoken punctuation behavior for the call If not set, uses default behavior based on model of choice e.g. command_and_search will enable spoken punctuation by default If 'true', replaces spoken punctuation with the corresponding symbols in the request. For example, "how are you question mark" becomes "how are you?". See https://cloud.google.com/speech-to-text/docs/spoken-punctuation for support. If 'false', spoken punctuation is not replaced.
The spoken emoji behavior for the call If not set, uses default behavior based on model of choice If 'true', adds spoken emoji formatting for the request. This will replace spoken emojis with the corresponding Unicode symbols in the final transcript. If 'false', spoken emojis are not replaced.
If 'true', enables speaker detection for each recognized word in the top alternative of the recognition result using a speaker_tag provided in the WordInfo. Note: Use diarization_config instead.
If set, specifies the estimated number of speakers in the conversation. Defaults to '2'. Ignored unless enable_speaker_diarization is set to true. Note: Use diarization_config instead.
Config to enable speaker diarization and set additional parameters to make diarization better suited for your application. Note: When this is enabled, we send all the words from the beginning of the audio for the top alternative in every consecutive STREAMING responses. This is done in order to improve our speaker tags as our models learn to identify the speakers in the conversation over time. For non-streaming requests, the diarization results will be provided only in the top alternative of the FINAL SpeechRecognitionResult.
Metadata regarding this request.
Which model to select for the given request. Select the model best suited to your domain to get best results. If a model is not explicitly specified, then we auto-select a model based on the parameters in the RecognitionConfig. .. raw:: html
Set to true to use an enhanced model for speech recognition. If
The encoding of the audio data sent in the request.
All encodings support only 1 channel (mono) audio, unless the
enable_separate_recognition_per_channel fields are set.
For best results, the audio source should be captured and
transmitted using a lossless encoding (
The accuracy of the speech recognition can be reduced if lossy
codecs are used to capture or transmit audio, particularly if
background noise is present. Lossy codecs include
WAV audio file formats include a header that
describes the included audio content. You can request recognition
WAV files that contain either
encoded audio. If you send
WAV audio file format in
your request, you do not need to specify an
audio encoding format is determined from the file header. If you
AudioEncoding when you send send
audio, the encoding configuration must match the encoding described
in the audio header; otherwise the request returns an
Uncompressed 16-bit signed little-endian
samples (Linear PCM).
FLAC (Free Lossless Audio Codec) is the recommended
encoding because it is lossless--therefore recognition is
not compromised--and requires only about half the bandwidth
FLAC stream encoding supports 16-bit
and 24-bit samples, however, not all fields in
STREAMINFO are supported.
8-bit samples that compand 14-bit audio
samples using G.711 PCMU/mu-law.
Adaptive Multi-Rate Narrowband codec.
must be 8000.
Adaptive Multi-Rate Wideband codec.
must be 16000.
Opus encoded audio frames in Ogg container
sample_rate_hertz must be one of 8000, 12000, 16000,
24000, or 48000.
Although the use of lossy encodings is not recommended, if a
very low bitrate encoding is required,
highly preferred over Speex encoding. The
Speex <https://speex.org/> encoding supported by Cloud
Speech API has a header byte in each block, as in MIME type
audio/x-speex-with-header-byte. It is a variant of the
RTP Speex encoding defined in
5574 <https://tools.ietf.org/html/rfc5574>__. The stream is
a sequence of blocks, one block per RTP packet. Each block
starts with a byte containing the length of the block, in
bytes, followed by one or more frames of Speex data, padded
to an integral number of bytes (octets) as specified in RFC
- In other words, each RTP header is replaced with a
single byte containing the block length. Only Speex wideband
sample_rate_hertzmust be 16000. MP3 (8): MP3 audio. MP3 encoding is a Beta feature and only available in v1p1beta1. Support all standard MP3 bitrates (which range from 32-320 kbps). When using this encoding,
sample_rate_hertzhas to match the sample rate of the file being used. WEBM_OPUS (9): Opus encoded audio frames in WebM container (
sample_rate_hertzmust be one of 8000, 12000, 16000, 24000, or 48000.