Method: speech.recognize

HTTP request
Request body
- JSON representation
Response body
- JSON representation
Authorization scopes
SpeechRecognitionResult
- JSON representation
SpeechRecognitionAlternative
- JSON representation
WordInfo
- JSON representation
SpeechAdaptationInfo
- JSON representation

Performs synchronous speech recognition: receive results after all audio has been sent and processed.

HTTP request

POST https://speech.googleapis.com/v1p1beta1/speech:recognize

Request body

The request body contains data with the following structure:

JSON representation
{ "config": { object (`RecognitionConfig`) }, "audio": { object (`RecognitionAudio`) } }

Fields

Fields
`config`	`object (RecognitionConfig)` Required. Provides information to the recognizer that specifies how to process the request.
`audio`	`object (RecognitionAudio)` Required. The audio data to be recognized.

config

object (RecognitionConfig)

Required. Provides information to the recognizer that specifies how to process the request.

audio

object (RecognitionAudio)

Required. The audio data to be recognized.

Response body

The only message returned to the client by the speech.recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.

If successful, the response body contains data with the following structure:

JSON representation
{ "results": [ { object (`SpeechRecognitionResult`) } ], "totalBilledTime": string, "speechAdaptationInfo": { object (`SpeechAdaptationInfo`) }, "requestId": string, "usingLegacyModels": boolean }

Fields
`results[]`	`object (SpeechRecognitionResult)` Sequential list of transcription results corresponding to sequential portions of audio.
`totalBilledTime`	`string (Duration format)` When available, billed audio seconds for the corresponding request. A duration in seconds with up to nine fractional digits, ending with '`s`'. Example: `"3.5s"`.
`speechAdaptationInfo`	`object (SpeechAdaptationInfo)` Provides information on adaptation behavior in response
`requestId`	`string (int64 format)` The ID associated with the request. This is a unique ID specific only to the given request.
`usingLegacyModels`	`boolean` Whether request used legacy asr models (was not automatically migrated to use conformer models).

Authorization scopes

Requires the following OAuth scope:

https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

SpeechRecognitionResult

A speech recognition result corresponding to a portion of the audio.

JSON representation
{ "alternatives": [ { object (`SpeechRecognitionAlternative`) } ], "channelTag": integer, "resultEndTime": string, "languageCode": string }

Fields
`alternatives[]`	`object (SpeechRecognitionAlternative)` May contain one or more recognition hypotheses (up to the maximum specified in `maxAlternatives`). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
`channelTag`	`integer` For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audioChannelCount = N, its output values can range from '1' to 'N'.
`resultEndTime`	`string (Duration format)` Time offset of the end of this result relative to the beginning of the audio. A duration in seconds with up to nine fractional digits, ending with '`s`'. Example: `"3.5s"`.
`languageCode`	`string` Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio.

SpeechRecognitionAlternative

Alternative hypotheses (a.k.a. n-best list).

JSON representation
{ "transcript": string, "confidence": number, "words": [ { object (`WordInfo`) } ] }

Fields

Fields
`transcript`	`string` Transcript text representing the words that the user spoke. In languages that use spaces to separate words, the transcript might have a leading space if it isn't the first result. You can concatenate each result to obtain the full transcript without using a separator.
`confidence`	`number` The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where `isFinal=true`. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating `confidence` was not set.
`words[]`	`object (WordInfo)` A list of word-specific information for each recognized word. Note: When `enableSpeakerDiarization` is true, you will see all the words from the beginning of the audio.

transcript

string

Transcript text representing the words that the user spoke. In languages that use spaces to separate words, the transcript might have a leading space if it isn't the first result. You can concatenate each result to obtain the full transcript without using a separator.

confidence

number

The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

words[]

object (WordInfo)

A list of word-specific information for each recognized word. Note: When enableSpeakerDiarization is true, you will see all the words from the beginning of the audio.

WordInfo

Word-specific information for recognized words.

JSON representation
{ "startTime": string, "endTime": string, "word": string, "confidence": number, "speakerTag": integer, "speakerLabel": string }

Fields
`startTime`	`string (Duration format)` Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if `enableWordTimeOffsets=true` and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary. A duration in seconds with up to nine fractional digits, ending with '`s`'. Example: `"3.5s"`.
`endTime`	`string (Duration format)` Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if `enableWordTimeOffsets=true` and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary. A duration in seconds with up to nine fractional digits, ending with '`s`'. Example: `"3.5s"`.
`word`	`string` The word corresponding to this set of information.
`confidence`	`number` The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where `isFinal=true`. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating `confidence` was not set.
`speakerTag (deprecated)`	`integer` This item is deprecated! Output only. A distinct integer value is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. Value ranges from '1' to diarizationSpeakerCount. speakerTag is set if enableSpeakerDiarization = 'true' and only for the top alternative. Note: Use speakerLabel instead.
`speakerLabel`	`string` Output only. A label value assigned for every unique speaker within the audio. This field specifies which speaker was detected to have spoken this word. For some models, like medical_conversation this can be actual speaker role, for example "patient" or "provider", but generally this would be a number identifying a speaker. This field is only set if enableSpeakerDiarization = 'true' and only for the top alternative.

SpeechAdaptationInfo

Information on speech adaptation use in results

JSON representation
{ "adaptationTimeout": boolean, "timeoutMessage": string }

Fields

Fields
`adaptationTimeout`	`boolean` Whether there was a timeout when applying speech adaptation. If true, adaptation had no effect in the response transcript.
`timeoutMessage`	`string` If set, returns a message specifying which part of the speech adaptation request timed out.

adaptationTimeout

boolean

Whether there was a timeout when applying speech adaptation. If true, adaptation had no effect in the response transcript.

timeoutMessage

string

If set, returns a message specifying which part of the speech adaptation request timed out.