- HTTP request
- Request body
- Response body
- Authorization Scopes
- SpeechRecognitionResult
- SpeechRecognitionAlternative
- WordInfo
- Try it!
Performs synchronous speech recognition: receive results after all audio has been sent and processed.
HTTP request
POST https://speech.googleapis.com/v1p1beta1/speech:recognize
The URL uses gRPC Transcoding syntax.
Request body
The request body contains data with the following structure:
JSON representation | |
---|---|
{ "config": { object ( |
Fields | |
---|---|
config |
Required. Provides information to the recognizer that specifies how to process the request. |
audio |
Required. The audio data to be recognized. |
Response body
If successful, the response body contains data with the following structure:
The only message returned to the client by the speech.recognize
method. It contains the result as zero or more sequential SpeechRecognitionResult
messages.
JSON representation | |
---|---|
{
"results": [
{
object ( |
Fields | |
---|---|
results[] |
Sequential list of transcription results corresponding to sequential portions of audio. |
Authorization Scopes
Requires the following OAuth scope:
https://www.googleapis.com/auth/cloud-platform
For more information, see the Authentication Overview.
SpeechRecognitionResult
A speech recognition result corresponding to a portion of the audio.
JSON representation | |
---|---|
{
"alternatives": [
{
object ( |
Fields | |
---|---|
alternatives[] |
May contain one or more recognition hypotheses (up to the maximum specified in |
channelTag |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audioChannelCount = N, its output values can range from '1' to 'N'. |
languageCode |
Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio. |
SpeechRecognitionAlternative
Alternative hypotheses (a.k.a. n-best list).
JSON representation | |
---|---|
{
"transcript": string,
"confidence": number,
"words": [
{
object ( |
Fields | |
---|---|
transcript |
Transcript text representing the words that the user spoke. |
confidence |
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where |
words[] |
A list of word-specific information for each recognized word. Note: When |
WordInfo
Word-specific information for recognized words.
JSON representation | |
---|---|
{ "startTime": string, "endTime": string, "word": string, "confidence": number, "speakerTag": integer } |
Fields | |
---|---|
startTime |
Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if A duration in seconds with up to nine fractional digits, terminated by ' |
endTime |
Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if A duration in seconds with up to nine fractional digits, terminated by ' |
word |
The word corresponding to this set of information. |
confidence |
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where |
speakerTag |
Output only. A distinct integer value is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. Value ranges from '1' to diarizationSpeakerCount. speakerTag is set if enableSpeakerDiarization = 'true' and only in the top alternative. |