Method: speech.recognize

Performs synchronous speech recognition: receive results after all audio has been sent and processed.

HTTP request

POST https://speech.googleapis.com/v1p1beta1/speech:recognize

The URL uses Google API HTTP annotation syntax.

Request body

The request body contains data with the following structure:

JSON representation
{
  "config": {
    object(RecognitionConfig)
  },
  "audio": {
    object(RecognitionAudio)
  }
}
Fields
config

object(RecognitionConfig)

Required Provides information to the recognizer that specifies how to process the request.

audio

object(RecognitionAudio)

Required The audio data to be recognized.

Response body

If successful, the response body contains data with the following structure:

The only message returned to the client by the speech.recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.

JSON representation
{
  "results": [
    {
      object(SpeechRecognitionResult)
    }
  ]
}
Fields
results[]

object(SpeechRecognitionResult)

Output only. Sequential list of transcription results corresponding to sequential portions of audio.

Authorization Scopes

Requires the following OAuth scope:

  • https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

SpeechRecognitionResult

A speech recognition result corresponding to a portion of the audio.

JSON representation
{
  "alternatives": [
    {
      object(SpeechRecognitionAlternative)
    }
  ],
  "channelTag": number,
  "languageCode": string
}
Fields
alternatives[]

object(SpeechRecognitionAlternative)

Output only. May contain one or more recognition hypotheses (up to the maximum specified in maxAlternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

channelTag

number

For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audioChannelCount = N, its output values can range from '1' to 'N'.

languageCode

string

Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio.

SpeechRecognitionAlternative

Alternative hypotheses (a.k.a. n-best list).

JSON representation
{
  "transcript": string,
  "confidence": number,
  "words": [
    {
      object(WordInfo)
    }
  ]
}
Fields
transcript

string

Output only. Transcript text representing the words that the user spoke.

confidence

number

Output only. The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

words[]

object(WordInfo)

Output only. A list of word-specific information for each recognized word. Note: When enableSpeakerDiarization is true, you will see all the words from the beginning of the audio.

WordInfo

Word-specific information for recognized words.

JSON representation
{
  "startTime": string,
  "endTime": string,
  "word": string,
  "confidence": number,
  "speakerTag": number
}
Fields
startTime

string (Duration format)

Output only. Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if enableWordTimeOffsets=true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.

A duration in seconds with up to nine fractional digits, terminated by 's'. Example: "3.5s".

endTime

string (Duration format)

Output only. Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if enableWordTimeOffsets=true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.

A duration in seconds with up to nine fractional digits, terminated by 's'. Example: "3.5s".

word

string

Output only. The word corresponding to this set of information.

confidence

number

Output only. The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

speakerTag

number

Output only. A distinct integer value is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. Value ranges from '1' to diarizationSpeakerCount. speakerTag is set if enableSpeakerDiarization = 'true' and only in the top alternative.

Try it!

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Speech-to-Text API