Method: projects.locations.recognizers.recognize

HTTP request
Path parameters
Request body
- JSON representation
Response body
- JSON representation
Authorization scopes
IAM Permissions
SpeechRecognitionResult
- JSON representation
SpeechRecognitionAlternative
- JSON representation
WordInfo
- JSON representation
RecognitionResponseMetadata
- JSON representation

Performs synchronous Speech recognition: receive results after all audio has been sent and processed.

HTTP request

POST https://{endpoint}/v2/{recognizer=projects/*/locations/*/recognizers/*}:recognize

Where {endpoint} is one of the supported service endpoints.

The URLs use gRPC Transcoding syntax.

Path parameters

Parameters

Parameters
`recognizer`	`string` Required. The name of the Recognizer to use during recognition. The expected format is `projects/{project}/locations/{location}/recognizers/{recognizer}`. The {recognizer} segment may be set to `_` to use an empty implicit Recognizer.

recognizer

string

Required. The name of the Recognizer to use during recognition. The expected format is projects/{project}/locations/{location}/recognizers/{recognizer}. The {recognizer} segment may be set to _ to use an empty implicit Recognizer.

Request body

The request body contains data with the following structure:

JSON representation

JSON representation
{ "config": { object (`RecognitionConfig`) }, "configMask": string, // Union field `audio_source` can be only one of the following: "content": string, "uri": string // End of list of possible types for union field `audio_source`. }

{
  "config": {
    object (RecognitionConfig)
  },
  "configMask": string,

  // Union field audio_source can be only one of the following:
  "content": string,
  "uri": string
  // End of list of possible types for union field audio_source.
}

Fields
`config`	`object (RecognitionConfig)` Features and audio metadata to use for the Automatic Speech Recognition. This field in combination with the `configMask` field can be used to override parts of the `defaultRecognitionConfig` of the Recognizer resource.
`configMask`	`string (FieldMask format)` The list of fields in `config` that override the values in the `defaultRecognitionConfig` of the recognizer during this recognition request. If no mask is provided, all non-default valued fields in `config` override the values in the recognizer for this recognition request. If a mask is provided, only the fields listed in the mask override the config in the recognizer for this recognition request. If a wildcard (`*`) is provided, `config` completely overrides and replaces the config in the recognizer for this recognition request. This is a comma-separated list of fully qualified names of fields. Example: `"user.displayName,photo"`.
Union field `audio_source`. The audio source, which is either inline content or a Google Cloud Storage URI. `audio_source` can be only one of the following:
`content`	`string (bytes format)` The audio data bytes encoded as specified in `RecognitionConfig`. As with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use base64. A base64-encoded string.
`uri`	`string` URI that points to a file that contains audio data bytes as specified in `RecognitionConfig`. The file must not be compressed (for example, gzip). Currently, only Google Cloud Storage URIs are supported, which must be specified in the following format: `gs://bucket_name/object_name` (other URI formats return `INVALID_ARGUMENT`). For more information, see Request URIs.

Response body

Response message for the recognizers.recognize method.

If successful, the response body contains data with the following structure:

JSON representation
{ "results": [ { object (`SpeechRecognitionResult`) } ], "metadata": { object (`RecognitionResponseMetadata`) } }

Fields

Fields
`results[]`	`object (SpeechRecognitionResult)` Sequential list of transcription results corresponding to sequential portions of audio.
`metadata`	`object (RecognitionResponseMetadata)` Metadata about the recognition.

results[]

object (SpeechRecognitionResult)

Sequential list of transcription results corresponding to sequential portions of audio.

metadata

object (RecognitionResponseMetadata)

Metadata about the recognition.

Authorization scopes

Requires the following OAuth scope:

https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

IAM Permissions

Requires the following IAM permission on the recognizer resource:

speech.recognizers.recognize

For more information, see the IAM documentation.

SpeechRecognitionResult

A speech recognition result corresponding to a portion of the audio.

JSON representation
{ "alternatives": [ { object (`SpeechRecognitionAlternative`) } ], "channelTag": integer, "resultEndOffset": string, "languageCode": string }

Fields
`alternatives[]`	`object (SpeechRecognitionAlternative)` May contain one or more recognition hypotheses. These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
`channelTag`	`integer` For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For `audioChannelCount` = `N`, its output values can range from `1` to `N`.
`resultEndOffset`	`string (Duration format)` Time offset of the end of this result relative to the beginning of the audio. A duration in seconds with up to nine fractional digits, ending with '`s`'. Example: `"3.5s"`.
`languageCode`	`string` Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio.

SpeechRecognitionAlternative

Alternative hypotheses (a.k.a. n-best list).

JSON representation
{ "transcript": string, "confidence": number, "words": [ { object (`WordInfo`) } ] }

Fields

Fields
`transcript`	`string` Transcript text representing the words that the user spoke.
`confidence`	`number` The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where `isFinal` is set to `true`. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating `confidence` was not set.
`words[]`	`object (WordInfo)` A list of word-specific information for each recognized word. When the `SpeakerDiarizationConfig` is set, you will see all the words from the beginning of the audio.

transcript

string

Transcript text representing the words that the user spoke.

confidence

number

The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal is set to true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

words[]

object (WordInfo)

A list of word-specific information for each recognized word. When the SpeakerDiarizationConfig is set, you will see all the words from the beginning of the audio.

WordInfo

Word-specific information for recognized words.

JSON representation
{ "startOffset": string, "endOffset": string, "word": string, "confidence": number, "speakerLabel": string }

Fields
`startOffset`	`string (Duration format)` Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if `enableWordTimeOffsets` is `true` and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary. A duration in seconds with up to nine fractional digits, ending with '`s`'. Example: `"3.5s"`.
`endOffset`	`string (Duration format)` Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if `enableWordTimeOffsets` is `true` and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary. A duration in seconds with up to nine fractional digits, ending with '`s`'. Example: `"3.5s"`.
`word`	`string` The word corresponding to this set of information.
`confidence`	`number` The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where `isFinal` is set to `true`. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating `confidence` was not set.
`speakerLabel`	`string` A distinct label is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. `speakerLabel` is set if `SpeakerDiarizationConfig` is given and only in the top alternative.

RecognitionResponseMetadata

Metadata about the recognition request and response.

JSON representation
{ "totalBilledDuration": string }

Fields

totalBilledDuration

string (Duration format)

When available, billed audio seconds for the corresponding request.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".