Method: projects.locations.recognizers.recognize

Performs synchronous Speech recognition: receive results after all audio has been sent and processed.

HTTP request

POST https://{endpoint}/v2/{recognizer=projects/*/locations/*/recognizers/*}:recognize

Where {endpoint} is one of the supported service endpoints.

The URLs use gRPC Transcoding syntax.

Path parameters

Parameters
recognizer

string

Required. The name of the Recognizer to use during recognition. The expected format is projects/{project}/locations/{location}/recognizers/{recognizer}. The {recognizer} segment may be set to _ to use an empty implicit Recognizer.

Request body

The request body contains data with the following structure:

JSON representation
{
  "config": {
    object (RecognitionConfig)
  },
  "configMask": string,

  // Union field audio_source can be only one of the following:
  "content": string,
  "uri": string
  // End of list of possible types for union field audio_source.
}
Fields
config

object (RecognitionConfig)

Features and audio metadata to use for the Automatic Speech Recognition. This field in combination with the configMask field can be used to override parts of the defaultRecognitionConfig of the Recognizer resource.

configMask

string (FieldMask format)

The list of fields in config that override the values in the defaultRecognitionConfig of the recognizer during this recognition request. If no mask is provided, all non-default valued fields in config override the values in the recognizer for this recognition request. If a mask is provided, only the fields listed in the mask override the config in the recognizer for this recognition request. If a wildcard (*) is provided, config completely overrides and replaces the config in the recognizer for this recognition request.

This is a comma-separated list of fully qualified names of fields. Example: "user.displayName,photo".

Union field audio_source. The audio source, which is either inline content or a Google Cloud Storage URI. audio_source can be only one of the following:
content

string (bytes format)

The audio data bytes encoded as specified in RecognitionConfig. As with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use base64.

A base64-encoded string.

uri

string

URI that points to a file that contains audio data bytes as specified in RecognitionConfig. The file must not be compressed (for example, gzip). Currently, only Google Cloud Storage URIs are supported, which must be specified in the following format: gs://bucket_name/object_name (other URI formats return INVALID_ARGUMENT). For more information, see Request URIs.

Response body

Response message for the recognizers.recognize method.

If successful, the response body contains data with the following structure:

JSON representation
{
  "results": [
    {
      object (SpeechRecognitionResult)
    }
  ],
  "metadata": {
    object (RecognitionResponseMetadata)
  }
}
Fields
results[]

object (SpeechRecognitionResult)

Sequential list of transcription results corresponding to sequential portions of audio.

metadata

object (RecognitionResponseMetadata)

Metadata about the recognition.

Authorization scopes

Requires the following OAuth scope:

  • https://www.googleapis.com/auth/cloud-platform

For more information, see the Authentication Overview.

IAM Permissions

Requires the following IAM permission on the recognizer resource:

  • speech.recognizers.recognize

For more information, see the IAM documentation.

SpeechRecognitionResult

A speech recognition result corresponding to a portion of the audio.

JSON representation
{
  "alternatives": [
    {
      object (SpeechRecognitionAlternative)
    }
  ],
  "channelTag": integer,
  "resultEndOffset": string,
  "languageCode": string
}
Fields
alternatives[]

object (SpeechRecognitionAlternative)

May contain one or more recognition hypotheses. These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.

channelTag

integer

For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audioChannelCount = N, its output values can range from 1 to N.

resultEndOffset

string (Duration format)

Time offset of the end of this result relative to the beginning of the audio.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

languageCode

string

Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio.

SpeechRecognitionAlternative

Alternative hypotheses (a.k.a. n-best list).

JSON representation
{
  "transcript": string,
  "confidence": number,
  "words": [
    {
      object (WordInfo)
    }
  ]
}
Fields
transcript

string

Transcript text representing the words that the user spoke.

confidence

number

The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal is set to true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

words[]

object (WordInfo)

A list of word-specific information for each recognized word. When the SpeakerDiarizationConfig is set, you will see all the words from the beginning of the audio.

WordInfo

Word-specific information for recognized words.

JSON representation
{
  "startOffset": string,
  "endOffset": string,
  "word": string,
  "confidence": number,
  "speakerLabel": string
}
Fields
startOffset

string (Duration format)

Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if enableWordTimeOffsets is true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

endOffset

string (Duration format)

Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if enableWordTimeOffsets is true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

word

string

The word corresponding to this set of information.

confidence

number

The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal is set to true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

speakerLabel

string

A distinct label is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. speakerLabel is set if SpeakerDiarizationConfig is given and only in the top alternative.

RecognitionResponseMetadata

Metadata about the recognition request and response.

JSON representation
{
  "totalBilledDuration": string
}
Fields
totalBilledDuration

string (Duration format)

When available, billed audio seconds for the corresponding request.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".