Index
Speech
(interface)AccessMetadata
(message)AccessMetadata.ConstraintType
(enum)AutoDetectDecodingConfig
(message)BatchRecognizeFileMetadata
(message)BatchRecognizeFileResult
(message)BatchRecognizeMetadata
(message)BatchRecognizeRequest
(message)BatchRecognizeRequest.ProcessingStrategy
(enum)BatchRecognizeResponse
(message)BatchRecognizeResults
(message)BatchRecognizeTranscriptionMetadata
(message)CloudStorageResult
(message)Config
(message)CreateCustomClassRequest
(message)CreatePhraseSetRequest
(message)CreateRecognizerRequest
(message)CustomClass
(message)CustomClass.ClassItem
(message)CustomClass.State
(enum)DeleteCustomClassRequest
(message)DeletePhraseSetRequest
(message)DeleteRecognizerRequest
(message)ExplicitDecodingConfig
(message)ExplicitDecodingConfig.AudioEncoding
(enum)GcsOutputConfig
(message)GetConfigRequest
(message)GetCustomClassRequest
(message)GetPhraseSetRequest
(message)GetRecognizerRequest
(message)InlineOutputConfig
(message)InlineResult
(message)LanguageMetadata
(message)ListCustomClassesRequest
(message)ListCustomClassesResponse
(message)ListPhraseSetsRequest
(message)ListPhraseSetsResponse
(message)ListRecognizersRequest
(message)ListRecognizersResponse
(message)LocationsMetadata
(message)ModelFeature
(message)ModelFeatures
(message)ModelMetadata
(message)NativeOutputFileFormatConfig
(message)OperationMetadata
(message)OutputFormatConfig
(message)PhraseSet
(message)PhraseSet.Phrase
(message)PhraseSet.State
(enum)RecognitionConfig
(message)RecognitionFeatures
(message)RecognitionFeatures.MultiChannelMode
(enum)RecognitionOutputConfig
(message)RecognitionResponseMetadata
(message)RecognizeRequest
(message)RecognizeResponse
(message)Recognizer
(message)Recognizer.State
(enum)SpeakerDiarizationConfig
(message)SpeechAdaptation
(message)SpeechAdaptation.AdaptationPhraseSet
(message)SpeechRecognitionAlternative
(message)SpeechRecognitionResult
(message)SrtOutputFileFormatConfig
(message)StreamingRecognitionConfig
(message)StreamingRecognitionFeatures
(message)StreamingRecognitionFeatures.VoiceActivityTimeout
(message)StreamingRecognitionResult
(message)StreamingRecognizeRequest
(message)StreamingRecognizeResponse
(message)StreamingRecognizeResponse.SpeechEventType
(enum)TranscriptNormalization
(message)TranscriptNormalization.Entry
(message)TranslationConfig
(message)UndeleteCustomClassRequest
(message)UndeletePhraseSetRequest
(message)UndeleteRecognizerRequest
(message)UpdateConfigRequest
(message)UpdateCustomClassRequest
(message)UpdatePhraseSetRequest
(message)UpdateRecognizerRequest
(message)VttOutputFileFormatConfig
(message)WordInfo
(message)
Speech
Enables speech transcription and resource management.
BatchRecognize |
---|
Performs batch asynchronous speech recognition: send a request with N audio files and receive a long running operation that can be polled to see when the transcriptions are finished.
|
CreateCustomClass |
---|
Creates a
|
CreatePhraseSet |
---|
Creates a
|
CreateRecognizer |
---|
Creates a
|
DeleteCustomClass |
---|
Deletes the
|
DeletePhraseSet |
---|
Deletes the
|
DeleteRecognizer |
---|
Deletes the
|
GetConfig |
---|
Returns the requested
|
GetCustomClass |
---|
Returns the requested
|
GetPhraseSet |
---|
Returns the requested
|
GetRecognizer |
---|
Returns the requested
|
ListCustomClasses |
---|
Lists CustomClasses.
|
ListPhraseSets |
---|
Lists PhraseSets.
|
ListRecognizers |
---|
Lists Recognizers.
|
Recognize |
---|
Performs synchronous Speech recognition: receive results after all audio has been sent and processed.
|
StreamingRecognize |
---|
Performs bidirectional streaming speech recognition: receive results while sending audio. This method is only available via the gRPC API (not REST).
|
UndeleteCustomClass |
---|
Undeletes the
|
UndeletePhraseSet |
---|
Undeletes the
|
UndeleteRecognizer |
---|
Undeletes the
|
UpdateConfig |
---|
Updates the
|
UpdateCustomClass |
---|
Updates the
|
UpdatePhraseSet |
---|
Updates the
|
UpdateRecognizer |
---|
Updates the
|
AccessMetadata
The access metadata for a particular region. This can be applied if the org policy for the given project disallows a particular region.
Fields | |
---|---|
constraint_type |
Describes the different types of constraints that are applied. |
ConstraintType
Describes the different types of constraints that can be applied on a region.
Enums | |
---|---|
CONSTRAINT_TYPE_UNSPECIFIED |
Unspecified constraint applied. |
RESOURCE_LOCATIONS_ORG_POLICY_CREATE_CONSTRAINT |
The project's org policy disallows the given region. |
AutoDetectDecodingConfig
This type has no fields.
Automatically detected decoding parameters. Supported for the following encodings:
WAV_LINEAR16: 16-bit signed little-endian PCM samples in a WAV container.
WAV_MULAW: 8-bit companded mulaw samples in a WAV container.
WAV_ALAW: 8-bit companded alaw samples in a WAV container.
RFC4867_5_AMR: AMR frames with an rfc4867.5 header.
RFC4867_5_AMRWB: AMR-WB frames with an rfc4867.5 header.
FLAC: FLAC frames in the "native FLAC" container format.
MP3: MPEG audio frames with optional (ignored) ID3 metadata.
OGG_OPUS: Opus audio frames in an Ogg container.
WEBM_OPUS: Opus audio frames in a WebM container.
MP4_AAC: AAC audio frames in an MP4 container.
M4A_AAC: AAC audio frames in an M4A container.
MOV_AAC: AAC audio frames in an MOV container.
BatchRecognizeFileMetadata
Metadata about a single file in a batch for BatchRecognize.
Fields | |
---|---|
config |
Features and audio metadata to use for the Automatic Speech Recognition. This field in combination with the |
config_mask |
The list of fields in |
Union field audio_source . The audio source, which is a Google Cloud Storage URI. audio_source can be only one of the following: |
|
uri |
Cloud Storage URI for the audio file. |
BatchRecognizeFileResult
Final results for a single file.
Fields | |
---|---|
error |
Error if one was encountered. |
metadata |
|
uri |
Deprecated. Use |
transcript |
Deprecated. Use |
Union field
|
|
cloud_storage_result |
Recognition results written to Cloud Storage. This is populated only when |
inline_result |
Recognition results. This is populated only when |
BatchRecognizeMetadata
Operation metadata for BatchRecognize
.
Fields | |
---|---|
transcription_metadata |
Map from provided filename to the transcription metadata for that file. |
BatchRecognizeRequest
Request message for the BatchRecognize
method.
Fields | |
---|---|
recognizer |
Required. The name of the Recognizer to use during recognition. The expected format is |
config |
Features and audio metadata to use for the Automatic Speech Recognition. This field in combination with the |
config_mask |
The list of fields in |
files[] |
Audio files with file metadata for ASR. The maximum number of files allowed to be specified is 15. |
recognition_output_config |
Configuration options for where to output the transcripts of each file. |
processing_strategy |
Processing strategy to use for this request. |
ProcessingStrategy
Possible processing strategies for batch requests.
Enums | |
---|---|
PROCESSING_STRATEGY_UNSPECIFIED |
Default value for the processing strategy. The request is processed as soon as its received. |
DYNAMIC_BATCHING |
If selected, processes the request during lower utilization periods for a price discount. The request is fulfilled within 24 hours. |
BatchRecognizeResponse
Response message for BatchRecognize
that is packaged into a longrunning Operation
.
Fields | |
---|---|
results |
Map from filename to the final result for that file. |
total_billed_duration |
When available, billed audio seconds for the corresponding request. |
BatchRecognizeResults
Output type for Cloud Storage of BatchRecognize transcripts. Though this proto isn't returned in this API anywhere, the Cloud Storage transcripts will be this proto serialized and should be parsed as such.
Fields | |
---|---|
results[] |
Sequential list of transcription results corresponding to sequential portions of audio. |
metadata |
Metadata about the recognition. |
BatchRecognizeTranscriptionMetadata
Metadata about transcription for a single file (for example, progress percent).
Fields | |
---|---|
progress_percent |
How much of the file has been transcribed so far. |
error |
Error if one was encountered. |
uri |
The Cloud Storage URI to which recognition results will be written. |
CloudStorageResult
Final results written to Cloud Storage.
Fields | |
---|---|
uri |
The Cloud Storage URI to which recognition results were written. |
vtt_format_uri |
The Cloud Storage URI to which recognition results were written as VTT formatted captions. This is populated only when |
srt_format_uri |
The Cloud Storage URI to which recognition results were written as SRT formatted captions. This is populated only when |
Config
Message representing the config for the Speech-to-Text API. This includes an optional KMS key with which incoming data will be encrypted.
Fields | |
---|---|
name |
Output only. Identifier. The name of the config resource. There is exactly one config resource per project per location. The expected format is |
kms_key_name |
Optional. An optional KMS key name that if present, will be used to encrypt Speech-to-Text resources at-rest. Updating this key will not encrypt existing resources using this key; only new resources will be encrypted using this key. The expected format is |
update_time |
Output only. The most recent time this resource was modified. |
CreateCustomClassRequest
Request message for the CreateCustomClass
method.
Fields | |
---|---|
custom_class |
Required. The CustomClass to create. |
validate_only |
If set, validate the request and preview the CustomClass, but do not actually create it. |
custom_class_id |
The ID to use for the CustomClass, which will become the final component of the CustomClass's resource name. This value should be 4-63 characters, and valid characters are /[a-z][0-9]-/. |
parent |
Required. The project and location where this CustomClass will be created. The expected format is |
CreatePhraseSetRequest
Request message for the CreatePhraseSet
method.
Fields | |
---|---|
phrase_set |
Required. The PhraseSet to create. |
validate_only |
If set, validate the request and preview the PhraseSet, but do not actually create it. |
phrase_set_id |
The ID to use for the PhraseSet, which will become the final component of the PhraseSet's resource name. This value should be 4-63 characters, and valid characters are /[a-z][0-9]-/. |
parent |
Required. The project and location where this PhraseSet will be created. The expected format is |
CreateRecognizerRequest
Request message for the CreateRecognizer
method.
Fields | |
---|---|
recognizer |
Required. The Recognizer to create. |
validate_only |
If set, validate the request and preview the Recognizer, but do not actually create it. |
recognizer_id |
The ID to use for the Recognizer, which will become the final component of the Recognizer's resource name. This value should be 4-63 characters, and valid characters are /[a-z][0-9]-/. |
parent |
Required. The project and location where this Recognizer will be created. The expected format is |
CustomClass
CustomClass for biasing in speech recognition. Used to define a set of words or phrases that represents a common concept or theme likely to appear in your audio, for example a list of passenger ship names.
Fields | |
---|---|
name |
Output only. Identifier. The resource name of the CustomClass. Format: |
uid |
Output only. System-assigned unique identifier for the CustomClass. |
display_name |
Optional. User-settable, human-readable name for the CustomClass. Must be 63 characters or less. |
items[] |
A collection of class items. |
state |
Output only. The CustomClass lifecycle state. |
create_time |
Output only. Creation time. |
update_time |
Output only. The most recent time this resource was modified. |
delete_time |
Output only. The time at which this resource was requested for deletion. |
expire_time |
Output only. The time at which this resource will be purged. |
annotations |
Optional. Allows users to store small amounts of arbitrary data. Both the key and the value must be 63 characters or less each. At most 100 annotations. |
etag |
Output only. This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
reconciling |
Output only. Whether or not this CustomClass is in the process of being updated. |
kms_key_name |
Output only. The KMS key name with which the CustomClass is encrypted. The expected format is |
kms_key_version_name |
Output only. The KMS key version name with which the CustomClass is encrypted. The expected format is |
ClassItem
An item of the class.
Fields | |
---|---|
value |
The class item's value. |
State
Set of states that define the lifecycle of a CustomClass.
Enums | |
---|---|
STATE_UNSPECIFIED |
Unspecified state. This is only used/useful for distinguishing unset values. |
ACTIVE |
The normal and active state. |
DELETED |
This CustomClass has been deleted. |
DeleteCustomClassRequest
Request message for the DeleteCustomClass
method.
Fields | |
---|---|
name |
Required. The name of the CustomClass to delete. Format: |
validate_only |
If set, validate the request and preview the deleted CustomClass, but do not actually delete it. |
allow_missing |
If set to true, and the CustomClass is not found, the request will succeed and be a no-op (no Operation is recorded in this case). |
etag |
This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
DeletePhraseSetRequest
Request message for the DeletePhraseSet
method.
Fields | |
---|---|
name |
Required. The name of the PhraseSet to delete. Format: |
validate_only |
If set, validate the request and preview the deleted PhraseSet, but do not actually delete it. |
allow_missing |
If set to true, and the PhraseSet is not found, the request will succeed and be a no-op (no Operation is recorded in this case). |
etag |
This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
DeleteRecognizerRequest
Request message for the DeleteRecognizer
method.
Fields | |
---|---|
name |
Required. The name of the Recognizer to delete. Format: |
validate_only |
If set, validate the request and preview the deleted Recognizer, but do not actually delete it. |
allow_missing |
If set to true, and the Recognizer is not found, the request will succeed and be a no-op (no Operation is recorded in this case). |
etag |
This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
ExplicitDecodingConfig
Explicitly specified decoding parameters.
Fields | |
---|---|
encoding |
Required. Encoding of the audio data sent for recognition. |
sample_rate_hertz |
Sample rate in Hertz of the audio data sent for recognition. Valid values are: 8000-48000. 16000 is optimal. For best results, set the sampling rate of the audio source to 16000 Hz. If that's not possible, use the native sample rate of the audio source (instead of re-sampling). Supported for the following encodings:
|
audio_channel_count |
Number of channels present in the audio data sent for recognition. Supported for the following encodings:
The maximum allowed value is 8. |
AudioEncoding
Supported audio data encodings.
Enums | |
---|---|
AUDIO_ENCODING_UNSPECIFIED |
Default value. This value is unused. |
LINEAR16 |
Headerless 16-bit signed little-endian PCM samples. |
MULAW |
Headerless 8-bit companded mulaw samples. |
ALAW |
Headerless 8-bit companded alaw samples. |
GcsOutputConfig
Output configurations for Cloud Storage.
Fields | |
---|---|
uri |
The Cloud Storage URI prefix with which recognition results will be written. |
GetConfigRequest
Request message for the GetConfig
method.
Fields | |
---|---|
name |
Required. The name of the config to retrieve. There is exactly one config resource per project per location. The expected format is |
GetCustomClassRequest
Request message for the GetCustomClass
method.
Fields | |
---|---|
name |
Required. The name of the CustomClass to retrieve. The expected format is |
GetPhraseSetRequest
Request message for the GetPhraseSet
method.
Fields | |
---|---|
name |
Required. The name of the PhraseSet to retrieve. The expected format is |
GetRecognizerRequest
Request message for the GetRecognizer
method.
Fields | |
---|---|
name |
Required. The name of the Recognizer to retrieve. The expected format is |
InlineOutputConfig
This type has no fields.
Output configurations for inline response.
InlineResult
Final results returned inline in the recognition response.
Fields | |
---|---|
transcript |
The transcript for the audio file. |
vtt_captions |
The transcript for the audio file as VTT formatted captions. This is populated only when |
srt_captions |
The transcript for the audio file as SRT formatted captions. This is populated only when |
LanguageMetadata
The metadata about locales available in a given region. Currently this is just the models that are available for each locale
Fields | |
---|---|
models |
Map of locale (language code) -> models |
ListCustomClassesRequest
Request message for the ListCustomClasses
method.
Fields | |
---|---|
parent |
Required. The project and location of CustomClass resources to list. The expected format is |
page_size |
Number of results per requests. A valid page_size ranges from 0 to 100 inclusive. If the page_size is zero or unspecified, a page size of 5 will be chosen. If the page size exceeds 100, it will be coerced down to 100. Note that a call might return fewer results than the requested page size. |
page_token |
A page token, received from a previous When paginating, all other parameters provided to |
show_deleted |
Whether, or not, to show resources that have been deleted. |
ListCustomClassesResponse
Response message for the ListCustomClasses
method.
Fields | |
---|---|
custom_classes[] |
The list of requested CustomClasses. |
next_page_token |
A token, which can be sent as |
ListPhraseSetsRequest
Request message for the ListPhraseSets
method.
Fields | |
---|---|
parent |
Required. The project and location of PhraseSet resources to list. The expected format is |
page_size |
The maximum number of PhraseSets to return. The service may return fewer than this value. If unspecified, at most 5 PhraseSets will be returned. The maximum value is 100; values above 100 will be coerced to 100. |
page_token |
A page token, received from a previous When paginating, all other parameters provided to |
show_deleted |
Whether, or not, to show resources that have been deleted. |
ListPhraseSetsResponse
Response message for the ListPhraseSets
method.
Fields | |
---|---|
phrase_sets[] |
The list of requested PhraseSets. |
next_page_token |
A token, which can be sent as |
ListRecognizersRequest
Request message for the ListRecognizers
method.
Fields | |
---|---|
parent |
Required. The project and location of Recognizers to list. The expected format is |
page_size |
The maximum number of Recognizers to return. The service may return fewer than this value. If unspecified, at most 5 Recognizers will be returned. The maximum value is 100; values above 100 will be coerced to 100. |
page_token |
A page token, received from a previous When paginating, all other parameters provided to |
show_deleted |
Whether, or not, to show resources that have been deleted. |
ListRecognizersResponse
Response message for the ListRecognizers
method.
Fields | |
---|---|
recognizers[] |
The list of requested Recognizers. |
next_page_token |
A token, which can be sent as |
LocationsMetadata
Main metadata for the Locations API for STT V2. Currently this is just the metadata about locales, models, and features
Fields | |
---|---|
languages |
Information about available locales, models, and features represented in the hierarchical structure of locales -> models -> features |
access_metadata |
Information about access metadata for the region and given project. |
ModelFeature
Representes a singular feature of a model. If the feature is recognizer
, the release_state of the feature represents the release_state of the model
Fields | |
---|---|
feature |
The name of the feature (Note: the feature can be |
release_state |
The release state of the feature |
ModelFeatures
Represents the collection of features belonging to a model
Fields | |
---|---|
model_feature[] |
Repeated field that contains all features of the model |
ModelMetadata
The metadata about the models in a given region for a specific locale. Currently this is just the features of the model
Fields | |
---|---|
model_features |
Map of the model name -> features of that model |
NativeOutputFileFormatConfig
This type has no fields.
Output configurations for serialized BatchRecognizeResults
protos.
OperationMetadata
Represents the metadata of a long-running operation.
Fields | |
---|---|
create_time |
The time the operation was created. |
update_time |
The time the operation was last updated. |
resource |
The resource path for the target of the operation. |
method |
The method that triggered the operation. |
kms_key_name |
The KMS key name with which the content of the Operation is encrypted. The expected format is |
kms_key_version_name |
The KMS key version name with which content of the Operation is encrypted. The expected format is |
progress_percent |
The percent progress of the Operation. Values can range from 0-100. If the value is 100, then the operation is finished. |
Union field request . The request that spawned the Operation. request can be only one of the following: |
|
batch_recognize_request |
The BatchRecognizeRequest that spawned the Operation. |
create_recognizer_request |
The CreateRecognizerRequest that spawned the Operation. |
update_recognizer_request |
The UpdateRecognizerRequest that spawned the Operation. |
delete_recognizer_request |
The DeleteRecognizerRequest that spawned the Operation. |
undelete_recognizer_request |
The UndeleteRecognizerRequest that spawned the Operation. |
create_custom_class_request |
The CreateCustomClassRequest that spawned the Operation. |
update_custom_class_request |
The UpdateCustomClassRequest that spawned the Operation. |
delete_custom_class_request |
The DeleteCustomClassRequest that spawned the Operation. |
undelete_custom_class_request |
The UndeleteCustomClassRequest that spawned the Operation. |
create_phrase_set_request |
The CreatePhraseSetRequest that spawned the Operation. |
update_phrase_set_request |
The UpdatePhraseSetRequest that spawned the Operation. |
delete_phrase_set_request |
The DeletePhraseSetRequest that spawned the Operation. |
undelete_phrase_set_request |
The UndeletePhraseSetRequest that spawned the Operation. |
update_config_request |
The UpdateConfigRequest that spawned the Operation. |
Union field metadata . Specific metadata per RPC. metadata can be only one of the following: |
|
batch_recognize_metadata |
Metadata specific to the BatchRecognize method. |
OutputFormatConfig
Configuration for the format of the results stored to output
.
Fields | |
---|---|
native |
Configuration for the native output format. If this field is set or if no other output format field is set, then transcripts will be written to the sink in the native format. |
vtt |
Configuration for the VTT output format. If this field is set, then transcripts will be written to the sink in the VTT format. |
srt |
Configuration for the SRT output format. If this field is set, then transcripts will be written to the sink in the SRT format. |
PhraseSet
PhraseSet for biasing in speech recognition. A PhraseSet is used to provide "hints" to the speech recognizer to favor specific words and phrases in the results.
Fields | |
---|---|
name |
Output only. Identifier. The resource name of the PhraseSet. Format: |
uid |
Output only. System-assigned unique identifier for the PhraseSet. |
phrases[] |
A list of word and phrases. |
boost |
Hint Boost. Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. Valid |
display_name |
User-settable, human-readable name for the PhraseSet. Must be 63 characters or less. |
state |
Output only. The PhraseSet lifecycle state. |
create_time |
Output only. Creation time. |
update_time |
Output only. The most recent time this resource was modified. |
delete_time |
Output only. The time at which this resource was requested for deletion. |
expire_time |
Output only. The time at which this resource will be purged. |
annotations |
Allows users to store small amounts of arbitrary data. Both the key and the value must be 63 characters or less each. At most 100 annotations. |
etag |
Output only. This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
reconciling |
Output only. Whether or not this PhraseSet is in the process of being updated. |
kms_key_name |
Output only. The KMS key name with which the PhraseSet is encrypted. The expected format is |
kms_key_version_name |
Output only. The KMS key version name with which the PhraseSet is encrypted. The expected format is |
Phrase
A Phrase contains words and phrase "hints" so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer.
List items can also include CustomClass references containing groups of words that represent common concepts that occur in natural language.
Fields | |
---|---|
value |
The phrase itself. |
boost |
Hint Boost. Overrides the boost set at the phrase set level. Positive value will increase the probability that a specific phrase will be recognized over other similar sounding phrases. The higher the boost, the higher the chance of false positive recognition as well. Negative boost values would correspond to anti-biasing. Anti-biasing is not enabled, so negative boost values will return an error. Boost values must be between 0 and 20. Any values outside that range will return an error. We recommend using a binary search approach to finding the optimal value for your use case as well as adding phrases both with and without boost to your requests. |
State
Set of states that define the lifecycle of a PhraseSet.
Enums | |
---|---|
STATE_UNSPECIFIED |
Unspecified state. This is only used/useful for distinguishing unset values. |
ACTIVE |
The normal and active state. |
DELETED |
This PhraseSet has been deleted. |
RecognitionConfig
Provides information to the Recognizer that specifies how to process the recognition request.
Fields | |
---|---|
model |
Optional. Which model to use for recognition requests. Select the model best suited to your domain to get best results. Guidance for choosing which model to use can be found in the Transcription Models Documentation and the models supported in each region can be found in the Table Of Supported Models. |
language_codes[] |
Optional. The language of the supplied audio as a BCP-47 language tag. Language tags are normalized to BCP-47 before they are used eg "en-us" becomes "en-US". Supported languages for each model are listed in the Table of Supported Models. If additional languages are provided, recognition result will contain recognition in the most likely language detected. The recognition result will include the language tag of the language detected in the audio. |
features |
Speech recognition features to enable. |
adaptation |
Speech adaptation context that weights recognizer predictions for specific words and phrases. |
transcript_normalization |
Optional. Use transcription normalization to automatically replace parts of the transcript with phrases of your choosing. For StreamingRecognize, this normalization only applies to stable partial transcripts (stability > 0.8) and final transcripts. |
translation_config |
Optional. Optional configuration used to automatically run translation on the given audio to the desired language for supported models. |
Union field decoding_config . Decoding parameters for audio being sent for recognition. decoding_config can be only one of the following: |
|
auto_decoding_config |
Automatically detect decoding parameters. Preferred for supported formats. |
explicit_decoding_config |
Explicitly specified decoding parameters. Required if using headerless PCM audio (linear16, mulaw, alaw). |
RecognitionFeatures
Available recognition features.
Fields | |
---|---|
profanity_filter |
If set to |
enable_word_time_offsets |
If |
enable_word_confidence |
If |
enable_automatic_punctuation |
If |
enable_spoken_punctuation |
The spoken punctuation behavior for the call. If |
enable_spoken_emojis |
The spoken emoji behavior for the call. If |
multi_channel_mode |
Mode for recognizing multi-channel audio. |
diarization_config |
Configuration to enable speaker diarization and set additional parameters to make diarization better suited for your application. When this is enabled, we send all the words from the beginning of the audio for the top alternative in every consecutive STREAMING responses. This is done in order to improve our speaker tags as our models learn to identify the speakers in the conversation over time. For non-streaming requests, the diarization results will be provided only in the top alternative of the FINAL SpeechRecognitionResult. |
max_alternatives |
Maximum number of recognition hypotheses to be returned. The server may return fewer than |
MultiChannelMode
Options for how to recognize multi-channel audio.
Enums | |
---|---|
MULTI_CHANNEL_MODE_UNSPECIFIED |
Default value for the multi-channel mode. If the audio contains multiple channels, only the first channel will be transcribed; other channels will be ignored. |
SEPARATE_RECOGNITION_PER_CHANNEL |
If selected, each channel in the provided audio is transcribed independently. This cannot be selected if the selected model is latest_short . |
RecognitionOutputConfig
Configuration options for the output(s) of recognition.
Fields | |
---|---|
output_format_config |
Optional. Configuration for the format of the results stored to |
Union field
|
|
gcs_output_config |
If this message is populated, recognition results are written to the provided Google Cloud Storage URI. |
inline_response_config |
If this message is populated, recognition results are provided in the |
RecognitionResponseMetadata
Metadata about the recognition request and response.
Fields | |
---|---|
request_id |
Global request identifier auto-generated by the API. |
total_billed_duration |
When available, billed audio seconds for the corresponding request. |
RecognizeRequest
Request message for the Recognize
method. Either content
or uri
must be supplied. Supplying both or neither returns INVALID_ARGUMENT
. See content limits.
Fields | |
---|---|
recognizer |
Required. The name of the Recognizer to use during recognition. The expected format is |
config |
Features and audio metadata to use for the Automatic Speech Recognition. This field in combination with the |
config_mask |
The list of fields in |
Union field audio_source . The audio source, which is either inline content or a Google Cloud Storage URI. audio_source can be only one of the following: |
|
content |
The audio data bytes encoded as specified in |
uri |
URI that points to a file that contains audio data bytes as specified in |
RecognizeResponse
Response message for the Recognize
method.
Fields | |
---|---|
results[] |
Sequential list of transcription results corresponding to sequential portions of audio. |
metadata |
Metadata about the recognition. |
Recognizer
A Recognizer message. Stores recognition configuration and metadata.
Fields | |
---|---|
name |
Output only. Identifier. The resource name of the Recognizer. Format: |
uid |
Output only. System-assigned unique identifier for the Recognizer. |
display_name |
User-settable, human-readable name for the Recognizer. Must be 63 characters or less. |
model |
Optional. This field is now deprecated. Prefer the Which model to use for recognition requests. Select the model best suited to your domain to get best results. Guidance for choosing which model to use can be found in the Transcription Models Documentation and the models supported in each region can be found in the Table Of Supported Models. |
language_codes[] |
Optional. This field is now deprecated. Prefer the The language of the supplied audio as a BCP-47 language tag. Supported languages for each model are listed in the Table of Supported Models. If additional languages are provided, recognition result will contain recognition in the most likely language detected. The recognition result will include the language tag of the language detected in the audio. When you create or update a Recognizer, these values are stored in normalized BCP-47 form. For example, "en-us" is stored as "en-US". |
default_recognition_config |
Default configuration to use for requests with this Recognizer. This can be overwritten by inline configuration in the |
annotations |
Allows users to store small amounts of arbitrary data. Both the key and the value must be 63 characters or less each. At most 100 annotations. |
state |
Output only. The Recognizer lifecycle state. |
create_time |
Output only. Creation time. |
update_time |
Output only. The most recent time this Recognizer was modified. |
delete_time |
Output only. The time at which this Recognizer was requested for deletion. |
expire_time |
Output only. The time at which this Recognizer will be purged. |
etag |
Output only. This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
reconciling |
Output only. Whether or not this Recognizer is in the process of being updated. |
kms_key_name |
Output only. The KMS key name with which the Recognizer is encrypted. The expected format is |
kms_key_version_name |
Output only. The KMS key version name with which the Recognizer is encrypted. The expected format is |
State
Set of states that define the lifecycle of a Recognizer.
Enums | |
---|---|
STATE_UNSPECIFIED |
The default value. This value is used if the state is omitted. |
ACTIVE |
The Recognizer is active and ready for use. |
DELETED |
This Recognizer has been deleted. |
SpeakerDiarizationConfig
Configuration to enable speaker diarization.
Fields | |
---|---|
min_speaker_count |
Required. Minimum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. To fix the number of speakers detected in the audio, set |
max_speaker_count |
Required. Maximum number of speakers in the conversation. Valid values are: 1-6. Must be >= |
SpeechAdaptation
Provides "hints" to the speech recognizer to favor specific words and phrases in the results. PhraseSets can be specified as an inline resource, or a reference to an existing PhraseSet resource.
Fields | |
---|---|
phrase_sets[] |
A list of inline or referenced PhraseSets. |
custom_classes[] |
A list of inline CustomClasses. Existing CustomClass resources can be referenced directly in a PhraseSet. |
AdaptationPhraseSet
A biasing PhraseSet, which can be either a string referencing the name of an existing PhraseSets resource, or an inline definition of a PhraseSet.
Fields | |
---|---|
Union field
|
|
phrase_set |
The name of an existing PhraseSet resource. The user must have read access to the resource and it must not be deleted. |
inline_phrase_set |
An inline defined PhraseSet. |
SpeechRecognitionAlternative
Alternative hypotheses (a.k.a. n-best list).
Fields | |
---|---|
transcript |
Transcript text representing the words that the user spoke. |
confidence |
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where |
words[] |
A list of word-specific information for each recognized word. When the |
SpeechRecognitionResult
A speech recognition result corresponding to a portion of the audio.
Fields | |
---|---|
alternatives[] |
May contain one or more recognition hypotheses. These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer. |
channel_tag |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For |
result_end_offset |
Time offset of the end of this result relative to the beginning of the audio. |
language_code |
Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio. |
SrtOutputFileFormatConfig
This type has no fields.
Output configurations SubRip Text formatted subtitle file.
StreamingRecognitionConfig
Provides configuration information for the StreamingRecognize request.
Fields | |
---|---|
config |
Required. Features and audio metadata to use for the Automatic Speech Recognition. This field in combination with the |
config_mask |
The list of fields in |
streaming_features |
Speech recognition features to enable specific to streaming audio recognition requests. |
StreamingRecognitionFeatures
Available recognition features specific to streaming recognition requests.
Fields | |
---|---|
enable_voice_activity_events |
If |
interim_results |
Whether or not to stream interim results to the client. If set to true, interim results will be streamed to the client. Otherwise, only the final response will be streamed back. |
voice_activity_timeout |
If set, the server will automatically close the stream after the specified duration has elapsed after the last VOICE_ACTIVITY speech event has been sent. The field |
VoiceActivityTimeout
Events that a timeout can be set on for voice activity.
Fields | |
---|---|
speech_start_timeout |
Duration to timeout the stream if no speech begins. If this is set and no speech is detected in this duration at the start of the stream, the server will close the stream. |
speech_end_timeout |
Duration to timeout the stream after speech ends. If this is set and no speech is detected in this duration after speech was detected, the server will close the stream. |
StreamingRecognitionResult
A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.
Fields | |
---|---|
alternatives[] |
May contain one or more recognition hypotheses. These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer. |
is_final |
If |
stability |
An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results ( |
result_end_offset |
Time offset of the end of this result relative to the beginning of the audio. |
channel_tag |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For |
language_code |
Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio. |
StreamingRecognizeRequest
Request message for the StreamingRecognize
method. Multiple StreamingRecognizeRequest
messages are sent in one call.
If the Recognizer
referenced by recognizer
contains a fully specified request configuration then the stream may only contain messages with only audio
set.
Otherwise the first message must contain a recognizer
and a streaming_config
message that together fully specify the request configuration and must not contain audio
. All subsequent messages must only have audio
set.
Fields | |
---|---|
recognizer |
Required. The name of the Recognizer to use during recognition. The expected format is |
Union field
|
|
streaming_config |
StreamingRecognitionConfig to be used in this recognition attempt. If provided, it will override the default RecognitionConfig stored in the Recognizer. |
audio |
Inline audio bytes to be Recognized. Maximum size for this field is 15 KB per request. |
StreamingRecognizeResponse
StreamingRecognizeResponse
is the only message returned to the client by StreamingRecognize
. A series of zero or more StreamingRecognizeResponse
messages are streamed back to the client. If there is no recognizable audio then no messages are streamed back to the client.
Here are some examples of StreamingRecognizeResponse
s that might be returned while processing audio:
results { alternatives { transcript: "tube" } stability: 0.01 }
results { alternatives { transcript: "to be a" } stability: 0.01 }
results { alternatives { transcript: "to be" } stability: 0.9 } results { alternatives { transcript: " or not to be" } stability: 0.01 }
results { alternatives { transcript: "to be or not to be" confidence: 0.92 } alternatives { transcript: "to bee or not to bee" } is_final: true }
results { alternatives { transcript: " that's" } stability: 0.01 }
results { alternatives { transcript: " that is" } stability: 0.9 } results { alternatives { transcript: " the question" } stability: 0.01 }
results { alternatives { transcript: " that is the question" confidence: 0.98 } alternatives { transcript: " that was the question" } is_final: true }
Notes:
Only two of the above responses #4 and #7 contain final results; they are indicated by
is_final: true
. Concatenating these together generates the full transcript: "to be or not to be that is the question".The others contain interim
results
. #3 and #6 contain two interimresults
: the first portion has a high stability and is less likely to change; the second portion has a low stability and is very likely to change. A UI designer might choose to show only high stabilityresults
.The specific
stability
andconfidence
values shown above are only for illustrative purposes. Actual values may vary.In each response, only one of these fields will be set:
error
,speech_event_type
, or one or more (repeated)results
.
Fields | |
---|---|
results[] |
This repeated list contains zero or more results that correspond to consecutive portions of the audio currently being processed. It contains zero or one |
speech_event_type |
Indicates the type of speech event. |
speech_event_offset |
Time offset between the beginning of the audio and event emission. |
metadata |
Metadata about the recognition. |
SpeechEventType
Indicates the type of speech event.
Enums | |
---|---|
SPEECH_EVENT_TYPE_UNSPECIFIED |
No speech event specified. |
END_OF_SINGLE_UTTERANCE |
This event indicates that the server has detected the end of the user's speech utterance and expects no additional speech. Therefore, the server will not process additional audio and will close the gRPC bidirectional stream. This event is only sent if there was a force cutoff due to silence being detected early. This event is only available through the latest_short model . |
SPEECH_ACTIVITY_BEGIN |
This event indicates that the server has detected the beginning of human voice activity in the stream. This event can be returned multiple times if speech starts and stops repeatedly throughout the stream. This event is only sent if voice_activity_events is set to true. |
SPEECH_ACTIVITY_END |
This event indicates that the server has detected the end of human voice activity in the stream. This event can be returned multiple times if speech starts and stops repeatedly throughout the stream. This event is only sent if voice_activity_events is set to true. |
TranscriptNormalization
Transcription normalization configuration. Use transcription normalization to automatically replace parts of the transcript with phrases of your choosing. For StreamingRecognize, this normalization only applies to stable partial transcripts (stability > 0.8) and final transcripts.
Fields | |
---|---|
entries[] |
A list of replacement entries. We will perform replacement with one entry at a time. For example, the second entry in ["cat" => "dog", "mountain cat" => "mountain dog"] will never be applied because we will always process the first entry before it. At most 100 entries. |
Entry
A single replacement configuration.
Fields | |
---|---|
search |
What to replace. Max length is 100 characters. |
replace |
What to replace with. Max length is 100 characters. |
case_sensitive |
Whether the search is case sensitive. |
TranslationConfig
Translation configuration. Use to translate the given audio into text for the desired language.
Fields | |
---|---|
target_language |
Required. The language code to translate to. |
UndeleteCustomClassRequest
Request message for the UndeleteCustomClass
method.
Fields | |
---|---|
name |
Required. The name of the CustomClass to undelete. Format: |
validate_only |
If set, validate the request and preview the undeleted CustomClass, but do not actually undelete it. |
etag |
This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
UndeletePhraseSetRequest
Request message for the UndeletePhraseSet
method.
Fields | |
---|---|
name |
Required. The name of the PhraseSet to undelete. Format: |
validate_only |
If set, validate the request and preview the undeleted PhraseSet, but do not actually undelete it. |
etag |
This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
UndeleteRecognizerRequest
Request message for the UndeleteRecognizer
method.
Fields | |
---|---|
name |
Required. The name of the Recognizer to undelete. Format: |
validate_only |
If set, validate the request and preview the undeleted Recognizer, but do not actually undelete it. |
etag |
This checksum is computed by the server based on the value of other fields. This may be sent on update, undelete, and delete requests to ensure the client has an up-to-date value before proceeding. |
UpdateConfigRequest
Request message for the UpdateConfig
method.
Fields | |
---|---|
config |
Required. The config to update. The config's |
update_mask |
The list of fields to be updated. |
UpdateCustomClassRequest
Request message for the UpdateCustomClass
method.
Fields | |
---|---|
custom_class |
Required. The CustomClass to update. The CustomClass's |
update_mask |
The list of fields to be updated. If empty, all fields are considered for update. |
validate_only |
If set, validate the request and preview the updated CustomClass, but do not actually update it. |
UpdatePhraseSetRequest
Request message for the UpdatePhraseSet
method.
Fields | |
---|---|
phrase_set |
Required. The PhraseSet to update. The PhraseSet's |
update_mask |
The list of fields to update. If empty, all non-default valued fields are considered for update. Use |
validate_only |
If set, validate the request and preview the updated PhraseSet, but do not actually update it. |
UpdateRecognizerRequest
Request message for the UpdateRecognizer
method.
Fields | |
---|---|
recognizer |
Required. The Recognizer to update. The Recognizer's |
update_mask |
The list of fields to update. If empty, all non-default valued fields are considered for update. Use |
validate_only |
If set, validate the request and preview the updated Recognizer, but do not actually update it. |
VttOutputFileFormatConfig
This type has no fields.
Output configurations for WebVTT formatted subtitle file.
WordInfo
Word-specific information for recognized words.
Fields | |
---|---|
start_offset |
Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if |
end_offset |
Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if |
word |
The word corresponding to this set of information. |
confidence |
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where |
speaker_label |
A distinct label is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. |