Equivalent of google.protobuf.duration.proto it is copied here to
have a self-contained proto.
Field | Type | Label | Description |
seconds | int64 | optional | Signed seconds of the span of time. Must be from -315,576,000,000 to +315,576,000,000 inclusive. Note: these bounds are computed from: 60 sec/min * 60 min/hr * 24 hr/day * 365.25 days/year * 10000 years |
nanos | int32 | optional | Signed fractions of a second at nanosecond resolution of the span of time. Durations less than one second are represented with a 0 `seconds` field and a positive or negative `nanos` field. For durations of one second or more, a non-zero value for the `nanos` field must be of the same sign as the `seconds` field. Must be from -999,999,999 to +999,999,999 inclusive. |
Field | Type | Label | Description |
recognition_event | SpeechEvent.RecognitionEvent | optional | Event sent when Speech pipeline has ASR; on Speech recognitions. Has transcribed text and metadata. |
vad_event | SpeechEvent.VADEvent | optional | Event sent on detection of voice activity. Such as start of speech and end of speech. |
audio_event | SpeechEvent.AudioEvent | optional | Event with information about the audio stream. |
lang_id_event | SpeechEvent.LangIdEvent | optional | Event sent when Speech pipeline has language identification capabilites. The event is sent for each audio frame. |
debug_event | SpeechEvent.DebugEvent | optional | Event sent with various debugging capabilities. |
hotword_event | SpeechEvent.HotwordEvent | optional | Event sent when Speech pipeline has hotwording and a hotword is detected. |
`AudioEvent` holds audio information.
Field | Type | Label | Description |
audio_event_type | SpeechEvent.AudioEvent.AudioEventType | optional | A type of audio event. For example, it can be used to indicate that a finite stream of audio has ben processed through `END_OF_AUDIO` event. |
`DebugEvent` holds debugging information.
`HotwordEvent` contains information from the hotwording engine.
Field | Type | Label | Description |
detection | SpeechEvent.HotwordEvent.Detection | optional | Comtains information about the detected hotword. |
When the hotwording engine detects a hotword it will emit a detection
event.
Field | Type | Label | Description |
hotword | string | optional | Hotword that was triggered. This can be used to differentiate between the trigger source when there are multiple hotwords supported. |
`LangIdEvent` holds results from the execution of `LangId` graph.
If a LangId graph is executed `libgspeech` periodically emits
`LangIdEvent`(s). A `LangIdEvent` provides the likelihood that
some chunk of audio is spoken in a certain language, expressed through
the confidence and the ordering of the languages.
One use case aggregates all of the top language predictions. However,
for a single `LangIdEvent`, the top language prediction is the first
language in the `language` list. The most prevalent top language
prediction weighed by its confidence can be used to predict the language
spoken over some interval.
Field | Type | Label | Description |
results | SpeechEvent.LangIdEvent.Result | repeated | Results are ordered from the highest likelihood to the lowest. Currently, only the highest likelihood contains a confidence value. |
Field | Type | Label | Description |
language | string | optional | Language recognized by the LangId model. |
confidence | float | optional | Confidence that the recognized language is correct. |
Field | Type | Label | Description |
results | SpeechEvent.RecognitionEvent.Result | repeated | Recognition results from the Recognition engine. The results are ordered by likelihood where the 0'th result has the highest likelihood. |
is_final | bool | optional | Indicates that this result is finalized. - A full transcript is the concatenation of all finalized results. - If this field is false, modified versions of the result can be emitted in the future. |
is_final_reason | SpeechEvent.RecognitionEvent.IsFinalReason | optional | Indicates why a result was finalized. |
Field | Type | Label | Description |
transcript | string | optional | Transcript result from the recognition engine. |
confidence | float | optional | Confidence the recognizer has in the transcript. Currently, this is only provided for the transcript with the highest likelihood. |
words | SpeechEvent.RecognitionEvent.Result.Word | repeated | Words detected in this result. |
Field | Type | Label | Description |
word | string | optional | The textual representation of the word. |
start_time | Duration | optional | Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. |
end_time | Duration | optional | Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. |
confidence | float | optional | Confidence the recognizer has in the word being correct. |
`VADEvent` holds voice activity information.
Field | Type | Label | Description |
vad_type | SpeechEvent.VADEvent.VADType | optional | The type of Voice Activity Event. For example if it was `START_OF_SPEECH` or `END_OF_SPEECH`. |
Polling for Speech events yields this message, holding all available
Speech events.
Field | Type | Label | Description |
speech_events | SpeechEvent | repeated |
|
`SpeechInitConfig` configures the initialisation of the Speech pipeline.
Initialization is a heavy operation and should only be done sparsely,
for example when changing language.
Field | Type | Label | Description |
asr_config | SpeechInitConfig.ASRConfig | optional | Configure the Speech pipeline with Speech recognition capabilites. |
audio_config | SpeechInitConfig.AudioConfig | optional | Configure the audio input of the Speech pipeline. |
lang_id_config | SpeechInitConfig.LangIdConfig | optional | Configure the Speech pipeline with Language Identification capabilites. |
thread_config | SpeechThreadConfig | optional | Configure the threads used by Speech. The availability of the functionality will vary between platforms. |
`ASRConfig` configures the recognizer.
Field | Type | Label | Description |
language_pack | string | optional | `language_pack` is a path to a directory holding an inference graph and resources for ASR. |
formatting_config | SpeechInitConfig.ASRConfig.FormattingConfig | optional | Configure the formatting of ASR outputs. |
If `enable_text_formatting` is set to true and the ASR language_pack
contains necessary resources then punctuation and capitalization
will be applied to the recognition results.
Field | Type | Label | Description |
enable_text_formatting | bool | optional | The original results from the recognizer will be dropped and replaced (not 1:1) with new results containing the formatted text. |
enable_spoken_punctuation | bool | optional | For example - If 'full stop' is spoken it would be transcribed as the full stop symbol '.', and 'question mark' as the question mark symbol '?'. If the model determines it to be in a context where the user is intending for it to be the symbolic representation. |
enable_spoken_emoji | bool | optional | For example - If 'smiley face' is spoken it would be transcribed as the smiley face symbol ':)'. |
mask_offensive_words | bool | optional | If `mask_offensive_words` is set to true and the ASR language_pack contains the necessary resources then certain offensive words will be masked. The `mask_offensive_words` is enabled by default if the field is unset. |
`AudioConfig` contains information about what audio will be pushed into
Speech.
Field | Type | Label | Description |
sample_rate_hz | int32 | optional | `sample_rate_hz` is the sampling frequency of the audio in hertz. |
channel_count | int32 | optional | `channel_count` is the number of channels in the audio. |
speaker_channel_count | int32 | optional | `speaker_channel_count` is the number of channels of loud speaker audio fed into libgspeech through loopback. |
sample_type | SpeechInitConfig.AudioConfig.SampleType | optional | The sample type the Speech pipeline is to expect. Prefer INT16 |
gain_config | SpeechInitConfig.AudioConfig.GainConfig | optional | Configure fixed audio gain at the start of Speech pipeline. |
Field | Type | Label | Description |
enable | bool | optional | Enable gain configuration. |
fixed_gain_multiplier | float | optional | Apply a fixed gain on ASR input. |
`LangIdConfig` configures the LangId pipeline. The LangId pipeline can
be used to determine the language spoken by a user.
Field | Type | Label | Description |
language_pack | string | optional | `language_pack` is a path to a directory holding an inference graph and resources for langId processing. |
`SpeechStartConfig` configures the start of the Speech pipeline.
Field | Type | Label | Description |
speech_adaptation | SpeechStartConfig.SpeechAdaptation | optional | If the Speech pipeline has Speech recognition capabilites this configuration can be used to adapt the capabilites for certain contexts. |
Speech adaptation configuration to improve the accuracy of speech
recognition.
Field | Type | Label | Description |
custom_classes | SpeechStartConfig.SpeechAdaptation.CustomClass | repeated | A collection of custom classes. Refer to the defined class in phrase hints by its unique `custom_class_id`. |
phrase_sets | SpeechStartConfig.SpeechAdaptation.PhraseSet | repeated | A collection of phrase sets, provides "hints" to the speech recognizer to favor specific words and phrases in the results. Any phrase set can use any custom class. |
A set of words or phrases that represents a common concept likely to
appear in your audio, for example a list of passenger ship names.
CustomClass items can be substituted into placeholders that you set in
PhraseSet phrases.
Field | Type | Label | Description |
custom_class_id | string | optional | A unique id to be referenced in phrases. Must match regex "[A-Za-z0-9_]+" (quotes are not part of the regex). |
items | string | repeated | A collection of class items. |
Phrase sets containing words and phrase "hints" so that the speech
recognition is more likely to recognize them. This can be used to improve
the accuracy for specific words and phrases, for example, if specific
commands are typically spoken by the user. This can also be used to add
additional words to the vocabulary of the recognizer.
List items can also include pre-built or custom classes containing groups
of words that represent common concepts that occur in natural language.
For example, rather than providing a phrase hint for every month of the
year (e.g. "i was born in january", "i was born in febuary", ...), use
the pre-built `$MONTH` class improves the likelihood of correctly
transcribing audio that includes months (e.g. "i was born in $MONTH"). To
refer to pre-built classes, use the class' symbol prepended with `$` e.g.
`$MONTH`. To refer to custom classes that were defined inline in the
request, set the class's `custom_class_id` to a unique string. Then use
the class' id wrapped in $`{...}` e.g. "${my-months}".
Field | Type | Label | Description |
phrases | string | repeated | A list of words and phrases. |
Field | Type | Label | Description |
affinity | SpeechThreadConfig.Affinity | optional |
|
Field | Type | Label | Description |
enable | bool | optional | Configure the CPU affinity of the primary worker threads. For example affinity { enable: true core: 1 core: 2 } Will attempt to set the CPU affinity of the working threads to core 1 and 2. This is an experimental feature. For Linux-variants it relies on the `sched_setaffinity` syscall and that child threads inherits the affinity from their parent. note: this will also set the affinity of the thread that is used to configure libgspeech. If set to true will attempt to apply the CPU affinity. |
core | int32 | repeated | The cores to schedule the workers on. Has to be non-empty. |
Name | Number | Description |
UNKNOWN | 0 | |
END_OF_AUDIO | 2 | `END_OF_AUDIO` indicates that Speech is done processing the audio. |
Give an indication as to why a result was finalized.
Name | Number | Description |
UNSET | 0 | |
END_OF_SPEECH | 1 | END_OF_SPEECH implies that the is_final was emitted because end of human speech was detected. This can occur at an utterance boundary or when a speaker stops speaking. |
END_OF_AUDIO | 2 | END_OF_AUDIO implies that the client marked the audio stream as finished and it triggered a final response. |
Name | Number | Description |
UNKNOWN | 0 | |
START_OF_SPEECH | 1 | `START_OF_SPEECH` indicates that the classifier detected start of human speech. |
END_OF_SPEECH | 2 | `END_OF_SPEECH` indicates that the classifier detected end of human speech. |
PAUSE_OF_SPEECH | 3 | `PAUSE_OF_SPEECH` indicates that the classifier detected a speech pause. |
`SampleType` has information about the sample width and type. Prefer
INT16.
Name | Number | Description |
UNKNOWN | 0 | |
INT16 | 1 | If In16 samples will be fed to the Speech pipeline. |
INT32 | 2 | If Int32 samples will be fed to the Speech pipeline. |
FLOAT16 | 3 | If Float16 samples will be fed to the Speech pipeline. |
FLOAT32 | 4 | If Float32 samples will be fed to the Speech pipeline. |
.proto Type | Notes | C++ Type | Java Type | Python Type |
double | double | double | float | |
float | float | float | float | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | long | int/long |
uint32 | Uses variable-length encoding. | uint32 | int | int/long |
uint64 | Uses variable-length encoding. | uint64 | long | int/long |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | long | int/long |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | long | int/long |
sfixed32 | Always four bytes. | int32 | int | int |
sfixed64 | Always eight bytes. | int64 | long | int/long |
bool | bool | boolean | boolean | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | String | str/unicode |
bytes | May contain any arbitrary sequence of bytes. | string | ByteString | str |