Protocol Documentation

libgspeech.proto
Scalar Value Types

libgspeech.proto

Top

Duration

Equivalent of google.protobuf.duration.proto it is copied here to

have a self-contained proto.

Field	Type	Label	Description
seconds	int64	optional	Signed seconds of the span of time. Must be from -315,576,000,000 to +315,576,000,000 inclusive. Note: these bounds are computed from: 60 sec/min * 60 min/hr * 24 hr/day * 365.25 days/year * 10000 years
nanos	int32	optional	Signed fractions of a second at nanosecond resolution of the span of time. Durations less than one second are represented with a 0 `seconds` field and a positive or negative `nanos` field. For durations of one second or more, a non-zero value for the `nanos` field must be of the same sign as the `seconds` field. Must be from -999,999,999 to +999,999,999 inclusive.

SpeechEvent

Field	Type	Label	Description
recognition_event	SpeechEvent.RecognitionEvent	optional	Event sent when Speech pipeline has ASR; on Speech recognitions. Has transcribed text and metadata.
vad_event	SpeechEvent.VADEvent	optional	Event sent on detection of voice activity. Such as start of speech and end of speech.
audio_event	SpeechEvent.AudioEvent	optional	Event with information about the audio stream.
lang_id_event	SpeechEvent.LangIdEvent	optional	Event sent when Speech pipeline has language identification capabilites. The event is sent for each audio frame.
debug_event	SpeechEvent.DebugEvent	optional	Event sent with various debugging capabilities.
hotword_event	SpeechEvent.HotwordEvent	optional	Event sent when Speech pipeline has hotwording and a hotword is detected.

SpeechEvent.AudioEvent

`AudioEvent` holds audio information.

Field	Type	Label	Description
audio_event_type	SpeechEvent.AudioEvent.AudioEventType	optional	A type of audio event. For example, it can be used to indicate that a finite stream of audio has ben processed through `END_OF_AUDIO` event.

SpeechEvent.DebugEvent

`DebugEvent` holds debugging information.

SpeechEvent.HotwordEvent

`HotwordEvent` contains information from the hotwording engine.

Field	Type	Label	Description
detection	SpeechEvent.HotwordEvent.Detection	optional	Comtains information about the detected hotword.

SpeechEvent.HotwordEvent.Detection

When the hotwording engine detects a hotword it will emit a detection

event.

Field	Type	Label	Description
hotword	string	optional	Hotword that was triggered. This can be used to differentiate between the trigger source when there are multiple hotwords supported.

SpeechEvent.LangIdEvent

`LangIdEvent` holds results from the execution of `LangId` graph.

If a LangId graph is executed `libgspeech` periodically emits

`LangIdEvent`(s). A `LangIdEvent` provides the likelihood that

some chunk of audio is spoken in a certain language, expressed through

the confidence and the ordering of the languages.

One use case aggregates all of the top language predictions. However,

for a single `LangIdEvent`, the top language prediction is the first

language in the `language` list. The most prevalent top language

prediction weighed by its confidence can be used to predict the language

spoken over some interval.

Field	Type	Label	Description
results	SpeechEvent.LangIdEvent.Result	repeated	Results are ordered from the highest likelihood to the lowest. Currently, only the highest likelihood contains a confidence value.

SpeechEvent.LangIdEvent.Result

Field	Type	Label	Description
language	string	optional	Language recognized by the LangId model.
confidence	float	optional	Confidence that the recognized language is correct.

SpeechEvent.RecognitionEvent

Field	Type	Label	Description
results	SpeechEvent.RecognitionEvent.Result	repeated	Recognition results from the Recognition engine. The results are ordered by likelihood where the 0'th result has the highest likelihood.
is_final	bool	optional	Indicates that this result is finalized. - A full transcript is the concatenation of all finalized results. - If this field is false, modified versions of the result can be emitted in the future.
is_final_reason	SpeechEvent.RecognitionEvent.IsFinalReason	optional	Indicates why a result was finalized.

SpeechEvent.RecognitionEvent.Result

Field	Type	Label	Description
transcript	string	optional	Transcript result from the recognition engine.
confidence	float	optional	Confidence the recognizer has in the transcript. Currently, this is only provided for the transcript with the highest likelihood.
words	SpeechEvent.RecognitionEvent.Result.Word	repeated	Words detected in this result.

SpeechEvent.RecognitionEvent.Result.Word

Field	Type	Label	Description
word	string	optional	The textual representation of the word.
start_time	Duration	optional	Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word.
end_time	Duration	optional	Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word.
confidence	float	optional	Confidence the recognizer has in the word being correct.

SpeechEvent.VADEvent

`VADEvent` holds voice activity information.

Field	Type	Label	Description
vad_type	SpeechEvent.VADEvent.VADType	optional	The type of Voice Activity Event. For example if it was `START_OF_SPEECH` or `END_OF_SPEECH`.

SpeechEvents

Polling for Speech events yields this message, holding all available

Speech events.

Field	Type	Label	Description
speech_events	SpeechEvent	repeated

SpeechInitConfig

`SpeechInitConfig` configures the initialisation of the Speech pipeline.

Initialization is a heavy operation and should only be done sparsely,

for example when changing language.

Field	Type	Label	Description
asr_config	SpeechInitConfig.ASRConfig	optional	Configure the Speech pipeline with Speech recognition capabilites.
audio_config	SpeechInitConfig.AudioConfig	optional	Configure the audio input of the Speech pipeline.
lang_id_config	SpeechInitConfig.LangIdConfig	optional	Configure the Speech pipeline with Language Identification capabilites.
thread_config	SpeechThreadConfig	optional	Configure the threads used by Speech. The availability of the functionality will vary between platforms.

SpeechInitConfig.ASRConfig

`ASRConfig` configures the recognizer.

Field	Type	Label	Description
language_pack	string	optional	`language_pack` is a path to a directory holding an inference graph and resources for ASR.
formatting_config	SpeechInitConfig.ASRConfig.FormattingConfig	optional	Configure the formatting of ASR outputs.

SpeechInitConfig.ASRConfig.FormattingConfig

If `enable_text_formatting` is set to true and the ASR language_pack

contains necessary resources then punctuation and capitalization

will be applied to the recognition results.

Field	Type	Label	Description
enable_text_formatting	bool	optional	The original results from the recognizer will be dropped and replaced (not 1:1) with new results containing the formatted text.
enable_spoken_punctuation	bool	optional	For example - If 'full stop' is spoken it would be transcribed as the full stop symbol '.', and 'question mark' as the question mark symbol '?'. If the model determines it to be in a context where the user is intending for it to be the symbolic representation.
enable_spoken_emoji	bool	optional	For example - If 'smiley face' is spoken it would be transcribed as the smiley face symbol ':)'.
mask_offensive_words	bool	optional	If `mask_offensive_words` is set to true and the ASR language_pack contains the necessary resources then certain offensive words will be masked. The `mask_offensive_words` is enabled by default if the field is unset.

SpeechInitConfig.AudioConfig

`AudioConfig` contains information about what audio will be pushed into

Speech.

Field	Type	Label	Description
sample_rate_hz	int32	optional	`sample_rate_hz` is the sampling frequency of the audio in hertz.
channel_count	int32	optional	`channel_count` is the number of channels in the audio.
speaker_channel_count	int32	optional	`speaker_channel_count` is the number of channels of loud speaker audio fed into libgspeech through loopback.
sample_type	SpeechInitConfig.AudioConfig.SampleType	optional	The sample type the Speech pipeline is to expect. Prefer INT16
gain_config	SpeechInitConfig.AudioConfig.GainConfig	optional	Configure fixed audio gain at the start of Speech pipeline.

SpeechInitConfig.AudioConfig.GainConfig

Field	Type	Label	Description
enable	bool	optional	Enable gain configuration.
fixed_gain_multiplier	float	optional	Apply a fixed gain on ASR input.

SpeechInitConfig.LangIdConfig

`LangIdConfig` configures the LangId pipeline. The LangId pipeline can

be used to determine the language spoken by a user.

Field	Type	Label	Description
language_pack	string	optional	`language_pack` is a path to a directory holding an inference graph and resources for langId processing.

SpeechStartConfig

`SpeechStartConfig` configures the start of the Speech pipeline.

Field	Type	Label	Description
speech_adaptation	SpeechStartConfig.SpeechAdaptation	optional	If the Speech pipeline has Speech recognition capabilites this configuration can be used to adapt the capabilites for certain contexts.

SpeechStartConfig.SpeechAdaptation

Speech adaptation configuration to improve the accuracy of speech

recognition.

Field	Type	Label	Description
custom_classes	SpeechStartConfig.SpeechAdaptation.CustomClass	repeated	A collection of custom classes. Refer to the defined class in phrase hints by its unique `custom_class_id`.
phrase_sets	SpeechStartConfig.SpeechAdaptation.PhraseSet	repeated	A collection of phrase sets, provides "hints" to the speech recognizer to favor specific words and phrases in the results. Any phrase set can use any custom class.

SpeechStartConfig.SpeechAdaptation.CustomClass

A set of words or phrases that represents a common concept likely to

appear in your audio, for example a list of passenger ship names.

CustomClass items can be substituted into placeholders that you set in

PhraseSet phrases.

Field	Type	Label	Description
custom_class_id	string	optional	A unique id to be referenced in phrases. Must match regex "[A-Za-z0-9_]+" (quotes are not part of the regex).
items	string	repeated	A collection of class items.

SpeechStartConfig.SpeechAdaptation.PhraseSet

Phrase sets containing words and phrase "hints" so that the speech

recognition is more likely to recognize them. This can be used to improve

the accuracy for specific words and phrases, for example, if specific

commands are typically spoken by the user. This can also be used to add

additional words to the vocabulary of the recognizer.

List items can also include pre-built or custom classes containing groups

of words that represent common concepts that occur in natural language.

For example, rather than providing a phrase hint for every month of the

year (e.g. "i was born in january", "i was born in febuary", ...), use

the pre-built `$MONTH` class improves the likelihood of correctly

transcribing audio that includes months (e.g. "i was born in $MONTH"). To

refer to pre-built classes, use the class' symbol prepended with `$` e.g.

`$MONTH`. To refer to custom classes that were defined inline in the

request, set the class's `custom_class_id` to a unique string. Then use

the class' id wrapped in $`{...}` e.g. "${my-months}".

Field	Type	Label	Description
phrases	string	repeated	A list of words and phrases.

SpeechThreadConfig

Field	Type	Label	Description
affinity	SpeechThreadConfig.Affinity	optional

SpeechThreadConfig.Affinity

Field	Type	Label	Description
enable	bool	optional	Configure the CPU affinity of the primary worker threads. For example affinity { enable: true core: 1 core: 2 } Will attempt to set the CPU affinity of the working threads to core 1 and 2. This is an experimental feature. For Linux-variants it relies on the `sched_setaffinity` syscall and that child threads inherits the affinity from their parent. note: this will also set the affinity of the thread that is used to configure libgspeech. If set to true will attempt to apply the CPU affinity.
core	int32	repeated	The cores to schedule the workers on. Has to be non-empty.

SpeechEvent.AudioEvent.AudioEventType

Name	Number	Description
UNKNOWN	0
END_OF_AUDIO	2	`END_OF_AUDIO` indicates that Speech is done processing the audio.

SpeechEvent.RecognitionEvent.IsFinalReason

Give an indication as to why a result was finalized.

Name	Number	Description
UNSET	0
END_OF_SPEECH	1	END_OF_SPEECH implies that the is_final was emitted because end of human speech was detected. This can occur at an utterance boundary or when a speaker stops speaking.
END_OF_AUDIO	2	END_OF_AUDIO implies that the client marked the audio stream as finished and it triggered a final response.

SpeechEvent.VADEvent.VADType

Name	Number	Description
UNKNOWN	0
START_OF_SPEECH	1	`START_OF_SPEECH` indicates that the classifier detected start of human speech.
END_OF_SPEECH	2	`END_OF_SPEECH` indicates that the classifier detected end of human speech.
PAUSE_OF_SPEECH	3	`PAUSE_OF_SPEECH` indicates that the classifier detected a speech pause.

SpeechInitConfig.AudioConfig.SampleType

`SampleType` has information about the sample width and type. Prefer

INT16.

Name	Number	Description
UNKNOWN	0
INT16	1	If In16 samples will be fed to the Speech pipeline.
INT32	2	If Int32 samples will be fed to the Speech pipeline.
FLOAT16	3	If Float16 samples will be fed to the Speech pipeline.
FLOAT32	4	If Float32 samples will be fed to the Speech pipeline.

Scalar Value Types

.proto Type	Notes	C++ Type	Java Type	Python Type
double		double	double	float
float		float	float	float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	int32	int	int
int64	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	int64	long	int/long
uint32	Uses variable-length encoding.	uint32	int	int/long
uint64	Uses variable-length encoding.	uint64	long	int/long
sint32	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	int32	int	int
sint64	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	int64	long	int/long
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	uint32	int	int
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	uint64	long	int/long
sfixed32	Always four bytes.	int32	int	int
sfixed64	Always eight bytes.	int64	long	int/long
bool		bool	boolean	boolean
string	A string must always contain UTF-8 encoded or 7-bit ASCII text.	string	String	str/unicode
bytes	May contain any arbitrary sequence of bytes.	string	ByteString	str