Protocol Documentation

Table of Contents

libgspeech.proto

Top

Duration

Equivalent of google.protobuf.duration.proto it is copied here to

have a self-contained proto.

FieldTypeLabelDescription
seconds int64 optional

Signed seconds of the span of time. Must be from -315,576,000,000 to +315,576,000,000 inclusive. Note: these bounds are computed from: 60 sec/min * 60 min/hr * 24 hr/day * 365.25 days/year * 10000 years

nanos int32 optional

Signed fractions of a second at nanosecond resolution of the span of time. Durations less than one second are represented with a 0 `seconds` field and a positive or negative `nanos` field. For durations of one second or more, a non-zero value for the `nanos` field must be of the same sign as the `seconds` field. Must be from -999,999,999 to +999,999,999 inclusive.

SpeechEvent

FieldTypeLabelDescription
recognition_event SpeechEvent.RecognitionEvent optional

Event sent when Speech pipeline has ASR; on Speech recognitions. Has transcribed text and metadata.

vad_event SpeechEvent.VADEvent optional

Event sent on detection of voice activity. Such as start of speech and end of speech.

audio_event SpeechEvent.AudioEvent optional

Event with information about the audio stream.

lang_id_event SpeechEvent.LangIdEvent optional

Event sent when Speech pipeline has language identification capabilites. The event is sent for each audio frame.

debug_event SpeechEvent.DebugEvent optional

Event sent with various debugging capabilities.

hotword_event SpeechEvent.HotwordEvent optional

Event sent when Speech pipeline has hotwording and a hotword is detected.

SpeechEvent.AudioEvent

`AudioEvent` holds audio information.

FieldTypeLabelDescription
audio_event_type SpeechEvent.AudioEvent.AudioEventType optional

A type of audio event. For example, it can be used to indicate that a finite stream of audio has ben processed through `END_OF_AUDIO` event.

SpeechEvent.DebugEvent

`DebugEvent` holds debugging information.

SpeechEvent.HotwordEvent

`HotwordEvent` contains information from the hotwording engine.

FieldTypeLabelDescription
detection SpeechEvent.HotwordEvent.Detection optional

Comtains information about the detected hotword.

SpeechEvent.HotwordEvent.Detection

When the hotwording engine detects a hotword it will emit a detection

event.

FieldTypeLabelDescription
hotword string optional

Hotword that was triggered. This can be used to differentiate between the trigger source when there are multiple hotwords supported.

SpeechEvent.LangIdEvent

`LangIdEvent` holds results from the execution of `LangId` graph.

If a LangId graph is executed `libgspeech` periodically emits

`LangIdEvent`(s). A `LangIdEvent` provides the likelihood that

some chunk of audio is spoken in a certain language, expressed through

the confidence and the ordering of the languages.

One use case aggregates all of the top language predictions. However,

for a single `LangIdEvent`, the top language prediction is the first

language in the `language` list. The most prevalent top language

prediction weighed by its confidence can be used to predict the language

spoken over some interval.

FieldTypeLabelDescription
results SpeechEvent.LangIdEvent.Result repeated

Results are ordered from the highest likelihood to the lowest. Currently, only the highest likelihood contains a confidence value.

SpeechEvent.LangIdEvent.Result

FieldTypeLabelDescription
language string optional

Language recognized by the LangId model.

confidence float optional

Confidence that the recognized language is correct.

SpeechEvent.RecognitionEvent

FieldTypeLabelDescription
results SpeechEvent.RecognitionEvent.Result repeated

Recognition results from the Recognition engine. The results are ordered by likelihood where the 0'th result has the highest likelihood.

is_final bool optional

Indicates that this result is finalized. - A full transcript is the concatenation of all finalized results. - If this field is false, modified versions of the result can be emitted in the future.

is_final_reason SpeechEvent.RecognitionEvent.IsFinalReason optional

Indicates why a result was finalized.

SpeechEvent.RecognitionEvent.Result

FieldTypeLabelDescription
transcript string optional

Transcript result from the recognition engine.

confidence float optional

Confidence the recognizer has in the transcript. Currently, this is only provided for the transcript with the highest likelihood.

words SpeechEvent.RecognitionEvent.Result.Word repeated

Words detected in this result.

SpeechEvent.RecognitionEvent.Result.Word

FieldTypeLabelDescription
word string optional

The textual representation of the word.

start_time Duration optional

Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word.

end_time Duration optional

Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word.

confidence float optional

Confidence the recognizer has in the word being correct.

SpeechEvent.VADEvent

`VADEvent` holds voice activity information.

FieldTypeLabelDescription
vad_type SpeechEvent.VADEvent.VADType optional

The type of Voice Activity Event. For example if it was `START_OF_SPEECH` or `END_OF_SPEECH`.

SpeechEvents

Polling for Speech events yields this message, holding all available

Speech events.

FieldTypeLabelDescription
speech_events SpeechEvent repeated

SpeechInitConfig

`SpeechInitConfig` configures the initialisation of the Speech pipeline.

Initialization is a heavy operation and should only be done sparsely,

for example when changing language.

FieldTypeLabelDescription
asr_config SpeechInitConfig.ASRConfig optional

Configure the Speech pipeline with Speech recognition capabilites.

audio_config SpeechInitConfig.AudioConfig optional

Configure the audio input of the Speech pipeline.

lang_id_config SpeechInitConfig.LangIdConfig optional

Configure the Speech pipeline with Language Identification capabilites.

thread_config SpeechThreadConfig optional

Configure the threads used by Speech. The availability of the functionality will vary between platforms.

SpeechInitConfig.ASRConfig

`ASRConfig` configures the recognizer.

FieldTypeLabelDescription
language_pack string optional

`language_pack` is a path to a directory holding an inference graph and resources for ASR.

formatting_config SpeechInitConfig.ASRConfig.FormattingConfig optional

Configure the formatting of ASR outputs.

SpeechInitConfig.ASRConfig.FormattingConfig

If `enable_text_formatting` is set to true and the ASR language_pack

contains necessary resources then punctuation and capitalization

will be applied to the recognition results.

FieldTypeLabelDescription
enable_text_formatting bool optional

The original results from the recognizer will be dropped and replaced (not 1:1) with new results containing the formatted text.

enable_spoken_punctuation bool optional

For example - If 'full stop' is spoken it would be transcribed as the full stop symbol '.', and 'question mark' as the question mark symbol '?'. If the model determines it to be in a context where the user is intending for it to be the symbolic representation.

enable_spoken_emoji bool optional

For example - If 'smiley face' is spoken it would be transcribed as the smiley face symbol ':)'.

mask_offensive_words bool optional

If `mask_offensive_words` is set to true and the ASR language_pack contains the necessary resources then certain offensive words will be masked. The `mask_offensive_words` is enabled by default if the field is unset.

SpeechInitConfig.AudioConfig

`AudioConfig` contains information about what audio will be pushed into

Speech.

FieldTypeLabelDescription
sample_rate_hz int32 optional

`sample_rate_hz` is the sampling frequency of the audio in hertz.

channel_count int32 optional

`channel_count` is the number of channels in the audio.

speaker_channel_count int32 optional

`speaker_channel_count` is the number of channels of loud speaker audio fed into libgspeech through loopback.

sample_type SpeechInitConfig.AudioConfig.SampleType optional

The sample type the Speech pipeline is to expect. Prefer INT16

gain_config SpeechInitConfig.AudioConfig.GainConfig optional

Configure fixed audio gain at the start of Speech pipeline.

SpeechInitConfig.AudioConfig.GainConfig

FieldTypeLabelDescription
enable bool optional

Enable gain configuration.

fixed_gain_multiplier float optional

Apply a fixed gain on ASR input.

SpeechInitConfig.LangIdConfig

`LangIdConfig` configures the LangId pipeline. The LangId pipeline can

be used to determine the language spoken by a user.

FieldTypeLabelDescription
language_pack string optional

`language_pack` is a path to a directory holding an inference graph and resources for langId processing.

SpeechStartConfig

`SpeechStartConfig` configures the start of the Speech pipeline.

FieldTypeLabelDescription
speech_adaptation SpeechStartConfig.SpeechAdaptation optional

If the Speech pipeline has Speech recognition capabilites this configuration can be used to adapt the capabilites for certain contexts.

SpeechStartConfig.SpeechAdaptation

Speech adaptation configuration to improve the accuracy of speech

recognition.

FieldTypeLabelDescription
custom_classes SpeechStartConfig.SpeechAdaptation.CustomClass repeated

A collection of custom classes. Refer to the defined class in phrase hints by its unique `custom_class_id`.

phrase_sets SpeechStartConfig.SpeechAdaptation.PhraseSet repeated

A collection of phrase sets, provides "hints" to the speech recognizer to favor specific words and phrases in the results. Any phrase set can use any custom class.

SpeechStartConfig.SpeechAdaptation.CustomClass

A set of words or phrases that represents a common concept likely to

appear in your audio, for example a list of passenger ship names.

CustomClass items can be substituted into placeholders that you set in

PhraseSet phrases.

FieldTypeLabelDescription
custom_class_id string optional

A unique id to be referenced in phrases. Must match regex "[A-Za-z0-9_]+" (quotes are not part of the regex).

items string repeated

A collection of class items.

SpeechStartConfig.SpeechAdaptation.PhraseSet

Phrase sets containing words and phrase "hints" so that the speech

recognition is more likely to recognize them. This can be used to improve

the accuracy for specific words and phrases, for example, if specific

commands are typically spoken by the user. This can also be used to add

additional words to the vocabulary of the recognizer.

List items can also include pre-built or custom classes containing groups

of words that represent common concepts that occur in natural language.

For example, rather than providing a phrase hint for every month of the

year (e.g. "i was born in january", "i was born in febuary", ...), use

the pre-built `$MONTH` class improves the likelihood of correctly

transcribing audio that includes months (e.g. "i was born in $MONTH"). To

refer to pre-built classes, use the class' symbol prepended with `$` e.g.

`$MONTH`. To refer to custom classes that were defined inline in the

request, set the class's `custom_class_id` to a unique string. Then use

the class' id wrapped in $`{...}` e.g. "${my-months}".

FieldTypeLabelDescription
phrases string repeated

A list of words and phrases.

SpeechThreadConfig

FieldTypeLabelDescription
affinity SpeechThreadConfig.Affinity optional

SpeechThreadConfig.Affinity

FieldTypeLabelDescription
enable bool optional

Configure the CPU affinity of the primary worker threads. For example affinity { enable: true core: 1 core: 2 } Will attempt to set the CPU affinity of the working threads to core 1 and 2. This is an experimental feature. For Linux-variants it relies on the `sched_setaffinity` syscall and that child threads inherits the affinity from their parent. note: this will also set the affinity of the thread that is used to configure libgspeech. If set to true will attempt to apply the CPU affinity.

core int32 repeated

The cores to schedule the workers on. Has to be non-empty.

SpeechEvent.AudioEvent.AudioEventType

NameNumberDescription
UNKNOWN 0

END_OF_AUDIO 2

`END_OF_AUDIO` indicates that Speech is done processing the audio.

SpeechEvent.RecognitionEvent.IsFinalReason

Give an indication as to why a result was finalized.

NameNumberDescription
UNSET 0

END_OF_SPEECH 1

END_OF_SPEECH implies that the is_final was emitted because end of human speech was detected. This can occur at an utterance boundary or when a speaker stops speaking.

END_OF_AUDIO 2

END_OF_AUDIO implies that the client marked the audio stream as finished and it triggered a final response.

SpeechEvent.VADEvent.VADType

NameNumberDescription
UNKNOWN 0

START_OF_SPEECH 1

`START_OF_SPEECH` indicates that the classifier detected start of human speech.

END_OF_SPEECH 2

`END_OF_SPEECH` indicates that the classifier detected end of human speech.

PAUSE_OF_SPEECH 3

`PAUSE_OF_SPEECH` indicates that the classifier detected a speech pause.

SpeechInitConfig.AudioConfig.SampleType

`SampleType` has information about the sample width and type. Prefer

INT16.

NameNumberDescription
UNKNOWN 0

INT16 1

If In16 samples will be fed to the Speech pipeline.

INT32 2

If Int32 samples will be fed to the Speech pipeline.

FLOAT16 3

If Float16 samples will be fed to the Speech pipeline.

FLOAT32 4

If Float32 samples will be fed to the Speech pipeline.

Scalar Value Types

.proto TypeNotesC++ TypeJava TypePython Type
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str