Voice activity events and timeouts

Voice activity events indicate when speech start or end has been detected throughout a stream. The events are sent in real-time as they are detected by Speech-to-Text. Voice activity events can be useful for developing applications that rely on automatic detection of when a user has started or finished speaking. Speech-to-Text can also be configured to automatically close the stream based on voice activity.

Voice activity events are only available for StreamingRecognize gRPC requests.

Enable voice activity events

You can enable receiving voice activity responses by setting the enable_voice_activity_events flag to true under the streaming_features message.

Voice activity event types

Voice activity events are usually returned in real time as Speech-to-Text detects speech start or stop during the stream. They will usually be returned before the transcription results for the corresponding segment of speech. Speech activity events can be sent for audio that produces empty transcription results.

Speech Activity Begin

Sent when Speech-to-Text detects that speech has started.

{
  "speechEventType": "SPEECH_ACTIVITY_BEGIN",
  "speechEventOffset": "1.070s"
}

Speech Activity End

Sent when Speech-to-Text detects that speech has ended.

{
  "speechEventType": "SPEECH_ACTIVITY_END",
  "speechEventOffset": "1.070s"
}
If the stream is closed before speech ends, a SPEECH_ACTIVITY_END event will not be sent.

Enable voice activity timeouts

You can enable voice activity timeouts by setting values on the voice_activity_timeout message in streaming_features. Voice activity timeouts must be greater than 500ms and less than 60s. Speech begin and end timeouts can be set independently.

Speech begin timeout

When a speech begin timeout is set, Speech-to-Text will automatically close the stream if speech has not started before the timeout period. Once a SPEECH_ACTIVITY_START event has been detected and returned, the timeout is canceled for the duration of the stream. This feature is useful for applications that expect a user to begin speaking within a given period of time.

Speech end timeout

When a speech end timeout is set, Speech-to-Text will automatically close the stream if no further speech is detected within the timeout duration after a SPEECH_ACTIVITY_END event. Once a SPEECH_ACTIVITY_START event has been detected and returned, the timeout is canceled and will start again once a SPEECH_ACTIVITY_END event is sent.

Time measurement for timeouts

Time elapsed is measured by the bytes of audio sent in requests to Speech-to-Text, as opposed to server time. This allows for preserving accuracy during variations in stream transmission. Sending very large audio chunks in requests, or sending requests in very rapid succession will reduce accuracy in timeout measurement. Note: the size limit for audio chunks is 15360 bytes per request.