ConversationEvent

Represents a notification sent to Pub/Sub subscribers for conversation lifecycle events.

JSON representation
{
  "conversation": string,
  "type": enum (Type),
  "errorStatus": {
    object (Status)
  },

  // Union field payload can be only one of the following:
  "newMessagePayload": {
    object (Message)
  },
  "newRecognitionResultPayload": {
    object (StreamingRecognitionResult)
  }
  // End of list of possible types for union field payload.
}
Fields
conversation

string

Required. The unique identifier of the conversation this notification refers to. Format: projects/<Project ID>/conversations/<Conversation ID>.

type

enum (Type)

Required. The type of the event that this notification refers to.

errorStatus

object (Status)

Optional. More detailed information about an error. Only set for type UNRECOVERABLE_ERROR_IN_PHONE_CALL.

Union field payload. Payload of conversation event. payload can be only one of the following:
newMessagePayload

object (Message)

Payload of NEW_MESSAGE event.

newRecognitionResultPayload

object (StreamingRecognitionResult)

Payload of NEW_RECOGNITION_RESULT event.

StreamingRecognitionResult

Contains a speech recognition result corresponding to a portion of the audio that is currently being processed or an indication that this is the end of the single requested utterance.

While end-user audio is being processed, Dialogflow sends a series of results. Each result may contain a transcript value. A transcript represents a portion of the utterance. While the recognizer is processing audio, transcript values may be interim values or finalized values. Once a transcript is finalized, the isFinal value is set to true and processing continues for the next transcript.

If StreamingDetectIntentRequest.query_input.audio_config.single_utterance was true, and the recognizer has completed processing audio, the messageType value is set to `END_OF_SINGLE_UTTERANCE and the following (last) result contains the last finalized transcript.

The complete end-user utterance is determined by concatenating the finalized transcript values received for the series of results.

In the following example, single utterance is enabled. In the case where single utterance is not enabled, result 7 would not occur.

Num | transcript              | messageType            | isFinal
--- | ----------------------- | ----------------------- | --------
1   | "tube"                  | TRANSCRIPT              | false
2   | "to be a"               | TRANSCRIPT              | false
3   | "to be"                 | TRANSCRIPT              | false
4   | "to be or not to be"    | TRANSCRIPT              | true
5   | "that's"                | TRANSCRIPT              | false
6   | "that is                | TRANSCRIPT              | false
7   | unset                   | END_OF_SINGLE_UTTERANCE | unset
8   | " that is the question" | TRANSCRIPT              | true

Concatenating the finalized transcripts with isFinal set to true, the complete utterance becomes "to be or not to be that is the question".

JSON representation
{
  "messageType": enum (MessageType),
  "transcript": string,
  "isFinal": boolean,
  "confidence": number,
  "stability": number,
  "speechWordInfo": [
    {
      object (SpeechWordInfo)
    }
  ],
  "speechEndOffset": string,
  "languageCode": string,
  "dtmfDigits": {
    object (TelephonyDtmfEvents)
  }
}
Fields
messageType

enum (MessageType)

Type of the result message.

transcript

string

Transcript text representing the words that the user spoke. Populated if and only if messageType = TRANSCRIPT.

isFinal

boolean

If false, the StreamingRecognitionResult represents an interim result that may change. If true, the recognizer will not return any further hypotheses about this piece of the audio. May only be populated for messageType = TRANSCRIPT.

confidence

number

The Speech confidence between 0.0 and 1.0 for the current portion of audio. A higher number indicates an estimated greater likelihood that the recognized words are correct. The default of 0.0 is a sentinel value indicating that confidence was not set.

This field is typically only provided if isFinal is true and you should not rely on it being accurate or even set.

stability

number

An estimate of the likelihood that the speech recognizer will not change its guess about this interim recognition result:

  • If the value is unspecified or 0.0, Dialogflow didn't compute the stability. In particular, Dialogflow will only provide stability for TRANSCRIPT results with isFinal = false.
  • Otherwise, the value is in (0.0, 1.0] where 0.0 means completely unstable and 1.0 means completely stable.
speechWordInfo[]

object (SpeechWordInfo)

Word-specific information for the words recognized by Speech in transcript. Populated if and only if messageType = TRANSCRIPT and [InputAudioConfig.enable_word_info] is set.

speechEndOffset

string (Duration format)

Time offset of the end of this Speech recognition result relative to the beginning of the audio. Only populated for messageType = TRANSCRIPT.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

languageCode

string

Detected language code for the transcript.

dtmfDigits

object (TelephonyDtmfEvents)

DTMF digits. Populated if and only if messageType = DTMF_DIGITS.

SpeechWordInfo

Information for a word recognized by the speech recognizer.

JSON representation
{
  "word": string,
  "startOffset": string,
  "endOffset": string,
  "confidence": number
}
Fields
word

string

The word this info is for.

startOffset

string (Duration format)

Time offset relative to the beginning of the audio that corresponds to the start of the spoken word. This is an experimental feature and the accuracy of the time offset can vary.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

endOffset

string (Duration format)

Time offset relative to the beginning of the audio that corresponds to the end of the spoken word. This is an experimental feature and the accuracy of the time offset can vary.

A duration in seconds with up to nine fractional digits, ending with 's'. Example: "3.5s".

confidence

number

The Speech confidence between 0.0 and 1.0 for this word. A higher number indicates an estimated greater likelihood that the recognized word is correct. The default of 0.0 is a sentinel value indicating that confidence was not set.

This field is not guaranteed to be fully stable over time for the same audio input. Users should also not rely on it to always be provided.