CCAI Transcription lets you convert your streaming audio data to transcripted text in real time. Agent Assist makes suggestions based on text, so audio data must be converted before it can be used. You can also use transcripted streaming audio with CCAI Insights to gather real-time data about agent conversations (for example, Topic Modeling).
There are two ways to transcribe streaming audio for use with CCAI: By using the SIPREC feature, or by making gRPC calls with audio data as payload. This page describes the process of transcribing streaming audio data using gRPC calls.
CCAI Transcription works using Speech-to-Text streaming speech recognition. Speech-to-Text offers multiple recognition models, standard and enhanced. CCAI Transcription is supported at the GA level only when it's used with the enhanced phone call model. Enhanced phone call is available for the following languages:
- en-US
- en-AU
- en-GB
- es-US
- fr-CA
- pt-BR
- fr-FR
- es-ES
- it-IT
- de-DE
- ja-JP
Prerequisites
- Create a project in Google Cloud.
- Enable the Dialogflow API.
- Contact your Google representative to make sure that your account has access to Speech-to-Text enhanced models.
Create a conversation profile
To create a conversation profile, call the create
method on the
ConversationProfile
resource. The method is the same as when you create a conversation profile for
chat data.
Create a conversation
Create participants
There are three types of
participant.
See the reference
documentation
for more details about their roles. Call the create
method on the
participant
and specify the role
. Only an END_USER
or a HUMAN_AGENT
participant can call StreamingAnalyzeContent
, which is required to get a
transcription.
Send audio data and get a transcript
You can use
StreamingAnalyzeContent
to send a participant's audio to Google and get transcription, with the
following parameters:
The first request in the stream must be
InputAudioConfig
. (It does not useConversationProfile.stt_config
right now). Don't send any audio input until the second request.audioEncoding
needs to be set toAUDIO_ENCODING_LINEAR_16
orAUDIO_ENCODING_MULAW
.model
: This is the Speech-to-Text model that you want to use to transcribe your audio. Set this field tophone_call
.speechModelVariant
: ThespeechModelVariant
field determines whether Speech-to-Text will use the standard or enhanced version of your selected model. Set this field toUSE_ENHANCED
.singleUtterance
should be set tofalse
for best transcription quality. You should not expectEND_OF_SINGLE_UTTERANCE
ifsingleUtterance
isfalse
, but you can depend onisFinal==true
insideStreamingAnalyzeContentResponse.recognition_result
to half-close the stream.- Optional additional parameters: The following parameters are
optional. To gain access to these parameters, contact your Google
representative.
languageCode
:language_code
of the audio. The default value isen-US
.alternativeLanguageCodes
: Additional languages that might be detected in the audio. Agent Assist uses thelanguage_code
field to automatically detect the language at the beginning of the audio and sticks to it in all following conversation turns. ThealternativeLanguageCodes
field lets you specify more options for Agent Assist to choose from.phraseSets
: The Speech-to-Text model adaptationphraseSet
resource name. To use model adaptation with CCAI Transcription you must first create thephraseSet
using the Speech-to-Text API and specify the resource name here.
After you send the second request with audio payload, you should start receiving some
StreamingAnalyzeContentResponses
from the stream.- You can half close the stream (or stop sending in some languages like
Python) when you see
is_final
set totrue
inStreamingAnalyzeContentResponse.recognition_result
. - After you half-close the stream, the server will send back the response containing final transcript, along with potential Dialogflow suggestions or Agent Assist suggestions.
- You can half close the stream (or stop sending in some languages like
Python) when you see
You can find the final transcription in the following locations:
StreamingAnalyzeContentResponse.message.content
.- If you enable Pub/Sub notifications, you can also see the transcription in Pub/Sub.
Start a new stream after the previous stream is closed.
- Audio re-send: Audio data generated after last
speech_end_offset
of the response withis_final=true
to the new stream start time needs to be re-sent toStreamingAnalyzeContent
for best transcription quality.
- Audio re-send: Audio data generated after last
Here is the diagram illustrate how stream works.
Streaming recognition request code sample
The following code sample illustrates how to send a streaming transcription request:
Python
To authenticate to Agent Assist, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.