Applications often need a bot to talk back to the user. Dialogflow can now use Cloud Text-to-Speech powered by DeepMind WaveNet to generate speech responses from your agent. Here is an example that uses audio for both input and output when detecting an intent. This use case is common when developing apps that communicate with users via a purely audio interface.
For a list of supported languages, see the Text-to-Speech column on the Languages page.
Set up your GCP project and authentication
Create an agent
Import the example file to your agent
Importing will add intents and entities to your agent. If any existing intents or entities have the same name as those in the imported file, they will be replaced.
To import the file, follow these steps:
-
Download the
RoomReservation.zip
file - Go to the Dialogflow Console
- Select your agent
- Click the settings settings button next to the agent name
- Select the Export and Import tab
- Select Import From Zip and import the zip file that you downloaded
Enable beta features
You may need to enable the beta API:
- Go to the Dialogflow Console.
- Select an agent.
- Click the gear icon settings next to the agent name to edit its settings.
- Scroll down while on the General tab and ensure that BETA FEATURES is enabled.
- If you have made changes, click SAVE.
Detect intent
curl command
Download the sample input_audio file, which says "book a room". The audio file must be base64 encoded for this example, so it can be provided in the JSON request below. Here is a Linux example:
base64 -w 0 book_a_room.wav > book_a_room.b64
For examples on other platforms, see Embedding Base64 encoded audio in the Cloud Speech API documentation.
Use the following
curl
command to call thedetectIntent
method and specify base64 encoded audio. Replace project-id with your Google Cloud project ID. and replace base64-audio with the base64 content from the output file from step 1.curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \ -H "Content-Type: application/json; charset=utf-8" --data "{ 'queryInput': { 'audioConfig': { 'languageCode': 'en-US' } }, 'outputAudioConfig' : { 'audioEncoding': 'OUTPUT_AUDIO_ENCODING_LINEAR_16' }, 'inputAudio': 'base64-audio' }" "https://dialogflow.googleapis.com/v2beta1/projects/project-id/agent/sessions/123456789:detectIntent"
You should see a response similar to the following.
Notice that the value of the
queryResult.action
field isroom.reservation
, and theoutputAudio
field contains a large base64 audio string.{ "responseId": "b7405848-2a3a-4e26-b9c6-c4cf9c9a22ee", "queryResult": { "queryText": "book a room", "speechRecognitionConfidence": 0.8616504, "action": "room.reservation", "parameters": { "time": "", "date": "", "duration": "", "guests": "", "location": "" }, "fulfillmentText": "I can help with that. Where would you like to reserve a room?", "fulfillmentMessages": [ { "text": { "text": [ "I can help with that. Where would you like to reserve a room?" ] }, "platform": "FACEBOOK" }, { "text": { "text": [ "I can help with that. Where would you like to reserve a room?" ] } } ], "outputContexts": [ { "name": "projects/myproject/agent/sessions/123456789/contexts/e8f6a63e-73da-4a1a-8bfc-857183f71228_id_dialog_context", "lifespanCount": 2, "parameters": { "time.original": "", "time": "", "duration.original": "", "date": "", "guests.original": "", "location.original": "", "duration": "", "guests": "", "location": "", "date.original": "" } }, { "name": "projects/myproject/agent/sessions/123456789/contexts/room_reservation_dialog_params_location", "lifespanCount": 1, "parameters": { "date.original": "", "time.original": "", "time": "", "duration.original": "", "date": "", "guests": "", "duration": "", "location.original": "", "guests.original": "", "location": "" } }, { "name": "projects/myproject/agent/sessions/123456789/contexts/room_reservation_dialog_context", "lifespanCount": 2, "parameters": { "date.original": "", "time.original": "", "time": "", "duration.original": "", "date": "", "guests.original": "", "guests": "", "duration": "", "location.original": "", "location": "" } } ], "intent": { "name": "projects/myproject/agent/intents/e8f6a63e-73da-4a1a-8bfc-857183f71228", "displayName": "room.reservation" }, "intentDetectionConfidence": 1, "diagnosticInfo": {}, "languageCode": "en-us" }, "outputAudio": "UklGRs6vAgBXQVZFZm10IBAAAAABAAEAwF0AAIC7AA..." }
[Optional] Copy the text from the
outputAudio
field and save it in a file namedoutput_audio.b64
. This file needs to be converted to audio. Here is a Linux example:base64 -d output_audio.b64 > output_audio.wav
For examples on other platforms, see Decoding Base64-Encoded Audio Content in the Text-to-speech API documentation. You can now play the
output_audio.wav
audio file and hear that it matches the text from thequeryResult.fulfillmentMessages[1].text.text[0]
field above. The second fulfillmentMessages element is chosen, because it is the text response for the default platform.
Java
Node.js
Python
See the Detect intent responses section for a description of the relevant response fields.
Detect intent responses
The response for a detect intent request is a DetectIntentResponse
object.
Normal detect intent processing controls the content of the DetectIntentResponse.queryResult.fulfillmentMessages
field.
The DetectIntentResponse.outputAudio
field is populated with audio
based on the values of default platform text responses
found in the DetectIntentResponse.queryResult.fulfillmentMessages
field.
If multiple default text responses exist,
they will be concatenated when generating audio.
If no default platform text responses exist,
the generated audio content will be empty.
The DetectIntentResponse.outputAudioConfig
field is populated with audio
settings used to generate the output audio.
Detect intent from a stream
When detecting intent from a stream,
you send requests similar to the example that does not use output audio:
Detecting Intent from a Stream.
However, you supply a
OutputAudioConfig
field to the request.
The output_audio
and output_audio_config
fields are populated in the very last streaming response that you get from the Dialogflow API server.
For more information, see
StreamingDetectIntentRequest
and
StreamingDetectIntentResponse.
Agent settings for speech
Here are the agent settings for text to speech and voice configuration:
- Text to Speech:
- Enable Automatic Text To Speech:
In the example above,
the
outputAudioConfig
field needed to be supplied in order to trigger output audio. If you would like output audio for all detect intent requests, enable this setting. - Output Audio Encoding Choose your desired output audio encoding when automatic text to speech is enabled.
- Enable Automatic Text To Speech:
In the example above,
the
- Agent Voice Configuration:
- Voice: Choose a voice generation model.
- Speaking Rate: Adjusts the voice speaking rate.
- Pitch: Adjusts the voice pitch.
- Volume Gain: Adjust the audio volume gain.
- Audio Effects Profile: Select audio effects profiles you want applied to the synthesized voice. Speech audio is optimized for the devices associated with the selected profiles (for example, headphones, large speaker, phone call). For more information, see Audio Profiles in Text to Speech documentation.
To access agent settings for speech:
- Go to the Dialogflow Console
- Select your agent
- Click the gear icon settings next to the agent name
- Select the Speech tab
Use the Dialogflow simulator
You can interact with the agent and receive audio responses via the Dialogflow simulator:
- Follow the steps above to enable automatic text to speech.
- Type or say "book a room" in the simulator.
- See the output audio section at the bottom of the simulator.