Detect intent with speech response

Applications often need a bot to talk back to the user. Dialogflow can now use Cloud Text-to-Speech powered by DeepMind WaveNet to generate speech responses from your agent. Here is an example that uses audio for both input and output when detecting an intent. This use case is common when developing apps that communicate with users via a purely audio interface.

For a list of supported languages, see the TTS column on the Languages page.

Before you begin

You should do the following before reading this guide:

  1. Read Dialogflow basics.
  2. Perform setup steps.

Create an agent

The steps in this guide make assumptions about your agent, so it's best to start with a new agent. You should delete any existing agent for your project before creating a new one. To delete an existing agent:

  1. Go to the Dialogflow Console.
  2. If requested, sign in to the Dialogflow Console. See Dialogflow console overview for more information.
  3. Select the agent you wish to delete.
  4. Click the settings settings button next to the agent's name.
  5. Scroll down to the bottom of the General settings tab.
  6. Click Delete this agent.
  7. Enter DELETE in the text field.
  8. Click Delete.

To create an agent:

  1. Go to the Dialogflow Console.
  2. If requested, sign in to the Dialogflow Console. See Dialogflow console overview for more information.
  3. Click Create Agent in the left sidebar menu. (If you already have other agents, click the agent name, scroll to the bottom and click Create new agent.)
  4. Enter your agent's name, default language, and default time zone.
  5. If you have already created a project, enter that project. If you want to allow the Dialogflow Console to create the project, select Create a new Google project.
  6. Click the Create button.

Import the example file to your agent

Importing will add intents and entities to your agent. If any existing intents or entities have the same name as those in the imported file, they will be replaced.

To import the file, follow these steps:

  1. Download the file
  2. Go to the Dialogflow Console
  3. Select your agent
  4. Click the settings settings button next to the agent name
  5. Select the Export and Import tab
  6. Select Import From Zip and import the zip file that you downloaded

Detect intent


1. Prepare audio content

Download the sample input_audio file, which says "book a room". The audio file must be base64 encoded for this example, so it can be provided in the JSON request below. Here is a Linux example:

base64 -w 0 book_a_room.wav > book_a_room.b64

For examples on other platforms, see Embedding Base64 encoded audio in the Cloud Speech API documentation.

2. Make detect intent request

Call the detectIntent method and specify base64 encoded audio.

Before using any of the request data below, make the following replacements:

  • project-id: your GCP project ID
  • base64-audio: the base64 content from the output file above

HTTP method and URL:


Request JSON body:

  "queryInput": {
    "audioConfig": {
      "languageCode": "en-US"
  "outputAudioConfig" : {
    "audioEncoding": "OUTPUT_AUDIO_ENCODING_LINEAR_16"
  "inputAudio": "base64-audio"

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

  "responseId": "b7405848-2a3a-4e26-b9c6-c4cf9c9a22ee",
  "queryResult": {
    "queryText": "book a room",
    "speechRecognitionConfidence": 0.8616504,
    "action": "room.reservation",
    "parameters": {
      "time": "",
      "date": "",
      "duration": "",
      "guests": "",
      "location": ""
    "fulfillmentText": "I can help with that. Where would you like to reserve a room?",
    "fulfillmentMessages": [
        "text": {
          "text": [
            "I can help with that. Where would you like to reserve a room?"
        "platform": "FACEBOOK"
        "text": {
          "text": [
            "I can help with that. Where would you like to reserve a room?"
    "outputContexts": [
        "name": "projects/project-id/agent/sessions/123456789/contexts/e8f6a63e-73da-4a1a-8bfc-857183f71228_id_dialog_context",
        "lifespanCount": 2,
        "parameters": {
          "time.original": "",
          "time": "",
          "duration.original": "",
          "date": "",
          "guests.original": "",
          "location.original": "",
          "duration": "",
          "guests": "",
          "location": "",
          "date.original": ""
        "name": "projects/project-id/agent/sessions/123456789/contexts/room_reservation_dialog_params_location",
        "lifespanCount": 1,
        "parameters": {
          "date.original": "",
          "time.original": "",
          "time": "",
          "duration.original": "",
          "date": "",
          "guests": "",
          "duration": "",
          "location.original": "",
          "guests.original": "",
          "location": ""
        "name": "projects/project-id/agent/sessions/123456789/contexts/room_reservation_dialog_context",
        "lifespanCount": 2,
        "parameters": {
          "date.original": "",
          "time.original": "",
          "time": "",
          "duration.original": "",
          "date": "",
          "guests.original": "",
          "guests": "",
          "duration": "",
          "location.original": "",
          "location": ""
    "intent": {
      "name": "projects/project-id/agent/intents/e8f6a63e-73da-4a1a-8bfc-857183f71228",
      "displayName": "room.reservation"
    "intentDetectionConfidence": 1,
    "diagnosticInfo": {},
    "languageCode": "en-us"
  "outputAudio": "UklGRs6vAgBXQVZFZm10IBAAAAABAAEAwF0AAIC7AA..."

Notice that the value of the queryResult.action field is room.reservation, and the outputAudio field contains a large base64 audio string.

3. Play output audio

Copy the text from the outputAudio field and save it in a file named output_audio.b64. This file needs to be converted to audio. Here is a Linux example:

base64 -d output_audio.b64 > output_audio.wav

For examples on other platforms, see Decoding Base64-Encoded Audio Content in the Text-to-speech API documentation.

You can now play the output_audio.wav audio file and hear that it matches the text from the queryResult.fulfillmentMessages[1].text.text[0] field above. The second fulfillmentMessages element is chosen, because it is the text response for the default platform.


 * Returns the result of detect intent with texts as inputs.
 * <p>Using the same `session_id` between requests allows continuation of the conversation.
 * @param projectId    Project/Agent Id.
 * @param texts        The text intents to be detected based on what a user says.
 * @param sessionId    Identifier of the DetectIntent session.
 * @param languageCode Language code of the query.
 * @return The QueryResult for each input text.
public static Map<String, QueryResult> detectIntentWithTexttoSpeech(
    String projectId,
    List<String> texts,
    String sessionId,
    String languageCode) throws Exception {
  Map<String, QueryResult> queryResults = Maps.newHashMap();
  // Instantiates a client
  try (SessionsClient sessionsClient = SessionsClient.create()) {
    // Set the session name using the sessionId (UUID) and projectID (my-project-id)
    SessionName session = SessionName.of(projectId, sessionId);
    System.out.println("Session Path: " + session.toString());

    // Detect intents for each text input
    for (String text : texts) {
      // Set the text (hello) and language code (en-US) for the query
      Builder textInput = TextInput.newBuilder().setText(text).setLanguageCode(languageCode);

      // Build the query with the TextInput
      QueryInput queryInput = QueryInput.newBuilder().setText(textInput).build();

      OutputAudioEncoding audioEncoding = OutputAudioEncoding.OUTPUT_AUDIO_ENCODING_LINEAR_16;
      int sampleRateHertz = 16000;
      OutputAudioConfig outputAudioConfig =

      DetectIntentRequest dr =

      // Performs the detect intent request
      DetectIntentResponse response = sessionsClient.detectIntent(dr);

      // Display the query result
      QueryResult queryResult = response.getQueryResult();

      System.out.format("Query Text: '%s'\n", queryResult.getQueryText());
          "Detected Intent: %s (confidence: %f)\n",
          queryResult.getIntent().getDisplayName(), queryResult.getIntentDetectionConfidence());
      System.out.format("Fulfillment Text: '%s'\n", queryResult.getFulfillmentText());

      queryResults.put(text, queryResult);
  return queryResults;


// Imports the Dialogflow client library
const dialogflow = require('dialogflow').v2;

// Instantiate a DialogFlow client.
const sessionClient = new dialogflow.SessionsClient();

 * TODO(developer): Uncomment the following lines before running the sample.
// const projectId = 'ID of GCP project associated with your Dialogflow agent';
// const sessionId = `user specific ID of session, e.g. 12345`;
// const query = `phrase(s) to pass to detect, e.g. I'd like to reserve a room for six people`;
// const languageCode = 'BCP-47 language code, e.g. en-US';
// const outputFile = `path for audio output file, e.g. ./resources/myOutput.wav`;

// Define session path
const sessionPath = sessionClient.sessionPath(projectId, sessionId);
const fs = require(`fs`);
const util = require(`util`);

async function detectIntentwithTTSResponse() {
  // The audio query request
  const request = {
    session: sessionPath,
    queryInput: {
      text: {
        text: query,
        languageCode: languageCode,
    outputAudioConfig: {
      audioEncoding: `OUTPUT_AUDIO_ENCODING_LINEAR_16`,
  sessionClient.detectIntent(request).then(responses => {
    console.log('Detected intent:');
    const audioFile = responses[0].outputAudio;
    util.promisify(fs.writeFile)(outputFile, audioFile, 'binary');
    console.log(`Audio content written to file: ${outputFile}`);


def detect_intent_with_texttospeech_response(project_id, session_id, texts,
    """Returns the result of detect intent with texts as inputs and includes
    the response in an audio format.

    Using the same `session_id` between requests allows continuation
    of the conversation."""
    import dialogflow_v2 as dialogflow
    session_client = dialogflow.SessionsClient()

    session_path = session_client.session_path(project_id, session_id)
    print('Session path: {}\n'.format(session_path))

    for text in texts:
        text_input = dialogflow.types.TextInput(
            text=text, language_code=language_code)

        query_input = dialogflow.types.QueryInput(text=text_input)

        # Set the query parameters with sentiment analysis
        output_audio_config = dialogflow.types.OutputAudioConfig(

        response = session_client.detect_intent(
            session=session_path, query_input=query_input,

        print('=' * 20)
        print('Query text: {}'.format(response.query_result.query_text))
        print('Detected intent: {} (confidence: {})\n'.format(
        print('Fulfillment text: {}\n'.format(
        # The response's audio_content is binary.
        with open('output.wav', 'wb') as out:
            print('Audio content written to file "output.wav"')

See the Detect intent responses section for a description of the relevant response fields.

Detect intent responses

The response for a detect intent request is a DetectIntentResponse object.

Normal detect intent processing controls the content of the DetectIntentResponse.queryResult.fulfillmentMessages field.

The DetectIntentResponse.outputAudio field is populated with audio based on the values of default platform text responses found in the DetectIntentResponse.queryResult.fulfillmentMessages field. If multiple default text responses exist, they will be concatenated when generating audio. If no default platform text responses exist, the generated audio content will be empty.

The DetectIntentResponse.outputAudioConfig field is populated with audio settings used to generate the output audio.

Detect intent from a stream

When detecting intent from a stream, you send requests similar to the example that does not use output audio: Detecting Intent from a Stream. However, you supply a OutputAudioConfig field to the request. The output_audio and output_audio_config fields are populated in the very last streaming response that you get from the Dialogflow API server. For more information, see StreamingDetectIntentRequest and StreamingDetectIntentResponse.

Agent settings for speech

Here are the agent settings for text to speech and voice configuration:

  • Text to Speech:
    • Enable Automatic Text To Speech: In the example above, the outputAudioConfig field needed to be supplied in order to trigger output audio. If you would like output audio for all detect intent requests, enable this setting.
    • Output Audio Encoding Choose your desired output audio encoding when automatic text to speech is enabled.
  • Agent Voice Configuration:
    • Voice: Choose a voice generation model.
    • Speaking Rate: Adjusts the voice speaking rate.
    • Pitch: Adjusts the voice pitch.
    • Volume Gain: Adjust the audio volume gain.
    • Audio Effects Profile: Select audio effects profiles you want applied to the synthesized voice. Speech audio is optimized for the devices associated with the selected profiles (for example, headphones, large speaker, phone call). For more information, see Audio Profiles in Text to Speech documentation.

To access agent settings for speech:

  1. Go to the Dialogflow Console
  2. Select your agent
  3. Click the gear icon settings next to the agent name
  4. Select the Speech tab

Use the Dialogflow simulator

You can interact with the agent and receive audio responses via the Dialogflow simulator:

  1. Follow the steps above to enable automatic text to speech.
  2. Type or say "book a room" in the simulator.
  3. See the output audio section at the bottom of the simulator.
Was this page helpful? Let us know how we did:

Send feedback about...

Dialogflow Documentation
Need help? Visit our support page.