Detectar intent com fluxo de entrada de áudio

Nesta página, você verá como fazer streaming de entrada de áudio para detectar uma solicitação de intent usando a API. O Dialogflow processa o áudio e o converte em texto antes de tentar uma correspondência de intent. Essa conversão é conhecida como entrada de áudio, reconhecimento de fala, conversão de voz em texto ou STT.

Antes de começar

Este recurso só é relevante quando se usa a API para interações do usuário final. Se você estiver usando uma integração, ignore este guia.

Faça o seguinte antes de ler este guia:

  1. Leia os Conceitos básicos do Dialogflow.
  2. Execute as etapas de configuração.

Criar um agente

Se você ainda não criou um agente, crie um agora:

  1. Acesse o console do Dialogflow.
  2. Se solicitado, faça login no Console do Dialogflow. Para mais informações, consulte a Visão geral do Console do Dialogflow.
  3. Clique em Criar agente no menu da barra lateral à esquerda. Se você já tiver outros agentes, clique no nome do agente, role até a parte inferior da página e clique em Criar novo agente.
  4. Digite o nome do agente, o idioma padrão e o fuso horário padrão.
  5. Se você já tiver criado um projeto, insira-o. Se você quiser permitir que o Console do Dialogflow crie o projeto, selecione Criar um novo projeto do Google.
  6. Clique no botão Criar.

Importar o arquivo de exemplo para seu agente

As etapas deste guia fazem considerações sobre o agente. Portanto, você precisa importar um agente preparado para este guia. Ao importar, essas etapas usam a opção restaurar, que substitui todas as configurações, intents e entidades do agente.

Para importar o arquivo, siga estas etapas:

  1. Faça o download do arquivo room-booking-agent.zip.
  2. Acesse o Console do Dialogflow (em inglês).
  3. Selecione seu agente.
  4. Clique no botão de configurações ao lado do nome do agente.
  5. Selecione a guia Exportar e importar.
  6. Selecione Restaurar do zip e siga as instruções para restaurar o arquivo zip que você baixou.

Princípios básicos de streaming

O Session tipo de método streamingDetectIntent retorna um objeto em streaming de gRPC bidirecional. Os métodos disponíveis para esse objeto variam de acordo com o idioma. Por isso, consulte a documentação de referência de sua biblioteca de cliente para mais detalhes.

O objeto em streaming é usado para enviar e receber dados simultaneamente. Ao usar esse objeto, o cliente transmite conteúdo de áudio para o Dialogflow, enquanto escuta StreamingDetectIntentResponse simultaneamente.

O método streamingDetectIntent tem um parâmetro query_input.audio_config.single_utterance que afeta o reconhecimento de fala:

  • Se for false (padrão), o reconhecimento não será interrompido até o cliente fechar o stream.
  • Se for true, o Dialogflow detectará uma única fala no áudio de entrada. Quando o Dialogflow detecta que a voz do áudio foi interrompida ou pausada, ele interrompe o reconhecimento de fala e envia uma StreamingDetectIntentResponse com um resultado de reconhecimento de END_OF_SINGLE_UTTERANCE para o cliente. Qualquer áudio enviado para o Dialogflow no stream, após o recebimento de END_OF_SINGLE_UTTERANCE, será ignorado pelo Dialogflow.

No streaming bidirecional, um cliente pode fechar o objeto stream para indicar ao servidor que ele não enviará mais dados. Por exemplo, em Java e Go esse método é chamado de closeSend. É importante fechar (mas não cancelar) os streams nas seguintes situações:

  • O cliente terminou de enviar os dados.
  • O cliente está configurado com single_utterance definido como verdadeiro e recebe um StreamingDetectIntentResponse com um resultado de reconhecimento de END_OF_SINGLE_UTTERANCE.

Depois de fechar um stream, o cliente precisa iniciar uma nova solicitação com um novo stream, conforme necessário.

Intent de detecção de streaming

Os exemplos a seguir usam o Session tipo de método streamingDetectIntent para fazer streaming de áudio.

C#

public static async Task<object> DetectIntentFromStreamAsync(
    string projectId,
    string sessionId,
    string filePath)
{
    var sessionsClient = SessionsClient.Create();
    var sessionName = SessionName.FromProjectSession(projectId, sessionId).ToString();

    // Initialize streaming call, retrieving the stream object
    var streamingDetectIntent = sessionsClient.StreamingDetectIntent();

    // Define a task to process results from the API
    var responseHandlerTask = Task.Run(async () =>
    {
        var responseStream = streamingDetectIntent.GetResponseStream();
        while (await responseStream.MoveNextAsync())
        {
            var response = responseStream.Current;
            var queryResult = response.QueryResult;

            if (queryResult != null)
            {
                Console.WriteLine($"Query text: {queryResult.QueryText}");
                if (queryResult.Intent != null)
                {
                    Console.Write("Intent detected:");
                    Console.WriteLine(queryResult.Intent.DisplayName);
                }
            }
        }
    });

    // Instructs the speech recognizer how to process the audio content.
    // Note: hard coding audioEncoding, sampleRateHertz for simplicity.
    var queryInput = new QueryInput
    {
        AudioConfig = new InputAudioConfig
        {
            AudioEncoding = AudioEncoding.Linear16,
            LanguageCode = "en-US",
            SampleRateHertz = 16000
        }
    };

    // The first request must **only** contain the audio configuration:
    await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
    {
        QueryInput = queryInput,
        Session = sessionName
    });

    using (FileStream fileStream = new FileStream(filePath, FileMode.Open))
    {
        // Subsequent requests must **only** contain the audio data.
        // Following messages: audio chunks. We just read the file in
        // fixed-size chunks. In reality you would split the user input
        // by time.
        var buffer = new byte[32 * 1024];
        int bytesRead;
        while ((bytesRead = await fileStream.ReadAsync(
            buffer, 0, buffer.Length)) > 0)
        {
            await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
            {
                Session = sessionName,
                InputAudio = Google.Protobuf.ByteString.CopyFrom(buffer, 0, bytesRead)
            });
        };
    }

    // Tell the service you are done sending data
    await streamingDetectIntent.WriteCompleteAsync();

    // This will complete once all server responses have been processed.
    await responseHandlerTask;

    return 0;
}

Go

func DetectIntentStream(projectID, sessionID, audioFile, languageCode string) (string, error) {
	ctx := context.Background()

	sessionClient, err := dialogflow.NewSessionsClient(ctx)
	if err != nil {
		return "", err
	}
	defer sessionClient.Close()

	if projectID == "" || sessionID == "" {
		return "", errors.New(fmt.Sprintf("Received empty project (%s) or session (%s)", projectID, sessionID))
	}

	sessionPath := fmt.Sprintf("projects/%s/agent/sessions/%s", projectID, sessionID)

	// In this example, we hard code the encoding and sample rate for simplicity.
	audioConfig := dialogflowpb.InputAudioConfig{AudioEncoding: dialogflowpb.AudioEncoding_AUDIO_ENCODING_LINEAR_16, SampleRateHertz: 16000, LanguageCode: languageCode}

	queryAudioInput := dialogflowpb.QueryInput_AudioConfig{AudioConfig: &audioConfig}

	queryInput := dialogflowpb.QueryInput{Input: &queryAudioInput}

	streamer, err := sessionClient.StreamingDetectIntent(ctx)
	if err != nil {
		return "", err
	}

	f, err := os.Open(audioFile)
	if err != nil {
		return "", err
	}

	defer f.Close()

	go func() {
		audioBytes := make([]byte, 1024)

		request := dialogflowpb.StreamingDetectIntentRequest{Session: sessionPath, QueryInput: &queryInput}
		err = streamer.Send(&request)
		if err != nil {
			log.Fatal(err)
		}

		for {
			_, err := f.Read(audioBytes)
			if err == io.EOF {
				streamer.CloseSend()
				break
			}
			if err != nil {
				log.Fatal(err)
			}

			request = dialogflowpb.StreamingDetectIntentRequest{InputAudio: audioBytes}
			err = streamer.Send(&request)
			if err != nil {
				log.Fatal(err)
			}
		}
	}()

	var queryResult *dialogflowpb.QueryResult

	for {
		response, err := streamer.Recv()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}

		recognitionResult := response.GetRecognitionResult()
		transcript := recognitionResult.GetTranscript()
		log.Printf("Recognition transcript: %s\n", transcript)

		queryResult = response.GetQueryResult()
	}

	fulfillmentText := queryResult.GetFulfillmentText()
	return fulfillmentText, nil
}

Java


import com.google.api.gax.rpc.ApiException;
import com.google.api.gax.rpc.BidiStream;
import com.google.cloud.dialogflow.v2.AudioEncoding;
import com.google.cloud.dialogflow.v2.InputAudioConfig;
import com.google.cloud.dialogflow.v2.QueryInput;
import com.google.cloud.dialogflow.v2.QueryResult;
import com.google.cloud.dialogflow.v2.SessionName;
import com.google.cloud.dialogflow.v2.SessionsClient;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentRequest;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentResponse;
import com.google.protobuf.ByteString;
import java.io.FileInputStream;
import java.io.IOException;

class DetectIntentStream {

  // DialogFlow API Detect Intent sample with audio files processes as an audio stream.
  static void detectIntentStream(String projectId, String audioFilePath, String sessionId)
      throws IOException, ApiException {
    // String projectId = "YOUR_PROJECT_ID";
    // String audioFilePath = "path_to_your_audio_file";
    // Using the same `sessionId` between requests allows continuation of the conversation.
    // String sessionId = "Identifier of the DetectIntent session";

    // Instantiates a client
    try (SessionsClient sessionsClient = SessionsClient.create()) {
      // Set the session name using the sessionId (UUID) and projectID (my-project-id)
      SessionName session = SessionName.of(projectId, sessionId);

      // Instructs the speech recognizer how to process the audio content.
      // Note: hard coding audioEncoding and sampleRateHertz for simplicity.
      // Audio encoding of the audio content sent in the query request.
      InputAudioConfig inputAudioConfig =
          InputAudioConfig.newBuilder()
              .setAudioEncoding(AudioEncoding.AUDIO_ENCODING_LINEAR_16)
              .setLanguageCode("en-US") // languageCode = "en-US"
              .setSampleRateHertz(16000) // sampleRateHertz = 16000
              .build();

      // Build the query with the InputAudioConfig
      QueryInput queryInput = QueryInput.newBuilder().setAudioConfig(inputAudioConfig).build();

      // Create the Bidirectional stream
      BidiStream<StreamingDetectIntentRequest, StreamingDetectIntentResponse> bidiStream =
          sessionsClient.streamingDetectIntentCallable().call();

      // The first request must **only** contain the audio configuration:
      bidiStream.send(
          StreamingDetectIntentRequest.newBuilder()
              .setSession(session.toString())
              .setQueryInput(queryInput)
              .build());

      try (FileInputStream audioStream = new FileInputStream(audioFilePath)) {
        // Subsequent requests must **only** contain the audio data.
        // Following messages: audio chunks. We just read the file in fixed-size chunks. In reality
        // you would split the user input by time.
        byte[] buffer = new byte[4096];
        int bytes;
        while ((bytes = audioStream.read(buffer)) != -1) {
          bidiStream.send(
              StreamingDetectIntentRequest.newBuilder()
                  .setInputAudio(ByteString.copyFrom(buffer, 0, bytes))
                  .build());
        }
      }

      // Tell the service you are done sending data
      bidiStream.closeSend();

      for (StreamingDetectIntentResponse response : bidiStream) {
        QueryResult queryResult = response.getQueryResult();
        System.out.println("====================");
        System.out.format("Intent Display Name: %s\n", queryResult.getIntent().getDisplayName());
        System.out.format("Query Text: '%s'\n", queryResult.getQueryText());
        System.out.format(
            "Detected Intent: %s (confidence: %f)\n",
            queryResult.getIntent().getDisplayName(), queryResult.getIntentDetectionConfidence());
        System.out.format("Fulfillment Text: '%s'\n", queryResult.getFulfillmentText());
      }
    }
  }
}

Node.js

const fs = require('fs');
const util = require('util');
const {Transform, pipeline} = require('stream');
const {struct} = require('pb-util');

const pump = util.promisify(pipeline);
// Imports the Dialogflow library
const dialogflow = require('@google-cloud/dialogflow');

// Instantiates a session client
const sessionClient = new dialogflow.SessionsClient();

// The path to the local file on which to perform speech recognition, e.g.
// /path/to/audio.raw const filename = '/path/to/audio.raw';

// The encoding of the audio file, e.g. 'AUDIO_ENCODING_LINEAR_16'
// const encoding = 'AUDIO_ENCODING_LINEAR_16';

// The sample rate of the audio file in hertz, e.g. 16000
// const sampleRateHertz = 16000;

// The BCP-47 language code to use, e.g. 'en-US'
// const languageCode = 'en-US';
const sessionPath = sessionClient.projectAgentSessionPath(
  projectId,
  sessionId
);

const initialStreamRequest = {
  session: sessionPath,
  queryInput: {
    audioConfig: {
      audioEncoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
    },
    singleUtterance: true,
  },
};

// Create a stream for the streaming request.
const detectStream = sessionClient
  .streamingDetectIntent()
  .on('error', console.error)
  .on('data', data => {
    if (data.recognitionResult) {
      console.log(
        `Intermediate transcript: ${data.recognitionResult.transcript}`
      );
    } else {
      console.log('Detected intent:');

      const result = data.queryResult;
      // Instantiates a context client
      const contextClient = new dialogflow.ContextsClient();

      console.log(`  Query: ${result.queryText}`);
      console.log(`  Response: ${result.fulfillmentText}`);
      if (result.intent) {
        console.log(`  Intent: ${result.intent.displayName}`);
      } else {
        console.log('  No intent matched.');
      }
      const parameters = JSON.stringify(struct.decode(result.parameters));
      console.log(`  Parameters: ${parameters}`);
      if (result.outputContexts && result.outputContexts.length) {
        console.log('  Output contexts:');
        result.outputContexts.forEach(context => {
          const contextId = contextClient.matchContextFromProjectAgentSessionContextName(
            context.name
          );
          const contextParameters = JSON.stringify(
            struct.decode(context.parameters)
          );
          console.log(`    ${contextId}`);
          console.log(`      lifespan: ${context.lifespanCount}`);
          console.log(`      parameters: ${contextParameters}`);
        });
      }
    }
  });

// Write the initial stream request to config for audio input.
detectStream.write(initialStreamRequest);

// Stream an audio file from disk to the Conversation API, e.g.
// "./resources/audio.raw"
await pump(
  fs.createReadStream(filename),
  // Format the audio stream into the request format.
  new Transform({
    objectMode: true,
    transform: (obj, _, next) => {
      next(null, {inputAudio: obj});
    },
  }),
  detectStream
);

PHP

namespace Google\Cloud\Samples\Dialogflow;

use Google\Cloud\Dialogflow\V2\SessionsClient;
use Google\Cloud\Dialogflow\V2\AudioEncoding;
use Google\Cloud\Dialogflow\V2\InputAudioConfig;
use Google\Cloud\Dialogflow\V2\QueryInput;
use Google\Cloud\Dialogflow\V2\StreamingDetectIntentRequest;

/**
* Returns the result of detect intent with streaming audio as input.
* Using the same `session_id` between requests allows continuation
* of the conversation.
*/
function detect_intent_stream($projectId, $path, $sessionId, $languageCode = 'en-US')
{
    // need to use gRPC
    if (!defined('Grpc\STATUS_OK')) {
        throw new \Exception('Install the grpc extension ' .
            '(pecl install grpc)');
    }

    // new session
    $sessionsClient = new SessionsClient();
    $session = $sessionsClient->sessionName($projectId, $sessionId ?: uniqid());
    printf('Session path: %s' . PHP_EOL, $session);

    // hard coding audio_encoding and sample_rate_hertz for simplicity
    $audioConfig = new InputAudioConfig();
    $audioConfig->setAudioEncoding(AudioEncoding::AUDIO_ENCODING_LINEAR_16);
    $audioConfig->setLanguageCode($languageCode);
    $audioConfig->setSampleRateHertz(16000);

    // create query input
    $queryInput = new QueryInput();
    $queryInput->setAudioConfig($audioConfig);

    // first request contains the configuration
    $request = new StreamingDetectIntentRequest();
    $request->setSession($session);
    $request->setQueryInput($queryInput);
    $requests = [$request];

    // we are going to read small chunks of audio data from
    // a local audio file. in practice, these chunks should
    // come from an audio input device.
    $audioStream = fopen($path, 'rb');
    while (true) {
        $chunk = stream_get_contents($audioStream, 4096);
        if (!$chunk) {
            break;
        }
        $request = new StreamingDetectIntentRequest();
        $request->setInputAudio($chunk);
        $requests[] = $request;
    }

    // intermediate transcript info
    print(PHP_EOL . str_repeat("=", 20) . PHP_EOL);
    $stream = $sessionsClient->streamingDetectIntent();
    foreach ($requests as $request) {
        $stream->write($request);
    }
    foreach ($stream->closeWriteAndReadAll() as $response) {
        $recognitionResult = $response->getRecognitionResult();
        if ($recognitionResult) {
            $transcript = $recognitionResult->getTranscript();
            printf('Intermediate transcript: %s' . PHP_EOL, $transcript);
        }
    }

    // get final response and relevant info
    if ($response) {
        print(str_repeat("=", 20) . PHP_EOL);
        $queryResult = $response->getQueryResult();
        $queryText = $queryResult->getQueryText();
        $intent = $queryResult->getIntent();
        $displayName = $intent->getDisplayName();
        $confidence = $queryResult->getIntentDetectionConfidence();
        $fulfilmentText = $queryResult->getFulfillmentText();

        // output relevant info
        printf('Query text: %s' . PHP_EOL, $queryText);
        printf('Detected intent: %s (confidence: %f)' . PHP_EOL, $displayName,
            $confidence);
        print(PHP_EOL);
        printf('Fulfilment text: %s' . PHP_EOL, $fulfilmentText);
    }

    $sessionsClient->close();
}

Python

def detect_intent_stream(project_id, session_id, audio_file_path,
                         language_code):
    """Returns the result of detect intent with streaming audio as input.

    Using the same `session_id` between requests allows continuation
    of the conversation."""
    import dialogflow_v2 as dialogflow
    session_client = dialogflow.SessionsClient()

    # Note: hard coding audio_encoding and sample_rate_hertz for simplicity.
    audio_encoding = dialogflow.enums.AudioEncoding.AUDIO_ENCODING_LINEAR_16
    sample_rate_hertz = 16000

    session_path = session_client.session_path(project_id, session_id)
    print('Session path: {}\n'.format(session_path))

    def request_generator(audio_config, audio_file_path):
        query_input = dialogflow.types.QueryInput(audio_config=audio_config)

        # The first request contains the configuration.
        yield dialogflow.types.StreamingDetectIntentRequest(
            session=session_path, query_input=query_input)

        # Here we are reading small chunks of audio data from a local
        # audio file.  In practice these chunks should come from
        # an audio input device.
        with open(audio_file_path, 'rb') as audio_file:
            while True:
                chunk = audio_file.read(4096)
                if not chunk:
                    break
                # The later requests contains audio data.
                yield dialogflow.types.StreamingDetectIntentRequest(
                    input_audio=chunk)

    audio_config = dialogflow.types.InputAudioConfig(
        audio_encoding=audio_encoding, language_code=language_code,
        sample_rate_hertz=sample_rate_hertz)

    requests = request_generator(audio_config, audio_file_path)
    responses = session_client.streaming_detect_intent(requests)

    print('=' * 20)
    for response in responses:
        print('Intermediate transcript: "{}".'.format(
                response.recognition_result.transcript))

    # Note: The result from the last response is the final transcript along
    # with the detected content.
    query_result = response.query_result

    print('=' * 20)
    print('Query text: {}'.format(query_result.query_text))
    print('Detected intent: {} (confidence: {})\n'.format(
        query_result.intent.display_name,
        query_result.intent_detection_confidence))
    print('Fulfillment text: {}\n'.format(
        query_result.fulfillment_text))

Ruby

# project_id = "Your Google Cloud project ID"
# session_id = "mysession"
# audio_file_path = "resources/book_a_room.wav"
# language_code = "en-US"

require "google/cloud/dialogflow"
require "monitor"

session_client = Google::Cloud::Dialogflow.sessions
session = session_client.session_path project: project_id,
                                      session: session_id
puts "Session path: #{session}"

audio_config = {
  audio_encoding:    :AUDIO_ENCODING_LINEAR_16,
  sample_rate_hertz: 16_000,
  language_code:     language_code
}
query_input = { audio_config: audio_config }
streaming_config = { session: session, query_input: query_input }

# Set up a stream of audio data
request_stream = Gapic::StreamInput.new

# Initiate the call
response_stream = session_client.streaming_detect_intent request_stream

# Process the response stream in a separate thread
response_thread = Thread.new do
  response_stream.each do |response|
    if response.recognition_result
      puts "Intermediate transcript: #{response.recognition_result.transcript}\n"
    else
      # the last response has the actual query result
      query_result = response.query_result
      puts "Query text:        #{query_result.query_text}"
      puts "Intent detected:   #{query_result.intent.display_name}"
      puts "Intent confidence: #{query_result.intent_detection_confidence}"
      puts "Fulfillment text:  #{query_result.fulfillment_text}\n"
    end
  end
end

# The first request needs to be the configuration.
request_stream.push streaming_config

# Send chunks of audio data in the request stream
begin
  audio_file = File.open audio_file_path, "rb"
  loop do
    chunk = audio_file.read 4096
    break unless chunk
    request_stream.push input_audio: chunk
    sleep 0.5
  end
ensure
  audio_file.close
  # Close the request stream to signal that you are finished sending data.
  request_stream.close
end

# Wait until the response processing thread is complete.
response_thread.join

Amostras

Consulte a página de amostras para ver as práticas recomendadas para streaming de um microfone do navegador para o Dialogflow.