Esta página foi traduzida pela API Cloud Translation.
Switch to English

Detectar intent com fluxo de entrada de áudio

Nesta página, você verá como fazer streaming de entrada de áudio para detectar uma solicitação de intent usando a API. O Dialogflow processa o áudio e o converte em texto antes de tentar uma correspondência de intent. Essa conversão é conhecida como entrada de áudio, reconhecimento de fala, conversão de voz em texto ou STT.

Antes de começar

Este recurso só é relevante quando se usa a API para interações do usuário final. Se você estiver usando uma integração, ignore este guia.

Faça o seguinte antes de ler este guia:

  1. Leia os Conceitos básicos do Dialogflow.
  2. Execute as etapas de configuração.

Criar um agente

Se você ainda não criou um agente, crie um agora:

  1. Acesse o Console do Dialogflow ES.
  2. Se solicitado, faça login no Console do Dialogflow. Para mais informações, consulte a Visão geral do Console do Dialogflow.
  3. Clique em Criar agente no menu da barra lateral à esquerda. Se você já tiver outros agentes, clique no nome do agente, role até a parte inferior da página e clique em Criar novo agente.
  4. Digite o nome do agente, o idioma padrão e o fuso horário padrão.
  5. Se você já tiver criado um projeto, insira-o. Se você quiser permitir que o Console do Dialogflow crie o projeto, selecione Criar um novo projeto do Google.
  6. Clique no botão Criar.

Importar o arquivo de exemplo para seu agente

As etapas deste guia fazem considerações sobre o agente. Portanto, você precisa importar um agente preparado para este guia. Ao importar, essas etapas usam a opção restaurar, que substitui todas as configurações, intents e entidades do agente.

Para importar o arquivo, siga estas etapas:

  1. Faça o download do arquivo room-booking-agent.zip.
  2. Acesse o Console do Dialogflow ES.
  3. Selecione seu agente.
  4. Clique no botão de configurações ao lado do nome do agente.
  5. Selecione a guia Exportar e importar.
  6. Selecione Restaurar do zip e siga as instruções para restaurar o arquivo zip que você baixou.

Princípios básicos de streaming

O Session tipo de método streamingDetectIntent retorna um objeto em streaming de gRPC bidirecional. Os métodos disponíveis para esse objeto variam de acordo com o idioma. Por isso, consulte a documentação de referência de sua biblioteca de cliente para mais detalhes.

O objeto em streaming é usado para enviar e receber dados simultaneamente. Ao usar esse objeto, o cliente transmite conteúdo de áudio para o Dialogflow, enquanto escuta StreamingDetectIntentResponse simultaneamente.

O método streamingDetectIntent tem um parâmetro query_input.audio_config.single_utterance que afeta o reconhecimento de fala:

  • Se for false (padrão), o reconhecimento não será interrompido até o cliente fechar o stream.
  • Se for true, o Dialogflow detectará uma única fala no áudio de entrada. Quando o Dialogflow detecta que a voz do áudio foi interrompida ou pausada, ele interrompe o reconhecimento de fala e envia uma StreamingDetectIntentResponse com um resultado de reconhecimento de END_OF_SINGLE_UTTERANCE para o cliente. Qualquer áudio enviado para o Dialogflow no stream, após o recebimento de END_OF_SINGLE_UTTERANCE, será ignorado pelo Dialogflow.

No streaming bidirecional, um cliente pode fechar o objeto stream para indicar ao servidor que ele não enviará mais dados. Por exemplo, em Java e Go esse método é chamado de closeSend. É importante fechar (mas não cancelar) os streams nas seguintes situações:

  • O cliente terminou de enviar os dados.
  • O cliente está configurado com single_utterance definido como verdadeiro e recebe um StreamingDetectIntentResponse com um resultado de reconhecimento de END_OF_SINGLE_UTTERANCE.

Depois de fechar um stream, o cliente precisa iniciar uma nova solicitação com um novo stream, conforme necessário.

Intent de detecção de streaming

Os exemplos a seguir usam o Session tipo de método streamingDetectIntent para fazer streaming de áudio.

C#

public static async Task<object> DetectIntentFromStreamAsync(
    string projectId,
    string sessionId,
    string filePath)
{
    var sessionsClient = SessionsClient.Create();
    var sessionName = SessionName.FromProjectSession(projectId, sessionId).ToString();

    // Initialize streaming call, retrieving the stream object
    var streamingDetectIntent = sessionsClient.StreamingDetectIntent();

    // Define a task to process results from the API
    var responseHandlerTask = Task.Run(async () =>
    {
        var responseStream = streamingDetectIntent.GetResponseStream();
        while (await responseStream.MoveNextAsync())
        {
            var response = responseStream.Current;
            var queryResult = response.QueryResult;

            if (queryResult != null)
            {
                Console.WriteLine($"Query text: {queryResult.QueryText}");
                if (queryResult.Intent != null)
                {
                    Console.Write("Intent detected:");
                    Console.WriteLine(queryResult.Intent.DisplayName);
                }
            }
        }
    });

    // Instructs the speech recognizer how to process the audio content.
    // Note: hard coding audioEncoding, sampleRateHertz for simplicity.
    var queryInput = new QueryInput
    {
        AudioConfig = new InputAudioConfig
        {
            AudioEncoding = AudioEncoding.Linear16,
            LanguageCode = "en-US",
            SampleRateHertz = 16000
        }
    };

    // The first request must **only** contain the audio configuration:
    await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
    {
        QueryInput = queryInput,
        Session = sessionName
    });

    using (FileStream fileStream = new FileStream(filePath, FileMode.Open))
    {
        // Subsequent requests must **only** contain the audio data.
        // Following messages: audio chunks. We just read the file in
        // fixed-size chunks. In reality you would split the user input
        // by time.
        var buffer = new byte[32 * 1024];
        int bytesRead;
        while ((bytesRead = await fileStream.ReadAsync(
            buffer, 0, buffer.Length)) > 0)
        {
            await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
            {
                Session = sessionName,
                InputAudio = Google.Protobuf.ByteString.CopyFrom(buffer, 0, bytesRead)
            });
        };
    }

    // Tell the service you are done sending data
    await streamingDetectIntent.WriteCompleteAsync();

    // This will complete once all server responses have been processed.
    await responseHandlerTask;

    return 0;
}

Go

func DetectIntentStream(projectID, sessionID, audioFile, languageCode string) (string, error) {
	ctx := context.Background()

	sessionClient, err := dialogflow.NewSessionsClient(ctx)
	if err != nil {
		return "", err
	}
	defer sessionClient.Close()

	if projectID == "" || sessionID == "" {
		return "", errors.New(fmt.Sprintf("Received empty project (%s) or session (%s)", projectID, sessionID))
	}

	sessionPath := fmt.Sprintf("projects/%s/agent/sessions/%s", projectID, sessionID)

	// In this example, we hard code the encoding and sample rate for simplicity.
	audioConfig := dialogflowpb.InputAudioConfig{AudioEncoding: dialogflowpb.AudioEncoding_AUDIO_ENCODING_LINEAR_16, SampleRateHertz: 16000, LanguageCode: languageCode}

	queryAudioInput := dialogflowpb.QueryInput_AudioConfig{AudioConfig: &audioConfig}

	queryInput := dialogflowpb.QueryInput{Input: &queryAudioInput}

	streamer, err := sessionClient.StreamingDetectIntent(ctx)
	if err != nil {
		return "", err
	}

	f, err := os.Open(audioFile)
	if err != nil {
		return "", err
	}

	defer f.Close()

	go func() {
		audioBytes := make([]byte, 1024)

		request := dialogflowpb.StreamingDetectIntentRequest{Session: sessionPath, QueryInput: &queryInput}
		err = streamer.Send(&request)
		if err != nil {
			log.Fatal(err)
		}

		for {
			_, err := f.Read(audioBytes)
			if err == io.EOF {
				streamer.CloseSend()
				break
			}
			if err != nil {
				log.Fatal(err)
			}

			request = dialogflowpb.StreamingDetectIntentRequest{InputAudio: audioBytes}
			err = streamer.Send(&request)
			if err != nil {
				log.Fatal(err)
			}
		}
	}()

	var queryResult *dialogflowpb.QueryResult

	for {
		response, err := streamer.Recv()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}

		recognitionResult := response.GetRecognitionResult()
		transcript := recognitionResult.GetTranscript()
		log.Printf("Recognition transcript: %s\n", transcript)

		queryResult = response.GetQueryResult()
	}

	fulfillmentText := queryResult.GetFulfillmentText()
	return fulfillmentText, nil
}

Java


import com.google.api.gax.rpc.ApiException;
import com.google.api.gax.rpc.BidiStream;
import com.google.cloud.dialogflow.v2.AudioEncoding;
import com.google.cloud.dialogflow.v2.InputAudioConfig;
import com.google.cloud.dialogflow.v2.QueryInput;
import com.google.cloud.dialogflow.v2.QueryResult;
import com.google.cloud.dialogflow.v2.SessionName;
import com.google.cloud.dialogflow.v2.SessionsClient;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentRequest;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentResponse;
import com.google.protobuf.ByteString;
import java.io.FileInputStream;
import java.io.IOException;

class DetectIntentStream {

  // DialogFlow API Detect Intent sample with audio files processes as an audio stream.
  static void detectIntentStream(String projectId, String audioFilePath, String sessionId)
      throws IOException, ApiException {
    // String projectId = "YOUR_PROJECT_ID";
    // String audioFilePath = "path_to_your_audio_file";
    // Using the same `sessionId` between requests allows continuation of the conversation.
    // String sessionId = "Identifier of the DetectIntent session";

    // Instantiates a client
    try (SessionsClient sessionsClient = SessionsClient.create()) {
      // Set the session name using the sessionId (UUID) and projectID (my-project-id)
      SessionName session = SessionName.of(projectId, sessionId);

      // Instructs the speech recognizer how to process the audio content.
      // Note: hard coding audioEncoding and sampleRateHertz for simplicity.
      // Audio encoding of the audio content sent in the query request.
      InputAudioConfig inputAudioConfig =
          InputAudioConfig.newBuilder()
              .setAudioEncoding(AudioEncoding.AUDIO_ENCODING_LINEAR_16)
              .setLanguageCode("en-US") // languageCode = "en-US"
              .setSampleRateHertz(16000) // sampleRateHertz = 16000
              .build();

      // Build the query with the InputAudioConfig
      QueryInput queryInput = QueryInput.newBuilder().setAudioConfig(inputAudioConfig).build();

      // Create the Bidirectional stream
      BidiStream<StreamingDetectIntentRequest, StreamingDetectIntentResponse> bidiStream =
          sessionsClient.streamingDetectIntentCallable().call();

      // The first request must **only** contain the audio configuration:
      bidiStream.send(
          StreamingDetectIntentRequest.newBuilder()
              .setSession(session.toString())
              .setQueryInput(queryInput)
              .build());

      try (FileInputStream audioStream = new FileInputStream(audioFilePath)) {
        // Subsequent requests must **only** contain the audio data.
        // Following messages: audio chunks. We just read the file in fixed-size chunks. In reality
        // you would split the user input by time.
        byte[] buffer = new byte[4096];
        int bytes;
        while ((bytes = audioStream.read(buffer)) != -1) {
          bidiStream.send(
              StreamingDetectIntentRequest.newBuilder()
                  .setInputAudio(ByteString.copyFrom(buffer, 0, bytes))
                  .build());
        }
      }

      // Tell the service you are done sending data
      bidiStream.closeSend();

      for (StreamingDetectIntentResponse response : bidiStream) {
        QueryResult queryResult = response.getQueryResult();
        System.out.println("====================");
        System.out.format("Intent Display Name: %s\n", queryResult.getIntent().getDisplayName());
        System.out.format("Query Text: '%s'\n", queryResult.getQueryText());
        System.out.format(
            "Detected Intent: %s (confidence: %f)\n",
            queryResult.getIntent().getDisplayName(), queryResult.getIntentDetectionConfidence());
        System.out.format("Fulfillment Text: '%s'\n", queryResult.getFulfillmentText());
      }
    }
  }
}

Node.js

const fs = require('fs');
const util = require('util');
const {Transform, pipeline} = require('stream');
const {struct} = require('pb-util');

const pump = util.promisify(pipeline);
// Imports the Dialogflow library
const dialogflow = require('@google-cloud/dialogflow');

// Instantiates a session client
const sessionClient = new dialogflow.SessionsClient();

// The path to the local file on which to perform speech recognition, e.g.
// /path/to/audio.raw const filename = '/path/to/audio.raw';

// The encoding of the audio file, e.g. 'AUDIO_ENCODING_LINEAR_16'
// const encoding = 'AUDIO_ENCODING_LINEAR_16';

// The sample rate of the audio file in hertz, e.g. 16000
// const sampleRateHertz = 16000;

// The BCP-47 language code to use, e.g. 'en-US'
// const languageCode = 'en-US';
const sessionPath = sessionClient.projectAgentSessionPath(
  projectId,
  sessionId
);

const initialStreamRequest = {
  session: sessionPath,
  queryInput: {
    audioConfig: {
      audioEncoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
    },
    singleUtterance: true,
  },
};

// Create a stream for the streaming request.
const detectStream = sessionClient
  .streamingDetectIntent()
  .on('error', console.error)
  .on('data', data => {
    if (data.recognitionResult) {
      console.log(
        `Intermediate transcript: ${data.recognitionResult.transcript}`
      );
    } else {
      console.log('Detected intent:');

      const result = data.queryResult;
      // Instantiates a context client
      const contextClient = new dialogflow.ContextsClient();

      console.log(`  Query: ${result.queryText}`);
      console.log(`  Response: ${result.fulfillmentText}`);
      if (result.intent) {
        console.log(`  Intent: ${result.intent.displayName}`);
      } else {
        console.log('  No intent matched.');
      }
      const parameters = JSON.stringify(struct.decode(result.parameters));
      console.log(`  Parameters: ${parameters}`);
      if (result.outputContexts && result.outputContexts.length) {
        console.log('  Output contexts:');
        result.outputContexts.forEach(context => {
          const contextId = contextClient.matchContextFromProjectAgentSessionContextName(
            context.name
          );
          const contextParameters = JSON.stringify(
            struct.decode(context.parameters)
          );
          console.log(`    ${contextId}`);
          console.log(`      lifespan: ${context.lifespanCount}`);
          console.log(`      parameters: ${contextParameters}`);
        });
      }
    }
  });

// Write the initial stream request to config for audio input.
detectStream.write(initialStreamRequest);

// Stream an audio file from disk to the Conversation API, e.g.
// "./resources/audio.raw"
await pump(
  fs.createReadStream(filename),
  // Format the audio stream into the request format.
  new Transform({
    objectMode: true,
    transform: (obj, _, next) => {
      next(null, {inputAudio: obj});
    },
  }),
  detectStream
);

PHP

namespace Google\Cloud\Samples\Dialogflow;

use Google\Cloud\Dialogflow\V2\SessionsClient;
use Google\Cloud\Dialogflow\V2\AudioEncoding;
use Google\Cloud\Dialogflow\V2\InputAudioConfig;
use Google\Cloud\Dialogflow\V2\QueryInput;
use Google\Cloud\Dialogflow\V2\StreamingDetectIntentRequest;

/**
* Returns the result of detect intent with streaming audio as input.
* Using the same `session_id` between requests allows continuation
* of the conversation.
*/
function detect_intent_stream($projectId, $path, $sessionId, $languageCode = 'en-US')
{
    // need to use gRPC
    if (!defined('Grpc\STATUS_OK')) {
        throw new \Exception('Install the grpc extension ' .
            '(pecl install grpc)');
    }

    // new session
    $sessionsClient = new SessionsClient();
    $session = $sessionsClient->sessionName($projectId, $sessionId ?: uniqid());
    printf('Session path: %s' . PHP_EOL, $session);

    // hard coding audio_encoding and sample_rate_hertz for simplicity
    $audioConfig = new InputAudioConfig();
    $audioConfig->setAudioEncoding(AudioEncoding::AUDIO_ENCODING_LINEAR_16);
    $audioConfig->setLanguageCode($languageCode);
    $audioConfig->setSampleRateHertz(16000);

    // create query input
    $queryInput = new QueryInput();
    $queryInput->setAudioConfig($audioConfig);

    // first request contains the configuration
    $request = new StreamingDetectIntentRequest();
    $request->setSession($session);
    $request->setQueryInput($queryInput);
    $requests = [$request];

    // we are going to read small chunks of audio data from
    // a local audio file. in practice, these chunks should
    // come from an audio input device.
    $audioStream = fopen($path, 'rb');
    while (true) {
        $chunk = stream_get_contents($audioStream, 4096);
        if (!$chunk) {
            break;
        }
        $request = new StreamingDetectIntentRequest();
        $request->setInputAudio($chunk);
        $requests[] = $request;
    }

    // intermediate transcript info
    print(PHP_EOL . str_repeat("=", 20) . PHP_EOL);
    $stream = $sessionsClient->streamingDetectIntent();
    foreach ($requests as $request) {
        $stream->write($request);
    }
    foreach ($stream->closeWriteAndReadAll() as $response) {
        $recognitionResult = $response->getRecognitionResult();
        if ($recognitionResult) {
            $transcript = $recognitionResult->getTranscript();
            printf('Intermediate transcript: %s' . PHP_EOL, $transcript);
        }
    }

    // get final response and relevant info
    if ($response) {
        print(str_repeat("=", 20) . PHP_EOL);
        $queryResult = $response->getQueryResult();
        $queryText = $queryResult->getQueryText();
        $intent = $queryResult->getIntent();
        $displayName = $intent->getDisplayName();
        $confidence = $queryResult->getIntentDetectionConfidence();
        $fulfilmentText = $queryResult->getFulfillmentText();

        // output relevant info
        printf('Query text: %s' . PHP_EOL, $queryText);
        printf('Detected intent: %s (confidence: %f)' . PHP_EOL, $displayName,
            $confidence);
        print(PHP_EOL);
        printf('Fulfilment text: %s' . PHP_EOL, $fulfilmentText);
    }

    $sessionsClient->close();
}

Python

def detect_intent_stream(project_id, session_id, audio_file_path,
                         language_code):
    """Returns the result of detect intent with streaming audio as input.

    Using the same `session_id` between requests allows continuation
    of the conversation."""
    from google.cloud import dialogflow
    session_client = dialogflow.SessionsClient()

    # Note: hard coding audio_encoding and sample_rate_hertz for simplicity.
    audio_encoding = dialogflow.AudioEncoding.AUDIO_ENCODING_LINEAR_16
    sample_rate_hertz = 16000

    session_path = session_client.session_path(project_id, session_id)
    print('Session path: {}\n'.format(session_path))

    def request_generator(audio_config, audio_file_path):
        query_input = dialogflow.QueryInput(audio_config=audio_config)

        # The first request contains the configuration.
        yield dialogflow.StreamingDetectIntentRequest(
            session=session_path, query_input=query_input)

        # Here we are reading small chunks of audio data from a local
        # audio file.  In practice these chunks should come from
        # an audio input device.
        with open(audio_file_path, 'rb') as audio_file:
            while True:
                chunk = audio_file.read(4096)
                if not chunk:
                    break
                # The later requests contains audio data.
                yield dialogflow.StreamingDetectIntentRequest(
                    input_audio=chunk)

    audio_config = dialogflow.InputAudioConfig(
        audio_encoding=audio_encoding, language_code=language_code,
        sample_rate_hertz=sample_rate_hertz)

    requests = request_generator(audio_config, audio_file_path)
    responses = session_client.streaming_detect_intent(requests=requests)

    print('=' * 20)
    for response in responses:
        print('Intermediate transcript: "{}".'.format(
                response.recognition_result.transcript))

    # Note: The result from the last response is the final transcript along
    # with the detected content.
    query_result = response.query_result

    print('=' * 20)
    print('Query text: {}'.format(query_result.query_text))
    print('Detected intent: {} (confidence: {})\n'.format(
        query_result.intent.display_name,
        query_result.intent_detection_confidence))
    print('Fulfillment text: {}\n'.format(
        query_result.fulfillment_text))

Ruby

# project_id = "Your Google Cloud project ID"
# session_id = "mysession"
# audio_file_path = "resources/book_a_room.wav"
# language_code = "en-US"

require "google/cloud/dialogflow"
require "monitor"

session_client = Google::Cloud::Dialogflow.sessions
session = session_client.session_path project: project_id,
                                      session: session_id
puts "Session path: #{session}"

audio_config = {
  audio_encoding:    :AUDIO_ENCODING_LINEAR_16,
  sample_rate_hertz: 16_000,
  language_code:     language_code
}
query_input = { audio_config: audio_config }
streaming_config = { session: session, query_input: query_input }

# Set up a stream of audio data
request_stream = Gapic::StreamInput.new

# Initiate the call
response_stream = session_client.streaming_detect_intent request_stream

# Process the response stream in a separate thread
response_thread = Thread.new do
  response_stream.each do |response|
    if response.recognition_result
      puts "Intermediate transcript: #{response.recognition_result.transcript}\n"
    else
      # the last response has the actual query result
      query_result = response.query_result
      puts "Query text:        #{query_result.query_text}"
      puts "Intent detected:   #{query_result.intent.display_name}"
      puts "Intent confidence: #{query_result.intent_detection_confidence}"
      puts "Fulfillment text:  #{query_result.fulfillment_text}\n"
    end
  end
end

# The first request needs to be the configuration.
request_stream.push streaming_config

# Send chunks of audio data in the request stream
begin
  audio_file = File.open audio_file_path, "rb"
  loop do
    chunk = audio_file.read 4096
    break unless chunk
    request_stream.push input_audio: chunk
    sleep 0.5
  end
ensure
  audio_file.close
  # Close the request stream to signal that you are finished sending data.
  request_stream.close
end

# Wait until the response processing thread is complete.
response_thread.join

Amostras

Consulte a página de amostras para ver as práticas recomendadas para streaming de um microfone do navegador para o Dialogflow.