从音频输入流中检测意图

本页面介绍如何使用 API 将音频输入流式传输给检测意图请求。Dialogflow 会处理音频并将其转换为文字,然后再尝试匹配意图。这种转换称为音频输入、语音识别语音转文字 (STT)。

准备工作

此功能仅适用于使用 API 与最终用户互动的情况。如果您使用的是集成服务,则可以跳过本指南。

在阅读本指南之前,请先完成以下事项:

  1. 阅读 Dialogflow 基础知识
  2. 执行设置步骤

创建代理

如果尚未创建代理,请立即创建一个:

  1. 转到 Dialogflow ES 控制台
  2. 如果系统要求登录 Dialogflow 控制台,请登录。如需了解详情,请参阅 Dialogflow 控制台概览
  3. 点击左侧边栏菜单中的创建代理 (Create Agent)。如果您已有其他代理,请点击代理名称,滚动到底部,然后点击创建新代理 (Create new agent)。
  4. 输入您的代理名称、默认语言和默认时区。
  5. 如果您已经创建了项目,请输入该项目。如果要允许 Dialogflow 控制台创建项目,请选择创建新 Google 项目 (Create a new Google project)。
  6. 点击创建 (Create) 按钮。

将示例文件导入代理

本指南中的步骤对您的代理进行了假设,因此您需要导入为本指南准备的代理。 导入时,这些步骤使用“恢复”(restore) 选项,该选项会覆盖所有代理设置、意图和实体。

如需导入文件,请按以下步骤操作:

  1. 下载 room-booking-agent.zip 文件。
  2. 转到 Dialogflow ES 控制台
  3. 选择您的代理。
  4. 点击代理名称旁边的设置 按钮。
  5. 选择导出和导入 (Export and Import) 标签页。
  6. 选择从 ZIP 文件恢复 (Restore from ZIP),然后按照说明恢复下载的 zip 文件。

流式传输基础知识

Session 类型的 streamingDetectIntent 方法返回双向 gRPC 流式传输对象。此对象的可用方法随语言而变,因此请参阅与您的客户端库相应的参考文档,以了解详情。

流式传输对象用于并发收发数据。使用此对象,客户端可将音频内容流式传输到 Dialogflow,并同时侦听 StreamingDetectIntentResponse

streamingDetectIntent 方法的 query_input.audio_config.single_utterance 参数会影响语音识别:

  • 如果为 false(默认值),则在客户端关闭数据流之前,不会停止语音识别。
  • 如果为 true,则 Dialogflow 将检测输入音频中的单独一条话语。Dialogflow 检测到音频的语音停止或暂停时,它会停止语音识别,并向客户端发送 StreamingDetectIntentResponseEND_OF_SINGLE_UTTERANCE 的识别结果。收到 END_OF_SINGLE_UTTERANCE 后,Dialogflow 会忽略该流中向其发送的任何音频。

在双向流式传输中,客户端可以半关闭流式对象,以告知服务器它不会再继续发送数据。例如,在 Java 和 Go 中,此方法称为 closeSend。在以下情况下,半关闭(并非取消)数据流非常重要:

  • 客户端已完成数据发送。
  • 客户端配置为 single_utterance 设为 true,并且收到了 StreamingDetectIntentResponse 以及识别结果 END_OF_SINGLE_UTTERANCE

关闭数据流后,客户端应根据需要使用新数据流发起新请求。

流式传输检测意图

以下示例使用 Session 类型的 streamingDetectIntent 方法流式传输音频。

C#

public static async Task<object> DetectIntentFromStreamAsync(
    string projectId,
    string sessionId,
    string filePath)
{
    var sessionsClient = SessionsClient.Create();
    var sessionName = SessionName.FromProjectSession(projectId, sessionId).ToString();

    // Initialize streaming call, retrieving the stream object
    var streamingDetectIntent = sessionsClient.StreamingDetectIntent();

    // Define a task to process results from the API
    var responseHandlerTask = Task.Run(async () =>
    {
        var responseStream = streamingDetectIntent.GetResponseStream();
        while (await responseStream.MoveNextAsync())
        {
            var response = responseStream.Current;
            var queryResult = response.QueryResult;

            if (queryResult != null)
            {
                Console.WriteLine($"Query text: {queryResult.QueryText}");
                if (queryResult.Intent != null)
                {
                    Console.Write("Intent detected:");
                    Console.WriteLine(queryResult.Intent.DisplayName);
                }
            }
        }
    });

    // Instructs the speech recognizer how to process the audio content.
    // Note: hard coding audioEncoding, sampleRateHertz for simplicity.
    var queryInput = new QueryInput
    {
        AudioConfig = new InputAudioConfig
        {
            AudioEncoding = AudioEncoding.Linear16,
            LanguageCode = "en-US",
            SampleRateHertz = 16000
        }
    };

    // The first request must **only** contain the audio configuration:
    await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
    {
        QueryInput = queryInput,
        Session = sessionName
    });

    using (FileStream fileStream = new FileStream(filePath, FileMode.Open))
    {
        // Subsequent requests must **only** contain the audio data.
        // Following messages: audio chunks. We just read the file in
        // fixed-size chunks. In reality you would split the user input
        // by time.
        var buffer = new byte[32 * 1024];
        int bytesRead;
        while ((bytesRead = await fileStream.ReadAsync(
            buffer, 0, buffer.Length)) > 0)
        {
            await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
            {
                Session = sessionName,
                InputAudio = Google.Protobuf.ByteString.CopyFrom(buffer, 0, bytesRead)
            });
        };
    }

    // Tell the service you are done sending data
    await streamingDetectIntent.WriteCompleteAsync();

    // This will complete once all server responses have been processed.
    await responseHandlerTask;

    return 0;
}

Go

func DetectIntentStream(projectID, sessionID, audioFile, languageCode string) (string, error) {
	ctx := context.Background()

	sessionClient, err := dialogflow.NewSessionsClient(ctx)
	if err != nil {
		return "", err
	}
	defer sessionClient.Close()

	if projectID == "" || sessionID == "" {
		return "", errors.New(fmt.Sprintf("Received empty project (%s) or session (%s)", projectID, sessionID))
	}

	sessionPath := fmt.Sprintf("projects/%s/agent/sessions/%s", projectID, sessionID)

	// In this example, we hard code the encoding and sample rate for simplicity.
	audioConfig := dialogflowpb.InputAudioConfig{AudioEncoding: dialogflowpb.AudioEncoding_AUDIO_ENCODING_LINEAR_16, SampleRateHertz: 16000, LanguageCode: languageCode}

	queryAudioInput := dialogflowpb.QueryInput_AudioConfig{AudioConfig: &audioConfig}

	queryInput := dialogflowpb.QueryInput{Input: &queryAudioInput}

	streamer, err := sessionClient.StreamingDetectIntent(ctx)
	if err != nil {
		return "", err
	}

	f, err := os.Open(audioFile)
	if err != nil {
		return "", err
	}

	defer f.Close()

	go func() {
		audioBytes := make([]byte, 1024)

		request := dialogflowpb.StreamingDetectIntentRequest{Session: sessionPath, QueryInput: &queryInput}
		err = streamer.Send(&request)
		if err != nil {
			log.Fatal(err)
		}

		for {
			_, err := f.Read(audioBytes)
			if err == io.EOF {
				streamer.CloseSend()
				break
			}
			if err != nil {
				log.Fatal(err)
			}

			request = dialogflowpb.StreamingDetectIntentRequest{InputAudio: audioBytes}
			err = streamer.Send(&request)
			if err != nil {
				log.Fatal(err)
			}
		}
	}()

	var queryResult *dialogflowpb.QueryResult

	for {
		response, err := streamer.Recv()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}

		recognitionResult := response.GetRecognitionResult()
		transcript := recognitionResult.GetTranscript()
		log.Printf("Recognition transcript: %s\n", transcript)

		queryResult = response.GetQueryResult()
	}

	fulfillmentText := queryResult.GetFulfillmentText()
	return fulfillmentText, nil
}

Java


import com.google.api.gax.rpc.ApiException;
import com.google.api.gax.rpc.BidiStream;
import com.google.cloud.dialogflow.v2.AudioEncoding;
import com.google.cloud.dialogflow.v2.InputAudioConfig;
import com.google.cloud.dialogflow.v2.QueryInput;
import com.google.cloud.dialogflow.v2.QueryResult;
import com.google.cloud.dialogflow.v2.SessionName;
import com.google.cloud.dialogflow.v2.SessionsClient;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentRequest;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentResponse;
import com.google.protobuf.ByteString;
import java.io.FileInputStream;
import java.io.IOException;

class DetectIntentStream {

  // DialogFlow API Detect Intent sample with audio files processes as an audio stream.
  static void detectIntentStream(String projectId, String audioFilePath, String sessionId)
      throws IOException, ApiException {
    // String projectId = "YOUR_PROJECT_ID";
    // String audioFilePath = "path_to_your_audio_file";
    // Using the same `sessionId` between requests allows continuation of the conversation.
    // String sessionId = "Identifier of the DetectIntent session";

    // Instantiates a client
    try (SessionsClient sessionsClient = SessionsClient.create()) {
      // Set the session name using the sessionId (UUID) and projectID (my-project-id)
      SessionName session = SessionName.of(projectId, sessionId);

      // Instructs the speech recognizer how to process the audio content.
      // Note: hard coding audioEncoding and sampleRateHertz for simplicity.
      // Audio encoding of the audio content sent in the query request.
      InputAudioConfig inputAudioConfig =
          InputAudioConfig.newBuilder()
              .setAudioEncoding(AudioEncoding.AUDIO_ENCODING_LINEAR_16)
              .setLanguageCode("en-US") // languageCode = "en-US"
              .setSampleRateHertz(16000) // sampleRateHertz = 16000
              .build();

      // Build the query with the InputAudioConfig
      QueryInput queryInput = QueryInput.newBuilder().setAudioConfig(inputAudioConfig).build();

      // Create the Bidirectional stream
      BidiStream<StreamingDetectIntentRequest, StreamingDetectIntentResponse> bidiStream =
          sessionsClient.streamingDetectIntentCallable().call();

      // The first request must **only** contain the audio configuration:
      bidiStream.send(
          StreamingDetectIntentRequest.newBuilder()
              .setSession(session.toString())
              .setQueryInput(queryInput)
              .build());

      try (FileInputStream audioStream = new FileInputStream(audioFilePath)) {
        // Subsequent requests must **only** contain the audio data.
        // Following messages: audio chunks. We just read the file in fixed-size chunks. In reality
        // you would split the user input by time.
        byte[] buffer = new byte[4096];
        int bytes;
        while ((bytes = audioStream.read(buffer)) != -1) {
          bidiStream.send(
              StreamingDetectIntentRequest.newBuilder()
                  .setInputAudio(ByteString.copyFrom(buffer, 0, bytes))
                  .build());
        }
      }

      // Tell the service you are done sending data
      bidiStream.closeSend();

      for (StreamingDetectIntentResponse response : bidiStream) {
        QueryResult queryResult = response.getQueryResult();
        System.out.println("====================");
        System.out.format("Intent Display Name: %s\n", queryResult.getIntent().getDisplayName());
        System.out.format("Query Text: '%s'\n", queryResult.getQueryText());
        System.out.format(
            "Detected Intent: %s (confidence: %f)\n",
            queryResult.getIntent().getDisplayName(), queryResult.getIntentDetectionConfidence());
        System.out.format("Fulfillment Text: '%s'\n", queryResult.getFulfillmentText());
      }
    }
  }
}

Node.js

const fs = require('fs');
const util = require('util');
const {Transform, pipeline} = require('stream');
const {struct} = require('pb-util');

const pump = util.promisify(pipeline);
// Imports the Dialogflow library
const dialogflow = require('@google-cloud/dialogflow');

// Instantiates a session client
const sessionClient = new dialogflow.SessionsClient();

// The path to the local file on which to perform speech recognition, e.g.
// /path/to/audio.raw const filename = '/path/to/audio.raw';

// The encoding of the audio file, e.g. 'AUDIO_ENCODING_LINEAR_16'
// const encoding = 'AUDIO_ENCODING_LINEAR_16';

// The sample rate of the audio file in hertz, e.g. 16000
// const sampleRateHertz = 16000;

// The BCP-47 language code to use, e.g. 'en-US'
// const languageCode = 'en-US';
const sessionPath = sessionClient.projectAgentSessionPath(
  projectId,
  sessionId
);

const initialStreamRequest = {
  session: sessionPath,
  queryInput: {
    audioConfig: {
      audioEncoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
    },
    singleUtterance: true,
  },
};

// Create a stream for the streaming request.
const detectStream = sessionClient
  .streamingDetectIntent()
  .on('error', console.error)
  .on('data', data => {
    if (data.recognitionResult) {
      console.log(
        `Intermediate transcript: ${data.recognitionResult.transcript}`
      );
    } else {
      console.log('Detected intent:');

      const result = data.queryResult;
      // Instantiates a context client
      const contextClient = new dialogflow.ContextsClient();

      console.log(`  Query: ${result.queryText}`);
      console.log(`  Response: ${result.fulfillmentText}`);
      if (result.intent) {
        console.log(`  Intent: ${result.intent.displayName}`);
      } else {
        console.log('  No intent matched.');
      }
      const parameters = JSON.stringify(struct.decode(result.parameters));
      console.log(`  Parameters: ${parameters}`);
      if (result.outputContexts && result.outputContexts.length) {
        console.log('  Output contexts:');
        result.outputContexts.forEach(context => {
          const contextId = contextClient.matchContextFromProjectAgentSessionContextName(
            context.name
          );
          const contextParameters = JSON.stringify(
            struct.decode(context.parameters)
          );
          console.log(`    ${contextId}`);
          console.log(`      lifespan: ${context.lifespanCount}`);
          console.log(`      parameters: ${contextParameters}`);
        });
      }
    }
  });

// Write the initial stream request to config for audio input.
detectStream.write(initialStreamRequest);

// Stream an audio file from disk to the Conversation API, e.g.
// "./resources/audio.raw"
await pump(
  fs.createReadStream(filename),
  // Format the audio stream into the request format.
  new Transform({
    objectMode: true,
    transform: (obj, _, next) => {
      next(null, {inputAudio: obj});
    },
  }),
  detectStream
);

PHP

namespace Google\Cloud\Samples\Dialogflow;

use Google\Cloud\Dialogflow\V2\SessionsClient;
use Google\Cloud\Dialogflow\V2\AudioEncoding;
use Google\Cloud\Dialogflow\V2\InputAudioConfig;
use Google\Cloud\Dialogflow\V2\QueryInput;
use Google\Cloud\Dialogflow\V2\StreamingDetectIntentRequest;

/**
* Returns the result of detect intent with streaming audio as input.
* Using the same `session_id` between requests allows continuation
* of the conversation.
*/
function detect_intent_stream($projectId, $path, $sessionId, $languageCode = 'en-US')
{
    // need to use gRPC
    if (!defined('Grpc\STATUS_OK')) {
        throw new \Exception('Install the grpc extension ' .
            '(pecl install grpc)');
    }

    // new session
    $sessionsClient = new SessionsClient();
    $session = $sessionsClient->sessionName($projectId, $sessionId ?: uniqid());
    printf('Session path: %s' . PHP_EOL, $session);

    // hard coding audio_encoding and sample_rate_hertz for simplicity
    $audioConfig = new InputAudioConfig();
    $audioConfig->setAudioEncoding(AudioEncoding::AUDIO_ENCODING_LINEAR_16);
    $audioConfig->setLanguageCode($languageCode);
    $audioConfig->setSampleRateHertz(16000);

    // create query input
    $queryInput = new QueryInput();
    $queryInput->setAudioConfig($audioConfig);

    // first request contains the configuration
    $request = new StreamingDetectIntentRequest();
    $request->setSession($session);
    $request->setQueryInput($queryInput);
    $requests = [$request];

    // we are going to read small chunks of audio data from
    // a local audio file. in practice, these chunks should
    // come from an audio input device.
    $audioStream = fopen($path, 'rb');
    while (true) {
        $chunk = stream_get_contents($audioStream, 4096);
        if (!$chunk) {
            break;
        }
        $request = new StreamingDetectIntentRequest();
        $request->setInputAudio($chunk);
        $requests[] = $request;
    }

    // intermediate transcript info
    print(PHP_EOL . str_repeat("=", 20) . PHP_EOL);
    $stream = $sessionsClient->streamingDetectIntent();
    foreach ($requests as $request) {
        $stream->write($request);
    }
    foreach ($stream->closeWriteAndReadAll() as $response) {
        $recognitionResult = $response->getRecognitionResult();
        if ($recognitionResult) {
            $transcript = $recognitionResult->getTranscript();
            printf('Intermediate transcript: %s' . PHP_EOL, $transcript);
        }
    }

    // get final response and relevant info
    if ($response) {
        print(str_repeat("=", 20) . PHP_EOL);
        $queryResult = $response->getQueryResult();
        $queryText = $queryResult->getQueryText();
        $intent = $queryResult->getIntent();
        $displayName = $intent->getDisplayName();
        $confidence = $queryResult->getIntentDetectionConfidence();
        $fulfilmentText = $queryResult->getFulfillmentText();

        // output relevant info
        printf('Query text: %s' . PHP_EOL, $queryText);
        printf('Detected intent: %s (confidence: %f)' . PHP_EOL, $displayName,
            $confidence);
        print(PHP_EOL);
        printf('Fulfilment text: %s' . PHP_EOL, $fulfilmentText);
    }

    $sessionsClient->close();
}

Python

def detect_intent_stream(project_id, session_id, audio_file_path,
                         language_code):
    """Returns the result of detect intent with streaming audio as input.

    Using the same `session_id` between requests allows continuation
    of the conversation."""
    import dialogflow_v2 as dialogflow
    session_client = dialogflow.SessionsClient()

    # Note: hard coding audio_encoding and sample_rate_hertz for simplicity.
    audio_encoding = dialogflow.enums.AudioEncoding.AUDIO_ENCODING_LINEAR_16
    sample_rate_hertz = 16000

    session_path = session_client.session_path(project_id, session_id)
    print('Session path: {}\n'.format(session_path))

    def request_generator(audio_config, audio_file_path):
        query_input = dialogflow.types.QueryInput(audio_config=audio_config)

        # The first request contains the configuration.
        yield dialogflow.types.StreamingDetectIntentRequest(
            session=session_path, query_input=query_input)

        # Here we are reading small chunks of audio data from a local
        # audio file.  In practice these chunks should come from
        # an audio input device.
        with open(audio_file_path, 'rb') as audio_file:
            while True:
                chunk = audio_file.read(4096)
                if not chunk:
                    break
                # The later requests contains audio data.
                yield dialogflow.types.StreamingDetectIntentRequest(
                    input_audio=chunk)

    audio_config = dialogflow.types.InputAudioConfig(
        audio_encoding=audio_encoding, language_code=language_code,
        sample_rate_hertz=sample_rate_hertz)

    requests = request_generator(audio_config, audio_file_path)
    responses = session_client.streaming_detect_intent(requests)

    print('=' * 20)
    for response in responses:
        print('Intermediate transcript: "{}".'.format(
                response.recognition_result.transcript))

    # Note: The result from the last response is the final transcript along
    # with the detected content.
    query_result = response.query_result

    print('=' * 20)
    print('Query text: {}'.format(query_result.query_text))
    print('Detected intent: {} (confidence: {})\n'.format(
        query_result.intent.display_name,
        query_result.intent_detection_confidence))
    print('Fulfillment text: {}\n'.format(
        query_result.fulfillment_text))

Ruby

# project_id = "Your Google Cloud project ID"
# session_id = "mysession"
# audio_file_path = "resources/book_a_room.wav"
# language_code = "en-US"

require "google/cloud/dialogflow"
require "monitor"

session_client = Google::Cloud::Dialogflow.sessions
session = session_client.session_path project: project_id,
                                      session: session_id
puts "Session path: #{session}"

audio_config = {
  audio_encoding:    :AUDIO_ENCODING_LINEAR_16,
  sample_rate_hertz: 16_000,
  language_code:     language_code
}
query_input = { audio_config: audio_config }
streaming_config = { session: session, query_input: query_input }

# Set up a stream of audio data
request_stream = Gapic::StreamInput.new

# Initiate the call
response_stream = session_client.streaming_detect_intent request_stream

# Process the response stream in a separate thread
response_thread = Thread.new do
  response_stream.each do |response|
    if response.recognition_result
      puts "Intermediate transcript: #{response.recognition_result.transcript}\n"
    else
      # the last response has the actual query result
      query_result = response.query_result
      puts "Query text:        #{query_result.query_text}"
      puts "Intent detected:   #{query_result.intent.display_name}"
      puts "Intent confidence: #{query_result.intent_detection_confidence}"
      puts "Fulfillment text:  #{query_result.fulfillment_text}\n"
    end
  end
end

# The first request needs to be the configuration.
request_stream.push streaming_config

# Send chunks of audio data in the request stream
begin
  audio_file = File.open audio_file_path, "rb"
  loop do
    chunk = audio_file.read 4096
    break unless chunk
    request_stream.push input_audio: chunk
    sleep 0.5
  end
ensure
  audio_file.close
  # Close the request stream to signal that you are finished sending data.
  request_stream.close
end

# Wait until the response processing thread is complete.
response_thread.join

示例

如需了解从浏览器麦克风流式传输到 Dialogflow 的最佳做法,请参阅示例页面