从音频输入流中检测意图

本页面介绍如何使用 API 将音频输入流式传输给检测意图请求。Dialogflow 会处理音频并将其转换为文字,然后再尝试匹配意图。这种转换称为音频输入、语音识别语音转文字 (STT)。

准备工作

此功能仅适用于使用 API 与最终用户互动的情况。如果您使用的是集成服务,则可以跳过本指南。

在阅读本指南之前,请先完成以下事项:

  1. 阅读 Dialogflow 基础知识
  2. 执行设置步骤

创建代理

如果尚未创建代理,请立即创建一个:

  1. 转到 Dialogflow ES 控制台
  2. 如果系统要求登录 Dialogflow 控制台,请登录。如需了解详情,请参阅 Dialogflow 控制台概览
  3. 点击左侧边栏菜单中的创建代理 (Create Agent)。如果您已有其他代理,请点击代理名称,滚动到底部,然后点击创建新代理 (Create new agent)。
  4. 输入您的代理名称、默认语言和默认时区。
  5. 如果您已经创建了项目,请输入该项目。如果要允许 Dialogflow 控制台创建项目,请选择创建新 Google 项目 (Create a new Google project)。
  6. 点击创建 (Create) 按钮。

将示例文件导入代理

本指南中的步骤对您的代理进行了假设,因此您需要import为本指南准备的代理。 导入时,这些步骤使用“恢复”(restore) 选项,该选项会覆盖所有代理设置、意图和实体。

如需导入文件,请按以下步骤操作:

  1. 下载 room-booking-agent.zip 文件。
  2. 转到 Dialogflow ES 控制台
  3. 选择您的代理。
  4. 点击代理名称旁边的设置 按钮。
  5. 选择导出和导入标签页。
  6. 选择从 ZIP 文件恢复 (Restore from ZIP),然后按照说明恢复下载的 zip 文件。

流式传输基础知识

Session 类型的 streamingDetectIntent 方法返回双向 gRPC 流式传输对象。此对象的可用方法随语言而变,因此请参阅与您的客户端库相应的参考文档,以了解详情。

流式传输对象用于并发收发数据。使用此对象,客户端可将音频内容流式传输到 Dialogflow,并同时侦听 StreamingDetectIntentResponse

streamingDetectIntent 方法的 query_input.audio_config.single_utterance 参数会影响语音识别:

  • 如果为 false(默认值),则在客户端关闭数据流之前,不会停止语音识别。
  • 如果为 true,则 Dialogflow 将检测输入音频中的单独一条话语。Dialogflow 检测到音频的语音停止或暂停时,它会停止语音识别,并向客户端发送 StreamingDetectIntentResponseEND_OF_SINGLE_UTTERANCE 的识别结果。收到 END_OF_SINGLE_UTTERANCE 后,Dialogflow 会忽略该流中向其发送的任何音频。

在双向流式传输中,客户端可以半关闭流式对象,以告知服务器它不会再继续发送数据。例如,在 Java 和 Go 中,此方法称为 closeSend。在以下情况下,半关闭(并非取消)数据流非常重要:

  • 客户端已完成数据发送。
  • 客户端配置为 single_utterance 设为 true,并且收到了 StreamingDetectIntentResponse 以及识别结果 END_OF_SINGLE_UTTERANCE

关闭数据流后,客户端应根据需要使用新数据流发起新请求。

流式传输检测意图

以下示例使用 Session 类型的 streamingDetectIntent 方法流式传输音频。

Go

要向 Dialogflow 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

func DetectIntentStream(projectID, sessionID, audioFile, languageCode string) (string, error) {
	ctx := context.Background()

	sessionClient, err := dialogflow.NewSessionsClient(ctx)
	if err != nil {
		return "", err
	}
	defer sessionClient.Close()

	if projectID == "" || sessionID == "" {
		return "", errors.New(fmt.Sprintf("Received empty project (%s) or session (%s)", projectID, sessionID))
	}

	sessionPath := fmt.Sprintf("projects/%s/agent/sessions/%s", projectID, sessionID)

	// In this example, we hard code the encoding and sample rate for simplicity.
	audioConfig := dialogflowpb.InputAudioConfig{AudioEncoding: dialogflowpb.AudioEncoding_AUDIO_ENCODING_LINEAR_16, SampleRateHertz: 16000, LanguageCode: languageCode}

	queryAudioInput := dialogflowpb.QueryInput_AudioConfig{AudioConfig: &audioConfig}

	queryInput := dialogflowpb.QueryInput{Input: &queryAudioInput}

	streamer, err := sessionClient.StreamingDetectIntent(ctx)
	if err != nil {
		return "", err
	}

	f, err := os.Open(audioFile)
	if err != nil {
		return "", err
	}

	defer f.Close()

	go func() {
		audioBytes := make([]byte, 1024)

		request := dialogflowpb.StreamingDetectIntentRequest{Session: sessionPath, QueryInput: &queryInput}
		err = streamer.Send(&request)
		if err != nil {
			log.Fatal(err)
		}

		for {
			_, err := f.Read(audioBytes)
			if err == io.EOF {
				streamer.CloseSend()
				break
			}
			if err != nil {
				log.Fatal(err)
			}

			request = dialogflowpb.StreamingDetectIntentRequest{InputAudio: audioBytes}
			err = streamer.Send(&request)
			if err != nil {
				log.Fatal(err)
			}
		}
	}()

	var queryResult *dialogflowpb.QueryResult

	for {
		response, err := streamer.Recv()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}

		recognitionResult := response.GetRecognitionResult()
		transcript := recognitionResult.GetTranscript()
		log.Printf("Recognition transcript: %s\n", transcript)

		queryResult = response.GetQueryResult()
	}

	fulfillmentText := queryResult.GetFulfillmentText()
	return fulfillmentText, nil
}

Java

要向 Dialogflow 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证


import com.google.api.gax.rpc.ApiException;
import com.google.api.gax.rpc.BidiStream;
import com.google.cloud.dialogflow.v2.AudioEncoding;
import com.google.cloud.dialogflow.v2.InputAudioConfig;
import com.google.cloud.dialogflow.v2.QueryInput;
import com.google.cloud.dialogflow.v2.QueryResult;
import com.google.cloud.dialogflow.v2.SessionName;
import com.google.cloud.dialogflow.v2.SessionsClient;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentRequest;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentResponse;
import com.google.protobuf.ByteString;
import java.io.FileInputStream;
import java.io.IOException;

class DetectIntentStream {

  // DialogFlow API Detect Intent sample with audio files processes as an audio stream.
  static void detectIntentStream(String projectId, String audioFilePath, String sessionId)
      throws IOException, ApiException {
    // String projectId = "YOUR_PROJECT_ID";
    // String audioFilePath = "path_to_your_audio_file";
    // Using the same `sessionId` between requests allows continuation of the conversation.
    // String sessionId = "Identifier of the DetectIntent session";

    // Instantiates a client
    try (SessionsClient sessionsClient = SessionsClient.create()) {
      // Set the session name using the sessionId (UUID) and projectID (my-project-id)
      SessionName session = SessionName.of(projectId, sessionId);

      // Instructs the speech recognizer how to process the audio content.
      // Note: hard coding audioEncoding and sampleRateHertz for simplicity.
      // Audio encoding of the audio content sent in the query request.
      InputAudioConfig inputAudioConfig =
          InputAudioConfig.newBuilder()
              .setAudioEncoding(AudioEncoding.AUDIO_ENCODING_LINEAR_16)
              .setLanguageCode("en-US") // languageCode = "en-US"
              .setSampleRateHertz(16000) // sampleRateHertz = 16000
              .build();

      // Build the query with the InputAudioConfig
      QueryInput queryInput = QueryInput.newBuilder().setAudioConfig(inputAudioConfig).build();

      // Create the Bidirectional stream
      BidiStream<StreamingDetectIntentRequest, StreamingDetectIntentResponse> bidiStream =
          sessionsClient.streamingDetectIntentCallable().call();

      // The first request must **only** contain the audio configuration:
      bidiStream.send(
          StreamingDetectIntentRequest.newBuilder()
              .setSession(session.toString())
              .setQueryInput(queryInput)
              .build());

      try (FileInputStream audioStream = new FileInputStream(audioFilePath)) {
        // Subsequent requests must **only** contain the audio data.
        // Following messages: audio chunks. We just read the file in fixed-size chunks. In reality
        // you would split the user input by time.
        byte[] buffer = new byte[4096];
        int bytes;
        while ((bytes = audioStream.read(buffer)) != -1) {
          bidiStream.send(
              StreamingDetectIntentRequest.newBuilder()
                  .setInputAudio(ByteString.copyFrom(buffer, 0, bytes))
                  .build());
        }
      }

      // Tell the service you are done sending data
      bidiStream.closeSend();

      for (StreamingDetectIntentResponse response : bidiStream) {
        QueryResult queryResult = response.getQueryResult();
        System.out.println("====================");
        System.out.format("Intent Display Name: %s\n", queryResult.getIntent().getDisplayName());
        System.out.format("Query Text: '%s'\n", queryResult.getQueryText());
        System.out.format(
            "Detected Intent: %s (confidence: %f)\n",
            queryResult.getIntent().getDisplayName(), queryResult.getIntentDetectionConfidence());
        System.out.format(
            "Fulfillment Text: '%s'\n",
            queryResult.getFulfillmentMessagesCount() > 0
                ? queryResult.getFulfillmentMessages(0).getText()
                : "Triggered Default Fallback Intent");
      }
    }
  }
}

Node.js

要向 Dialogflow 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

const fs = require('fs');
const util = require('util');
const {Transform, pipeline} = require('stream');
const {struct} = require('pb-util');

const pump = util.promisify(pipeline);
// Imports the Dialogflow library
const dialogflow = require('@google-cloud/dialogflow');

// Instantiates a session client
const sessionClient = new dialogflow.SessionsClient();

// The path to the local file on which to perform speech recognition, e.g.
// /path/to/audio.raw const filename = '/path/to/audio.raw';

// The encoding of the audio file, e.g. 'AUDIO_ENCODING_LINEAR_16'
// const encoding = 'AUDIO_ENCODING_LINEAR_16';

// The sample rate of the audio file in hertz, e.g. 16000
// const sampleRateHertz = 16000;

// The BCP-47 language code to use, e.g. 'en-US'
// const languageCode = 'en-US';
const sessionPath = sessionClient.projectAgentSessionPath(
  projectId,
  sessionId
);

const initialStreamRequest = {
  session: sessionPath,
  queryInput: {
    audioConfig: {
      audioEncoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
    },
  },
};

// Create a stream for the streaming request.
const detectStream = sessionClient
  .streamingDetectIntent()
  .on('error', console.error)
  .on('data', data => {
    if (data.recognitionResult) {
      console.log(
        `Intermediate transcript: ${data.recognitionResult.transcript}`
      );
    } else {
      console.log('Detected intent:');

      const result = data.queryResult;
      // Instantiates a context client
      const contextClient = new dialogflow.ContextsClient();

      console.log(`  Query: ${result.queryText}`);
      console.log(`  Response: ${result.fulfillmentText}`);
      if (result.intent) {
        console.log(`  Intent: ${result.intent.displayName}`);
      } else {
        console.log('  No intent matched.');
      }
      const parameters = JSON.stringify(struct.decode(result.parameters));
      console.log(`  Parameters: ${parameters}`);
      if (result.outputContexts && result.outputContexts.length) {
        console.log('  Output contexts:');
        result.outputContexts.forEach(context => {
          const contextId =
            contextClient.matchContextFromProjectAgentSessionContextName(
              context.name
            );
          const contextParameters = JSON.stringify(
            struct.decode(context.parameters)
          );
          console.log(`    ${contextId}`);
          console.log(`      lifespan: ${context.lifespanCount}`);
          console.log(`      parameters: ${contextParameters}`);
        });
      }
    }
  });

// Write the initial stream request to config for audio input.
detectStream.write(initialStreamRequest);

// Stream an audio file from disk to the Conversation API, e.g.
// "./resources/audio.raw"
await pump(
  fs.createReadStream(filename),
  // Format the audio stream into the request format.
  new Transform({
    objectMode: true,
    transform: (obj, _, next) => {
      next(null, {inputAudio: obj});
    },
  }),
  detectStream
);

Python

要向 Dialogflow 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

def detect_intent_stream(project_id, session_id, audio_file_path, language_code):
    """Returns the result of detect intent with streaming audio as input.

    Using the same `session_id` between requests allows continuation
    of the conversation."""
    from google.cloud import dialogflow

    session_client = dialogflow.SessionsClient()

    # Note: hard coding audio_encoding and sample_rate_hertz for simplicity.
    audio_encoding = dialogflow.AudioEncoding.AUDIO_ENCODING_LINEAR_16
    sample_rate_hertz = 16000

    session_path = session_client.session_path(project_id, session_id)
    print("Session path: {}\n".format(session_path))

    def request_generator(audio_config, audio_file_path):
        query_input = dialogflow.QueryInput(audio_config=audio_config)

        # The first request contains the configuration.
        yield dialogflow.StreamingDetectIntentRequest(
            session=session_path, query_input=query_input
        )

        # Here we are reading small chunks of audio data from a local
        # audio file.  In practice these chunks should come from
        # an audio input device.
        with open(audio_file_path, "rb") as audio_file:
            while True:
                chunk = audio_file.read(4096)
                if not chunk:
                    break
                # The later requests contains audio data.
                yield dialogflow.StreamingDetectIntentRequest(input_audio=chunk)

    audio_config = dialogflow.InputAudioConfig(
        audio_encoding=audio_encoding,
        language_code=language_code,
        sample_rate_hertz=sample_rate_hertz,
    )

    requests = request_generator(audio_config, audio_file_path)
    responses = session_client.streaming_detect_intent(requests=requests)

    print("=" * 20)
    for response in responses:
        print(
            'Intermediate transcript: "{}".'.format(
                response.recognition_result.transcript
            )
        )

    # Note: The result from the last response is the final transcript along
    # with the detected content.
    query_result = response.query_result

    print("=" * 20)
    print("Query text: {}".format(query_result.query_text))
    print(
        "Detected intent: {} (confidence: {})\n".format(
            query_result.intent.display_name, query_result.intent_detection_confidence
        )
    )
    print("Fulfillment text: {}\n".format(query_result.fulfillment_text))

其他语言

C#: 请按照客户端库页面上的 C# 设置说明操作,然后访问 .NET 版 Dialogflow 参考文档

PHP: 请按照客户端库页面上的 PHP 设置说明操作,然后访问 PHP 版 Dialogflow 参考文档

Ruby 版: 请按照客户端库页面上的 Ruby 设置说明操作,然后访问 Ruby 版 Dialogflow 参考文档

示例

如需了解从浏览器麦克风流式传输到 Dialogflow 的最佳做法,请参阅示例页面