本页面介绍如何使用 API 将音频输入流式传输给检测意图请求。Dialogflow 会处理音频并将其转换为文字,然后再尝试匹配意图。这种转换称为音频输入、语音识别或语音转文字 (STT)。
准备工作
此功能仅适用于使用 API 与最终用户互动的情况。如果您使用的是集成服务,则可以跳过本指南。
在阅读本指南之前,请先完成以下事项:
- 阅读 Dialogflow 基础知识。
- 执行设置步骤。
创建代理
如果尚未创建代理,请立即创建一个:
- 转到 Dialogflow ES 控制台。
- 如果系统要求登录 Dialogflow 控制台,请登录。如需了解详情,请参阅 Dialogflow 控制台概览。
- 点击左侧边栏菜单中的创建代理 (Create Agent)。如果您已有其他代理,请点击代理名称,滚动到底部,然后点击创建新代理 (Create new agent)。
- 输入您的代理名称、默认语言和默认时区。
- 如果您已经创建了项目,请输入该项目。如果要允许 Dialogflow 控制台创建项目,请选择创建新 Google 项目 (Create a new Google project)。
- 点击创建 (Create) 按钮。
将示例文件导入代理
本指南中的步骤对您的代理进行了假设,因此您需要导入为本指南准备的代理。 导入时,这些步骤使用“恢复”(restore) 选项,该选项会覆盖所有代理设置、意图和实体。
如需导入文件,请按以下步骤操作:
-
下载
room-booking-agent.zip
文件。 - 转到 Dialogflow ES 控制台。
- 选择您的代理。
- 点击代理名称旁边的设置 settings 按钮。
- 选择导出和导入 (Export and Import) 标签页。
- 选择从 ZIP 文件恢复 (Restore from ZIP),然后按照说明恢复下载的 zip 文件。
流式传输基础知识
Session
类型的 streamingDetectIntent
方法返回双向 gRPC 流式传输对象。此对象的可用方法随语言而变,因此请参阅与您的客户端库相应的参考文档,以了解详情。
流式传输对象用于并发收发数据。使用此对象,客户端可将音频内容流式传输到 Dialogflow,并同时侦听 StreamingDetectIntentResponse
。
streamingDetectIntent
方法的 query_input.audio_config.single_utterance
参数会影响语音识别:
- 如果为
false
(默认值),则在客户端关闭数据流之前,不会停止语音识别。 - 如果为
true
,则 Dialogflow 将检测输入音频中的单独一条话语。Dialogflow 检测到音频的语音停止或暂停时,它会停止语音识别,并向客户端发送StreamingDetectIntentResponse
及END_OF_SINGLE_UTTERANCE
的识别结果。收到END_OF_SINGLE_UTTERANCE
后,Dialogflow 会忽略该流中向其发送的任何音频。
在双向流式传输中,客户端可以半关闭流式对象,以告知服务器它不会再继续发送数据。例如,在 Java 和 Go 中,此方法称为 closeSend
。在以下情况下,半关闭(并非取消)数据流非常重要:
- 客户端已完成数据发送。
- 客户端配置为
single_utterance
设为 true,并且收到了StreamingDetectIntentResponse
以及识别结果END_OF_SINGLE_UTTERANCE
。
关闭数据流后,客户端应根据需要使用新数据流发起新请求。
流式传输检测意图
以下示例使用 Session
类型的 streamingDetectIntent
方法流式传输音频。
C#
public static async Task<object> DetectIntentFromStreamAsync(
string projectId,
string sessionId,
string filePath)
{
var sessionsClient = SessionsClient.Create();
var sessionName = SessionName.FromProjectSession(projectId, sessionId).ToString();
// Initialize streaming call, retrieving the stream object
var streamingDetectIntent = sessionsClient.StreamingDetectIntent();
// Define a task to process results from the API
var responseHandlerTask = Task.Run(async () =>
{
var responseStream = streamingDetectIntent.GetResponseStream();
while (await responseStream.MoveNextAsync())
{
var response = responseStream.Current;
var queryResult = response.QueryResult;
if (queryResult != null)
{
Console.WriteLine($"Query text: {queryResult.QueryText}");
if (queryResult.Intent != null)
{
Console.Write("Intent detected:");
Console.WriteLine(queryResult.Intent.DisplayName);
}
}
}
});
// Instructs the speech recognizer how to process the audio content.
// Note: hard coding audioEncoding, sampleRateHertz for simplicity.
var queryInput = new QueryInput
{
AudioConfig = new InputAudioConfig
{
AudioEncoding = AudioEncoding.Linear16,
LanguageCode = "en-US",
SampleRateHertz = 16000
}
};
// The first request must **only** contain the audio configuration:
await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
{
QueryInput = queryInput,
Session = sessionName
});
using (FileStream fileStream = new FileStream(filePath, FileMode.Open))
{
// Subsequent requests must **only** contain the audio data.
// Following messages: audio chunks. We just read the file in
// fixed-size chunks. In reality you would split the user input
// by time.
var buffer = new byte[32 * 1024];
int bytesRead;
while ((bytesRead = await fileStream.ReadAsync(
buffer, 0, buffer.Length)) > 0)
{
await streamingDetectIntent.WriteAsync(new StreamingDetectIntentRequest
{
Session = sessionName,
InputAudio = Google.Protobuf.ByteString.CopyFrom(buffer, 0, bytesRead)
});
};
}
// Tell the service you are done sending data
await streamingDetectIntent.WriteCompleteAsync();
// This will complete once all server responses have been processed.
await responseHandlerTask;
return 0;
}
Go
func DetectIntentStream(projectID, sessionID, audioFile, languageCode string) (string, error) {
ctx := context.Background()
sessionClient, err := dialogflow.NewSessionsClient(ctx)
if err != nil {
return "", err
}
defer sessionClient.Close()
if projectID == "" || sessionID == "" {
return "", errors.New(fmt.Sprintf("Received empty project (%s) or session (%s)", projectID, sessionID))
}
sessionPath := fmt.Sprintf("projects/%s/agent/sessions/%s", projectID, sessionID)
// In this example, we hard code the encoding and sample rate for simplicity.
audioConfig := dialogflowpb.InputAudioConfig{AudioEncoding: dialogflowpb.AudioEncoding_AUDIO_ENCODING_LINEAR_16, SampleRateHertz: 16000, LanguageCode: languageCode}
queryAudioInput := dialogflowpb.QueryInput_AudioConfig{AudioConfig: &audioConfig}
queryInput := dialogflowpb.QueryInput{Input: &queryAudioInput}
streamer, err := sessionClient.StreamingDetectIntent(ctx)
if err != nil {
return "", err
}
f, err := os.Open(audioFile)
if err != nil {
return "", err
}
defer f.Close()
go func() {
audioBytes := make([]byte, 1024)
request := dialogflowpb.StreamingDetectIntentRequest{Session: sessionPath, QueryInput: &queryInput}
err = streamer.Send(&request)
if err != nil {
log.Fatal(err)
}
for {
_, err := f.Read(audioBytes)
if err == io.EOF {
streamer.CloseSend()
break
}
if err != nil {
log.Fatal(err)
}
request = dialogflowpb.StreamingDetectIntentRequest{InputAudio: audioBytes}
err = streamer.Send(&request)
if err != nil {
log.Fatal(err)
}
}
}()
var queryResult *dialogflowpb.QueryResult
for {
response, err := streamer.Recv()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
recognitionResult := response.GetRecognitionResult()
transcript := recognitionResult.GetTranscript()
log.Printf("Recognition transcript: %s\n", transcript)
queryResult = response.GetQueryResult()
}
fulfillmentText := queryResult.GetFulfillmentText()
return fulfillmentText, nil
}
Java
import com.google.api.gax.rpc.ApiException;
import com.google.api.gax.rpc.BidiStream;
import com.google.cloud.dialogflow.v2.AudioEncoding;
import com.google.cloud.dialogflow.v2.InputAudioConfig;
import com.google.cloud.dialogflow.v2.QueryInput;
import com.google.cloud.dialogflow.v2.QueryResult;
import com.google.cloud.dialogflow.v2.SessionName;
import com.google.cloud.dialogflow.v2.SessionsClient;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentRequest;
import com.google.cloud.dialogflow.v2.StreamingDetectIntentResponse;
import com.google.protobuf.ByteString;
import java.io.FileInputStream;
import java.io.IOException;
class DetectIntentStream {
// DialogFlow API Detect Intent sample with audio files processes as an audio stream.
static void detectIntentStream(String projectId, String audioFilePath, String sessionId)
throws IOException, ApiException {
// String projectId = "YOUR_PROJECT_ID";
// String audioFilePath = "path_to_your_audio_file";
// Using the same `sessionId` between requests allows continuation of the conversation.
// String sessionId = "Identifier of the DetectIntent session";
// Instantiates a client
try (SessionsClient sessionsClient = SessionsClient.create()) {
// Set the session name using the sessionId (UUID) and projectID (my-project-id)
SessionName session = SessionName.of(projectId, sessionId);
// Instructs the speech recognizer how to process the audio content.
// Note: hard coding audioEncoding and sampleRateHertz for simplicity.
// Audio encoding of the audio content sent in the query request.
InputAudioConfig inputAudioConfig =
InputAudioConfig.newBuilder()
.setAudioEncoding(AudioEncoding.AUDIO_ENCODING_LINEAR_16)
.setLanguageCode("en-US") // languageCode = "en-US"
.setSampleRateHertz(16000) // sampleRateHertz = 16000
.build();
// Build the query with the InputAudioConfig
QueryInput queryInput = QueryInput.newBuilder().setAudioConfig(inputAudioConfig).build();
// Create the Bidirectional stream
BidiStream<StreamingDetectIntentRequest, StreamingDetectIntentResponse> bidiStream =
sessionsClient.streamingDetectIntentCallable().call();
// The first request must **only** contain the audio configuration:
bidiStream.send(
StreamingDetectIntentRequest.newBuilder()
.setSession(session.toString())
.setQueryInput(queryInput)
.build());
try (FileInputStream audioStream = new FileInputStream(audioFilePath)) {
// Subsequent requests must **only** contain the audio data.
// Following messages: audio chunks. We just read the file in fixed-size chunks. In reality
// you would split the user input by time.
byte[] buffer = new byte[4096];
int bytes;
while ((bytes = audioStream.read(buffer)) != -1) {
bidiStream.send(
StreamingDetectIntentRequest.newBuilder()
.setInputAudio(ByteString.copyFrom(buffer, 0, bytes))
.build());
}
}
// Tell the service you are done sending data
bidiStream.closeSend();
for (StreamingDetectIntentResponse response : bidiStream) {
QueryResult queryResult = response.getQueryResult();
System.out.println("====================");
System.out.format("Intent Display Name: %s\n", queryResult.getIntent().getDisplayName());
System.out.format("Query Text: '%s'\n", queryResult.getQueryText());
System.out.format(
"Detected Intent: %s (confidence: %f)\n",
queryResult.getIntent().getDisplayName(), queryResult.getIntentDetectionConfidence());
System.out.format("Fulfillment Text: '%s'\n", queryResult.getFulfillmentText());
}
}
}
}
Node.js
const fs = require('fs');
const util = require('util');
const {Transform, pipeline} = require('stream');
const {struct} = require('pb-util');
const pump = util.promisify(pipeline);
// Imports the Dialogflow library
const dialogflow = require('@google-cloud/dialogflow');
// Instantiates a session client
const sessionClient = new dialogflow.SessionsClient();
// The path to the local file on which to perform speech recognition, e.g.
// /path/to/audio.raw const filename = '/path/to/audio.raw';
// The encoding of the audio file, e.g. 'AUDIO_ENCODING_LINEAR_16'
// const encoding = 'AUDIO_ENCODING_LINEAR_16';
// The sample rate of the audio file in hertz, e.g. 16000
// const sampleRateHertz = 16000;
// The BCP-47 language code to use, e.g. 'en-US'
// const languageCode = 'en-US';
const sessionPath = sessionClient.projectAgentSessionPath(
projectId,
sessionId
);
const initialStreamRequest = {
session: sessionPath,
queryInput: {
audioConfig: {
audioEncoding: encoding,
sampleRateHertz: sampleRateHertz,
languageCode: languageCode,
},
singleUtterance: true,
},
};
// Create a stream for the streaming request.
const detectStream = sessionClient
.streamingDetectIntent()
.on('error', console.error)
.on('data', data => {
if (data.recognitionResult) {
console.log(
`Intermediate transcript: ${data.recognitionResult.transcript}`
);
} else {
console.log('Detected intent:');
const result = data.queryResult;
// Instantiates a context client
const contextClient = new dialogflow.ContextsClient();
console.log(` Query: ${result.queryText}`);
console.log(` Response: ${result.fulfillmentText}`);
if (result.intent) {
console.log(` Intent: ${result.intent.displayName}`);
} else {
console.log(' No intent matched.');
}
const parameters = JSON.stringify(struct.decode(result.parameters));
console.log(` Parameters: ${parameters}`);
if (result.outputContexts && result.outputContexts.length) {
console.log(' Output contexts:');
result.outputContexts.forEach(context => {
const contextId = contextClient.matchContextFromProjectAgentSessionContextName(
context.name
);
const contextParameters = JSON.stringify(
struct.decode(context.parameters)
);
console.log(` ${contextId}`);
console.log(` lifespan: ${context.lifespanCount}`);
console.log(` parameters: ${contextParameters}`);
});
}
}
});
// Write the initial stream request to config for audio input.
detectStream.write(initialStreamRequest);
// Stream an audio file from disk to the Conversation API, e.g.
// "./resources/audio.raw"
await pump(
fs.createReadStream(filename),
// Format the audio stream into the request format.
new Transform({
objectMode: true,
transform: (obj, _, next) => {
next(null, {inputAudio: obj});
},
}),
detectStream
);
PHP
namespace Google\Cloud\Samples\Dialogflow;
use Google\Cloud\Dialogflow\V2\SessionsClient;
use Google\Cloud\Dialogflow\V2\AudioEncoding;
use Google\Cloud\Dialogflow\V2\InputAudioConfig;
use Google\Cloud\Dialogflow\V2\QueryInput;
use Google\Cloud\Dialogflow\V2\StreamingDetectIntentRequest;
/**
* Returns the result of detect intent with streaming audio as input.
* Using the same `session_id` between requests allows continuation
* of the conversation.
*/
function detect_intent_stream($projectId, $path, $sessionId, $languageCode = 'en-US')
{
// need to use gRPC
if (!defined('Grpc\STATUS_OK')) {
throw new \Exception('Install the grpc extension ' .
'(pecl install grpc)');
}
// new session
$sessionsClient = new SessionsClient();
$session = $sessionsClient->sessionName($projectId, $sessionId ?: uniqid());
printf('Session path: %s' . PHP_EOL, $session);
// hard coding audio_encoding and sample_rate_hertz for simplicity
$audioConfig = new InputAudioConfig();
$audioConfig->setAudioEncoding(AudioEncoding::AUDIO_ENCODING_LINEAR_16);
$audioConfig->setLanguageCode($languageCode);
$audioConfig->setSampleRateHertz(16000);
// create query input
$queryInput = new QueryInput();
$queryInput->setAudioConfig($audioConfig);
// first request contains the configuration
$request = new StreamingDetectIntentRequest();
$request->setSession($session);
$request->setQueryInput($queryInput);
$requests = [$request];
// we are going to read small chunks of audio data from
// a local audio file. in practice, these chunks should
// come from an audio input device.
$audioStream = fopen($path, 'rb');
while (true) {
$chunk = stream_get_contents($audioStream, 4096);
if (!$chunk) {
break;
}
$request = new StreamingDetectIntentRequest();
$request->setInputAudio($chunk);
$requests[] = $request;
}
// intermediate transcript info
print(PHP_EOL . str_repeat("=", 20) . PHP_EOL);
$stream = $sessionsClient->streamingDetectIntent();
foreach ($requests as $request) {
$stream->write($request);
}
foreach ($stream->closeWriteAndReadAll() as $response) {
$recognitionResult = $response->getRecognitionResult();
if ($recognitionResult) {
$transcript = $recognitionResult->getTranscript();
printf('Intermediate transcript: %s' . PHP_EOL, $transcript);
}
}
// get final response and relevant info
if ($response) {
print(str_repeat("=", 20) . PHP_EOL);
$queryResult = $response->getQueryResult();
$queryText = $queryResult->getQueryText();
$intent = $queryResult->getIntent();
$displayName = $intent->getDisplayName();
$confidence = $queryResult->getIntentDetectionConfidence();
$fulfilmentText = $queryResult->getFulfillmentText();
// output relevant info
printf('Query text: %s' . PHP_EOL, $queryText);
printf('Detected intent: %s (confidence: %f)' . PHP_EOL, $displayName,
$confidence);
print(PHP_EOL);
printf('Fulfilment text: %s' . PHP_EOL, $fulfilmentText);
}
$sessionsClient->close();
}
Python
def detect_intent_stream(project_id, session_id, audio_file_path,
language_code):
"""Returns the result of detect intent with streaming audio as input.
Using the same `session_id` between requests allows continuation
of the conversation."""
from google.cloud import dialogflow
session_client = dialogflow.SessionsClient()
# Note: hard coding audio_encoding and sample_rate_hertz for simplicity.
audio_encoding = dialogflow.AudioEncoding.AUDIO_ENCODING_LINEAR_16
sample_rate_hertz = 16000
session_path = session_client.session_path(project_id, session_id)
print('Session path: {}\n'.format(session_path))
def request_generator(audio_config, audio_file_path):
query_input = dialogflow.QueryInput(audio_config=audio_config)
# The first request contains the configuration.
yield dialogflow.StreamingDetectIntentRequest(
session=session_path, query_input=query_input)
# Here we are reading small chunks of audio data from a local
# audio file. In practice these chunks should come from
# an audio input device.
with open(audio_file_path, 'rb') as audio_file:
while True:
chunk = audio_file.read(4096)
if not chunk:
break
# The later requests contains audio data.
yield dialogflow.StreamingDetectIntentRequest(
input_audio=chunk)
audio_config = dialogflow.InputAudioConfig(
audio_encoding=audio_encoding, language_code=language_code,
sample_rate_hertz=sample_rate_hertz)
requests = request_generator(audio_config, audio_file_path)
responses = session_client.streaming_detect_intent(requests=requests)
print('=' * 20)
for response in responses:
print('Intermediate transcript: "{}".'.format(
response.recognition_result.transcript))
# Note: The result from the last response is the final transcript along
# with the detected content.
query_result = response.query_result
print('=' * 20)
print('Query text: {}'.format(query_result.query_text))
print('Detected intent: {} (confidence: {})\n'.format(
query_result.intent.display_name,
query_result.intent_detection_confidence))
print('Fulfillment text: {}\n'.format(
query_result.fulfillment_text))
Ruby
# project_id = "Your Google Cloud project ID"
# session_id = "mysession"
# audio_file_path = "resources/book_a_room.wav"
# language_code = "en-US"
require "google/cloud/dialogflow"
require "monitor"
session_client = Google::Cloud::Dialogflow.sessions
session = session_client.session_path project: project_id,
session: session_id
puts "Session path: #{session}"
audio_config = {
audio_encoding: :AUDIO_ENCODING_LINEAR_16,
sample_rate_hertz: 16_000,
language_code: language_code
}
query_input = { audio_config: audio_config }
streaming_config = { session: session, query_input: query_input }
# Set up a stream of audio data
request_stream = Gapic::StreamInput.new
# Initiate the call
response_stream = session_client.streaming_detect_intent request_stream
# Process the response stream in a separate thread
response_thread = Thread.new do
response_stream.each do |response|
if response.recognition_result
puts "Intermediate transcript: #{response.recognition_result.transcript}\n"
else
# the last response has the actual query result
query_result = response.query_result
puts "Query text: #{query_result.query_text}"
puts "Intent detected: #{query_result.intent.display_name}"
puts "Intent confidence: #{query_result.intent_detection_confidence}"
puts "Fulfillment text: #{query_result.fulfillment_text}\n"
end
end
end
# The first request needs to be the configuration.
request_stream.push streaming_config
# Send chunks of audio data in the request stream
begin
audio_file = File.open audio_file_path, "rb"
loop do
chunk = audio_file.read 4096
break unless chunk
request_stream.push input_audio: chunk
sleep 0.5
end
ensure
audio_file.close
# Close the request stream to signal that you are finished sending data.
request_stream.close
end
# Wait until the response processing thread is complete.
response_thread.join
示例
如需了解从浏览器麦克风流式传输到 Dialogflow 的最佳做法,请参阅示例页面。