通过增强型语音自适应提高转录准确率

准备工作

增强型语音自适应是可选的语音自适应功能。 在使用增强型功能之前,请确保您已阅读语音自适应文档。如需查看增强型功能是否支持您的语言,请参阅语言支持页面

概览

默认情况下,语音自适应的影响相对较小,对单字词短语来说尤其如此。增强型语音自适应允许您为某些短语分配比其他短语更高的权重,从而提高识别模型偏差。例如,您的很多录音中讲话人在询问“fare to get into the county fair”,其中“fair”一词出现的频率高于“fare”。在这种情况下,您希望 Speech-to-Text 更频繁地识别“fair”和“fare”,而不是如“hare”或“lair”这些词。不过,“fair”的识别频率要高于“fare”,因为它在音频中出现的频率更高。

在这种情况下,您可能希望提高“fair”和“fare”的权重,以提高这些字词被正确识别的可能性。不过,由于“fair”出现的次数多于“fare”,您可以为“fair”指定较高的权重值,以使 Speech-to-Text API 选择该词的频率高于“fare”。

设置增强值

使用增强型功能时,您可以为 SpeechContext 对象指定权重值。Speech-to-Text 为音频数据中的字词选择可能的转录时,会参考此权重值。值越高,Speech-to-Text 从可能的备选项中选择该短语的可能性就越大。

较高的增强值可以减少假负例,假负例是指音频中出现的字词或短语未被 Speech-to-Text 正确识别的情况。但是,增强型功能也会增加出现假正例的可能性;假正例是指音频中不包含的字词或短语出现在转录中的情况。

增强值必须是大于 0 的浮点值。增强值的实际上限为 20。为获得最佳转录结果,请选择一个初始增强值进行实验,并根据需要调大或调小。

增强型语音自适应示例

如需在语音转录请求中为“fair”和“fare”设置不同的增强值,请将两个 SpeechContext 对象设置为 RecognitionConfig 对象的 speechContexts 数组。对于每个 SpeechContext 对象(一个包含“fair”,另一个包含“fare”),请将 boost 值设置为非负浮点值。

以下代码段显示了发送至 Speech-to-Text API 的 JSON 载荷示例。该 JSON 代码段包含一个 RecognitionConfig 对象,该对象使用增强值对“fair”和“fare”字词赋予不同的权重。

"config": {
    "encoding":"LINEAR16",
    "sampleRateHertz": 8000,
    "languageCode":"en-US",
    "speechContexts": [{
      "phrases": ["fair"],
      "boost": 15
     }, {
      "phrases": ["fare"],
      "boost": 2
     }]
  }

以下代码示例演示了如何使用增强型语音自适应发送请求。

REST 和命令行

如需详细了解 API 端点,请参阅 speech:recognize

在使用下面的请求数据之前,请先进行以下替换:

  • language-code:音频剪辑中所用语言的 BCP-47 代码。
  • phrases-to-boost:您希望 Speech-to-Text 增强的短语或短语组(以一组字符串的形式提供)。
  • storage-bucket:Cloud Storage 存储分区;
  • input-audio:您要转录的音频数据。

HTTP 方法和网址:

POST https://speech.googleapis.com/v1p1beta1/speech:recognize

请求 JSON 正文:

{
  "config":{
      "languageCode":"language-code",
      "speechContexts":[{
          "phrases":[phrases-to-boost],
          "boost": 2
      }]
  },
  "audio":{
    "uri":"gs:storage-bucket/input-file"
  }
}

如需发送您的请求,请展开以下选项之一:

您应会收到如下所示的 JSON 响应:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "When deciding whether to bring an umbrella, I consider the weather",
          "confidence": 0.9463943
        }
      ],
      "languageCode": "en-us"
    }
  ]
}

Java

import com.google.cloud.speech.v1p1beta1.RecognitionAudio;
import com.google.cloud.speech.v1p1beta1.RecognitionConfig;
import com.google.cloud.speech.v1p1beta1.RecognizeRequest;
import com.google.cloud.speech.v1p1beta1.RecognizeResponse;
import com.google.cloud.speech.v1p1beta1.SpeechClient;
import com.google.cloud.speech.v1p1beta1.SpeechContext;
import com.google.cloud.speech.v1p1beta1.SpeechRecognitionAlternative;
import com.google.cloud.speech.v1p1beta1.SpeechRecognitionResult;
import java.io.IOException;

public class SpeechAdaptation {

  public void speechAdaptation() throws IOException {
    String uriPath = "gs://cloud-samples-data/speech/brooklyn_bridge.mp3";
    speechAdaptation(uriPath);
  }

  public static void speechAdaptation(String uriPath) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (SpeechClient speechClient = SpeechClient.create()) {

      // Provides "hints" to the speech recognizer to favor specific words and phrases in the
      // results.
      // https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1p1beta1#google.cloud.speech.v1p1beta1.SpeechContext
      SpeechContext speechContext =
          SpeechContext.newBuilder().addPhrases("Brooklyn Bridge").setBoost(20.0F).build();
      // Configure recognition config to match your audio file.
      RecognitionConfig config =
          RecognitionConfig.newBuilder()
              .setEncoding(RecognitionConfig.AudioEncoding.MP3)
              .setSampleRateHertz(44100)
              .setLanguageCode("en-US")
              .addSpeechContexts(speechContext)
              .build();
      // Set the path to your audio file
      RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(uriPath).build();

      // Make the request
      RecognizeRequest request =
          RecognizeRequest.newBuilder().setConfig(config).setAudio(audio).build();

      // Display the results
      RecognizeResponse response = speechClient.recognize(request);
      for (SpeechRecognitionResult result : response.getResultsList()) {
        // First alternative is the most probable result
        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
        System.out.printf("Transcript: %s\n", alternative.getTranscript());
      }
    }
  }
}

Node.js


const speech = require('@google-cloud/speech').v1p1beta1;

/**
 * Performs synchronous speech recognition with speech adaptation.
 *
 * @param sampleRateHertz {number} Sample rate in Hertz of the audio data sent in all
 * `RecognitionAudio` messages. Valid values are: 8000-48000.
 * @param languageCode {string} The language of the supplied audio.
 * @param phrase {string} Phrase "hints" help Speech-to-Text API recognize the specified phrases from
 * your audio data.
 * @param boost {number} Positive value will increase the probability that a specific phrase will be
 * recognized over other similar sounding phrases.
 * @param uriPath {string} Path to the audio file stored on GCS.
 */
function sampleRecognize(
  sampleRateHertz,
  languageCode,
  phrase,
  boost,
  uriPath
) {
  const client = new speech.SpeechClient();
  // const sampleRateHertz = 44100;
  // const languageCode = 'en-US';
  // const phrase = 'Brooklyn Bridge';
  // const boost = 20.0;
  // const uriPath = 'gs://cloud-samples-data/speech/brooklyn_bridge.mp3';
  const encoding = 'MP3';
  const phrases = [phrase];
  const speechContextsElement = {
    phrases: phrases,
    boost: boost,
  };
  const speechContexts = [speechContextsElement];
  const config = {
    encoding: encoding,
    sampleRateHertz: sampleRateHertz,
    languageCode: languageCode,
    speechContexts: speechContexts,
  };
  const audio = {
    uri: uriPath,
  };
  const request = {
    config: config,
    audio: audio,
  };
  client
    .recognize(request)
    .then(responses => {
      const response = responses[0];
      for (const result of response.results) {
        // First alternative is the most probable result
        const alternative = result.alternatives[0];
        console.log(`Transcript: ${alternative.transcript}`);
      }
    })
    .catch(err => {
      console.error(err);
    });
}

Python

from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1 import enums

def sample_recognize(storage_uri, phrase):
    """
    Transcribe a short audio file with speech adaptation.

    Args:
      storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
      phrase Phrase "hints" help recognize the specified phrases from your audio.
    """

    client = speech_v1p1beta1.SpeechClient()

    # storage_uri = 'gs://cloud-samples-data/speech/brooklyn_bridge.mp3'
    # phrase = 'Brooklyn Bridge'
    phrases = [phrase]

    # Hint Boost. This value increases the probability that a specific
    # phrase will be recognized over other similar sounding phrases.
    # The higher the boost, the higher the chance of false positive
    # recognition as well. Can accept wide range of positive values.
    # Most use cases are best served with values between 0 and 20.
    # Using a binary search happroach may help you find the optimal value.
    boost = 20.0
    speech_contexts_element = {"phrases": phrases, "boost": boost}
    speech_contexts = [speech_contexts_element]

    # Sample rate in Hertz of the audio data sent
    sample_rate_hertz = 44100

    # The language of the supplied audio
    language_code = "en-US"

    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats.
    encoding = enums.RecognitionConfig.AudioEncoding.MP3
    config = {
        "speech_contexts": speech_contexts,
        "sample_rate_hertz": sample_rate_hertz,
        "language_code": language_code,
        "encoding": encoding,
    }
    audio = {"uri": storage_uri}

    response = client.recognize(config, audio)
    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        print(u"Transcript: {}".format(alternative.transcript))

后续步骤