通过增强型语音自适应提高转录准确率

准备工作

增强型语音自适应是可选的语音自适应功能。 在使用增强型功能之前,请确保您已阅读语音自适应文档。如需查看增强型功能是否支持您的语言,请参阅语言支持页面

概览

默认情况下,语音自适应的影响相对较小,对单字词短语来说尤其如此。增强型语音自适应允许您为某些短语分配比其他短语更高的权重,从而提高识别模型偏差。例如,您的很多录音中讲话人在询问“fare to get into the county fair”,其中“fair”一词出现的频率高于“fare”。在这种情况下,您希望 Speech-to-Text 更频繁地识别“fair”和“fare”,而不是如“hare”或“lair”这些词。不过,“fair”的识别频率要高于“fare”,因为它在音频中出现的频率更高。

在这种情况下,您可能希望提高“fair”和“fare”的权重,以提高这些字词被正确识别的可能性。不过,由于“fair”出现的次数多于“fare”,您可以为“fair”指定较高的权重值,以使 Speech-to-Text API 选择该词的频率高于“fare”。

设置增强值

使用增强型功能时,您可以为 SpeechContext 对象指定权重值。Speech-to-Text 为音频数据中的字词选择可能的转录时,会参考此权重值。值越高,Speech-to-Text 从可能的备选项中选择该短语的可能性就越大。

较高的增强值可以减少假负例,假负例是指音频中出现的字词或短语未被 Speech-to-Text 正确识别的情况。但是,增强型功能也会增加出现假正例的可能性;假正例是指音频中不包含的字词或短语出现在转录中的情况。

增强值必须是大于 0 的浮点值。增强值的实际上限为 20。为获得最佳转录结果,请选择一个初始增强值进行实验,并根据需要调大或调小。

增强型语音自适应示例

如需在语音转录请求中为“fair”和“fare”设置不同的增强值,请将两个 SpeechContext 对象设置为 RecognitionConfig 对象的 speechContexts 数组。对于每个 SpeechContext 对象(一个包含“fair”,另一个包含“fare”),请将 boost 值设置为非负浮点值。

以下代码段显示了发送至 Speech-to-Text API 的 JSON 载荷示例。该 JSON 代码段包含一个 RecognitionConfig 对象,该对象使用增强值对“fair”和“fare”字词赋予不同的权重。

    "config": {
        "encoding":"LINEAR16",
        "sampleRateHertz": 8000,
        "languageCode":"en-US",
        "speechContexts": [{
          "phrases": ["fair"],
          "boost": 15
         }, {
          "phrases": ["fare"],
          "boost": 2
         }]
      }
    

以下代码示例演示了如何使用增强型语音自适应发送请求。

REST 和命令行

如需详细了解 API 端点,请参阅 speech:recognize

在使用下面的任何请求数据之前,请先进行以下替换:

  • language-code:您的音频剪辑中所讲的语言的 BCP-47 代码。
  • phrases-to-boost:您希望 Speech-to-Text 以一组字符串数组形式增强的短语或短语组。
  • storage-bucket:Cloud Storage 存储分区。
  • input-audio:您要转录的音频数据。

HTTP 方法和网址:

POST https://speech.googleapis.com/v1p1beta1/speech:recognize

请求 JSON 正文:

    {
      "config":{
          "languageCode":"language-code",
          "speechContexts":[{
              "phrases":[phrases-to-boost],
              "boost": 2
          }]
      },
      "audio":{
        "uri":"gs:storage-bucket/input-file"
      }
    }
    

如需发送您的请求,请展开以下选项之一:

您应会收到如下所示的 JSON 响应:

    {
      "results": [
        {
          "alternatives": [
            {
              "transcript": "When deciding whether to bring an umbrella, I consider the weather",
              "confidence": 0.9463943
            }
          ],
          "languageCode": "en-us"
        }
      ]
    }
    

Java

import com.google.cloud.speech.v1p1beta1.RecognitionAudio;
    import com.google.cloud.speech.v1p1beta1.RecognitionConfig;
    import com.google.cloud.speech.v1p1beta1.RecognizeRequest;
    import com.google.cloud.speech.v1p1beta1.RecognizeResponse;
    import com.google.cloud.speech.v1p1beta1.SpeechClient;
    import com.google.cloud.speech.v1p1beta1.SpeechContext;
    import com.google.cloud.speech.v1p1beta1.SpeechRecognitionAlternative;
    import com.google.cloud.speech.v1p1beta1.SpeechRecognitionResult;
    import java.io.IOException;

    public class SpeechAdaptation {

      public void speechAdaptation() throws IOException {
        String uriPath = "gs://cloud-samples-data/speech/brooklyn_bridge.mp3";
        speechAdaptation(uriPath);
      }

      public static void speechAdaptation(String uriPath) throws IOException {
        // Initialize client that will be used to send requests. This client only needs to be created
        // once, and can be reused for multiple requests. After completing all of your requests, call
        // the "close" method on the client to safely clean up any remaining background resources.
        try (SpeechClient speechClient = SpeechClient.create()) {

          // Provides "hints" to the speech recognizer to favor specific words and phrases in the
          // results.
          // https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1p1beta1#google.cloud.speech.v1p1beta1.SpeechContext
          SpeechContext speechContext =
              SpeechContext.newBuilder().addPhrases("Brooklyn Bridge").setBoost(20.0F).build();
          // Configure recognition config to match your audio file.
          RecognitionConfig config =
              RecognitionConfig.newBuilder()
                  .setEncoding(RecognitionConfig.AudioEncoding.MP3)
                  .setSampleRateHertz(44100)
                  .setLanguageCode("en-US")
                  .addSpeechContexts(speechContext)
                  .build();
          // Set the path to your audio file
          RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(uriPath).build();

          // Make the request
          RecognizeRequest request =
              RecognizeRequest.newBuilder().setConfig(config).setAudio(audio).build();

          // Display the results
          RecognizeResponse response = speechClient.recognize(request);
          for (SpeechRecognitionResult result : response.getResultsList()) {
            // First alternative is the most probable result
            SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
            System.out.printf("Transcript: %s\n", alternative.getTranscript());
          }
        }
      }
    }

Node.js


    const speech = require('@google-cloud/speech').v1p1beta1;

    /**
     * Performs synchronous speech recognition with speech adaptation.
     *
     * @param sampleRateHertz {number} Sample rate in Hertz of the audio data sent in all
     * `RecognitionAudio` messages. Valid values are: 8000-48000.
     * @param languageCode {string} The language of the supplied audio.
     * @param phrase {string} Phrase "hints" help Speech-to-Text API recognize the specified phrases from
     * your audio data.
     * @param boost {number} Positive value will increase the probability that a specific phrase will be
     * recognized over other similar sounding phrases.
     * @param uriPath {string} Path to the audio file stored on GCS.
     */
    function sampleRecognize(
      sampleRateHertz,
      languageCode,
      phrase,
      boost,
      uriPath
    ) {
      const client = new speech.SpeechClient();
      // const sampleRateHertz = 44100;
      // const languageCode = 'en-US';
      // const phrase = 'Brooklyn Bridge';
      // const boost = 20.0;
      // const uriPath = 'gs://cloud-samples-data/speech/brooklyn_bridge.mp3';
      const encoding = 'MP3';
      const phrases = [phrase];
      const speechContextsElement = {
        phrases: phrases,
        boost: boost,
      };
      const speechContexts = [speechContextsElement];
      const config = {
        encoding: encoding,
        sampleRateHertz: sampleRateHertz,
        languageCode: languageCode,
        speechContexts: speechContexts,
      };
      const audio = {
        uri: uriPath,
      };
      const request = {
        config: config,
        audio: audio,
      };
      client
        .recognize(request)
        .then(responses => {
          const response = responses[0];
          for (const result of response.results) {
            // First alternative is the most probable result
            const alternative = result.alternatives[0];
            console.log(`Transcript: ${alternative.transcript}`);
          }
        })
        .catch(err => {
          console.error(err);
        });
    }
    

Python

from google.cloud import speech_v1p1beta1
    from google.cloud.speech_v1p1beta1 import enums

    def sample_recognize(storage_uri, phrase):
        """
        Transcribe a short audio file with speech adaptation.

        Args:
          storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
          phrase Phrase "hints" help recognize the specified phrases from your audio.
        """

        client = speech_v1p1beta1.SpeechClient()

        # storage_uri = 'gs://cloud-samples-data/speech/brooklyn_bridge.mp3'
        # phrase = 'Brooklyn Bridge'
        phrases = [phrase]

        # Hint Boost. This value increases the probability that a specific
        # phrase will be recognized over other similar sounding phrases.
        # The higher the boost, the higher the chance of false positive
        # recognition as well. Can accept wide range of positive values.
        # Most use cases are best served with values between 0 and 20.
        # Using a binary search happroach may help you find the optimal value.
        boost = 20.0
        speech_contexts_element = {"phrases": phrases, "boost": boost}
        speech_contexts = [speech_contexts_element]

        # Sample rate in Hertz of the audio data sent
        sample_rate_hertz = 44100

        # The language of the supplied audio
        language_code = "en-US"

        # Encoding of audio data sent. This sample sets this explicitly.
        # This field is optional for FLAC and WAV audio formats.
        encoding = enums.RecognitionConfig.AudioEncoding.MP3
        config = {
            "speech_contexts": speech_contexts,
            "sample_rate_hertz": sample_rate_hertz,
            "language_code": language_code,
            "encoding": encoding,
        }
        audio = {"uri": storage_uri}

        response = client.recognize(config, audio)
        for result in response.results:
            # First alternative is the most probable result
            alternative = result.alternatives[0]
            print(u"Transcript: {}".format(alternative.transcript))

    

后续步骤