使用增强模型转录手机音频

本教程介绍了如何使用 Speech-to-Text 转录从电话录制的音频。

音频文件可能来自许多不同的来源。音频数据可能来自电话(如语音邮件)或视频文件所包含的音轨。

Speech-to-Text 可以从多种机器学习模型中选择一种来转录音频文件,以便完美匹配音频的原始来源。为了获得更好的语音转录结果,您可以指定原始音频的来源。这样,Speech-to-Text 就可以在处理您的音频文件时使用针对类似数据训练过的机器学习模型。

目标

  • 向 Speech-to-Text 发送音频转录请求,要求转录从电话录制的音频(如语音邮件)。
  • 为音频转录请求指定增强型语音识别模型

费用

本教程使用 Cloud Platform 的以下收费组件:

  • Speech-to-Text

请使用价格计算器根据您的预计使用情况来估算费用。Cloud Platform 新用户可能有资格免费试用

准备工作

本教程有几个前提条件:

发送请求

为了在转录电话上录制的音频(如通话或语音邮件)时获得最佳效果,您可以将 RecognitionConfig 载荷的 model 字段设置为 phone_callmodel 字段会告知 Speech-to-Text API 为转录请求使用哪种语音识别模型。

使用增强型模型可以改善电话音频转录的效果。如需使用增强型模型,您可以将 RecognitionConfig 载荷的 useEnhanced 字段设置为 true

以下代码示例演示了如何在调用 Speech-to-Text 时选择特定的转录模型。

协议

如需了解完整的详细信息,请参阅 speech:recognize API 端点。

如需执行同步语音识别,请发出 POST 请求并提供相应的请求正文。以下示例展示了一个使用 curl 发出的 POST 请求。该示例使用通过 Google Cloud Cloud SDK 为项目设置的服务帐号的访问令牌。如需了解有关安装 Cloud SDK、建立项目和服务帐号以及获取访问令牌的说明,请参阅快速入门

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "languageCode": "en-US",
        "enableWordTimeOffsets": false,
        "enableAutomaticPunctuation": true,
        "model": "phone_call",
        "useEnhanced": true
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/commercial_mono.wav"
    }
}'

如需详细了解如何配置请求正文,请参阅 RecognitionConfig 参考文档。

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "Hi, I'd like to buy a Chromecast. I was wondering whether you could help me with that.",
          "confidence": 0.8930228
        }
      ],
      "resultEndTime": "5.640s"
    },
    {
      "alternatives": [
        {
          "transcript": " Certainly, which color would you like? We are blue black and red.",
          "confidence": 0.9101991
        }
      ],
      "resultEndTime": "10.220s"
    },
    {
      "alternatives": [
        {
          "transcript": " Let's go with the black one.",
          "confidence": 0.8818244
        }
      ],
      "resultEndTime": "13.870s"
    },
    {
      "alternatives": [
        {
          "transcript": " Would you like the new Chromecast Ultra model or the regular Chromecast?",
          "confidence": 0.94733626
        }
      ],
      "resultEndTime": "18.460s"
    },
    {
      "alternatives": [
        {
          "transcript": " Regular Chromecast is fine. Thank you. Okay. Sure. Would you like to ship it regular or Express?",
          "confidence": 0.9519095
        }
      ],
      "resultEndTime": "25.930s"
    },
    {
      "alternatives": [
        {
          "transcript": " Express, please.",
          "confidence": 0.9101229
        }
      ],
      "resultEndTime": "28.260s"
    },
    {
      "alternatives": [
        {
          "transcript": " Terrific. It's on the way. Thank you. Thank you very much. Bye.",
          "confidence": 0.9321616
        }
      ],
      "resultEndTime": "34.150s"
    }
 ]
}

C#

static object SyncRecognizeEnhancedModel(string filePath)
{
    var speech = SpeechClient.Create();
    var response = speech.Recognize(new RecognitionConfig()
    {
        Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
        SampleRateHertz = 8000,
        LanguageCode = "en-US",
        // Enhanced models are only available for projects that
        // opt into audio data logging.
        UseEnhanced = true,
        // A model must be specified to use an enhanced model.
        Model = "phone_call",
    }, RecognitionAudio.FromFile(filePath));
    foreach (var result in response.Results)
    {
        foreach (var alternative in result.Alternatives)
        {
            Console.WriteLine(alternative.Transcript);
        }
    }
    return 0;
}

Go


func enhancedModel(w io.Writer, path string) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %v", err)
	}

	// path = "../testdata/commercial_mono.wav"
	data, err := ioutil.ReadFile(path)
	if err != nil {
		return fmt.Errorf("ReadFile: %v", err)
	}

	resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 8000,
			LanguageCode:    "en-US",
			UseEnhanced:     true,
			// A model must be specified to use enhanced model.
			Model: "phone_call",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	})
	if err != nil {
		return fmt.Errorf("Recognize: %v", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Java

/*
 * Please include the following imports to run this sample.
 *
 * import com.google.cloud.speech.v1.RecognitionAudio;
 * import com.google.cloud.speech.v1.RecognitionConfig;
 * import com.google.cloud.speech.v1.RecognizeRequest;
 * import com.google.cloud.speech.v1.RecognizeResponse;
 * import com.google.cloud.speech.v1.SpeechClient;
 * import com.google.cloud.speech.v1.SpeechRecognitionAlternative;
 * import com.google.cloud.speech.v1.SpeechRecognitionResult;
 * import com.google.protobuf.ByteString;
 * import java.nio.file.Files;
 * import java.nio.file.Path;
 * import java.nio.file.Paths;
 */

public static void sampleRecognize() {
  // TODO(developer): Replace these variables before running the sample.
  String localFilePath = "resources/hello.wav";
  String model = "phone_call";
  sampleRecognize(localFilePath, model);
}

/**
 * Transcribe a short audio file using a specified transcription model
 *
 * @param localFilePath Path to local audio file, e.g. /path/audio.wav
 * @param model The transcription model to use, e.g. video, phone_call, default For a list of
 *     available transcription models, see:
 *     https://cloud.google.com/speech-to-text/docs/transcription-model#transcription_models
 */
public static void sampleRecognize(String localFilePath, String model) {
  try (SpeechClient speechClient = SpeechClient.create()) {

    // The language of the supplied audio
    String languageCode = "en-US";
    RecognitionConfig config =
        RecognitionConfig.newBuilder().setModel(model).setLanguageCode(languageCode).build();
    Path path = Paths.get(localFilePath);
    byte[] data = Files.readAllBytes(path);
    ByteString content = ByteString.copyFrom(data);
    RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(content).build();
    RecognizeRequest request =
        RecognizeRequest.newBuilder().setConfig(config).setAudio(audio).build();
    RecognizeResponse response = speechClient.recognize(request);
    for (SpeechRecognitionResult result : response.getResultsList()) {
      // First alternative is the most probable result
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcript: %s\n", alternative.getTranscript());
    }
  } catch (Exception exception) {
    System.err.println("Failed to create the client due to: " + exception);
  }
}

Node.js

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

def transcribe_model_selection(speech_file, model):
    """Transcribe the given audio file synchronously with
    the selected model."""
    from google.cloud import speech

    client = speech.SpeechClient()

    with open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)

    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
        model=model,
    )

    response = client.recognize(config=config, audio=audio)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print("-" * 20)
        print("First alternative of result {}".format(i))
        print(u"Transcript: {}".format(alternative.transcript))

清理

为避免因本教程中使用的资源而导致您的 Google Cloud Platform 帐号产生费用,请执行以下操作:

删除项目

为了避免产生费用,最简单的方法是删除您为本教程创建的项目。

如需删除项目,请执行以下操作:

  1. 在 Cloud Console 中,转到管理资源页面。

    转到“管理资源”页面

  2. 在项目列表中,选择要删除的项目,然后点击删除
  3. 在对话框中输入项目 ID,然后点击关闭以删除项目。

删除实例

如需删除 Compute Engine 实例,请执行以下操作:

  1. 在 Cloud Console 中,转到虚拟机实例页面。

    转到“虚拟机实例”页面

  2. 点击您要删除的实例对应的复选框。
  3. 点击删除 以删除实例。

删除默认网络的防火墙规则

如需删除防火墙规则,请执行以下操作:

  1. 在 Cloud Console 中,转到防火墙规则页面。

    转到“防火墙规则”页面

  2. 点击 要删除的防火墙规则。
  3. 点击删除 以删除防火墙规则。