试用 Gemini 1.5 模型，体验 Vertex AI 中最新的多模态模型，了解运用 100 万个词元的上下文窗口可以构建怎样的应用。 试用 Gemini 1.5 模型，体验 Vertex AI 中最新的多模态模型，了解运用 100 万个词元的上下文窗口可以构建怎样的应用。

Transcribe audio from a video file using Speech-to-Text

本教程介绍如何使用 Speech-to-Text 转录视频文件中的音轨。

音频文件可能来自许多不同的来源。音频数据可能来自电话（如语音邮件）或视频文件所包含的音轨。

Speech-to-Text 可以从多种机器学习模型中选择一种来转录音频文件，以便完美匹配音频的原始来源。为了获得更好的语音转录结果，您可以指定原始音频的来源。这样，Speech-to-Text 就可以在处理您的音频文件时使用针对类似数据训练过的机器学习模型。

目标

向 Speech-to-Text 发送视频文件的音频转录请求。

费用

在本文档中，您将使用 Google Cloud 的以下收费组件：

Speech-to-Text

您可使用价格计算器根据您的预计使用情况来估算费用。 Google Cloud 新用户可能有资格申请免费试用。

准备工作

本教程有几个前提条件：

您已经在 Google Cloud 控制台中设置了 Speech-to-Text 项目。
您已经在 Google Cloud 控制台中使用应用默认凭据设置环境。
您已经针对所选编程语言设置开发环境。
您已经针对所选编程语言安装 Google Cloud 客户端库。

准备音频数据

在从视频转录音频之前，您必须从视频文件中提取数据。提取音频数据后，您必须将其存储在 Cloud Storage 存储桶中，或者将其转换为 base64 编码。

提取音频数据

您可以使用能够处理音频和视频文件的任何文件转换工具，例如 FFmpeg。

通过以下代码段，使用 ffmpeg 将视频文件转换为音频文件。

ffmpeg -i video-input-file audio-output-file

存储或转换音频数据

您可以转录本地机器或 Cloud Storage 存储桶中存储的音频文件。

通过以下命令，使用 gsutil 工具将音频文件上传到现有的 Cloud Storage 存储桶。

gsutil cp audio-output-file storage-bucket-uri

如果您使用本地文件并且打算从命令行使用 curl 工具来发送请求，则必须首先将音频文件转换为 base64 编码的数据。

请使用以下命令将音频文件转换为文本文件。

base64 audio-output-file -w 0 > audio-data-text

发送转录请求

请使用以下代码将转录请求发送到 Speech-to-Text。

本地文件请求

协议

如需了解完整的详细信息，请参阅 speech:recognize API 端点。

如需执行同步语音识别，请发出 POST 请求并提供相应的请求正文。以下示例展示了一个使用 curl 发出的 POST 请求。该示例使用 Google Cloud CLI 生成访问令牌。如需了解如何安装 gcloud CLI，请参阅快速入门。

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "model": "video"
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
    }
}'

如需详细了解如何配置请求正文，请参阅 RecognitionConfig 参考文档。

如果请求成功，服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应：

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "OK Google stream stranger things from
            Netflix to my TV okay stranger things from
            Netflix playing on TV from the people that brought you
            Google home comes the next evolution of the smart home
            and it's just outside your window me Google know hi
            how can I help okay no what's the weather like outside
            the weather outside is sunny and 76 degrees he's right
            okay no turn on the hose I'm holding sure okay no I'm can
            I eat this lemon tree leaf yes what about this Daisy yes
            but I wouldn't recommend it but I could eat it okay
            Nomad milk to my shopping list I'm sorry that sounds like
            an indoor request I keep doing that sorry you do keep
            doing that okay no is this compost really we're all
            compost if you think about it pretty much everything is
            made up of organic matter and will return",
          "confidence": 0.9251011
        }
      ]
    }
  ]
}

Go

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Go API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。


func modelSelection(w io.Writer, path string) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	// path = "../testdata/Google_Gnome.wav"
	data, err := ioutil.ReadFile(path)
	if err != nil {
		return fmt.Errorf("ReadFile: %w", err)
	}

	req := &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
			Model:           "video",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	}

	resp, err := client.Recognize(ctx, req)
	if err != nil {
		return fmt.Errorf("Recognize: %w", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Java

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Java API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。

/**
 * Performs transcription of the given audio file synchronously with the selected model.
 *
 * @param fileName the path to a audio file to transcribe
 */
public static void transcribeModelSelection(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speech = SpeechClient.create()) {
    // Configure request with video media type
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
    // Just print the first result here.
    SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Node.js API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Python API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。

def transcribe_model_selection(
    speech_file: str,
    model: str,
) -> speech.RecognizeResponse:
    """Transcribe the given audio file synchronously with
    the selected model."""
    client = speech.SpeechClient()

    with open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)

    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
        model=model,
    )

    response = client.recognize(config=config, audio=audio)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print("-" * 20)
        print(f"First alternative of result {i}")
        print(f"Transcript: {alternative.transcript}")

    return response

其他语言

C#：请按照客户端库页面上的 C# 设置说明操作，然后访问 .NET 的 Speech-to-Text 参考文档。

PHP：请按照客户端库页面上的 PHP 设置说明操作，然后访问 PHP 的 Speech-to-Text 参考文档。

Ruby：请按照客户端库页面上的 Ruby 设置说明操作，然后访问 Ruby 的 Speech-to-Text 参考文档。

远程文件请求

Go

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Go API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。


import (
	"context"
	"fmt"
	"io"
	"strings"

	speech "cloud.google.com/go/speech/apiv1"
	"cloud.google.com/go/speech/apiv1/speechpb"
)

// transcribe_model_selection_gcs Transcribes the given audio file asynchronously with
// the selected model.
func transcribe_model_selection_gcs(w io.Writer, gcsUri string, model string) error {
	// Google Cloud Storage URI pointing to the audio content.
	// gcsUri := "gs://bucket-name/path_to_audio_file"

	// The speech recognition model to use
	// See, https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model
	// model := "default"
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	audio := &speechpb.RecognitionAudio{
		AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsUri},
	}

	recognitionConfig := &speechpb.RecognitionConfig{
		Encoding:        speechpb.RecognitionConfig_LINEAR16,
		SampleRateHertz: 16000,
		LanguageCode:    "en-US",
		Model:           model,
	}

	longRunningRecognizeRequest := &speechpb.LongRunningRecognizeRequest{
		Config: recognitionConfig,
		Audio:  audio,
	}

	operation, err := client.LongRunningRecognize(ctx, longRunningRecognizeRequest)
	if err != nil {
		return fmt.Errorf("error running recognize %v", err)
	}

	response, err := operation.Wait(ctx)
	if err != nil {
		return err
	}
	for i, result := range response.Results {
		alternative := result.Alternatives[0]
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "First alternative of result %d", i)
		fmt.Fprintf(w, "Transcript: %s", alternative.Transcript)
	}
	return nil
}

Java

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Java API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。

/**
 * Performs transcription of the remote audio file asynchronously with the selected model.
 *
 * @param gcsUri the path to the remote audio file to transcribe.
 */
public static void transcribeModelSelectionGcs(String gcsUri) throws Exception {
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure request with video media type
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);

    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    // Just print the first result here.
    SpeechRecognitionResult result = results.get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Node.js API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file.
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

如需了解如何安装和使用 Speech-to-Text 客户端库，请参阅 Speech-to-Text 客户端库。如需了解详情，请参阅 Speech-to-Text Python API 参考文档。

如需向 Speech-to-Text 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置身份验证。

def transcribe_model_selection_gcs(
    gcs_uri: str,
    model: str,
) -> speech.RecognizeResponse:
    """Transcribe the given audio file asynchronously with
    the selected model."""
    from google.cloud import speech

    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=gcs_uri)

    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
        model=model,
    )

    operation = client.long_running_recognize(config=config, audio=audio)

    print("Waiting for operation to complete...")
    response = operation.result(timeout=90)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print("-" * 20)
        print(f"First alternative of result {i}")
        print(f"Transcript: {alternative.transcript}")

    return response

其他语言

C#：请按照客户端库页面上的 C# 设置说明操作，然后访问 .NET 的 Speech-to-Text 参考文档。

PHP：请按照客户端库页面上的 PHP 设置说明操作，然后访问 PHP 的 Speech-to-Text 参考文档。

Ruby：请按照客户端库页面上的 Ruby 设置说明操作，然后访问 Ruby 的 Speech-to-Text 参考文档。

清除数据

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除项目

为了避免产生费用，最简单的方法是删除您为本教程创建的项目。

如需删除项目，请执行以下操作：

在 Google Cloud 控制台中，进入管理资源页面。
转到“管理资源”
在项目列表中，选择要删除的项目，然后点击删除。
在对话框中输入项目 ID，然后点击关闭以删除项目。

删除实例

要删除 Compute Engine 实例，请运行以下命令：

在 Google Cloud 控制台中，转到虚拟机实例页面。
转到“虚拟机实例”
选中要删除的实例。
如需删除实例，请点击更多操作，点击删除，然后按照说明操作。

删除默认网络的防火墙规则

如需删除防火墙规则，请执行以下操作：

在 Google Cloud 控制台中，转到防火墙页面。
转到“防火墙”
选中要删除的防火墙规则。
如需删除防火墙规则，请点击删除。

后续步骤

了解如何为音频添加时间戳。
识别音频文件中的不同讲话人。

自行试用

如果您是 Google Cloud 新手，请创建一个账号来评估 Speech-to-Text 在实际场景中的表现。新客户还可获享 $300 赠金，用于运行、测试和部署工作负载。

免费试用 Speech-to-Text