
本页面介绍了如何获取由 Speech-to-Text 转录的音频的时间偏移值。

Speech-to-Text 可以在识别请求的响应文本中加上时间偏移(时间戳)值。时间偏移值能显示从提供的音频中识别出的所说字词的开始时间和结束时间。时间偏移值表示从音频开头起已经过的时间长度,以 100 毫秒为增量。

在分析较长的音频文件时,您可能需要在识别出的文字中搜索特定字词并在原始音频中对其进行定位(跳转),这种情况下时间偏移特别有用。Speech-to-Text 的时间偏移支持以下所有语音识别方法:speech:recognizespeech:longrunningrecognize流式

Speech-to-Text 仅会为识别响应中提供的第一个备选项加上时间偏移值。

要在请求结果中加入时间偏移,请在请求配置中将 enableWordTimeOffsets 参数设置为 true


如需执行同步语音识别,请发出 POST 请求并提供相应的请求正文。以下示例展示了一个使用 curl 发出的 POST 请求。该示例使用 Google Cloud CLI 生成访问令牌。如需了解如何安装 gcloud CLI,请参阅快速入门

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {
    'language_code': 'en-US',
    'enableWordTimeOffsets': true
}" "https://speech.googleapis.com/v1/speech:longrunningrecognize"

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应。如果操作未完成(仍在处理中),则响应将类似于以下内容:

  "name": "2885768779530032514",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 97,
    "startTime": "2020-12-14T03:11:54.492593Z",
    "lastUpdateTime": "2020-12-14T03:15:57.484509Z",
    "uri": "gs://{BUCKET_NAME}/{FILE_NAME}"


  "name": "7612202767953098924"

其中,name 是为请求创建的长音频转录操作的名称。

处理 vr.flac 文件大约需要 30 秒才能完成。如需检索该操作的结果,请向 https://speech.googleapis.com/v1/operations/ 端点发出 GET 请求。请将 your-operation-name 替换为通过 longrunningrecognize 请求收到的 name

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应:

  "name": "7612202767953098924",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2017-07-20T16:36:55.033650Z",
    "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
        "alternatives": [
            "transcript": "okay so what am I doing here...(etc)...",
            "confidence": 0.96596134,
            "words": [
                "startTime": "1.400s",
                "endTime": "1.800s",
                "word": "okay"
                "startTime": "1.800s",
                "endTime": "2.300s",
                "word": "so"
                "startTime": "2.300s",
                "endTime": "2.400s",
                "word": "what"
                "startTime": "2.400s",
                "endTime": "2.600s",
                "word": "am"
                "startTime": "2.600s",
                "endTime": "2.600s",
                "word": "I"
                "startTime": "2.600s",
                "endTime": "2.700s",
                "word": "doing"
                "startTime": "2.700s",
                "endTime": "3s",
                "word": "here"
                "startTime": "3s",
                "endTime": "3.300s",
                "word": "why"
                "startTime": "3.300s",
                "endTime": "3.400s",
                "word": "am"
                "startTime": "3.400s",
                "endTime": "3.500s",
                "word": "I"
                "startTime": "3.500s",
                "endTime": "3.500s",
                "word": "here"
        "alternatives": [
            "transcript": "so so what am I doing here...(etc)...",
            "confidence": 0.9642093,

如果该操作尚未完成,则可以通过反复发出 GET 请求来轮询此端点,直到相应响应的 done 属性为 true 为止。


如需执行异步语音识别,请使用 Google Cloud CLI,并提供本地文件的路径或 Google Cloud Storage 网址。添加 --include-word-time-offsets 标志。

gcloud ml speech recognize-long-running \
    'gs://cloud-samples-tests/speech/brooklyn.flac' \
    --language-code='en-US' --include-word-time-offsets --async

如果请求成功,则服务器以 JSON 格式返回长时间运行的操作的 ID。

  "name": OPERATION_ID


gcloud ml speech operations describe OPERATION_ID


gcloud ml speech operations wait OPERATION_ID

该操作完成后,它会以 JSON 格式返回音频的转录内容。

  "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
  "results": [
      "alternatives": [
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge",
          "words": [
              "endTime": "0.300s",
              "startTime": "0s",
              "word": "how"
              "endTime": "0.600s",
              "startTime": "0.300s",
              "word": "old"
              "endTime": "0.800s",
              "startTime": "0.600s",
              "word": "is"
              "endTime": "0.900s",
              "startTime": "0.800s",
              "word": "the"
              "endTime": "1.100s",
              "startTime": "0.900s",
              "word": "Brooklyn"
              "endTime": "1.500s",
              "startTime": "1.100s",
              "word": "Bridge"


如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Go API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

func asyncWords(client *speech.Client, out io.Writer, gcsURI string) error {
	ctx := context.Background()

	// Send the contents of the audio file with the encoding and
	// and sample rate information to be transcripted.
	req := &speechpb.LongRunningRecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:              speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz:       16000,
			LanguageCode:          "en-US",
			EnableWordTimeOffsets: true,
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},

	op, err := client.LongRunningRecognize(ctx, req)
	if err != nil {
		return err
	resp, err := op.Wait(ctx)
	if err != nil {
		return err

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(out, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
			for _, w := range alt.Words {
					"Word: \"%v\" (startTime=%3f, endTime=%3f)\n",
	return nil


如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Java API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

 * Performs non-blocking speech recognition on remote FLAC file and prints the transcription as
 * well as word time offsets.
 * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
public static void asyncRecognizeWords(String gcsUri) throws Exception {
  // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure remote file request for FLAC
    RecognitionConfig config =
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);
    while (!response.isDone()) {
      System.out.println("Waiting for response...");

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    for (SpeechRecognitionResult result : results) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcription: %s\n", alternative.getTranscript());
      for (WordInfo wordInfo : alternative.getWordsList()) {
            "\t%s.%s sec - %s.%s sec\n",
            wordInfo.getStartTime().getNanos() / 100000000,
            wordInfo.getEndTime().getNanos() / 100000000);


如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Node.js API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

 * TODO(developer): Uncomment the following lines before running the sample.
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  enableWordTimeOffsets: true,
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,

const audio = {
  uri: gcsUri,

const request = {
  config: config,
  audio: audio,

// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
const [operation] = await client.longRunningRecognize(request);

// Get a Promise representation of the final result of the job
const [response] = await operation.promise();
response.results.forEach(result => {
  console.log(`Transcription: ${result.alternatives[0].transcript}`);
  result.alternatives[0].words.forEach(wordInfo => {
    // NOTE: If you have a time offset exceeding 2^32 seconds, use the
    // wordInfo.{x}Time.seconds.high to calculate seconds.
    const startSecs =
      `${wordInfo.startTime.seconds}` +
      '.' +
      wordInfo.startTime.nanos / 100000000;
    const endSecs =
      `${wordInfo.endTime.seconds}` +
      '.' +
      wordInfo.endTime.nanos / 100000000;
    console.log(`Word: ${wordInfo.word}`);
    console.log(`\t ${startSecs} secs - ${endSecs} secs`);


如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Python API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

def transcribe_gcs_with_word_time_offsets(
    audio_uri: str,
) -> speech.RecognizeResponse:
    """Transcribe the given audio file asynchronously and output the word time
        audio_uri (str): The Google Cloud Storage URI of the input audio file.
            E.g., gs://[BUCKET]/[FILE]
        speech.RecognizeResponse: The response containing the transcription results with word time offsets.
    from google.cloud import speech

    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=audio_uri)
    config = speech.RecognitionConfig(

    operation = client.long_running_recognize(config=config, audio=audio)

    print("Waiting for operation to complete...")
    result = operation.result(timeout=90)

    for result in result.results:
        alternative = result.alternatives[0]
        print(f"Transcript: {alternative.transcript}")
        print(f"Confidence: {alternative.confidence}")

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time

                f"Word: {word}, start_time: {start_time.total_seconds()}, end_time: {end_time.total_seconds()}"

    return result


