Getting Word Timestamps

This page describes how to get time offset values for audio transcribed by Cloud Speech-to-Text.

Speech-to-Text can include Time offset (timestamp) values in the response text for your recognize request. Time offset values show the beginning and end of each spoken word that is recognized in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.

Time offsets are especially useful for analyzing longer audio files, where you may need to search for a particular word in the recognized text and locate it (seek) in the original audio. Speech-to-Text supports time offsets for all speech recognition methods: speech:recognize, speech:longrunningrecognize, and StreamingRecognizeRequest.

Time offset values are only included for the first alternative provided in the recognition response.

To include time offsets in the results of your request, set the enableWordTimeOffsets parameter to true in your request configuration.

Protocol

Refer to the speech:longrunningrecognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the Quickstart.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {
    'language_code': 'en-US',
    'enableWordTimeOffsets': true
  },
  'audio':{
    'uri':'gs://gcs-test-data/vr.flac'
  }
}" "https://speech.googleapis.com/v1/speech:longrunningrecognize"

See the RecognitionConfig and RecognitionAudio reference documentation for more information on configuring the request body.

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

{
  "name": "7612202767953098924"
}

where name is the name of the long running operation created for the request.

Processing the vr.flac file takes about 30 seconds to complete. To retrieve the result of the operation, make a GET request to the https://speech.googleapis.com/v1/operations/ endpoint. Replace your-operation-name with the name received from your longrunningrecognize request.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8"
     "https://speech.googleapis.com/v1/operations/your-operation-name"

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

{
  "name": "7612202767953098924",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2017-07-20T16:36:55.033650Z",
    "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "transcript": "okay so what am I doing here...(etc)...",
            "confidence": 0.96596134,
            "words": [
              {
                "startTime": "1.400s",
                "endTime": "1.800s",
                "word": "okay"
              },
              {
                "startTime": "1.800s",
                "endTime": "2.300s",
                "word": "so"
              },
              {
                "startTime": "2.300s",
                "endTime": "2.400s",
                "word": "what"
              },
              {
                "startTime": "2.400s",
                "endTime": "2.600s",
                "word": "am"
              },
              {
                "startTime": "2.600s",
                "endTime": "2.600s",
                "word": "I"
              },
              {
                "startTime": "2.600s",
                "endTime": "2.700s",
                "word": "doing"
              },
              {
                "startTime": "2.700s",
                "endTime": "3s",
                "word": "here"
              },
              {
                "startTime": "3s",
                "endTime": "3.300s",
                "word": "why"
              },
              {
                "startTime": "3.300s",
                "endTime": "3.400s",
                "word": "am"
              },
              {
                "startTime": "3.400s",
                "endTime": "3.500s",
                "word": "I"
              },
              {
                "startTime": "3.500s",
                "endTime": "3.500s",
                "word": "here"
              },
              ...
            ]
          }
        ]
      },
      {
        "alternatives": [
          {
            "transcript": "so so what am I doing here...(etc)...",
            "confidence": 0.9642093,
          }
        ]
      }
    ]
  }
}

If the operation has not completed, you can poll the endpoint by repeatedly making the GET request until the done property of the response is true.

GCLOUD COMMAND

Refer to the recognize-long-running command for complete details.

To perform asynchronous speech recognition, use the gcloud command line tool, providing the path of a local file or a Google Cloud Storage URL. Include the --include-word-time-offsets flag.

gcloud ml speech recognize-long-running \
    'gs://cloud-samples-tests/speech/brooklyn.flac' \
    --language-code='en-US' --include-word-time-offsets --async

If the request is successful, the server returns the ID of the long-running operation in JSON format.

{
  "name": OPERATION_ID
}

You can then get information about the operation by running the following command.

gcloud ml speech operations describe OPERATION_ID

You can also poll the operation until it completes by running the following command.

gcloud ml speech operations wait OPERATION_ID

Once the operation completes, the operation returns the audio in JSON-format.

{
  "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge",
          "words": [
            {
              "endTime": "0.300s",
              "startTime": "0s",
              "word": "how"
            },
            {
              "endTime": "0.600s",
              "startTime": "0.300s",
              "word": "old"
            },
            {
              "endTime": "0.800s",
              "startTime": "0.600s",
              "word": "is"
            },
            {
              "endTime": "0.900s",
              "startTime": "0.800s",
              "word": "the"
            },
            {
              "endTime": "1.100s",
              "startTime": "0.900s",
              "word": "Brooklyn"
            },
            {
              "endTime": "1.500s",
              "startTime": "1.100s",
              "word": "Bridge"
            }
          ]
        }
      ]
    }
  ]
}

C#

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

static object AsyncRecognizeGcsWords(string storageUri)
{
    var speech = SpeechClient.Create();
    var longOperation = speech.LongRunningRecognize(new RecognitionConfig()
    {
        Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
        SampleRateHertz = 16000,
        LanguageCode = "en",
        EnableWordTimeOffsets = true,
    }, RecognitionAudio.FromStorageUri(storageUri));
    longOperation = longOperation.PollUntilCompleted();
    var response = longOperation.Result;
    foreach (var result in response.Results)
    {
        foreach (var alternative in result.Alternatives)
        {
            Console.WriteLine($"Transcript: { alternative.Transcript}");
            Console.WriteLine("Word details:");
            Console.WriteLine($" Word count:{alternative.Words.Count}");
            foreach (var item in alternative.Words)
            {
                Console.WriteLine($"  {item.Word}");
                Console.WriteLine($"    WordStartTime: {item.StartTime}");
                Console.WriteLine($"    WordEndTime: {item.EndTime}");
            }
        }
    }
    return 0;
}

Go

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

func asyncWords(client *speech.Client, out io.Writer, gcsURI string) error {
	ctx := context.Background()

	// Send the contents of the audio file with the encoding and
	// and sample rate information to be transcripted.
	req := &speechpb.LongRunningRecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:              speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz:       16000,
			LanguageCode:          "en-US",
			EnableWordTimeOffsets: true,
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
		},
	}

	op, err := client.LongRunningRecognize(ctx, req)
	if err != nil {
		return err
	}
	resp, err := op.Wait(ctx)
	if err != nil {
		return err
	}

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(out, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
			for _, w := range alt.Words {
				fmt.Fprintf(out,
					"Word: \"%v\" (startTime=%3f, endTime=%3f)\n",
					w.Word,
					float64(w.StartTime.Seconds)+float64(w.StartTime.Nanos)*1e-9,
					float64(w.EndTime.Seconds)+float64(w.EndTime.Nanos)*1e-9,
				)
			}
		}
	}
	return nil
}

Java

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

/**
 * Performs non-blocking speech recognition on remote FLAC file and prints the transcription as
 * well as word time offsets.
 *
 * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
 */
public static void asyncRecognizeWords(String gcsUri) throws Exception {
  // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure remote file request for Linear16
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.FLAC)
            .setLanguageCode("en-US")
            .setSampleRateHertz(16000)
            .setEnableWordTimeOffsets(true)
            .build();
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);
    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    for (SpeechRecognitionResult result : results) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcription: %s\n", alternative.getTranscript());
      for (WordInfo wordInfo : alternative.getWordsList()) {
        System.out.println(wordInfo.getWord());
        System.out.printf(
            "\t%s.%s sec - %s.%s sec\n",
            wordInfo.getStartTime().getSeconds(),
            wordInfo.getStartTime().getNanos() / 100000000,
            wordInfo.getEndTime().getSeconds(),
            wordInfo.getEndTime().getNanos() / 100000000);
      }
    }
  }
}

Node.js

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Eencoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  enableWordTimeOffsets: true,
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
};

const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
client
  .longRunningRecognize(request)
  .then(data => {
    const operation = data[0];
    // Get a Promise representation of the final result of the job
    return operation.promise();
  })
  .then(data => {
    const response = data[0];
    response.results.forEach(result => {
      console.log(`Transcription: ${result.alternatives[0].transcript}`);
      result.alternatives[0].words.forEach(wordInfo => {
        // NOTE: If you have a time offset exceeding 2^32 seconds, use the
        // wordInfo.{x}Time.seconds.high to calculate seconds.
        const startSecs =
          `${wordInfo.startTime.seconds}` +
          `.` +
          wordInfo.startTime.nanos / 100000000;
        const endSecs =
          `${wordInfo.endTime.seconds}` +
          `.` +
          wordInfo.endTime.nanos / 100000000;
        console.log(`Word: ${wordInfo.word}`);
        console.log(`\t ${startSecs} secs - ${endSecs} secs`);
      });
    });
  })
  .catch(err => {
    console.error('ERROR:', err);
  });

PHP

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

use Google\Cloud\Speech\SpeechClient;
use Google\Cloud\Core\ExponentialBackoff;

/**
 * Transcribe an audio file using Google Cloud Speech API
 * Example:
 * ```
 * transcribe_async_words('/path/to/audiofile.wav');
 * ```.
 *
 * @param string $audioFile path to an audio file.
 * @param string $languageCode The language of the content to
 *     be recognized. Accepts BCP-47 (e.g., `"en-US"`, `"es-ES"`).
 * @param array $options configuration options.
 *
 * @return string the text transcription
 */
function transcribe_async_words($audioFile, $languageCode = 'en-US', $options = [])
{
    // Create the speech client
    $speech = new SpeechClient([
        'languageCode' => $languageCode,
    ]);

    // When true, time offsets for every word will be included in the response.
    $options['enableWordTimeOffsets'] = true;

    // Create the asyncronous recognize operation
    $operation = $speech->beginRecognizeOperation(
        fopen($audioFile, 'r'),
        $options
    );

    // Wait for the operation to complete
    $backoff = new ExponentialBackoff(10);
    $backoff->execute(function () use ($operation) {
        print('Waiting for operation to complete' . PHP_EOL);
        $operation->reload();
        if (!$operation->isComplete()) {
            throw new Exception('Job has not yet completed', 500);
        }
    });

    // Print the results
    if ($operation->isComplete()) {
        $results = $operation->results();
        foreach ($results as $result) {
            $alternative = $result->alternatives()[0];
            printf('Transcript: %s' . PHP_EOL, $alternative['transcript']);
            printf('Confidence: %s' . PHP_EOL, $alternative['confidence']);
            foreach ($alternative['words'] as $wordInfo) {
                printf('  Word: %s (start: %s, end: %s)' . PHP_EOL,
                    $wordInfo['word'],
                    $wordInfo['startTime'],
                    $wordInfo['endTime']);
            }
        }
    }
}

Python

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

def transcribe_gcs_with_word_time_offsets(gcs_uri):
    """Transcribe the given audio file asynchronously and output the word time
    offsets."""
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US',
        enable_word_time_offsets=True)

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    result = operation.result(timeout=90)

    for result in result.results:
        alternative = result.alternatives[0]
        print(u'Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

Ruby

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

# storage_path = "Path to file in Cloud Storage, eg. gs://bucket/audio.raw"

require "google/cloud/speech"

speech = Google::Cloud::Speech.new

config = { encoding:                 :LINEAR16,
           sample_rate_hertz:        16000,
           language_code:            "en-US",
           enable_word_time_offsets: true }
audio  = { uri: storage_path }

operation = speech.long_running_recognize config, audio

puts "Operation started"

operation.wait_until_done!

raise operation.results.message if operation.error?

results = operation.response.results

alternatives = results.first.alternatives
alternatives.each do |alternative|
  puts "Transcription: #{alternative.transcript}"

  alternative.words.each do |word|
    start_time = word.start_time.seconds + word.start_time.nanos/1000000000.0
    end_time   = word.end_time.seconds + word.end_time.nanos/1000000000.0

    puts "Word: #{word.word} #{start_time} #{end_time}"
  end
end

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Speech API Documentation