Generate WebVTT and SRT captions

This page describes how to use Speech-to-Text V2 API to automatically generate captions from audio files, in SRT and VTT formats.

Overview

You can use the power of Speech-to-Text V2 API to automatically generate accurate captions in both SubRip (.srt) and WebVTT (.vtt) formats. These formats are used to store the text and timing information of audio, making it possible to display subtitles or captions in sync with the media for subtitling and closed captioning.

Enabling caption outputs in your request to Google Speech-to-Text is only supported in the V2 API. Specifically, you can only use BatchRecognize to transcribe long audio files. You can save outputs in a Cloud Storage bucket, or they can be returned inline. Multiple formats can be specified at the same time for the Cloud Storage output configuration, which is written to the specified bucket with different file extensions.

Enable caption outputs in a request

To generate SRT or VTT caption outputs for your audio using Google Speech-to-Text, follow the next steps to enable caption outputs in your transcription request:

  1. Make a request to the Speech-to-Text V2 API BatchRecognize method with the output_format_config field populated. Values specified are:
    • srt, for the output to follow the SubRip(.srt) format.
    • vtt, for the output to follow the WebVTT(.vtt) format.
    • native, which is the default output format if no format is specified as a serialized BatchRecognizeResults request.
  2. Since the operation is async, poll the request until it's complete.

Multiple formats can be specified at the same time for the Cloud Storage output configuration. They're written to the specified bucket with different file extensions. Those are .json for native, .srt for SRT, and .vtt for WebVTT support, respectively.

If multiple formats are specified for the inline output config, each format will be available as a field in the BatchRecognizeFileResult.inline_result message.

The following code snippet demonstrates how to enable caption outputs in a transcription request to Speech-to-Text using local and remote files:

API

  curl -X POST \
    -H "Content-Type: application/json; charset=utf-8" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v2/projects/my-project/locations/global/recognizers/_:batchRecognize \
    --data '{
      "files": [{
        "uri": "gs://my-bucket/jfk_and_the_press.wav"
      }],
      "config": {
        "features": { "enableWordTimeOffsets": true },
        "autoDecodingConfig": {},
        "model": "long",
        "languageCodes": ["en-US"]
      },
      "recognitionOutputConfig": {
        "gcsOutputConfig": { "uri": "gs://my-bucket" },
        "output_format_config": { "srt": {} }
      }
    }'

What's next

  • Learn how to [transcribe long audio files][batch-recognize].
  • Learn how to choose the best transcription model.
  • Transcribe audio files using [Chirp][chirp].
  • For best performance, accuracy, and other tips, see the [best practices][best-practices] documentation.