Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window. Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window.

Speech transcription

Speech Transcription transcribes spoken audio in a video or video segment into text and returns blocks of text for each portion of the transcribed audio.

Supported models

The Video Intelligence only supports English (US). For other languages, use the Speech-to-Text API, which supports all available languages. For the list of available languages, see Language support in the Speech-to-Text documentation.

To transcribe speech from a video, call the annotate method and specify SPEECH_TRANSCRIPTION in the features field.

You can use the following features when transcribing speech:

Alternative words: Use the maxAlternatives option to specify the maximum number of options for recognized text translations to include in the response. This value can be an integer from 1 to 30. The default is 1. The API returns multiple transcriptions in descending order based on the confidence value for the transcription. Alternative transcriptions do not include word-level entries.
Profanity filtering: Use the filterProfanity option to filter out known profanities in transcriptions. Matched words are replaced with the leading character of the word followed by asterisks. The default is false.
Transcription hints: Use the speechContexts option to provide common or unusual phrases in your audio. Those phrases are then used to assist the transcription service to create more accurate transcriptions. You provide a transcription hint as a SpeechContext object.
Audio track selection: Use the audioTracks option to specify which track to transcribe from multi-track video. Users can specify up to two tracks. Default is 0. Once the language code is set to en-US, the request is routed to the enhanced mode, which is trained on en-US audio; it does not really know en-US or any other languages per se. If we feed a Spanish audio into the enhanced model, transcription will run its course but there may be outputs with low confidence scores, or no output at all – which is what is expected of a good model.
Automatic punctuation: Use the enableAutomaticPunctuation option to include punctuation in the transcribed text. The default is false.
Multiple speakers: Use the enableSpeakerDiarization option to identify different speakers in a video. In the response, each recognized word includes a speakerTag field that identifies which speaker the recognized word is attributed to.

For best results, provide audio recorded at 16,000Hz or greater sampling rate.

Check out the Video Intelligence API visualizer to see this feature in action.

For examples of requesting speech transcription, see Speech Transcription.