Speech Transcription transcribes spoken audio in a video or video segment into text and returns blocks of text for each portion of the transcribed audio.
The Video Intelligence only supports English (US). For other languages, use the Speech-to-Text API, which supports all available languages. For the list of available languages, see Language support in the Speech-to-Text documentation.
You can use the following features when transcribing speech:
Alternative words: Use the
maxAlternativesoption to specify the maximum number of options for recognized text translations to include in the response. This value can be an integer from 1 to 30. The default is 1. The API returns multiple transcriptions in descending order based on the confidence value for the transcription. Alternative transcriptions do not include word-level entries.
Profanity filtering: Use the
filterProfanityoption to filter out known profanities in transcriptions. Matched words are replaced with the leading character of the word followed by asterisks. The default is false.
Transcription hints: Use the
speechContextsoption to provide common or unusual phrases in your audio. Those phrases are then used to assist the transcription service to create more accurate transcriptions. You provide a transcription hint as a SpeechContext object.
Audio track selection: Use the
audioTracksoption to specify which track to transcribe from multi-track video. Users can specify up to two tracks. Default is 0. Once the language code is set to en-US, the request is routed to the enhanced mode, which is trained on en-US audio; it does not really know en-US or any other languages per se. If we feed a Spanish audio into the enhanced model, transcription will run its course but there may be outputs with low confidence scores, or no output at all – which is what is expected of a good model.
Automatic punctuation: Use the
enableAutomaticPunctuationoption to include punctuation in the transcribed text. The default is false.
Multiple speakers: Use the
enableSpeakerDiarizationoption to identify different speakers in a video. In the response, each recognized word includes a
speakerTagfield that identifies which speaker the recognized word is attributed to.
For best results, provide audio recorded at 16,000Hz or greater sampling rate.
Check out the Video Intelligence API visualizer to see this feature in action.
For examples of requesting speech transcription, see Speech Transcription.