Speech Transcription

The Video Intelligence API can transcribe speech to text from supported video files.

Video Intelligence speech transcription supports the following features:

  • Alternative words: Use the maxAlternatives option to specify the maximum number of options for recognized text translations to include in the response. This value can be an integer from 1 to 30. The default is 1. The API returns multiple transcriptions in descending order based on the confidence value for the transcription. Alternative transcriptions do not include word-level entries.

  • Profanity filtering: Use the filterProfanity option to filter out known profanities in transcriptions. Matched words are replaced with the leading character of the word followed by asterisks. The default is false.

  • Transcription hints: Use the speechContexts option to provide common or unusual phrases in your audio. Those phrases are then used to assist the transcription service to create more accurate transcriptions. You provide a transcription hint as a SpeechContext object.

  • Audio track selection: Use the audioTracks option to specify which track to transcribe from multi-track audio. This value can be an integer from 0 to 2. Default is 0.

  • Automatic punctuation: Use the enableAutomaticPunctuation option to include punctuation in the transcribed text. The default is false.

  • Multiple speakers: Use the enableSpeakerDiarization option to identify different speakers in a video. In the response, each recognized word includes a speakerTag field that identifies which speaker the recognized word is attributed to.

Request Speech Transcription for a Video


To perform speech transcription, make a POST request to the v1/videos:annotate endpoint.

The example uses the gcloud auth application-default print-access-token command to obtain an access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account see the Quickstart.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
      'inputUri': 'gs://bucket-name-123/sample-video-short.mp4',
      'features': ['SPEECH_TRANSCRIPTION'],
      'videoContext': {
        'speechTranscriptionConfig': {
          'languageCode': 'en-US',
          'enableAutomaticPunctuation': true,
          'filterProfanity':  true
    }" "https://videointelligence.googleapis.com/v1/videos:annotate"

An operation ID is returned:

  "name": "us-east1.12938669590037241992"

To retrieve the operation results, replace NAME in the command below with the value of name in your previous result:

curl -X GET -H "Content-Type: application/json" -H \
"Authorization: Bearer  $(gcloud auth application-default print-access-token)" \

When the operation has completed, the result looks like:

  "name": "us-east1.12938669590037241992",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoProgress",
    "annotationProgress": [
        "inputUri": "/bucket-name-123/sample-video-short.mp4",
        "progressPercent": 100,
        "startTime": "2018-04-09T15:19:38.919779Z",
        "updateTime": "2018-04-09T15:21:17.652470Z"
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoResponse",
    "annotationResults": [
        "speechTranscriptions": [
            "alternatives": [
                "transcript": "and laughing going to talk about is the video intelligence API how many of
you saw it at the keynote yesterday",
                "confidence": 0.8442509,
                "words": [
                    "startTime": "0.200s",
                    "endTime": "0.800s",
                    "word": "and"
                    "startTime": "0.800s",
                    "endTime": "1.100s",
                    "word": "laughing"
                    "startTime": "1.100s",
                    "endTime": "1.200s",
                    "word": "going"


public static object TranscribeVideo(string uri)
    Console.WriteLine("Processing video for speech transcription.");

    var client = VideoIntelligenceServiceClient.Create();
    var request = new AnnotateVideoRequest
        InputUri = uri,
        Features = { Feature.SpeechTranscription },
        VideoContext = new VideoContext
            SpeechTranscriptionConfig = new SpeechTranscriptionConfig
                LanguageCode = "en-US",
                EnableAutomaticPunctuation = true
    var op = client.AnnotateVideo(request).PollUntilCompleted();

    // There is only one annotation result since only one video is
    // processed.
    var annotationResults = op.Result.AnnotationResults[0];
    foreach (var transcription in annotationResults.SpeechTranscriptions)
        // The number of alternatives for each transcription is limited
        // by SpeechTranscriptionConfig.MaxAlternatives.
        // Each alternative is a different possible transcription
        // and has its own confidence score.
        foreach (var alternative in transcription.Alternatives)
            Console.WriteLine("Alternative level information:");

            Console.WriteLine($"Transcript: {alternative.Transcript}");
            Console.WriteLine($"Confidence: {alternative.Confidence}");

            foreach (var wordInfo in alternative.Words)
                Console.WriteLine($"\t{wordInfo.StartTime} - " +
                                  $"{wordInfo.EndTime}:" +

    return 0;


func speechTranscription(w io.Writer, file string) error {
	ctx := context.Background()
	client, err := video.NewClient(ctx)
	if err != nil {
		return err

	fileBytes, err := ioutil.ReadFile(file)
	if err != nil {
		return err

	op, err := client.AnnotateVideo(ctx, &videopb.AnnotateVideoRequest{
		Features: []videopb.Feature{
		VideoContext: &videopb.VideoContext{
			SpeechTranscriptionConfig: &videopb.SpeechTranscriptionConfig{
				LanguageCode:               "en-US",
				EnableAutomaticPunctuation: true,
		InputContent: fileBytes,
	if err != nil {
		return err
	resp, err := op.Wait(ctx)
	if err != nil {
		return err

	// A single video was processed. Get the first result.
	result := resp.AnnotationResults[0]

	for _, transcription := range result.SpeechTranscriptions {
		// The number of alternatives for each transcription is limited by
		// SpeechTranscriptionConfig.MaxAlternatives.
		// Each alternative is a different possible transcription
		// and has its own confidence score.
		for _, alternative := range transcription.GetAlternatives() {
			fmt.Fprintf(w, "Alternative level information:\n")
			fmt.Fprintf(w, "\tTranscript: %v\n", alternative.GetTranscript())
			fmt.Fprintf(w, "\tConfidence: %v\n", alternative.GetConfidence())

			fmt.Fprintf(w, "Word level information:\n")
			for _, wordInfo := range alternative.GetWords() {
				startTime := wordInfo.GetStartTime()
				endTime := wordInfo.GetEndTime()
				fmt.Fprintf(w, "\t%4.1f - %4.1f: %v (speaker %v)\n",
					float64(startTime.GetSeconds())+float64(startTime.GetNanos())*1e-9, // start as seconds
					float64(endTime.GetSeconds())+float64(endTime.GetNanos())*1e-9,     // end as seconds

	return nil


// Instantiate a com.google.cloud.videointelligence.v1.VideoIntelligenceServiceClient
try (VideoIntelligenceServiceClient client = VideoIntelligenceServiceClient.create()) {
  // Set the language code
  SpeechTranscriptionConfig config = SpeechTranscriptionConfig.newBuilder()

  // Set the video context with the above configuration
  VideoContext context = VideoContext.newBuilder()

  // Create the request
  AnnotateVideoRequest request = AnnotateVideoRequest.newBuilder()

  // asynchronously perform speech transcription on videos
  OperationFuture<AnnotateVideoResponse, AnnotateVideoProgress> response =

  System.out.println("Waiting for operation to complete...");
  // Display the results
  for (VideoAnnotationResults results : response.get(600, TimeUnit.SECONDS)
          .getAnnotationResultsList()) {
    for (SpeechTranscription speechTranscription : results.getSpeechTranscriptionsList()) {
      try {
        // Print the transcription
        if (speechTranscription.getAlternativesCount() > 0) {
          SpeechRecognitionAlternative alternative = speechTranscription.getAlternatives(0);

          System.out.printf("Transcript: %s\n", alternative.getTranscript());
          System.out.printf("Confidence: %.2f\n", alternative.getConfidence());

          System.out.println("Word level information:");
          for (WordInfo wordInfo : alternative.getWordsList()) {
            double startTime = wordInfo.getStartTime().getSeconds()
                    + wordInfo.getStartTime().getNanos() / 1e9;
            double endTime = wordInfo.getEndTime().getSeconds()
                    + wordInfo.getEndTime().getNanos() / 1e9;
            System.out.printf("\t%4.2fs - %4.2fs: %s\n",
                    startTime, endTime, wordInfo.getWord());
        } else {
          System.out.println("No transcription found");
      } catch (IndexOutOfBoundsException ioe) {
        System.out.println("Could not retrieve frame: " + ioe.getMessage());


// Imports the Google Cloud Video Intelligence library
const videoIntelligence = require('@google-cloud/video-intelligence');

// Creates a client
const client = new videoIntelligence.VideoIntelligenceServiceClient();

 * TODO(developer): Uncomment the following line before running the sample.
// const gcsUri = 'GCS URI of video to analyze, e.g. gs://my-bucket/my-video.mp4';

const videoContext = {
  speechTranscriptionConfig: {
    languageCode: 'en-US',
    enableAutomaticPunctuation2: true,

const request = {
  inputUri: gcsUri,
  videoContext: videoContext,

const [operation] = await client.annotateVideo(request);
console.log('Waiting for operation to complete...');
const [operationResult] = await operation.promise();
console.log('Word level information:');
const alternative =
alternative.words.forEach(wordInfo => {
  const start_time =
    wordInfo.startTime.seconds + wordInfo.startTime.nanos * 1e-9;
  const end_time = wordInfo.endTime.seconds + wordInfo.endTime.nanos * 1e-9;
  console.log('\t' + start_time + 's - ' + end_time + 's: ' + wordInfo.word);
console.log('Transcription: ' + alternative.transcript);


use Google\Cloud\VideoIntelligence\V1\VideoIntelligenceServiceClient;
use Google\Cloud\VideoIntelligence\V1\Feature;
use Google\Cloud\VideoIntelligence\V1\VideoContext;
use Google\Cloud\VideoIntelligence\V1\SpeechTranscriptionConfig;

 * Transcribe speech from a video stored on GCS.
 * @param string $path The cloud storage object to analyze.
function analyze_transcription($uri, array $options = [])
    # set configs
    $features = [Feature::SPEECH_TRANSCRIPTION];
    $speechTranscriptionConfig = (new SpeechTranscriptionConfig())
    $videoContext = (new VideoContext())

    # instantiate a client
    $client = new VideoIntelligenceServiceClient();

    # execute a request.
    $operation = $client->annotateVideo([
        'inputUri' => $uri,
        'features' => $features,
        'videoContext' => $videoContext

    print('Processing video for speech transcription...' . PHP_EOL);
    # Wait for the request to complete.

    # Print the result.
    if ($operation->operationSucceeded()) {
        $result = $operation->getResult();
        # there is only one annotation_result since only
        # one video is processed.
        $annotationResults = $result->getAnnotationResults()[0];
        $speechTranscriptions = $annotationResults ->getSpeechTranscriptions();

        foreach ($speechTranscriptions as $transcription) {
            # the number of alternatives for each transcription is limited by
            # $max_alternatives in SpeechTranscriptionConfig
            # each alternative is a different possible transcription
            # and has its own confidence score.
            foreach ($transcription->getAlternatives() as $alternative) {
                print('Alternative level information' . PHP_EOL);

                printf('Transcript: %s' . PHP_EOL, $alternative->getTranscript());
                printf('Confidence: %s' . PHP_EOL, $alternative->getConfidence());

                print('Word level information:');
                foreach ($alternative->getWords() as $wordInfo) {
                    printf('%s s - %s s: %s' . PHP_EOL, 


"""Transcribe speech from a video stored on GCS."""
from google.cloud import videointelligence

video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]

config = videointelligence.types.SpeechTranscriptionConfig(
video_context = videointelligence.types.VideoContext(

operation = video_client.annotate_video(
    path, features=features,

print('\nProcessing video for speech transcription.')

result = operation.result(timeout=600)

# There is only one annotation_result since only
# one video is processed.
annotation_results = result.annotation_results[0]
for speech_transcription in annotation_results.speech_transcriptions:

    # The number of alternatives for each transcription is limited by
    # SpeechTranscriptionConfig.max_alternatives.
    # Each alternative is a different possible transcription
    # and has its own confidence score.
    for alternative in speech_transcription.alternatives:
        print('Alternative level information:')

        print('Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}\n'.format(alternative.confidence))

        print('Word level information:')
        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('\t{}s - {}s: {}'.format(
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9,


# path = "Path to a video file on Google Cloud Storage: gs://bucket/video.mp4"

require "google/cloud/video_intelligence"

video = Google::Cloud::VideoIntelligence.new

context = {
  speech_transcription_config: {
    language_code: "en-US",
    enable_automatic_punctuation: true

# Register a callback during the method call
operation = video.annotate_video input_uri: path, features: [:SPEECH_TRANSCRIPTION], video_context: context do |operation|
  raise operation.results.message? if operation.error?
  puts "Finished Processing."

  transcriptions = operation.results.annotation_results.first.speech_transcriptions

  transcriptions.each do |transcription|
    transcription.alternatives.each do |alternative|
      puts "Alternative level information:"

      puts "Transcript: #{alternative.transcript}"
      puts "Confidence: #{alternative.confidence}"

      puts "Word level information:"
      alternative.words.each do |word_info|
        start_time = ( word_info.start_time.seconds +
                       word_info.start_time.nanos / 1e9 )
        end_time =   ( word_info.end_time.seconds +
                       word_info.end_time.nanos / 1e9 )

        puts "#{word_info.word}: #{start_time} to #{end_time}"

puts "Processing video for speech transcriptions:"

Var denne side nyttig? Giv os en anmeldelse af den:

Send feedback om...

Cloud Video Intelligence API Documentation