Speech Transcription

The Video Intelligence API can transcribe speech to text from supported video files.

Video Intelligence speech transcription supports the following features:

  • Alternative words: Use the maxAlternatives option to specify the maximum number of options for recognized text translations to include in the response. This value can be an integer from 1 to 30. The default is 1. The API returns multiple transcriptions in descending order based on the confidence value for the transcription. Alternative transcriptions do not include word-level entries.

  • Profanity filtering: Use the filterProfanity option to filter out known profanities in transcriptions. Matched words are replaced with the leading character of the word followed by asterisks. The default is false.

  • Transcription hints: Use the speechContexts option to provide common or unusual phrases in your audio. Those phrases are then used to assist the transcription service to create more accurate transcriptions. You provide a transcription hint as a SpeechContext object.

  • Audio track selection: Use the audioTracks option to specify which track to transcribe from multi-track audio. This value can be an integer from 0 to 2. Default is 0.

  • Automatic punctuation: Use the enableAutomaticPunctuation option to include punctuation in the transcribed text. The default is false.

  • Multiple speakers: Use the enableSpeakerDiarization option to identify different speakers in a video. In the response, each recognized word includes a speakerTag field that identifies which speaker the recognized word is attributed to.

Request Speech Transcription for a Video

Protocol

To perform speech transcription, make a POST request to the v1/videos:annotate endpoint.

The example uses the gcloud auth application-default print-access-token command to obtain an access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account see the Quickstart.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
      'inputUri': 'gs://bucket-name-123/sample-video-short.mp4',
      'features': ['SPEECH_TRANSCRIPTION'],
      'videoContext': {
        'speechTranscriptionConfig': {
          'languageCode': 'en-US',
          'enableAutomaticPunctuation': true,
          'filterProfanity':  true
        }
      }
    }" "https://videointelligence.googleapis.com/v1/videos:annotate"

An operation ID is returned:

{
  "name": "us-east1.12938669590037241992"
}

To retrieve the operation results, replace NAME in the command below with the value of name in your previous result:

curl -X GET -H "Content-Type: application/json" -H \
"Authorization: Bearer  $(gcloud auth application-default print-access-token)" \
"https://videointelligence.googleapis.com/v1/operations/NAME"

When the operation has completed, the result looks like:

{
  "name": "us-east1.12938669590037241992",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoProgress",
    "annotationProgress": [
      {
        "inputUri": "/bucket-name-123/sample-video-short.mp4",
        "progressPercent": 100,
        "startTime": "2018-04-09T15:19:38.919779Z",
        "updateTime": "2018-04-09T15:21:17.652470Z"
      }
    ]
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoResponse",
    "annotationResults": [
      {
        "speechTranscriptions": [
          {
            "alternatives": [
              {
                "transcript": "and laughing going to talk about is the video intelligence API how many of
you saw it at the keynote yesterday",
                "confidence": 0.8442509,
                "words": [
                  {
                    "startTime": "0.200s",
                    "endTime": "0.800s",
                    "word": "and"
                  },
                  {
                    "startTime": "0.800s",
                    "endTime": "1.100s",
                    "word": "laughing"
                  },
                  {
                    "startTime": "1.100s",
                    "endTime": "1.200s",
                    "word": "going"
                  },
      ...

C#

public static object TranscribeVideo(string uri)
{
    Console.WriteLine("Processing video for speech transcription.");

    var client = VideoIntelligenceServiceClient.Create();
    var request = new AnnotateVideoRequest
    {
        InputUri = uri,
        Features = { Feature.SpeechTranscription },
        VideoContext = new VideoContext
        {
            SpeechTranscriptionConfig = new SpeechTranscriptionConfig
            {
                LanguageCode = "en-US",
                EnableAutomaticPunctuation = true
            }
        },
    };
    var op = client.AnnotateVideo(request).PollUntilCompleted();

    // There is only one annotation result since only one video is
    // processed.
    var annotationResults = op.Result.AnnotationResults[0];
    foreach (var transcription in annotationResults.SpeechTranscriptions)
    {
        // The number of alternatives for each transcription is limited
        // by SpeechTranscriptionConfig.MaxAlternatives.
        // Each alternative is a different possible transcription
        // and has its own confidence score.
        foreach (var alternative in transcription.Alternatives)
        {
            Console.WriteLine("Alternative level information:");

            Console.WriteLine($"Transcript: {alternative.Transcript}");
            Console.WriteLine($"Confidence: {alternative.Confidence}");

            foreach (var wordInfo in alternative.Words)
            {
                Console.WriteLine($"\t{wordInfo.StartTime} - " +
                                  $"{wordInfo.EndTime}:" +
                                  $"{wordInfo.Word}");
            }
        }
    }

    return 0;
}

Go

func speechTranscription(w io.Writer, file string) error {
	ctx := context.Background()
	client, err := video.NewClient(ctx)
	if err != nil {
		return err
	}

	fileBytes, err := ioutil.ReadFile(file)
	if err != nil {
		return err
	}

	op, err := client.AnnotateVideo(ctx, &videopb.AnnotateVideoRequest{
		Features: []videopb.Feature{
			videopb.Feature_SPEECH_TRANSCRIPTION,
		},
		VideoContext: &videopb.VideoContext{
			SpeechTranscriptionConfig: &videopb.SpeechTranscriptionConfig{
				LanguageCode:               "en-US",
				EnableAutomaticPunctuation: true,
			},
		},
		InputContent: fileBytes,
	})
	if err != nil {
		return err
	}
	resp, err := op.Wait(ctx)
	if err != nil {
		return err
	}

	// A single video was processed. Get the first result.
	result := resp.AnnotationResults[0]

	for _, transcription := range result.SpeechTranscriptions {
		// The number of alternatives for each transcription is limited by
		// SpeechTranscriptionConfig.MaxAlternatives.
		// Each alternative is a different possible transcription
		// and has its own confidence score.
		for _, alternative := range transcription.GetAlternatives() {
			fmt.Fprintf(w, "Alternative level information:\n")
			fmt.Fprintf(w, "\tTranscript: %v\n", alternative.GetTranscript())
			fmt.Fprintf(w, "\tConfidence: %v\n", alternative.GetConfidence())

			fmt.Fprintf(w, "Word level information:\n")
			for _, wordInfo := range alternative.GetWords() {
				startTime := wordInfo.GetStartTime()
				endTime := wordInfo.GetEndTime()
				fmt.Fprintf(w, "\t%4.1f - %4.1f: %v (speaker %v)\n",
					float64(startTime.GetSeconds())+float64(startTime.GetNanos())*1e-9, // start as seconds
					float64(endTime.GetSeconds())+float64(endTime.GetNanos())*1e-9,     // end as seconds
					wordInfo.GetWord(),
					wordInfo.GetSpeakerTag())
			}
		}
	}

	return nil
}

Java

// Instantiate a com.google.cloud.videointelligence.v1.VideoIntelligenceServiceClient
try (VideoIntelligenceServiceClient client = VideoIntelligenceServiceClient.create()) {
  // Set the language code
  SpeechTranscriptionConfig config = SpeechTranscriptionConfig.newBuilder()
          .setLanguageCode("en-US")
          .setEnableAutomaticPunctuation(true)
          .build();

  // Set the video context with the above configuration
  VideoContext context = VideoContext.newBuilder()
          .setSpeechTranscriptionConfig(config)
          .build();

  // Create the request
  AnnotateVideoRequest request = AnnotateVideoRequest.newBuilder()
          .setInputUri(gcsUri)
          .addFeatures(Feature.SPEECH_TRANSCRIPTION)
          .setVideoContext(context)
          .build();

  // asynchronously perform speech transcription on videos
  OperationFuture<AnnotateVideoResponse, AnnotateVideoProgress> response =
          client.annotateVideoAsync(request);

  System.out.println("Waiting for operation to complete...");
  // Display the results
  for (VideoAnnotationResults results : response.get(600, TimeUnit.SECONDS)
          .getAnnotationResultsList()) {
    for (SpeechTranscription speechTranscription : results.getSpeechTranscriptionsList()) {
      try {
        // Print the transcription
        if (speechTranscription.getAlternativesCount() > 0) {
          SpeechRecognitionAlternative alternative = speechTranscription.getAlternatives(0);

          System.out.printf("Transcript: %s\n", alternative.getTranscript());
          System.out.printf("Confidence: %.2f\n", alternative.getConfidence());

          System.out.println("Word level information:");
          for (WordInfo wordInfo : alternative.getWordsList()) {
            double startTime = wordInfo.getStartTime().getSeconds()
                    + wordInfo.getStartTime().getNanos() / 1e9;
            double endTime = wordInfo.getEndTime().getSeconds()
                    + wordInfo.getEndTime().getNanos() / 1e9;
            System.out.printf("\t%4.2fs - %4.2fs: %s\n",
                    startTime, endTime, wordInfo.getWord());
          }
        } else {
          System.out.println("No transcription found");
        }
      } catch (IndexOutOfBoundsException ioe) {
        System.out.println("Could not retrieve frame: " + ioe.getMessage());
      }
    }
  }
}

Node.js

// Imports the Google Cloud Video Intelligence library
const videoIntelligence = require('@google-cloud/video-intelligence');

// Creates a client
const client = new videoIntelligence.VideoIntelligenceServiceClient();

/**
 * TODO(developer): Uncomment the following line before running the sample.
 */
// const gcsUri = 'GCS URI of video to analyze, e.g. gs://my-bucket/my-video.mp4';

const videoContext = {
  speechTranscriptionConfig: {
    languageCode: 'en-US',
    enableAutomaticPunctuation2: true,
  },
};

const request = {
  inputUri: gcsUri,
  features: ['SPEECH_TRANSCRIPTION'],
  videoContext: videoContext,
};

const [operation] = await client.annotateVideo(request);
console.log('Waiting for operation to complete...');
const [operationResult] = await operation.promise();
console.log('Word level information:');
const alternative =
  operationResult.annotationResults[0].speechTranscriptions[0]
    .alternatives[0];
alternative.words.forEach(wordInfo => {
  const start_time =
    wordInfo.startTime.seconds + wordInfo.startTime.nanos * 1e-9;
  const end_time = wordInfo.endTime.seconds + wordInfo.endTime.nanos * 1e-9;
  console.log('\t' + start_time + 's - ' + end_time + 's: ' + wordInfo.word);
});
console.log('Transcription: ' + alternative.transcript);

PHP

use Google\Cloud\VideoIntelligence\V1\VideoIntelligenceServiceClient;
use Google\Cloud\VideoIntelligence\V1\Feature;
use Google\Cloud\VideoIntelligence\V1\VideoContext;
use Google\Cloud\VideoIntelligence\V1\SpeechTranscriptionConfig;

/** Uncomment and populate these variables in your code */
// $uri = 'The cloud storage object to analyze (gs://your-bucket-name/your-object-name)';
// $options = [];

# set configs
$features = [Feature::SPEECH_TRANSCRIPTION];
$speechTranscriptionConfig = (new SpeechTranscriptionConfig())
    ->setLanguageCode('en-US')
    ->setEnableAutomaticPunctuation(true);
$videoContext = (new VideoContext())
    ->setSpeechTranscriptionConfig($speechTranscriptionConfig);

# instantiate a client
$client = new VideoIntelligenceServiceClient();

# execute a request.
$operation = $client->annotateVideo([
    'inputUri' => $uri,
    'features' => $features,
    'videoContext' => $videoContext
]);

print('Processing video for speech transcription...' . PHP_EOL);
# Wait for the request to complete.
$operation->pollUntilComplete($options);

# Print the result.
if ($operation->operationSucceeded()) {
    $result = $operation->getResult();
    # there is only one annotation_result since only
    # one video is processed.
    $annotationResults = $result->getAnnotationResults()[0];
    $speechTranscriptions = $annotationResults ->getSpeechTranscriptions();

    foreach ($speechTranscriptions as $transcription) {
        # the number of alternatives for each transcription is limited by
        # $max_alternatives in SpeechTranscriptionConfig
        # each alternative is a different possible transcription
        # and has its own confidence score.
        foreach ($transcription->getAlternatives() as $alternative) {
            print('Alternative level information' . PHP_EOL);

            printf('Transcript: %s' . PHP_EOL, $alternative->getTranscript());
            printf('Confidence: %s' . PHP_EOL, $alternative->getConfidence());

            print('Word level information:');
            foreach ($alternative->getWords() as $wordInfo) {
                printf(
                    '%s s - %s s: %s' . PHP_EOL,
                    $wordInfo->getStartTime()->getSeconds(),
                    $wordInfo->getEndTime()->getSeconds(),
                    $wordInfo->getWord()
                );
            }
        }
    }
}
$client->close();

Python

"""Transcribe speech from a video stored on GCS."""
from google.cloud import videointelligence

video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]

config = videointelligence.types.SpeechTranscriptionConfig(
    language_code='en-US',
    enable_automatic_punctuation=True)
video_context = videointelligence.types.VideoContext(
    speech_transcription_config=config)

operation = video_client.annotate_video(
    path, features=features,
    video_context=video_context)

print('\nProcessing video for speech transcription.')

result = operation.result(timeout=600)

# There is only one annotation_result since only
# one video is processed.
annotation_results = result.annotation_results[0]
for speech_transcription in annotation_results.speech_transcriptions:

    # The number of alternatives for each transcription is limited by
    # SpeechTranscriptionConfig.max_alternatives.
    # Each alternative is a different possible transcription
    # and has its own confidence score.
    for alternative in speech_transcription.alternatives:
        print('Alternative level information:')

        print('Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}\n'.format(alternative.confidence))

        print('Word level information:')
        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('\t{}s - {}s: {}'.format(
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9,
                word))

Ruby

# path = "Path to a video file on Google Cloud Storage: gs://bucket/video.mp4"

require "google/cloud/video_intelligence"

video = Google::Cloud::VideoIntelligence.new

context = {
  speech_transcription_config: {
    language_code:                "en-US",
    enable_automatic_punctuation: true
  }
}

# Register a callback during the method call
operation = video.annotate_video input_uri: path, features: [:SPEECH_TRANSCRIPTION], video_context: context do |operation|
  raise operation.results.message? if operation.error?
  puts "Finished Processing."

  transcriptions = operation.results.annotation_results.first.speech_transcriptions

  transcriptions.each do |transcription|
    transcription.alternatives.each do |alternative|
      puts "Alternative level information:"

      puts "Transcript: #{alternative.transcript}"
      puts "Confidence: #{alternative.confidence}"

      puts "Word level information:"
      alternative.words.each do |word_info|
        start_time = (word_info.start_time.seconds +
                       word_info.start_time.nanos / 1e9)
        end_time =   (word_info.end_time.seconds +
                       word_info.end_time.nanos / 1e9)

        puts "#{word_info.word}: #{start_time} to #{end_time}"
      end
    end
  end
end

puts "Processing video for speech transcriptions:"
operation.wait_until_done!

¿Te ha resultado útil esta página? Enviar comentarios:

Enviar comentarios sobre...

Cloud Video Intelligence API Documentation