Getting audio track transcription

The Video Intelligence can transcribe speech to text from supported video files.

Video Intelligence speech transcription supports the following features:

  • Alternative words: Use the maxAlternatives option to specify the maximum number of options for recognized text translations to include in the response. This value can be an integer from 1 to 30. The default is 1. The API returns multiple transcriptions in descending order based on the confidence value for the transcription. Alternative transcriptions do not include word-level entries.

  • Profanity filtering: Use the filterProfanity option to filter out known profanities in transcriptions. Matched words are replaced with the leading character of the word followed by asterisks. The default is false.

  • Transcription hints: Use the speechContexts option to provide common or unusual phrases in your audio. Those phrases are then used to assist the transcription service to create more accurate transcriptions. You provide a transcription hint as a SpeechContext object.

  • Audio track selection: Use the audioTracks option to specify which track to transcribe from multi-track audio. This value can be an integer from 0 to 2. Default is 0.

  • Automatic punctuation: Use the enableAutomaticPunctuation option to include punctuation in the transcribed text. The default is false.

  • Multiple speakers: Use the enableSpeakerDiarization option to identify different speakers in a video. In the response, each recognized word includes a speakerTag field that identifies which speaker the recognized word is attributed to.

Request Speech Transcription for a Video


Send the process request

The following shows how to send a POST request to the videos:annotate method. The example uses the access token for a service account set up for the project using the Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the Video Intelligence quickstart.

Before using any of the request data below, make the following replacements:

  • input-uri: a Cloud Storage bucket that contains the file you want to annotate, including the file name. Must start with gs://.
    For example: "inputUri": "gs://cloud-videointelligence-demo/assistant.mp4",
  • language-code: [Optional] See supported languages

HTTP method and URL:


Request JSON body:

"inputUri": "input-uri",
  "features": ["SPEECH_TRANSCRIPTION"],
  "videoContext": {
    "speechTranscriptionConfig": {
      "languageCode": "language-code",
      "enableAutomaticPunctuation": true,
      "filterProfanity": true

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

  "name": "projects/project-number/locations/location-id/operations/operation-id"

If the request is successful, Video Intelligence returns the name for your operation. The above shows an example of such a response, where project-number is the number of your project and operation-id is the ID of the long-running operation created for the request.

Get the results

To get the results of your request, you must send a GET, using the operation name returned from the call to videos:annotate, as shown in the following example.

Before using any of the request data below, make the following replacements:

  • operation-name: the name of the operation as returned by Video Intelligence API. The operation name has the format projects/project-number/locations/location-id/operations/operation-id

HTTP method and URL:


To send your request, expand one of these options:

You should receive a JSON response similar to the following:


public static object TranscribeVideo(string uri)
    Console.WriteLine("Processing video for speech transcription.");

    var client = VideoIntelligenceServiceClient.Create();
    var request = new AnnotateVideoRequest
        InputUri = uri,
        Features = { Feature.SpeechTranscription },
        VideoContext = new VideoContext
            SpeechTranscriptionConfig = new SpeechTranscriptionConfig
                LanguageCode = "en-US",
                EnableAutomaticPunctuation = true
    var op = client.AnnotateVideo(request).PollUntilCompleted();

    // There is only one annotation result since only one video is
    // processed.
    var annotationResults = op.Result.AnnotationResults[0];
    foreach (var transcription in annotationResults.SpeechTranscriptions)
        // The number of alternatives for each transcription is limited
        // by SpeechTranscriptionConfig.MaxAlternatives.
        // Each alternative is a different possible transcription
        // and has its own confidence score.
        foreach (var alternative in transcription.Alternatives)
            Console.WriteLine("Alternative level information:");

            Console.WriteLine($"Transcript: {alternative.Transcript}");
            Console.WriteLine($"Confidence: {alternative.Confidence}");

            foreach (var wordInfo in alternative.Words)
                Console.WriteLine($"\t{wordInfo.StartTime} - " +
                                  $"{wordInfo.EndTime}:" +

    return 0;


func speechTranscription(w io.Writer, file string) error {
	ctx := context.Background()
	client, err := video.NewClient(ctx)
	if err != nil {
		return err

	fileBytes, err := ioutil.ReadFile(file)
	if err != nil {
		return err

	op, err := client.AnnotateVideo(ctx, &videopb.AnnotateVideoRequest{
		Features: []videopb.Feature{
		VideoContext: &videopb.VideoContext{
			SpeechTranscriptionConfig: &videopb.SpeechTranscriptionConfig{
				LanguageCode:               "en-US",
				EnableAutomaticPunctuation: true,
		InputContent: fileBytes,
	if err != nil {
		return err
	resp, err := op.Wait(ctx)
	if err != nil {
		return err

	// A single video was processed. Get the first result.
	result := resp.AnnotationResults[0]

	for _, transcription := range result.SpeechTranscriptions {
		// The number of alternatives for each transcription is limited by
		// SpeechTranscriptionConfig.MaxAlternatives.
		// Each alternative is a different possible transcription
		// and has its own confidence score.
		for _, alternative := range transcription.GetAlternatives() {
			fmt.Fprintf(w, "Alternative level information:\n")
			fmt.Fprintf(w, "\tTranscript: %v\n", alternative.GetTranscript())
			fmt.Fprintf(w, "\tConfidence: %v\n", alternative.GetConfidence())

			fmt.Fprintf(w, "Word level information:\n")
			for _, wordInfo := range alternative.GetWords() {
				startTime := wordInfo.GetStartTime()
				endTime := wordInfo.GetEndTime()
				fmt.Fprintf(w, "\t%4.1f - %4.1f: %v (speaker %v)\n",
					float64(startTime.GetSeconds())+float64(startTime.GetNanos())*1e-9, // start as seconds
					float64(endTime.GetSeconds())+float64(endTime.GetNanos())*1e-9,     // end as seconds

	return nil


// Instantiate a
try (VideoIntelligenceServiceClient client = VideoIntelligenceServiceClient.create()) {
  // Set the language code
  SpeechTranscriptionConfig config = SpeechTranscriptionConfig.newBuilder()

  // Set the video context with the above configuration
  VideoContext context = VideoContext.newBuilder()

  // Create the request
  AnnotateVideoRequest request = AnnotateVideoRequest.newBuilder()

  // asynchronously perform speech transcription on videos
  OperationFuture<AnnotateVideoResponse, AnnotateVideoProgress> response =

  System.out.println("Waiting for operation to complete...");
  // Display the results
  for (VideoAnnotationResults results : response.get(600, TimeUnit.SECONDS)
          .getAnnotationResultsList()) {
    for (SpeechTranscription speechTranscription : results.getSpeechTranscriptionsList()) {
      try {
        // Print the transcription
        if (speechTranscription.getAlternativesCount() > 0) {
          SpeechRecognitionAlternative alternative = speechTranscription.getAlternatives(0);

          System.out.printf("Transcript: %s\n", alternative.getTranscript());
          System.out.printf("Confidence: %.2f\n", alternative.getConfidence());

          System.out.println("Word level information:");
          for (WordInfo wordInfo : alternative.getWordsList()) {
            double startTime = wordInfo.getStartTime().getSeconds()
                    + wordInfo.getStartTime().getNanos() / 1e9;
            double endTime = wordInfo.getEndTime().getSeconds()
                    + wordInfo.getEndTime().getNanos() / 1e9;
            System.out.printf("\t%4.2fs - %4.2fs: %s\n",
                    startTime, endTime, wordInfo.getWord());
        } else {
          System.out.println("No transcription found");
      } catch (IndexOutOfBoundsException ioe) {
        System.out.println("Could not retrieve frame: " + ioe.getMessage());


// Imports the Google Cloud Video Intelligence library
const videoIntelligence = require('@google-cloud/video-intelligence');

// Creates a client
const client = new videoIntelligence.VideoIntelligenceServiceClient();

 * TODO(developer): Uncomment the following line before running the sample.
// const gcsUri = 'GCS URI of video to analyze, e.g. gs://my-bucket/my-video.mp4';

const videoContext = {
  speechTranscriptionConfig: {
    languageCode: 'en-US',
    enableAutomaticPunctuation: true,

const request = {
  inputUri: gcsUri,
  videoContext: videoContext,

const [operation] = await client.annotateVideo(request);
console.log('Waiting for operation to complete...');
const [operationResult] = await operation.promise();
console.log('Word level information:');
const alternative =
alternative.words.forEach(wordInfo => {
  const start_time =
    wordInfo.startTime.seconds + wordInfo.startTime.nanos * 1e-9;
  const end_time = wordInfo.endTime.seconds + wordInfo.endTime.nanos * 1e-9;
  console.log('\t' + start_time + 's - ' + end_time + 's: ' + wordInfo.word);
console.log('Transcription: ' + alternative.transcript);


use Google\Cloud\VideoIntelligence\V1\VideoIntelligenceServiceClient;
use Google\Cloud\VideoIntelligence\V1\Feature;
use Google\Cloud\VideoIntelligence\V1\VideoContext;
use Google\Cloud\VideoIntelligence\V1\SpeechTranscriptionConfig;

/** Uncomment and populate these variables in your code */
// $uri = 'The cloud storage object to analyze (gs://your-bucket-name/your-object-name)';
// $options = [];

# set configs
$features = [Feature::SPEECH_TRANSCRIPTION];
$speechTranscriptionConfig = (new SpeechTranscriptionConfig())
$videoContext = (new VideoContext())

# instantiate a client
$client = new VideoIntelligenceServiceClient();

# execute a request.
$operation = $client->annotateVideo([
    'inputUri' => $uri,
    'features' => $features,
    'videoContext' => $videoContext

print('Processing video for speech transcription...' . PHP_EOL);
# Wait for the request to complete.

# Print the result.
if ($operation->operationSucceeded()) {
    $result = $operation->getResult();
    # there is only one annotation_result since only
    # one video is processed.
    $annotationResults = $result->getAnnotationResults()[0];
    $speechTranscriptions = $annotationResults ->getSpeechTranscriptions();

    foreach ($speechTranscriptions as $transcription) {
        # the number of alternatives for each transcription is limited by
        # $max_alternatives in SpeechTranscriptionConfig
        # each alternative is a different possible transcription
        # and has its own confidence score.
        foreach ($transcription->getAlternatives() as $alternative) {
            print('Alternative level information' . PHP_EOL);

            printf('Transcript: %s' . PHP_EOL, $alternative->getTranscript());
            printf('Confidence: %s' . PHP_EOL, $alternative->getConfidence());

            print('Word level information:');
            foreach ($alternative->getWords() as $wordInfo) {
                    '%s s - %s s: %s' . PHP_EOL,


"""Transcribe speech from a video stored on GCS."""
from import videointelligence

video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]

config = videointelligence.types.SpeechTranscriptionConfig(
video_context = videointelligence.types.VideoContext(

operation = video_client.annotate_video(
    path, features=features,

print('\nProcessing video for speech transcription.')

result = operation.result(timeout=600)

# There is only one annotation_result since only
# one video is processed.
annotation_results = result.annotation_results[0]
for speech_transcription in annotation_results.speech_transcriptions:

    # The number of alternatives for each transcription is limited by
    # SpeechTranscriptionConfig.max_alternatives.
    # Each alternative is a different possible transcription
    # and has its own confidence score.
    for alternative in speech_transcription.alternatives:
        print('Alternative level information:')

        print('Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}\n'.format(alternative.confidence))

        print('Word level information:')
        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('\t{}s - {}s: {}'.format(
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9,


# path = "Path to a video file on Google Cloud Storage: gs://bucket/video.mp4"

require "google/cloud/video_intelligence"

video =

context = {
  speech_transcription_config: {
    language_code:                "en-US",
    enable_automatic_punctuation: true

# Register a callback during the method call
operation = video.annotate_video input_uri: path, features: [:SPEECH_TRANSCRIPTION], video_context: context do |operation|
  raise operation.results.message? if operation.error?
  puts "Finished Processing."

  transcriptions = operation.results.annotation_results.first.speech_transcriptions

  transcriptions.each do |transcription|
    transcription.alternatives.each do |alternative|
      puts "Alternative level information:"

      puts "Transcript: #{alternative.transcript}"
      puts "Confidence: #{alternative.confidence}"

      puts "Word level information:"
      alternative.words.each do |word_info|
        start_time = (word_info.start_time.seconds +
                       word_info.start_time.nanos / 1e9)
        end_time =   (word_info.end_time.seconds +
                       word_info.end_time.nanos / 1e9)

        puts "#{word_info.word}: #{start_time} to #{end_time}"

puts "Processing video for speech transcriptions:"

¿Te ha resultado útil esta página? Enviar comentarios:

Enviar comentarios sobre...

Cloud Video Intelligence API Documentation