Speech Transcription

The Video Intelligence API can transcribe speech to text from supported video files.

It supports the following features:

  • Alternative words: The API can return up to 30 possible alternatives for each transcribed word or phrase.

  • Profanity filtering: The service attempts to filter out profanities, replacing all but the initial character in each filtered word with asterisks.

  • Transcription hints: You can provide a list of words or phrases to assist the transcription service with a specific request.

  • Audio track selection: Specify the tracks to transcribe (up to 2) from multi-track video files.

  • Automatic punctuation: Include punctuation in the transcribed text.

See the Speech transcription options section for details.

Making a speech transcription request

Protocol

To perform speech transcription, make a POST request to the v1p1beta1/videos:annotate endpoint.

The example uses the gcloud auth application-default print-access-token command to obtain an access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account see the Quickstart.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
      'input_uri': 'gs://bucket-name-123/google-next-17-videointelligence-short.mp4',
      'features': ['SPEECH_TRANSCRIPTION'],
      'video_context': {
        'speech_transcription_config': {
          'language_code': 'en-US',
          'enableAutomaticPunctuation': true
        }
      }
    }" "https://videointelligence.googleapis.com/v1p1beta1/videos:annotate"

An operation ID is returned:

{
  "name": "us-east1.12938669590037241992"
}

To retrieve the operation results, replace NAME in the command below with the value of name in your previous result:

curl -X GET -H "Content-Type: application/json" -H \
"Authorization: Bearer  $(gcloud auth application-default print-access-token)" \
"https://videointelligence.googleapis.com/v1/operations/NAME"

When the operation has completed, the result looks like:

{
  "name": "us-east1.12938669590037241992",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1p1beta1.AnnotateVideoProgress",
    "annotationProgress": [
      {
        "inputUri": "/bucket-name-123/google-next-17-videointelligence-short.mp4",
        "progressPercent": 100,
        "startTime": "2018-04-09T15:19:38.919779Z",
        "updateTime": "2018-04-09T15:21:17.652470Z"
      },
      {
        "inputUri": "/bucket-name-123/google-next-17-videointelligence-short.mp4",
        "progressPercent": 100,
        "startTime": "2018-04-09T15:19:38.919779Z",
        "updateTime": "2018-04-09T15:19:51.884318Z"
      }
    ]
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1p1beta1.AnnotateVideoResponse",
    "annotationResults": [
      {
        "speechTranscriptions": [
          {
            "alternatives": [
              {
                "transcript": "and laughing going to talk about is the video intelligence API how many of
you saw it at the keynote yesterday",
                "confidence": 0.8442509,
                "words": [
                  {
                    "startTime": "0.200s",
                    "endTime": "0.800s",
                    "word": "and"
                  },
                  {
                    "startTime": "0.800s",
                    "endTime": "1.100s",
                    "word": "laughing"
                  },
                  {
                    "startTime": "1.100s",
                    "endTime": "1.200s",
                    "word": "going"
                  },

Java

/**
 * Transcribe speech from a video stored on GCS.
 *
 * @param gcsUri the path to the video file to analyze.
 */
public static void speechTranscription(String gcsUri) throws Exception {
  // Instantiate a com.google.cloud.videointelligence.v1p1beta1.VideoIntelligenceServiceClient
  try (VideoIntelligenceServiceClient client = VideoIntelligenceServiceClient.create()) {
    // Set the language code
    SpeechTranscriptionConfig config = SpeechTranscriptionConfig.newBuilder()
        .setLanguageCode("en-US")
        .setEnableAutomaticPunctuation(true)
        .build();

    // Set the video context with the above configuration
    VideoContext context = VideoContext.newBuilder()
        .setSpeechTranscriptionConfig(config)
        .build();

    // Create the request
    AnnotateVideoRequest request = AnnotateVideoRequest.newBuilder()
        .setInputUri(gcsUri)
        .addFeatures(Feature.SPEECH_TRANSCRIPTION)
        .setVideoContext(context)
        .build();

    // asynchronously perform speech transcription on videos
    OperationFuture<AnnotateVideoResponse, AnnotateVideoProgress> response =
        client.annotateVideoAsync(request);

    System.out.println("Waiting for operation to complete...");
    // Display the results
    for (VideoAnnotationResults results : response.get(300, TimeUnit.SECONDS)
        .getAnnotationResultsList()) {
      for (SpeechTranscription speechTranscription : results.getSpeechTranscriptionsList()) {
        try {
          // Print the transcription
          if (speechTranscription.getAlternativesCount() > 0) {
            SpeechRecognitionAlternative alternative = speechTranscription.getAlternatives(0);

            System.out.printf("Transcript: %s\n", alternative.getTranscript());
            System.out.printf("Confidence: %.2f\n", alternative.getConfidence());

            System.out.println("Word level information:");
            for (WordInfo wordInfo : alternative.getWordsList()) {
              double startTime = wordInfo.getStartTime().getSeconds()
                  + wordInfo.getStartTime().getNanos() / 1e9;
              double endTime = wordInfo.getEndTime().getSeconds()
                  + wordInfo.getEndTime().getNanos() / 1e9;
              System.out.printf("\t%4.2fs - %4.2fs: %s\n",
                  startTime, endTime, wordInfo.getWord());
            }
          } else {
            System.out.println("No transcription found");
          }
        } catch (IndexOutOfBoundsException ioe) {
          System.out.println("Could not retrieve frame: " + ioe.getMessage());
        }
      }
    }
  }
}

Node.js

// Imports the Google Cloud Video Intelligence library
const videoIntelligence = require('@google-cloud/video-intelligence')
  .v1p1beta1;

// Creates a client
const client = new videoIntelligence.VideoIntelligenceServiceClient();

/**
 * TODO(developer): Uncomment the following line before running the sample.
 */
// const gcsUri = 'GCS URI of video to analyze, e.g. gs://my-bucket/my-video.mp4';

const videoContext = {
  speechTranscriptionConfig: {
    languageCode: 'en-US',
    enableAutomaticPunctuation2: true,
  },
};

const request = {
  inputUri: gcsUri,
  features: ['SPEECH_TRANSCRIPTION'],
  videoContext: videoContext,
};

client
  .annotateVideo(request)
  .then(results => {
    const operation = results[0];
    console.log('Waiting for operation to complete...');
    return operation.promise();
  })
  .then(results => {
    console.log('Word level information:');
    const alternative =
      results[0].annotationResults[0].speechTranscriptions[0].alternatives[0];
    alternative.words.forEach(wordInfo => {
      const start_time =
        wordInfo.startTime.seconds + wordInfo.startTime.nanos * 1e-9;
      const end_time =
        wordInfo.endTime.seconds + wordInfo.endTime.nanos * 1e-9;
      console.log(
        '\t' + start_time + 's - ' + end_time + 's: ' + wordInfo.word
      );
    });
    console.log('Transcription: ' + alternative.transcript);
  })
  .catch(err => {
    console.error('ERROR:', err);
  });

Python

Beta features are available from the regular Video Intelligence API client library for Python.

Your requirements.txt should include:

google-cloud-videointelligence==1.2.0

In your Python application, import the v1p1beta1 version of the API:

from google.cloud import videointelligence_v1p1beta1 as videointelligence
"""Transcribe speech from a video stored on GCS."""
from google.cloud import videointelligence_v1p1beta1 as videointelligence

video_client = videointelligence.VideoIntelligenceServiceClient()

features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]

config = videointelligence.types.SpeechTranscriptionConfig(
    language_code='en-US',
    enable_automatic_punctuation=True)
video_context = videointelligence.types.VideoContext(
    speech_transcription_config=config)

operation = video_client.annotate_video(
    input_uri, features=features,
    video_context=video_context)

print('\nProcessing video for speech transcription.')

result = operation.result(timeout=180)

# There is only one annotation_result since only
# one video is processed.
annotation_results = result.annotation_results[0]
for speech_transcription in annotation_results.speech_transcriptions:

    # The number of alternatives for each transcription is limited by
    # SpeechTranscriptionConfig.max_alternatives.
    # Each alternative is a different possible transcription
    # and has its own confidence score.
    for alternative in speech_transcription.alternatives:
        print('Alternative level information:')

        print('Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}\n'.format(alternative.confidence))

        print('Word level information:')
        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('\t{}s - {}s: {}'.format(
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9,
                word))

Speech transcription options

The following options can be specified with your transcription request:

  • languageCode is required. Only "en-US" is currently supported.

  • maxAlternatives (optional) specifies the maximum number of recognition hypotheses to be returned (from 1 to 30). The default is 1. Specifying more than one may return additional transcript messages for each block of the transcription, in descending order of confidence. Alternative transcriptions do not include word-level entries.

  • filterProfanity (optional, boolean) attempts to filter out known profanities. Matched words are replaced with the leading character followed by asterisks. Default is false.

  • speechContexts (optional) can be used to provide common or unusual phrases in your audio, to assist the transcription service. Provided as a SpeechContext object.

  • audioTracks (optional) specifies which track (up to 2) to transcribe from multi-track audio files. Accepts a list containing the track number(s). Default is 0.

  • enableAutomaticPunctuation (optional) can be used to include punctuation in the transcribed text. The default is false.

For example:

{
  "input_uri": "gs://bucket-name-123/google-next-17-videointelligence-short.mp4",
  "features": ["SPEECH_TRANSCRIPTION"],
  "video_context": {
    "speech_transcription_config": {
      "language_code": "en-US",
      "maxAlternatives": 3,
      "filterProfanity": true,
      "enableAutomaticPunctuation": true,
      "speechContexts": {
        "phrases": [
          "video intelligence API",
          "keynote"
        ]
      },
      "audioTracks": [1, 3]
    }
  }
}

Pricing

Speech transcription pricing is listed on the Pricing page.

Var denne side nyttig? Giv os en anmeldelse af den:

Send feedback om...

Cloud Video Intelligence API Documentation