Ottieni la trascrizione di tracce audio

L'API Video Intelligence trascrive la voce in testo dei file video supportati. Esistono due modelli supportati, "predefinito" e "video".

Richiedere la trascrizione vocale per un video

REST

Invia la richiesta di processo

Di seguito viene mostrato come inviare una richiesta POST al metodo videos:annotate. L'esempio utilizza il token di accesso per un account di servizio configurato per il progetto utilizzando Google Cloud CLI. Per istruzioni sull'installazione di Google Cloud CLI, sulla configurazione di un progetto con un account di servizio e sull'ottenimento di un token di accesso, consulta la guida rapida di Video Intelligence.

Prima di utilizzare i dati della richiesta, effettua le seguenti sostituzioni:

INPUT_URI: un bucket Cloud Storage che contiene il file su cui vuoi annotare il nome, incluso il nome. Deve iniziare con gs://.
Ad esempio: "inputUri": "gs://cloud-videointelligence-demo/assistant.mp4",
LANGUAGE_CODE: [facoltativo] consulta le lingue supportate
PROJECT_NUMBER: l'identificatore numerico del tuo progetto Google Cloud

Metodo HTTP e URL:

POST https://videointelligence.googleapis.com/v1/videos:annotate

Corpo JSON della richiesta:


{
"inputUri": "INPUT_URI",
  "features": ["SPEECH_TRANSCRIPTION"],
  "videoContext": {
    "speechTranscriptionConfig": {
      "languageCode": "LANGUAGE_CODE",
      "enableAutomaticPunctuation": true,
      "filterProfanity": true
    }
  }
}

Per inviare la richiesta, espandi una di queste opzioni:

curl (Linux, macOS o Cloud Shell)

Nota: il comando seguente presuppone che tu abbia eseguito l'accesso all'interfaccia a riga di comando gcloud con il tuo account utente eseguendo gcloud init o gcloud auth login oppure utilizzando Cloud Shell, che ti consente di accedere automaticamente all'interfaccia a riga di comando gcloud. Puoi controllare l'account attualmente attivo eseguendo gcloud auth list.

Salva il corpo della richiesta in un file denominato request.json ed esegui questo comando:

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "x-goog-user-project: PROJECT_NUMBER" \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    "https://videointelligence.googleapis.com/v1/videos:annotate"

PowerShell (Windows)

Salva il corpo della richiesta in un file denominato request.json ed esegui questo comando:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_NUMBER" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://videointelligence.googleapis.com/v1/videos:annotate" | Select-Object -Expand Content

Dovresti ricevere una risposta JSON simile alla seguente:

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID"
}

Se la richiesta ha esito positivo, Video Intelligence restituisce name per la tua operazione. Quanto riportato sopra mostra un esempio di questa risposta, dove project-number è il numero del progetto e operation-id è l'ID dell'operazione a lunga esecuzione creata per la richiesta.

Ottieni i risultati

Per ottenere i risultati della richiesta, devi inviare un codice GET utilizzando il nome dell'operazione restituito dalla chiamata a videos:annotate, come mostrato nell'esempio seguente.

Prima di utilizzare i dati della richiesta, effettua le seguenti sostituzioni:

OPERATION_NAME: il nome dell'operazione restituito dall'API Video Intelligence. Il nome dell'operazione ha il formato projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID

PROJECT_NUMBER: l'identificatore numerico del tuo progetto Google Cloud

Metodo HTTP e URL:

GET https://videointelligence.googleapis.com/v1/OPERATION_NAME

Per inviare la richiesta, espandi una di queste opzioni:

curl (Linux, macOS o Cloud Shell)

Esegui questo comando:

curl -X GET \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "x-goog-user-project: PROJECT_NUMBER" \
    "https://videointelligence.googleapis.com/v1/OPERATION_NAME"

PowerShell (Windows)

Esegui questo comando:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred"; "x-goog-user-project" = "PROJECT_NUMBER" }

Invoke-WebRequest `
    -Method GET `
    -Headers $headers `
    -Uri "https://videointelligence.googleapis.com/v1/OPERATION_NAME" | Select-Object -Expand Content

Dovresti ricevere una risposta JSON simile alla seguente:

Risposta

{
  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoProgress",
    "annotationProgress": [{
      "inputUri": "/bucket-name-123/sample-video-short.mp4",
      "progressPercent": 100,
      "startTime": "2018-04-09T15:19:38.919779Z",
      "updateTime": "2018-04-09T15:21:17.652470Z"
    }]
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.videointelligence.v1.AnnotateVideoResponse",
    "annotationResults": [
    {
          "speechTranscriptions": [
      {
            "alternatives": [
          {
            "transcript": "and laughing going to talk about is the video intelligence API how many of you saw it at the keynote yesterday ",
            "confidence": 0.8442509,
            "words": [
          {
            "startTime": "0.200s",
            "endTime": "0.800s",
            "word": "and"
          },
          {
            "startTime": "0.800s",
            "endTime": "1.100s",
            "word": "laughing"
          },
          {
            "startTime": "1.100s",
            "endTime": "1.200s",
            "word": "going"
          },
    ...

Scarica risultati annotazioni

Copia l'annotazione dall'origine al bucket di destinazione: (vedi Copiare file e oggetti)

gsutil cp gcs_uri gs://my-bucket

Nota: se l'URI GCS di output viene fornito dall'utente, l'annotazione viene archiviata in quell'URI.

Go

Per eseguire l'autenticazione a Video Intelligence, configura Credenziali predefinite dell'applicazione. Per maggiori informazioni, consulta Configurare l'autenticazione per un ambiente di sviluppo locale.


func speechTranscriptionURI(w io.Writer, file string) error {
	ctx := context.Background()
	client, err := video.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	op, err := client.AnnotateVideo(ctx, &videopb.AnnotateVideoRequest{
		Features: []videopb.Feature{
			videopb.Feature_SPEECH_TRANSCRIPTION,
		},
		VideoContext: &videopb.VideoContext{
			SpeechTranscriptionConfig: &videopb.SpeechTranscriptionConfig{
				LanguageCode:               "en-US",
				EnableAutomaticPunctuation: true,
			},
		},
		InputUri: file,
	})
	if err != nil {
		return err
	}
	resp, err := op.Wait(ctx)
	if err != nil {
		return err
	}

	// A single video was processed. Get the first result.
	result := resp.AnnotationResults[0]

	for _, transcription := range result.SpeechTranscriptions {
		// The number of alternatives for each transcription is limited by
		// SpeechTranscriptionConfig.MaxAlternatives.
		// Each alternative is a different possible transcription
		// and has its own confidence score.
		for _, alternative := range transcription.GetAlternatives() {
			fmt.Fprintf(w, "Alternative level information:\n")
			fmt.Fprintf(w, "\tTranscript: %v\n", alternative.GetTranscript())
			fmt.Fprintf(w, "\tConfidence: %v\n", alternative.GetConfidence())

			fmt.Fprintf(w, "Word level information:\n")
			for _, wordInfo := range alternative.GetWords() {
				startTime := wordInfo.GetStartTime()
				endTime := wordInfo.GetEndTime()
				fmt.Fprintf(w, "\t%4.1f - %4.1f: %v (speaker %v)\n",
					float64(startTime.GetSeconds())+float64(startTime.GetNanos())*1e-9, // start as seconds
					float64(endTime.GetSeconds())+float64(endTime.GetNanos())*1e-9,     // end as seconds
					wordInfo.GetWord(),
					wordInfo.GetSpeakerTag())
			}
		}
	}

	return nil
}

Java

// Instantiate a com.google.cloud.videointelligence.v1.VideoIntelligenceServiceClient
try (VideoIntelligenceServiceClient client = VideoIntelligenceServiceClient.create()) {
  // Set the language code
  SpeechTranscriptionConfig config =
      SpeechTranscriptionConfig.newBuilder()
          .setLanguageCode("en-US")
          .setEnableAutomaticPunctuation(true)
          .build();

  // Set the video context with the above configuration
  VideoContext context = VideoContext.newBuilder().setSpeechTranscriptionConfig(config).build();

  // Create the request
  AnnotateVideoRequest request =
      AnnotateVideoRequest.newBuilder()
          .setInputUri(gcsUri)
          .addFeatures(Feature.SPEECH_TRANSCRIPTION)
          .setVideoContext(context)
          .build();

  // asynchronously perform speech transcription on videos
  OperationFuture<AnnotateVideoResponse, AnnotateVideoProgress> response =
      client.annotateVideoAsync(request);

  System.out.println("Waiting for operation to complete...");
  // Display the results
  for (VideoAnnotationResults results :
      response.get(600, TimeUnit.SECONDS).getAnnotationResultsList()) {
    for (SpeechTranscription speechTranscription : results.getSpeechTranscriptionsList()) {
      try {
        // Print the transcription
        if (speechTranscription.getAlternativesCount() > 0) {
          SpeechRecognitionAlternative alternative = speechTranscription.getAlternatives(0);

          System.out.printf("Transcript: %s\n", alternative.getTranscript());
          System.out.printf("Confidence: %.2f\n", alternative.getConfidence());

          System.out.println("Word level information:");
          for (WordInfo wordInfo : alternative.getWordsList()) {
            double startTime =
                wordInfo.getStartTime().getSeconds() + wordInfo.getStartTime().getNanos() / 1e9;
            double endTime =
                wordInfo.getEndTime().getSeconds() + wordInfo.getEndTime().getNanos() / 1e9;
            System.out.printf(
                "\t%4.2fs - %4.2fs: %s\n", startTime, endTime, wordInfo.getWord());
          }
        } else {
          System.out.println("No transcription found");
        }
      } catch (IndexOutOfBoundsException ioe) {
        System.out.println("Could not retrieve frame: " + ioe.getMessage());
      }
    }
  }
}

Node.js

// Imports the Google Cloud Video Intelligence library
const videoIntelligence = require('@google-cloud/video-intelligence');

// Creates a client
const client = new videoIntelligence.VideoIntelligenceServiceClient();

/**
 * TODO(developer): Uncomment the following line before running the sample.
 */
// const gcsUri = 'GCS URI of video to analyze, e.g. gs://my-bucket/my-video.mp4';

async function analyzeVideoTranscript() {
  const videoContext = {
    speechTranscriptionConfig: {
      languageCode: 'en-US',
      enableAutomaticPunctuation: true,
    },
  };

  const request = {
    inputUri: gcsUri,
    features: ['SPEECH_TRANSCRIPTION'],
    videoContext: videoContext,
  };

  const [operation] = await client.annotateVideo(request);
  console.log('Waiting for operation to complete...');
  const [operationResult] = await operation.promise();
  // There is only one annotation_result since only
  // one video is processed.
  const annotationResults = operationResult.annotationResults[0];

  for (const speechTranscription of annotationResults.speechTranscriptions) {
    // The number of alternatives for each transcription is limited by
    // SpeechTranscriptionConfig.max_alternatives.
    // Each alternative is a different possible transcription
    // and has its own confidence score.
    for (const alternative of speechTranscription.alternatives) {
      console.log('Alternative level information:');
      console.log(`Transcript: ${alternative.transcript}`);
      console.log(`Confidence: ${alternative.confidence}`);

      console.log('Word level information:');
      for (const wordInfo of alternative.words) {
        const word = wordInfo.word;
        const start_time =
          wordInfo.startTime.seconds + wordInfo.startTime.nanos * 1e-9;
        const end_time =
          wordInfo.endTime.seconds + wordInfo.endTime.nanos * 1e-9;
        console.log('\t' + start_time + 's - ' + end_time + 's: ' + word);
      }
    }
  }
}

analyzeVideoTranscript();

Python

"""Transcribe speech from a video stored on GCS."""
from google.cloud import videointelligence

video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.Feature.SPEECH_TRANSCRIPTION]

config = videointelligence.SpeechTranscriptionConfig(
    language_code="en-US", enable_automatic_punctuation=True
)
video_context = videointelligence.VideoContext(speech_transcription_config=config)

operation = video_client.annotate_video(
    request={
        "features": features,
        "input_uri": path,
        "video_context": video_context,
    }
)

print("\nProcessing video for speech transcription.")

result = operation.result(timeout=600)

# There is only one annotation_result since only
# one video is processed.
annotation_results = result.annotation_results[0]
for speech_transcription in annotation_results.speech_transcriptions:
    # The number of alternatives for each transcription is limited by
    # SpeechTranscriptionConfig.max_alternatives.
    # Each alternative is a different possible transcription
    # and has its own confidence score.
    for alternative in speech_transcription.alternatives:
        print("Alternative level information:")

        print("Transcript: {}".format(alternative.transcript))
        print("Confidence: {}\n".format(alternative.confidence))

        print("Word level information:")
        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print(
                "\t{}s - {}s: {}".format(
                    start_time.seconds + start_time.microseconds * 1e-6,
                    end_time.seconds + end_time.microseconds * 1e-6,
                    word,
                )
            )

Linguaggi aggiuntivi

C#: segui le istruzioni di configurazione di C# nella pagina delle librerie client e poi consulta la documentazione di riferimento di Video Intelligence per .NET.

PHP: segui le istruzioni per la configurazione dei file PHP nella pagina delle librerie client e consulta la documentazione di riferimento di Video Intelligence per PHP.

Ruby: segui le istruzioni per la configurazione di Ruby nella pagina delle librerie client e visita la documentazione di riferimento di Video Intelligence per Ruby.