Lange Audiodateien transkribieren

Auf dieser Seite wird gezeigt, wie lange Audiodateien (länger als eine Minute) mithilfe asynchroner Spracherkennung in Text transkribiert werden.

Bei der asynchronen Spracherkennung wird ein längerer Vorgang zur Audioverarbeitung gestartet. Verwenden Sie die asynchrone Spracherkennung, um Audiodateien zu erkennen, die länger als eine Minute sind. Für kürzere Audiodaten ist die synchrone Spracherkennung schneller und einfacher.

Sie können die Ergebnisse des Vorgangs über die Schnittstelle google.longrunning.Operations abrufen. Die Ergebnisse können fünf Tage (120 Stunden) abgerufen werden. Audioinhalte können direkt an Speech-to-Text gesendet werden oder es können Audioinhalte verarbeitet werden, die sich bereits in Google Cloud Storage befinden. Hier finden Sie Informationen zu den Audiolimits für Anfragen zur asynchronen Spracherkennung.

Version 1 von Speech-to-Text wurde veröffentlicht und ist über den Endpunkt https://speech.googleapis.com/v1/speech allgemein verfügbar. Die Clientbibliotheken wurden als Alphaversion veröffentlicht und werden vermutlich nicht abwärtskompatibel sein. Die Clientbibliotheken werden derzeit nicht für eine Verwendung in der Produktion empfohlen.

Für diese Beispiele ist es erforderlich, dass Sie gcloud eingerichtet sowie ein Dienstkonto erstellt und aktiviert haben. Informationen zum Einrichten von gcloud sowie zum Erstellen und Aktivieren eines Dienstkontos finden Sie in der Kurzanleitung.

Lange Audiodateien mithilfe einer Google Cloud Storage-Datei transkribieren

Bei diesen Beispielen wird ein Cloud Storage-Bucket verwendet, um die eingehenden Audio-Rohdaten für den Transkriptionsprozess mit langer Ausführungszeit zu speichern.

Protokoll

Ausführliche Informationen finden Sie unter dem API-Endpunkt speech:longrunningrecognize.

Für eine synchrone Spracherkennung senden Sie eine POST-Anfrage und geben den entsprechenden Anfragetext an. Das folgende Beispiel zeigt eine POST-Anfrage mit curl. In diesem Beispiel wird das Zugriffstoken für ein Dienstkonto verwendet, das mit dem Cloud SDK von Google Cloud für das Projekt eingerichtet wurde. Anleitungen zur Installation von Cloud SDK, zur Einrichtung eines Projekts mit einem Dienstkonto und zur Anforderung eines Zugriffstokens finden Sie in der Kurzanleitung.

    curl -X POST \
         -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
         -H "Content-Type: application/json; charset=utf-8" \
         --data "{
      'config': {
        'language_code': 'en-US'
      },
      'audio':{
        'uri':'gs://gcs-test-data/vr.flac'
      }
    }" "https://speech.googleapis.com/v1/speech:longrunningrecognize"
    

Weitere Informationen zum Konfigurieren des Anfragetexts finden Sie in der Referenzdokumentation zu RecognitionConfig und RecognitionAudio.

Wenn die Anfrage erfolgreich ist, gibt der Server den HTTP-Statuscode 200 OK und die Antwort im JSON-Format zurück:

    {
      "name": "7612202767953098924"
    }
    

Dabei ist name der Name des lang andauernden Vorgangs, der für die Anfrage erstellt wurde.

Warten Sie, bis die Verarbeitung abgeschlossen ist. Die Verarbeitungszeit hängt vom Quell-Audio ab. In den meisten Fällen erhalten Sie Ergebnisse in der Hälfte der Zeit, die das Quell-Audio lang ist. Sie können den Status Ihres lang andauernden Vorgangs abrufen. Stellen Sie dazu eine GET-Anfrage an den Endpunkt https://speech.googleapis.com/v1/operations/. Ersetzen Sie your-operation-name durch den name, der von Ihrer longrunningrecognize-Anfrage zurückgegeben wurde. Sie können den geschätzten Fortschritt der Anfrage im Feld progressPercent sehen.

    curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
         -H "Content-Type: application/json; charset=utf-8" \
         "https://speech.googleapis.com/v1/operations/your-operation-name"
    

Wenn die Anfrage erfolgreich ist, gibt der Server den HTTP-Statuscode 200 OK und die Antwort im JSON-Format zurück:

    {
      "name": "7612202767953098924",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
        "progressPercent": 100,
        "startTime": "2017-07-20T16:36:55.033650Z",
        "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
      },
      "done": true,
      "response": {
        "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
        "results": [
          {
            "alternatives": [
              {
                "transcript": "okay so what am I doing here...(etc)...",
                "confidence": 0.96096134,
              }
            ]
          },
          {
            "alternatives": [
              {
                ...
              }
            ]
          }
        ]
      }
    }
    

Wenn der Vorgang nicht abgeschlossen ist, können Sie den Endpunkt abfragen. Dazu stellen Sie wiederholt eine GET-Anfrage, bis das Attribut done der Antwort true ist.

gcloud-Befehl

Ausführliche Informationen finden Sie unter dem Befehl recognize-long-running.

Verwenden Sie zum Ausführen der asynchronen Spracherkennung das gcloud-Befehlszeilentool und geben Sie den Pfad einer lokalen Datei oder einer Google Cloud Storage-URL an.

    gcloud ml speech recognize-long-running \
        'gs://cloud-samples-tests/speech/brooklyn.flac' \
         --language-code='en-US' --async
    

Wenn die Anfrage erfolgreich ist, gibt der Server die ID des lang laufenden Vorgangs im JSON-Format zurück.

    {
      "name": OPERATION_ID
    }
    

Sie können dann mit dem folgenden Befehl Informationen zum Vorgang abrufen:

    gcloud ml speech operations describe OPERATION_ID
    

Außerdem können Sie den Vorgang mit dem folgenden Befehl so lange abfragen, bis er abgeschlossen ist:

    gcloud ml speech operations wait OPERATION_ID
    

Sobald der Vorgang abgeschlossen ist, gibt er die Audiodaten im JSON-Format zurück:

    {
      "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
      "results": [
        {
          "alternatives": [
            {
              "confidence": 0.9840146,
              "transcript": "how old is the Brooklyn Bridge"
            }
          ]
        }
      ]
    }
    

C#

static object AsyncRecognizeGcs(string storageUri)
    {
        var speech = SpeechClient.Create();
        var longOperation = speech.LongRunningRecognize(new RecognitionConfig()
        {
            Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
            SampleRateHertz = 16000,
            LanguageCode = "en",
        }, RecognitionAudio.FromStorageUri(storageUri));
        longOperation = longOperation.PollUntilCompleted();
        var response = longOperation.Result;
        foreach (var result in response.Results)
        {
            foreach (var alternative in result.Alternatives)
            {
                Console.WriteLine($"Transcript: { alternative.Transcript}");
            }
        }
        return 0;
    }

Go


    func sendGCS(w io.Writer, client *speech.Client, gcsURI string) error {
    	ctx := context.Background()

    	// Send the contents of the audio file with the encoding and
    	// and sample rate information to be transcripted.
    	req := &speechpb.LongRunningRecognizeRequest{
    		Config: &speechpb.RecognitionConfig{
    			Encoding:        speechpb.RecognitionConfig_LINEAR16,
    			SampleRateHertz: 16000,
    			LanguageCode:    "en-US",
    		},
    		Audio: &speechpb.RecognitionAudio{
    			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
    		},
    	}

    	op, err := client.LongRunningRecognize(ctx, req)
    	if err != nil {
    		return err
    	}
    	resp, err := op.Wait(ctx)
    	if err != nil {
    		return err
    	}

    	// Print the results.
    	for _, result := range resp.Results {
    		for _, alt := range result.Alternatives {
    			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
    		}
    	}
    	return nil
    }
    

Java

/**
     * Performs non-blocking speech recognition on remote FLAC file and prints the transcription.
     *
     * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
     */
    public static void asyncRecognizeGcs(String gcsUri) throws Exception {
      // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
      try (SpeechClient speech = SpeechClient.create()) {

        // Configure remote file request for FLAC
        RecognitionConfig config =
            RecognitionConfig.newBuilder()
                .setEncoding(AudioEncoding.FLAC)
                .setLanguageCode("en-US")
                .setSampleRateHertz(16000)
                .build();
        RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

        // Use non-blocking call for getting file transcription
        OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
            speech.longRunningRecognizeAsync(config, audio);
        while (!response.isDone()) {
          System.out.println("Waiting for response...");
          Thread.sleep(10000);
        }

        List<SpeechRecognitionResult> results = response.get().getResultsList();

        for (SpeechRecognitionResult result : results) {
          // There can be several alternative transcripts for a given chunk of speech. Just use the
          // first (most likely) one here.
          SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
          System.out.printf("Transcription: %s\n", alternative.getTranscript());
        }
      }
    }

Node.js

// Imports the Google Cloud client library
    const speech = require('@google-cloud/speech');

    // Creates a client
    const client = new speech.SpeechClient();

    /**
     * TODO(developer): Uncomment the following lines before running the sample.
     */
    // const gcsUri = 'gs://my-bucket/audio.raw';
    // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
    // const sampleRateHertz = 16000;
    // const languageCode = 'BCP-47 language code, e.g. en-US';

    const config = {
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
    };

    const audio = {
      uri: gcsUri,
    };

    const request = {
      config: config,
      audio: audio,
    };

    // Detects speech in the audio file. This creates a recognition job that you
    // can wait for now, or get its result later.
    const [operation] = await client.longRunningRecognize(request);
    // Get a Promise representation of the final result of the job
    const [response] = await operation.promise();
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log(`Transcription: ${transcription}`);

PHP

use Google\Cloud\Speech\V1\SpeechClient;
    use Google\Cloud\Speech\V1\RecognitionAudio;
    use Google\Cloud\Speech\V1\RecognitionConfig;
    use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

    /** Uncomment and populate these variables in your code */
    // $uri = 'The Cloud Storage object to transcribe (gs://your-bucket-name/your-object-name)';

    // change these variables if necessary
    $encoding = AudioEncoding::LINEAR16;
    $sampleRateHertz = 32000;
    $languageCode = 'en-US';

    // set string as audio content
    $audio = (new RecognitionAudio())
        ->setUri($uri);

    // set config
    $config = (new RecognitionConfig())
        ->setEncoding($encoding)
        ->setSampleRateHertz($sampleRateHertz)
        ->setLanguageCode($languageCode);

    // create the speech client
    $client = new SpeechClient();

    // create the asyncronous recognize operation
    $operation = $client->longRunningRecognize($config, $audio);
    $operation->pollUntilComplete();

    if ($operation->operationSucceeded()) {
        $response = $operation->getResult();

        // each result is for a consecutive portion of the audio. iterate
        // through them to get the transcripts for the entire audio file.
        foreach ($response->getResults() as $result) {
            $alternatives = $result->getAlternatives();
            $mostLikely = $alternatives[0];
            $transcript = $mostLikely->getTranscript();
            $confidence = $mostLikely->getConfidence();
            printf('Transcript: %s' . PHP_EOL, $transcript);
            printf('Confidence: %s' . PHP_EOL, $confidence);
        }
    } else {
        print_r($operation->getError());
    }

    $client->close();

Python

from google.cloud import speech_v1
    from google.cloud.speech_v1 import enums

    def sample_long_running_recognize(storage_uri):
        """
        Transcribe long audio file from Cloud Storage using asynchronous speech
        recognition

        Args:
          storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
        """

        client = speech_v1.SpeechClient()

        # storage_uri = 'gs://cloud-samples-data/speech/brooklyn_bridge.raw'

        # Sample rate in Hertz of the audio data sent
        sample_rate_hertz = 16000

        # The language of the supplied audio
        language_code = "en-US"

        # Encoding of audio data sent. This sample sets this explicitly.
        # This field is optional for FLAC and WAV audio formats.
        encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16
        config = {
            "sample_rate_hertz": sample_rate_hertz,
            "language_code": language_code,
            "encoding": encoding,
        }
        audio = {"uri": storage_uri}

        operation = client.long_running_recognize(config, audio)

        print(u"Waiting for operation to complete...")
        response = operation.result()

        for result in response.results:
            # First alternative is the most probable result
            alternative = result.alternatives[0]
            print(u"Transcript: {}".format(alternative.transcript))

    

Ruby

# storage_path = "Path to file in Cloud Storage, eg. gs://bucket/audio.raw"

    require "google/cloud/speech"

    speech = Google::Cloud::Speech.new

    config = { encoding:          :LINEAR16,
               sample_rate_hertz: 16_000,
               language_code:     "en-US" }
    audio = { uri: storage_path }

    operation = speech.long_running_recognize config, audio

    puts "Operation started"

    operation.wait_until_done!

    raise operation.results.message if operation.error?

    results = operation.response.results

    alternatives = results.first.alternatives
    alternatives.each do |alternative|
      puts "Transcription: #{alternative.transcript}"
    end

Lange Audiodateien mithilfe einer lokalen Datei transkribieren

Bei diesen Beispielen wird eine lokale Datei verwendet, um die eingehenden Audio-Rohdaten für den Transkriptionsprozess mit langer Ausführungszeit zu speichern.

C#

static object LongRunningRecognize(string filePath)
    {
        var speech = SpeechClient.Create();
        var longOperation = speech.LongRunningRecognize(new RecognitionConfig()
        {
            Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
            SampleRateHertz = 16000,
            LanguageCode = "en",
        }, RecognitionAudio.FromFile(filePath));
        longOperation = longOperation.PollUntilCompleted();
        var response = longOperation.Result;
        foreach (var result in response.Results)
        {
            foreach (var alternative in result.Alternatives)
            {
                Console.WriteLine(alternative.Transcript);
            }
        }
        return 0;
    }

Go


    func send(w io.Writer, client *speech.Client, filename string) error {
    	ctx := context.Background()
    	data, err := ioutil.ReadFile(filename)
    	if err != nil {
    		return err
    	}

    	// Send the contents of the audio file with the encoding and
    	// and sample rate information to be transcripted.
    	req := &speechpb.LongRunningRecognizeRequest{
    		Config: &speechpb.RecognitionConfig{
    			Encoding:        speechpb.RecognitionConfig_LINEAR16,
    			SampleRateHertz: 16000,
    			LanguageCode:    "en-US",
    		},
    		Audio: &speechpb.RecognitionAudio{
    			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
    		},
    	}

    	op, err := client.LongRunningRecognize(ctx, req)
    	if err != nil {
    		return err
    	}
    	resp, err := op.Wait(ctx)
    	if err != nil {
    		return err
    	}

    	// Print the results.
    	for _, result := range resp.Results {
    		for _, alt := range result.Alternatives {
    			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
    		}
    	}
    	return nil
    }
    

Java

/**
     * Performs non-blocking speech recognition on raw PCM audio and prints the transcription. Note
     * that transcription is limited to 60 seconds audio.
     *
     * @param fileName the path to a PCM audio file to transcribe.
     */
    public static void asyncRecognizeFile(String fileName) throws Exception {
      // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
      try (SpeechClient speech = SpeechClient.create()) {

        Path path = Paths.get(fileName);
        byte[] data = Files.readAllBytes(path);
        ByteString audioBytes = ByteString.copyFrom(data);

        // Configure request with local raw PCM audio
        RecognitionConfig config =
            RecognitionConfig.newBuilder()
                .setEncoding(AudioEncoding.LINEAR16)
                .setLanguageCode("en-US")
                .setSampleRateHertz(16000)
                .build();
        RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(audioBytes).build();

        // Use non-blocking call for getting file transcription
        OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
            speech.longRunningRecognizeAsync(config, audio);

        while (!response.isDone()) {
          System.out.println("Waiting for response...");
          Thread.sleep(10000);
        }

        List<SpeechRecognitionResult> results = response.get().getResultsList();

        for (SpeechRecognitionResult result : results) {
          // There can be several alternative transcripts for a given chunk of speech. Just use the
          // first (most likely) one here.
          SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
          System.out.printf("Transcription: %s%n", alternative.getTranscript());
        }
      }
    }

Node.js

// Imports the Google Cloud client library
    const speech = require('@google-cloud/speech');
    const fs = require('fs');

    // Creates a client
    const client = new speech.SpeechClient();

    /**
     * TODO(developer): Uncomment the following lines before running the sample.
     */
    // const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
    // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
    // const sampleRateHertz = 16000;
    // const languageCode = 'BCP-47 language code, e.g. en-US';

    const config = {
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
    };
    const audio = {
      content: fs.readFileSync(filename).toString('base64'),
    };

    const request = {
      config: config,
      audio: audio,
    };

    // Detects speech in the audio file. This creates a recognition job that you
    // can wait for now, or get its result later.
    const [operation] = await client.longRunningRecognize(request);

    // Get a Promise representation of the final result of the job
    const [response] = await operation.promise();
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log(`Transcription: ${transcription}`);

PHP

use Google\Cloud\Speech\V1\SpeechClient;
    use Google\Cloud\Speech\V1\RecognitionAudio;
    use Google\Cloud\Speech\V1\RecognitionConfig;
    use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

    /** Uncomment and populate these variables in your code */
    // $audioFile = 'path to an audio file';

    // change these variables if necessary
    $encoding = AudioEncoding::LINEAR16;
    $sampleRateHertz = 32000;
    $languageCode = 'en-US';

    // get contents of a file into a string
    $content = file_get_contents($audioFile);

    // set string as audio content
    $audio = (new RecognitionAudio())
        ->setContent($content);

    // set config
    $config = (new RecognitionConfig())
        ->setEncoding($encoding)
        ->setSampleRateHertz($sampleRateHertz)
        ->setLanguageCode($languageCode);

    // create the speech client
    $client = new SpeechClient();

    // create the asyncronous recognize operation
    $operation = $client->longRunningRecognize($config, $audio);
    $operation->pollUntilComplete();

    if ($operation->operationSucceeded()) {
        $response = $operation->getResult();

        // each result is for a consecutive portion of the audio. iterate
        // through them to get the transcripts for the entire audio file.
        foreach ($response->getResults() as $result) {
            $alternatives = $result->getAlternatives();
            $mostLikely = $alternatives[0];
            $transcript = $mostLikely->getTranscript();
            $confidence = $mostLikely->getConfidence();
            printf('Transcript: %s' . PHP_EOL, $transcript);
            printf('Confidence: %s' . PHP_EOL, $confidence);
        }
    } else {
        print_r($operation->getError());
    }

    $client->close();

Python

from google.cloud import speech_v1
    from google.cloud.speech_v1 import enums
    import io

    def sample_long_running_recognize(local_file_path):
        """
        Transcribe a long audio file using asynchronous speech recognition

        Args:
          local_file_path Path to local audio file, e.g. /path/audio.wav
        """

        client = speech_v1.SpeechClient()

        # local_file_path = 'resources/brooklyn_bridge.raw'

        # The language of the supplied audio
        language_code = "en-US"

        # Sample rate in Hertz of the audio data sent
        sample_rate_hertz = 16000

        # Encoding of audio data sent. This sample sets this explicitly.
        # This field is optional for FLAC and WAV audio formats.
        encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16
        config = {
            "language_code": language_code,
            "sample_rate_hertz": sample_rate_hertz,
            "encoding": encoding,
        }
        with io.open(local_file_path, "rb") as f:
            content = f.read()
        audio = {"content": content}

        operation = client.long_running_recognize(config, audio)

        print(u"Waiting for operation to complete...")
        response = operation.result()

        for result in response.results:
            # First alternative is the most probable result
            alternative = result.alternatives[0]
            print(u"Transcript: {}".format(alternative.transcript))

    

Ruby

# audio_file_path = "Path to file on which to perform speech recognition"

    require "google/cloud/speech"

    speech = Google::Cloud::Speech.new

    audio_file = File.binread audio_file_path
    config     = { encoding:          :LINEAR16,
                   sample_rate_hertz: 16_000,
                   language_code:     "en-US" }
    audio      = { content: audio_file }

    operation = speech.long_running_recognize config, audio

    puts "Operation started"

    operation.wait_until_done!

    raise operation.results.message if operation.error?

    results = operation.response.results

    alternatives = results.first.alternatives
    alternatives.each do |alternative|
      puts "Transcription: #{alternative.transcript}"
    end