단어 타임스탬프 가져오기

이 페이지에서는 Speech-to-Text를 통해 텍스트 변환되는 오디오의 시차 값을 가져오는 방법을 설명합니다.

Speech-to-Text는 recognize 요청의 응답 텍스트에 시차(타임스탬프) 값을 포함할 수 있습니다. 시차 값은 제공된 오디오에서 인식되는 각 단어의 시작 부분과 끝부분을 표시합니다. 시차 값은 오디오 시작 부분에서 경과된 시간을 100ms 단위로 나타냅니다.

시차는 특히 긴 오디오 파일을 분석하는 경우, 즉 인식된 텍스트에서 특정 단어를 검색하고 원본 오디오에서 찾아야 하는 경우에 유용합니다. Speech-to-Text는 speech:recognize, speech:longrunningrecognize, 스트리밍 등의 모든 음성 인식 방법에서 시차를 지원합니다.

인식 응답에 제공된 첫 번째 대체 변환 텍스트의 시차 값만 포함됩니다.

요청 결과에 시차를 포함하려면 요청 구성에서 enableWordTimeOffsets 매개변수를 true로 설정합니다.

프로토콜

자세한 내용은 speech:longrunningrecognize API 엔드포인트를 참조하세요.

동기 음성 인식을 수행하려면 POST 요청을 실행하고 적절한 요청 본문을 제공합니다. 다음은 curl을 사용한 POST 요청의 예시입니다. 이 예시에서는 Google Cloud Cloud SDK를 사용하는 프로젝트용으로 설정된 서비스 계정의 액세스 토큰을 사용합니다. Cloud SDK를 설치하고, 서비스 계정으로 프로젝트를 설정하고, 액세스 토큰을 획득하는 방법은 빠른 시작을 참조하세요.

    curl -X POST \
         -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
         -H "Content-Type: application/json; charset=utf-8" \
         --data "{
      'config': {
        'language_code': 'en-US',
        'enableWordTimeOffsets': true
      },
      'audio':{
        'uri':'gs://gcs-test-data/vr.flac'
      }
    }" "https://speech.googleapis.com/v1/speech:longrunningrecognize"
    

요청 본문 구성에 대한 자세한 내용은 RecognitionConfigRecognitionAudio 참조 문서를 확인하세요.

요청이 성공하면 서버가 200 OK HTTP 상태 코드와 응답을 JSON 형식으로 반환합니다.

    {
      "name": "7612202767953098924"
    }
    

여기서 name은 요청에 대해 생성된 장기 실행 작업의 이름입니다.

vr.flac 파일을 처리하는 데 약 30초 정도 걸립니다. 작업 결과를 가져오려면 https://speech.googleapis.com/v1/operations/ 엔드포인트에 대해 GET 요청을 실행합니다. 이때 your-operation-namelongrunningrecognize 요청에서 수신된 name으로 바꿉니다.

    curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
         -H "Content-Type: application/json; charset=utf-8"
         "https://speech.googleapis.com/v1/operations/your-operation-name"
    

요청이 성공하면 서버가 200 OK HTTP 상태 코드와 응답을 JSON 형식으로 반환합니다.

    {
      "name": "7612202767953098924",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
        "progressPercent": 100,
        "startTime": "2017-07-20T16:36:55.033650Z",
        "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
      },
      "done": true,
      "response": {
        "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
        "results": [
          {
            "alternatives": [
              {
                "transcript": "okay so what am I doing here...(etc)...",
                "confidence": 0.96596134,
                "words": [
                  {
                    "startTime": "1.400s",
                    "endTime": "1.800s",
                    "word": "okay"
                  },
                  {
                    "startTime": "1.800s",
                    "endTime": "2.300s",
                    "word": "so"
                  },
                  {
                    "startTime": "2.300s",
                    "endTime": "2.400s",
                    "word": "what"
                  },
                  {
                    "startTime": "2.400s",
                    "endTime": "2.600s",
                    "word": "am"
                  },
                  {
                    "startTime": "2.600s",
                    "endTime": "2.600s",
                    "word": "I"
                  },
                  {
                    "startTime": "2.600s",
                    "endTime": "2.700s",
                    "word": "doing"
                  },
                  {
                    "startTime": "2.700s",
                    "endTime": "3s",
                    "word": "here"
                  },
                  {
                    "startTime": "3s",
                    "endTime": "3.300s",
                    "word": "why"
                  },
                  {
                    "startTime": "3.300s",
                    "endTime": "3.400s",
                    "word": "am"
                  },
                  {
                    "startTime": "3.400s",
                    "endTime": "3.500s",
                    "word": "I"
                  },
                  {
                    "startTime": "3.500s",
                    "endTime": "3.500s",
                    "word": "here"
                  },
                  ...
                ]
              }
            ]
          },
          {
            "alternatives": [
              {
                "transcript": "so so what am I doing here...(etc)...",
                "confidence": 0.9642093,
              }
            ]
          }
        ]
      }
    }
    

작업이 완료되지 않았으면 응답의 done 속성이 true가 될 때까지 GET 요청을 반복해서 엔드포인트를 폴링할 수 있습니다.

gcloud 명령어

자세한 내용은 recognize-long-running 명령어를 참조하세요.

비동기 음성 인식을 수행하려면 gcloud 명령줄 도구를 사용하여 로컬 파일의 경로나 Google Cloud Storage URL을 제공합니다. --include-word-time-offsets 플래그를 포함합니다.

    gcloud ml speech recognize-long-running \
        'gs://cloud-samples-tests/speech/brooklyn.flac' \
        --language-code='en-US' --include-word-time-offsets --async
    

요청이 성공하면 서버가 장기 실행 작업의 ID를 JSON 형식으로 반환합니다.

    {
      "name": OPERATION_ID
    }
    

그러면 다음 명령어를 실행하여 작업에 대한 정보를 얻을 수 있습니다.

    gcloud ml speech operations describe OPERATION_ID
    

또한 다음 명령어를 실행하여 작업이 완료될 때까지 작업을 폴링할 수 있습니다.

    gcloud ml speech operations wait OPERATION_ID
    

작업이 완료되면 오디오가 JSON 형식으로 반환됩니다.

    {
      "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
      "results": [
        {
          "alternatives": [
            {
              "confidence": 0.9840146,
              "transcript": "how old is the Brooklyn Bridge",
              "words": [
                {
                  "endTime": "0.300s",
                  "startTime": "0s",
                  "word": "how"
                },
                {
                  "endTime": "0.600s",
                  "startTime": "0.300s",
                  "word": "old"
                },
                {
                  "endTime": "0.800s",
                  "startTime": "0.600s",
                  "word": "is"
                },
                {
                  "endTime": "0.900s",
                  "startTime": "0.800s",
                  "word": "the"
                },
                {
                  "endTime": "1.100s",
                  "startTime": "0.900s",
                  "word": "Brooklyn"
                },
                {
                  "endTime": "1.500s",
                  "startTime": "1.100s",
                  "word": "Bridge"
                }
              ]
            }
          ]
        }
      ]
    }
    

C#

static object AsyncRecognizeGcsWords(string storageUri)
    {
        var speech = SpeechClient.Create();
        var longOperation = speech.LongRunningRecognize(new RecognitionConfig()
        {
            Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
            SampleRateHertz = 16000,
            LanguageCode = "en",
            EnableWordTimeOffsets = true,
        }, RecognitionAudio.FromStorageUri(storageUri));
        longOperation = longOperation.PollUntilCompleted();
        var response = longOperation.Result;
        foreach (var result in response.Results)
        {
            foreach (var alternative in result.Alternatives)
            {
                Console.WriteLine($"Transcript: { alternative.Transcript}");
                Console.WriteLine("Word details:");
                Console.WriteLine($" Word count:{alternative.Words.Count}");
                foreach (var item in alternative.Words)
                {
                    Console.WriteLine($"  {item.Word}");
                    Console.WriteLine($"    WordStartTime: {item.StartTime}");
                    Console.WriteLine($"    WordEndTime: {item.EndTime}");
                }
            }
        }
        return 0;
    }

Go


    func asyncWords(client *speech.Client, out io.Writer, gcsURI string) error {
    	ctx := context.Background()

    	// Send the contents of the audio file with the encoding and
    	// and sample rate information to be transcripted.
    	req := &speechpb.LongRunningRecognizeRequest{
    		Config: &speechpb.RecognitionConfig{
    			Encoding:              speechpb.RecognitionConfig_LINEAR16,
    			SampleRateHertz:       16000,
    			LanguageCode:          "en-US",
    			EnableWordTimeOffsets: true,
    		},
    		Audio: &speechpb.RecognitionAudio{
    			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
    		},
    	}

    	op, err := client.LongRunningRecognize(ctx, req)
    	if err != nil {
    		return err
    	}
    	resp, err := op.Wait(ctx)
    	if err != nil {
    		return err
    	}

    	// Print the results.
    	for _, result := range resp.Results {
    		for _, alt := range result.Alternatives {
    			fmt.Fprintf(out, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
    			for _, w := range alt.Words {
    				fmt.Fprintf(out,
    					"Word: \"%v\" (startTime=%3f, endTime=%3f)\n",
    					w.Word,
    					float64(w.StartTime.Seconds)+float64(w.StartTime.Nanos)*1e-9,
    					float64(w.EndTime.Seconds)+float64(w.EndTime.Nanos)*1e-9,
    				)
    			}
    		}
    	}
    	return nil
    }
    

자바

/**
     * Performs non-blocking speech recognition on remote FLAC file and prints the transcription as
     * well as word time offsets.
     *
     * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
     */
    public static void asyncRecognizeWords(String gcsUri) throws Exception {
      // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
      try (SpeechClient speech = SpeechClient.create()) {

        // Configure remote file request for FLAC
        RecognitionConfig config =
            RecognitionConfig.newBuilder()
                .setEncoding(AudioEncoding.FLAC)
                .setLanguageCode("en-US")
                .setSampleRateHertz(16000)
                .setEnableWordTimeOffsets(true)
                .build();
        RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

        // Use non-blocking call for getting file transcription
        OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
            speech.longRunningRecognizeAsync(config, audio);
        while (!response.isDone()) {
          System.out.println("Waiting for response...");
          Thread.sleep(10000);
        }

        List<SpeechRecognitionResult> results = response.get().getResultsList();

        for (SpeechRecognitionResult result : results) {
          // There can be several alternative transcripts for a given chunk of speech. Just use the
          // first (most likely) one here.
          SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
          System.out.printf("Transcription: %s\n", alternative.getTranscript());
          for (WordInfo wordInfo : alternative.getWordsList()) {
            System.out.println(wordInfo.getWord());
            System.out.printf(
                "\t%s.%s sec - %s.%s sec\n",
                wordInfo.getStartTime().getSeconds(),
                wordInfo.getStartTime().getNanos() / 100000000,
                wordInfo.getEndTime().getSeconds(),
                wordInfo.getEndTime().getNanos() / 100000000);
          }
        }
      }
    }

Node.js

// Imports the Google Cloud client library
    const speech = require('@google-cloud/speech');

    // Creates a client
    const client = new speech.SpeechClient();

    /**
     * TODO(developer): Uncomment the following lines before running the sample.
     */
    // const gcsUri = 'gs://my-bucket/audio.raw';
    // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
    // const sampleRateHertz = 16000;
    // const languageCode = 'BCP-47 language code, e.g. en-US';

    const config = {
      enableWordTimeOffsets: true,
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
    };

    const audio = {
      uri: gcsUri,
    };

    const request = {
      config: config,
      audio: audio,
    };

    // Detects speech in the audio file. This creates a recognition job that you
    // can wait for now, or get its result later.
    const [operation] = await client.longRunningRecognize(request);

    // Get a Promise representation of the final result of the job
    const [response] = await operation.promise();
    response.results.forEach(result => {
      console.log(`Transcription: ${result.alternatives[0].transcript}`);
      result.alternatives[0].words.forEach(wordInfo => {
        // NOTE: If you have a time offset exceeding 2^32 seconds, use the
        // wordInfo.{x}Time.seconds.high to calculate seconds.
        const startSecs =
          `${wordInfo.startTime.seconds}` +
          '.' +
          wordInfo.startTime.nanos / 100000000;
        const endSecs =
          `${wordInfo.endTime.seconds}` +
          '.' +
          wordInfo.endTime.nanos / 100000000;
        console.log(`Word: ${wordInfo.word}`);
        console.log(`\t ${startSecs} secs - ${endSecs} secs`);
      });
    });

PHP

use Google\Cloud\Speech\V1\SpeechClient;
    use Google\Cloud\Speech\V1\RecognitionAudio;
    use Google\Cloud\Speech\V1\RecognitionConfig;
    use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

    /** Uncomment and populate these variables in your code */
    // $audioFile = 'path to an audio file';

    // change these variables if necessary
    $encoding = AudioEncoding::LINEAR16;
    $sampleRateHertz = 32000;
    $languageCode = 'en-US';

    if (!extension_loaded('grpc')) {
        throw new \Exception('Install the grpc extension (pecl install grpc)');
    }

    // When true, time offsets for every word will be included in the response.
    $enableWordTimeOffsets = true;

    // get contents of a file into a string
    $content = file_get_contents($audioFile);

    // set string as audio content
    $audio = (new RecognitionAudio())
        ->setContent($content);

    // set config
    $config = (new RecognitionConfig())
        ->setEncoding($encoding)
        ->setSampleRateHertz($sampleRateHertz)
        ->setLanguageCode($languageCode)
        ->setEnableWordTimeOffsets($enableWordTimeOffsets);

    // create the speech client
    $client = new SpeechClient();

    // create the asyncronous recognize operation
    $operation = $client->longRunningRecognize($config, $audio);
    $operation->pollUntilComplete();

    if ($operation->operationSucceeded()) {
        $response = $operation->getResult();

        // each result is for a consecutive portion of the audio. iterate
        // through them to get the transcripts for the entire audio file.
        foreach ($response->getResults() as $result) {
            $alternatives = $result->getAlternatives();
            $mostLikely = $alternatives[0];
            $transcript = $mostLikely->getTranscript();
            $confidence = $mostLikely->getConfidence();
            printf('Transcript: %s' . PHP_EOL, $transcript);
            printf('Confidence: %s' . PHP_EOL, $confidence);
            foreach ($mostLikely->getWords() as $wordInfo) {
                $startTime = $wordInfo->getStartTime();
                $endTime = $wordInfo->getEndTime();
                printf('  Word: %s (start: %s, end: %s)' . PHP_EOL,
                    $wordInfo->getWord(),
                    $startTime->serializeToJsonString(),
                    $endTime->serializeToJsonString());
            }
        }
    } else {
        print_r($operation->getError());
    }

    $client->close();

Python

from google.cloud import speech_v1

    def sample_long_running_recognize(storage_uri):
        """
        Print start and end time of each word spoken in audio file from Cloud Storage

        Args:
          storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
        """

        client = speech_v1.SpeechClient()

        # storage_uri = 'gs://cloud-samples-data/speech/brooklyn_bridge.flac'

        # When enabled, the first result returned by the API will include a list
        # of words and the start and end time offsets (timestamps) for those words.
        enable_word_time_offsets = True

        # The language of the supplied audio
        language_code = "en-US"
        config = {
            "enable_word_time_offsets": enable_word_time_offsets,
            "language_code": language_code,
        }
        audio = {"uri": storage_uri}

        operation = client.long_running_recognize(config, audio)

        print(u"Waiting for operation to complete...")
        response = operation.result()

        # The first result includes start and end time word offsets
        result = response.results[0]
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        print(u"Transcript: {}".format(alternative.transcript))
        # Print the start and end time of each word
        for word in alternative.words:
            print(u"Word: {}".format(word.word))
            print(
                u"Start time: {} seconds {} nanos".format(
                    word.start_time.seconds, word.start_time.nanos
                )
            )
            print(
                u"End time: {} seconds {} nanos".format(
                    word.end_time.seconds, word.end_time.nanos
                )
            )

    

Ruby

# storage_path = "Path to file in Cloud Storage, eg. gs://bucket/audio.raw"

    require "google/cloud/speech"

    speech = Google::Cloud::Speech.new

    config = { encoding:                 :LINEAR16,
               sample_rate_hertz:        16_000,
               language_code:            "en-US",
               enable_word_time_offsets: true }
    audio  = { uri: storage_path }

    operation = speech.long_running_recognize config, audio

    puts "Operation started"

    operation.wait_until_done!

    raise operation.results.message if operation.error?

    results = operation.response.results

    alternatives = results.first.alternatives
    alternatives.each do |alternative|
      puts "Transcription: #{alternative.transcript}"

      alternative.words.each do |word|
        start_time = word.start_time.seconds + word.start_time.nanos / 1_000_000_000.0
        end_time   = word.end_time.seconds + word.end_time.nanos / 1_000_000_000.0

        puts "Word: #{word.word} #{start_time} #{end_time}"
      end
    end