텍스트 변환 모델 선택

이 페이지에서는 Speech-to-Text에 대한 오디오 텍스트 변환 요청에 특정 머신러닝 모델을 사용하는 방법을 설명합니다.

텍스트 변환 모델

Speech-to-Text는 입력을 여러 머신러닝 모델 중 하나와 비교하는 방법으로 오디오 클립의 단어를 감지합니다. 각 모델은 예시 수백만 개(이 경우 사람의 음성을 녹음한 수많은 오디오 자료)를 분석하는 학습 과정을 거쳤습니다.

Speech-to-Text에는 전화 통화나 동영상과 같은 특정 소스의 오디오로 학습된 특화 모델이 있습니다. 이러한 학습 과정을 거친 특화 모델을 유사한 종류의 오디오 데이터에 적용하면 더 좋은 결과를 얻을 수 있습니다.

예를 들어 Speech-to-Text에는 전화 통화 음성을 인식하도록 학습된 텍스트 변환 모델이 있습니다. Speech-to-Text에서 이 모델을 사용하여 전화 통화 음성을 텍스트로 변환하면 다른 모델을 사용할 때보다 훨씬 더 좋은 결과를 얻게 됩니다.

다음 표에서는 Speech-to-Text와 함께 사용할 수 있는 텍스트 변환 모델을 보여줍니다.

모델 이름 설명
command_and_search 짧거나 한 단어로 이루어진 음성 명령, 음성 검색 등의 발화에 적합합니다.
phone_call 전화 통화 오디오에 가장 적합합니다. 대개 8kHz의 샘플링 레이트로 녹음된 통화를 사용합니다.
video

동영상에서 나온 오디오나 화자가 둘 이상인 오디오에 가장 적합합니다. 16kHz 이상의 샘플링 레이트로 녹음된 오디오가 좋습니다.

이 모델은 표준 요금보다 비싼 프리미엄 모델입니다. 자세한 내용은 가격 책정 페이지를 참조하세요.

default 긴 오디오, 구술 등 다른 오디오 모델에 맞지 않는 오디오에 적합합니다. Hi-Fi이고 샘플링 레이트 16kHz 이상으로 녹음된 오디오가 좋습니다.

오디오 텍스트 변환용 모델 선택

오디오 텍스트 변환에 사용할 특정 모델을 지정하려면 요청의 RecognitionConfig 매개변수에서 model 필드를 허용된 값(video, phone_call, command_and_search, default) 중 하나로 설정해야 합니다. Speech-to-Text는 speech:recognize, speech:longrunningrecognize, 스트리밍 등의 모든 인식 방법에서 모델 선택을 지원합니다.

로컬 오디오 파일의 텍스트 변환 수행

프로토콜

자세한 내용은 [speech:recognize] API 엔드포인트를 참조하세요.

동기 음성 인식을 수행하려면 POST 요청을 실행하고 적절한 요청 본문을 제공합니다. 다음은 curl을 사용한 POST 요청의 예시입니다. 이 예시에서는 Google Cloud Cloud SDK를 사용하는 프로젝트용으로 설정된 서비스 계정의 액세스 토큰을 사용합니다. Cloud SDK를 설치하고, 서비스 계정으로 프로젝트를 설정하고, 액세스 토큰을 획득하는 방법은 빠른 시작을 참조하세요.

    curl -s -H "Content-Type: application/json" \
        -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
        https://speech.googleapis.com/v1/speech:recognize \
        --data '{
        "config": {
            "encoding": "LINEAR16",
            "sampleRateHertz": 16000,
            "languageCode": "en-US",
            "model": "video"
        },
        "audio": {
            "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
        }
    }'
    

요청 본문 구성에 대한 자세한 내용은 RecognitionConfig 참조 문서를 확인하세요.

요청이 성공하면 서버가 200 OK HTTP 상태 코드와 응답을 JSON 형식으로 반환합니다.

    {
      "results": [
        {
          "alternatives": [
            {
              "transcript": "OK Google stream stranger things from
                Netflix to my TV okay stranger things from
                Netflix playing on TV from the people that brought you
                Google home comes the next evolution of the smart home
                and it's just outside your window me Google know hi
                how can I help okay no what's the weather like outside
                the weather outside is sunny and 76 degrees he's right
                okay no turn on the hose I'm holding sure okay no I'm can
                I eat this lemon tree leaf yes what about this Daisy yes
                but I wouldn't recommend it but I could eat it okay
                Nomad milk to my shopping list I'm sorry that sounds like
                an indoor request I keep doing that sorry you do keep
                doing that okay no is this compost really we're all
                compost if you think about it pretty much everything is
                made up of organic matter and will return",
              "confidence": 0.9251011
            }
          ]
        }
      ]
    }
    

C#

static object SyncRecognizeModelSelection(string filePath, string model)
    {
        var speech = SpeechClient.Create();
        var response = speech.Recognize(new RecognitionConfig()
        {
            Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
            SampleRateHertz = 16000,
            LanguageCode = "en",
            // The `model` value must be one of the following:
            // "video", "phone_call", "command_and_search", "default"
            Model = model
        }, RecognitionAudio.FromFile(filePath));
        foreach (var result in response.Results)
        {
            foreach (var alternative in result.Alternatives)
            {
                Console.WriteLine(alternative.Transcript);
            }
        }
        return 0;
    }

Go


    func modelSelection(w io.Writer, path string) error {
    	ctx := context.Background()

    	client, err := speech.NewClient(ctx)
    	if err != nil {
    		return fmt.Errorf("NewClient: %v", err)
    	}

    	// path = "../testdata/Google_Gnome.wav"
    	data, err := ioutil.ReadFile(path)
    	if err != nil {
    		return fmt.Errorf("ReadFile: %v", err)
    	}

    	req := &speechpb.RecognizeRequest{
    		Config: &speechpb.RecognitionConfig{
    			Encoding:        speechpb.RecognitionConfig_LINEAR16,
    			SampleRateHertz: 16000,
    			LanguageCode:    "en-US",
    			Model:           "video",
    		},
    		Audio: &speechpb.RecognitionAudio{
    			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
    		},
    	}

    	resp, err := client.Recognize(ctx, req)
    	if err != nil {
    		return fmt.Errorf("Recognize: %v", err)
    	}

    	for i, result := range resp.Results {
    		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
    		fmt.Fprintf(w, "Result %d\n", i+1)
    		for j, alternative := range result.Alternatives {
    			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
    		}
    	}
    	return nil
    }
    

자바

/**
     * Performs transcription of the given audio file synchronously with the selected model.
     *
     * @param fileName the path to a audio file to transcribe
     */
    public static void transcribeModelSelection(String fileName) throws Exception {
      Path path = Paths.get(fileName);
      byte[] content = Files.readAllBytes(path);

      try (SpeechClient speech = SpeechClient.create()) {
        // Configure request with video media type
        RecognitionConfig recConfig =
            RecognitionConfig.newBuilder()
                // encoding may either be omitted or must match the value in the file header
                .setEncoding(AudioEncoding.LINEAR16)
                .setLanguageCode("en-US")
                // sample rate hertz may be either be omitted or must match the value in the file
                // header
                .setSampleRateHertz(16000)
                .setModel("video")
                .build();

        RecognitionAudio recognitionAudio =
            RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

        RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
        // Just print the first result here.
        SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
        // There can be several alternative transcripts for a given chunk of speech. Just use the
        // first (most likely) one here.
        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
        System.out.printf("Transcript : %s\n", alternative.getTranscript());
      }
    }

Node.js

// Imports the Google Cloud client library for Beta API
    /**
     * TODO(developer): Update client library import to use new
     * version of API when desired features become available
     */
    const speech = require('@google-cloud/speech').v1p1beta1;
    const fs = require('fs');

    // Creates a client
    const client = new speech.SpeechClient();

    /**
     * TODO(developer): Uncomment the following lines before running the sample.
     */
    // const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
    // const model = 'Model to use, e.g. phone_call, video, default';
    // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
    // const sampleRateHertz = 16000;
    // const languageCode = 'BCP-47 language code, e.g. en-US';

    const config = {
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
      model: model,
    };
    const audio = {
      content: fs.readFileSync(filename).toString('base64'),
    };

    const request = {
      config: config,
      audio: audio,
    };

    // Detects speech in the audio file
    const [response] = await client.recognize(request);
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log('Transcription: ', transcription);

PHP

use Google\Cloud\Speech\V1\SpeechClient;
    use Google\Cloud\Speech\V1\RecognitionAudio;
    use Google\Cloud\Speech\V1\RecognitionConfig;
    use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

    /** Uncomment and populate these variables in your code */
    // $audioFile = 'path to an audio file';
    // $model = 'video';

    // change these variables if necessary
    $encoding = AudioEncoding::LINEAR16;
    $sampleRateHertz = 32000;
    $languageCode = 'en-US';

    // get contents of a file into a string
    $content = file_get_contents($audioFile);

    // set string as audio content
    $audio = (new RecognitionAudio())
        ->setContent($content);

    // set config
    $config = (new RecognitionConfig())
        ->setEncoding($encoding)
        ->setSampleRateHertz($sampleRateHertz)
        ->setLanguageCode($languageCode)
        ->setModel($model);

    // create the speech client
    $client = new SpeechClient();

    // make the API call
    $response = $client->recognize($config, $audio);
    $results = $response->getResults();

    // print results
    foreach ($results as $result) {
        $alternatives = $result->getAlternatives();
        $mostLikely = $alternatives[0];
        $transcript = $mostLikely->getTranscript();
        $confidence = $mostLikely->getConfidence();
        printf('Transcript: %s' . PHP_EOL, $transcript);
        printf('Confidence: %s' . PHP_EOL, $confidence);
    }

    $client->close();

Python

from google.cloud import speech_v1
    import io

    def sample_recognize(local_file_path, model):
        """
        Transcribe a short audio file using a specified transcription model

        Args:
          local_file_path Path to local audio file, e.g. /path/audio.wav
          model The transcription model to use, e.g. video, phone_call, default
          For a list of available transcription models, see:
          https://cloud.google.com/speech-to-text/docs/transcription-model#transcription_models
        """

        client = speech_v1.SpeechClient()

        # local_file_path = 'resources/hello.wav'
        # model = 'phone_call'

        # The language of the supplied audio
        language_code = "en-US"
        config = {"model": model, "language_code": language_code}
        with io.open(local_file_path, "rb") as f:
            content = f.read()
        audio = {"content": content}

        response = client.recognize(config, audio)
        for result in response.results:
            # First alternative is the most probable result
            alternative = result.alternatives[0]
            print(u"Transcript: {}".format(alternative.transcript))

    

Ruby

# file_path = "path/to/audio.wav"

    require "google/cloud/speech"

    speech = Google::Cloud::Speech.new

    config = {
      encoding:          :LINEAR16,
      sample_rate_hertz: 16_000,
      language_code:     "en-US",
      model:             model
    }

    file  = File.binread file_path
    audio = { content: file }

    operation = speech.long_running_recognize config, audio

    puts "Operation started"

    operation.wait_until_done!

    raise operation.results.message if operation.error?

    results = operation.response.results

    results.each_with_index do |result, i|
      alternative = result.alternatives.first
      puts "-" * 20
      puts "First alternative of result #{i}"
      puts "Transcript: #{alternative.transcript}"
    end

Google Cloud Storage 오디오 파일의 텍스트 변환 수행

자바

/**
     * Performs transcription of the remote audio file asynchronously with the selected model.
     *
     * @param gcsUri the path to the remote audio file to transcribe.
     */
    public static void transcribeModelSelectionGcs(String gcsUri) throws Exception {
      try (SpeechClient speech = SpeechClient.create()) {

        // Configure request with video media type
        RecognitionConfig config =
            RecognitionConfig.newBuilder()
                // encoding may either be omitted or must match the value in the file header
                .setEncoding(AudioEncoding.LINEAR16)
                .setLanguageCode("en-US")
                // sample rate hertz may be either be omitted or must match the value in the file
                // header
                .setSampleRateHertz(16000)
                .setModel("video")
                .build();

        RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

        // Use non-blocking call for getting file transcription
        OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
            speech.longRunningRecognizeAsync(config, audio);

        while (!response.isDone()) {
          System.out.println("Waiting for response...");
          Thread.sleep(10000);
        }

        List<SpeechRecognitionResult> results = response.get().getResultsList();

        // Just print the first result here.
        SpeechRecognitionResult result = results.get(0);
        // There can be several alternative transcripts for a given chunk of speech. Just use the
        // first (most likely) one here.
        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
        System.out.printf("Transcript : %s\n", alternative.getTranscript());
      }
    }

Node.js

// Imports the Google Cloud client library for Beta API
    /**
     * TODO(developer): Update client library import to use new
     * version of API when desired features become available
     */
    const speech = require('@google-cloud/speech').v1p1beta1;

    // Creates a client
    const client = new speech.SpeechClient();

    /**
     * TODO(developer): Uncomment the following lines before running the sample.
     */
    // const gcsUri = 'gs://my-bucket/audio.raw';
    // const model = 'Model to use, e.g. phone_call, video, default';
    // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
    // const sampleRateHertz = 16000;
    // const languageCode = 'BCP-47 language code, e.g. en-US';

    const config = {
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
      model: model,
    };
    const audio = {
      uri: gcsUri,
    };

    const request = {
      config: config,
      audio: audio,
    };

    // Detects speech in the audio file
    const [response] = await client.recognize(request);
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log('Transcription: ', transcription);

Python

from google.cloud import speech_v1

    def sample_recognize(storage_uri, model):
        """
        Transcribe a short audio file from Cloud Storage using a specified
        transcription model

        Args:
          storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
          model The transcription model to use, e.g. video, phone_call, default
          For a list of available transcription models, see:
          https://cloud.google.com/speech-to-text/docs/transcription-model#transcription_models
        """

        client = speech_v1.SpeechClient()

        # storage_uri = 'gs://cloud-samples-data/speech/hello.wav'
        # model = 'phone_call'

        # The language of the supplied audio
        language_code = "en-US"
        config = {"model": model, "language_code": language_code}
        audio = {"uri": storage_uri}

        response = client.recognize(config, audio)
        for result in response.results:
            # First alternative is the most probable result
            alternative = result.alternatives[0]
            print(u"Transcript: {}".format(alternative.transcript))