选择转录模型

本页面介绍了如何将特定的机器学习模型用于发送到 Speech-to-Text 的音频转录请求。

转录模型

Speech-to-Text 会将输入与多个机器学习模型中的一个进行比较,以检测音频剪辑中的字词。每个模型都通过分析数百万个示例(在此是指大量实际的人物说话录音)进行过训练。

Speech-to-Text 拥有使用特定来源(例如电话通话或视频)的音频训练的专门的模型。由于这一训练过程,用这些专门的模型处理相似类型的音频数据时,可以得到更好的结果。

例如,Speech-to-Text 有一个经过训练的转录模型,用于识别在电话中捕获的语音。当 Speech-to-Text 使用此模型转录电话音频时,生成的结果明显优于其他模型。

下表显示了可用于 Speech-to-Text 的转录模型。

模型名称 说明
command_and_search 最适合简短话语或单字词话语,例如语音指令或语音搜索。
phone_call 最适合来自电话通话的音频(通常以 8khz 的采样率录制)。
video

最适合来自视频的音频或有多人讲话的音频。理想情况下,音频以 16khz 或更高的采样率录制。

这是一个高于标准价格的高级模型。如需了解详情,请参阅价格页面。

default 最适合那些不适合其他音频模型的音频,例如时间较长的音频或口述。理想情况是,这类音频具有较高的保真度,并且以 16khz 或更高的采样率录制。

选择用于音频转录的模型

如需指定用于音频转录的特定模型,您必须将请求的 RecognitionConfig 参数的 model 字段设置为允许值之一:videophone_callcommand_and_searchdefault。Speech-to-Text 的模型选择功能支持以下所有语音识别方法:speech:recognizespeech:longrunningrecognize流式

对本地音频文件执行转录

协议

如需了解完整的详细信息,请参阅 [speech:recognize] API 端点。

如需执行同步语音识别,请发出 POST 请求并提供相应的请求正文。以下示例展示了一个使用 curl 发出的 POST 请求。该示例使用通过 Google Cloud Cloud SDK 为项目设置的服务帐号的访问令牌。如需了解有关安装 Cloud SDK、使用服务帐号设置项目以及获取访问令牌的说明,请参阅快速入门

    curl -s -H "Content-Type: application/json" \
        -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
        https://speech.googleapis.com/v1/speech:recognize \
        --data '{
        "config": {
            "encoding": "LINEAR16",
            "sampleRateHertz": 16000,
            "languageCode": "en-US",
            "model": "video"
        },
        "audio": {
            "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
        }
    }'
    

如需详细了解如何配置请求正文,请参阅 RecognitionConfig 参考文档。

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应。

    {
      "results": [
        {
          "alternatives": [
            {
              "transcript": "OK Google stream stranger things from
                Netflix to my TV okay stranger things from
                Netflix playing on TV from the people that brought you
                Google home comes the next evolution of the smart home
                and it's just outside your window me Google know hi
                how can I help okay no what's the weather like outside
                the weather outside is sunny and 76 degrees he's right
                okay no turn on the hose I'm holding sure okay no I'm can
                I eat this lemon tree leaf yes what about this Daisy yes
                but I wouldn't recommend it but I could eat it okay
                Nomad milk to my shopping list I'm sorry that sounds like
                an indoor request I keep doing that sorry you do keep
                doing that okay no is this compost really we're all
                compost if you think about it pretty much everything is
                made up of organic matter and will return",
              "confidence": 0.9251011
            }
          ]
        }
      ]
    }
    

C#

static object SyncRecognizeModelSelection(string filePath, string model)
    {
        var speech = SpeechClient.Create();
        var response = speech.Recognize(new RecognitionConfig()
        {
            Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
            SampleRateHertz = 16000,
            LanguageCode = "en",
            // The `model` value must be one of the following:
            // "video", "phone_call", "command_and_search", "default"
            Model = model
        }, RecognitionAudio.FromFile(filePath));
        foreach (var result in response.Results)
        {
            foreach (var alternative in result.Alternatives)
            {
                Console.WriteLine(alternative.Transcript);
            }
        }
        return 0;
    }

Go


    func modelSelection(w io.Writer, path string) error {
    	ctx := context.Background()

    	client, err := speech.NewClient(ctx)
    	if err != nil {
    		return fmt.Errorf("NewClient: %v", err)
    	}

    	// path = "../testdata/Google_Gnome.wav"
    	data, err := ioutil.ReadFile(path)
    	if err != nil {
    		return fmt.Errorf("ReadFile: %v", err)
    	}

    	req := &speechpb.RecognizeRequest{
    		Config: &speechpb.RecognitionConfig{
    			Encoding:        speechpb.RecognitionConfig_LINEAR16,
    			SampleRateHertz: 16000,
    			LanguageCode:    "en-US",
    			Model:           "video",
    		},
    		Audio: &speechpb.RecognitionAudio{
    			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
    		},
    	}

    	resp, err := client.Recognize(ctx, req)
    	if err != nil {
    		return fmt.Errorf("Recognize: %v", err)
    	}

    	for i, result := range resp.Results {
    		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
    		fmt.Fprintf(w, "Result %d\n", i+1)
    		for j, alternative := range result.Alternatives {
    			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
    		}
    	}
    	return nil
    }
    

Java

/**
     * Performs transcription of the given audio file synchronously with the selected model.
     *
     * @param fileName the path to a audio file to transcribe
     */
    public static void transcribeModelSelection(String fileName) throws Exception {
      Path path = Paths.get(fileName);
      byte[] content = Files.readAllBytes(path);

      try (SpeechClient speech = SpeechClient.create()) {
        // Configure request with video media type
        RecognitionConfig recConfig =
            RecognitionConfig.newBuilder()
                // encoding may either be omitted or must match the value in the file header
                .setEncoding(AudioEncoding.LINEAR16)
                .setLanguageCode("en-US")
                // sample rate hertz may be either be omitted or must match the value in the file
                // header
                .setSampleRateHertz(16000)
                .setModel("video")
                .build();

        RecognitionAudio recognitionAudio =
            RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

        RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
        // Just print the first result here.
        SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
        // There can be several alternative transcripts for a given chunk of speech. Just use the
        // first (most likely) one here.
        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
        System.out.printf("Transcript : %s\n", alternative.getTranscript());
      }
    }

Node.js

// Imports the Google Cloud client library for Beta API
    /**
     * TODO(developer): Update client library import to use new
     * version of API when desired features become available
     */
    const speech = require('@google-cloud/speech').v1p1beta1;
    const fs = require('fs');

    // Creates a client
    const client = new speech.SpeechClient();

    /**
     * TODO(developer): Uncomment the following lines before running the sample.
     */
    // const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
    // const model = 'Model to use, e.g. phone_call, video, default';
    // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
    // const sampleRateHertz = 16000;
    // const languageCode = 'BCP-47 language code, e.g. en-US';

    const config = {
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
      model: model,
    };
    const audio = {
      content: fs.readFileSync(filename).toString('base64'),
    };

    const request = {
      config: config,
      audio: audio,
    };

    // Detects speech in the audio file
    const [response] = await client.recognize(request);
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log('Transcription: ', transcription);

PHP

use Google\Cloud\Speech\V1\SpeechClient;
    use Google\Cloud\Speech\V1\RecognitionAudio;
    use Google\Cloud\Speech\V1\RecognitionConfig;
    use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

    /** Uncomment and populate these variables in your code */
    // $audioFile = 'path to an audio file';
    // $model = 'video';

    // change these variables if necessary
    $encoding = AudioEncoding::LINEAR16;
    $sampleRateHertz = 32000;
    $languageCode = 'en-US';

    // get contents of a file into a string
    $content = file_get_contents($audioFile);

    // set string as audio content
    $audio = (new RecognitionAudio())
        ->setContent($content);

    // set config
    $config = (new RecognitionConfig())
        ->setEncoding($encoding)
        ->setSampleRateHertz($sampleRateHertz)
        ->setLanguageCode($languageCode)
        ->setModel($model);

    // create the speech client
    $client = new SpeechClient();

    // make the API call
    $response = $client->recognize($config, $audio);
    $results = $response->getResults();

    // print results
    foreach ($results as $result) {
        $alternatives = $result->getAlternatives();
        $mostLikely = $alternatives[0];
        $transcript = $mostLikely->getTranscript();
        $confidence = $mostLikely->getConfidence();
        printf('Transcript: %s' . PHP_EOL, $transcript);
        printf('Confidence: %s' . PHP_EOL, $confidence);
    }

    $client->close();

Python

from google.cloud import speech_v1
    import io

    def sample_recognize(local_file_path, model):
        """
        Transcribe a short audio file using a specified transcription model

        Args:
          local_file_path Path to local audio file, e.g. /path/audio.wav
          model The transcription model to use, e.g. video, phone_call, default
          For a list of available transcription models, see:
          https://cloud.google.com/speech-to-text/docs/transcription-model#transcription_models
        """

        client = speech_v1.SpeechClient()

        # local_file_path = 'resources/hello.wav'
        # model = 'phone_call'

        # The language of the supplied audio
        language_code = "en-US"
        config = {"model": model, "language_code": language_code}
        with io.open(local_file_path, "rb") as f:
            content = f.read()
        audio = {"content": content}

        response = client.recognize(config, audio)
        for result in response.results:
            # First alternative is the most probable result
            alternative = result.alternatives[0]
            print(u"Transcript: {}".format(alternative.transcript))

    

Ruby

# file_path = "path/to/audio.wav"

    require "google/cloud/speech"

    speech = Google::Cloud::Speech.new

    config = {
      encoding:          :LINEAR16,
      sample_rate_hertz: 16_000,
      language_code:     "en-US",
      model:             model
    }

    file  = File.binread file_path
    audio = { content: file }

    operation = speech.long_running_recognize config, audio

    puts "Operation started"

    operation.wait_until_done!

    raise operation.results.message if operation.error?

    results = operation.response.results

    results.each_with_index do |result, i|
      alternative = result.alternatives.first
      puts "-" * 20
      puts "First alternative of result #{i}"
      puts "Transcript: #{alternative.transcript}"
    end

对 Google Cloud Storage 音频文件执行转录

Java

/**
     * Performs transcription of the remote audio file asynchronously with the selected model.
     *
     * @param gcsUri the path to the remote audio file to transcribe.
     */
    public static void transcribeModelSelectionGcs(String gcsUri) throws Exception {
      try (SpeechClient speech = SpeechClient.create()) {

        // Configure request with video media type
        RecognitionConfig config =
            RecognitionConfig.newBuilder()
                // encoding may either be omitted or must match the value in the file header
                .setEncoding(AudioEncoding.LINEAR16)
                .setLanguageCode("en-US")
                // sample rate hertz may be either be omitted or must match the value in the file
                // header
                .setSampleRateHertz(16000)
                .setModel("video")
                .build();

        RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

        // Use non-blocking call for getting file transcription
        OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
            speech.longRunningRecognizeAsync(config, audio);

        while (!response.isDone()) {
          System.out.println("Waiting for response...");
          Thread.sleep(10000);
        }

        List<SpeechRecognitionResult> results = response.get().getResultsList();

        // Just print the first result here.
        SpeechRecognitionResult result = results.get(0);
        // There can be several alternative transcripts for a given chunk of speech. Just use the
        // first (most likely) one here.
        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
        System.out.printf("Transcript : %s\n", alternative.getTranscript());
      }
    }

Node.js

// Imports the Google Cloud client library for Beta API
    /**
     * TODO(developer): Update client library import to use new
     * version of API when desired features become available
     */
    const speech = require('@google-cloud/speech').v1p1beta1;

    // Creates a client
    const client = new speech.SpeechClient();

    /**
     * TODO(developer): Uncomment the following lines before running the sample.
     */
    // const gcsUri = 'gs://my-bucket/audio.raw';
    // const model = 'Model to use, e.g. phone_call, video, default';
    // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
    // const sampleRateHertz = 16000;
    // const languageCode = 'BCP-47 language code, e.g. en-US';

    const config = {
      encoding: encoding,
      sampleRateHertz: sampleRateHertz,
      languageCode: languageCode,
      model: model,
    };
    const audio = {
      uri: gcsUri,
    };

    const request = {
      config: config,
      audio: audio,
    };

    // Detects speech in the audio file
    const [response] = await client.recognize(request);
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log('Transcription: ', transcription);

Python

from google.cloud import speech_v1

    def sample_recognize(storage_uri, model):
        """
        Transcribe a short audio file from Cloud Storage using a specified
        transcription model

        Args:
          storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
          model The transcription model to use, e.g. video, phone_call, default
          For a list of available transcription models, see:
          https://cloud.google.com/speech-to-text/docs/transcription-model#transcription_models
        """

        client = speech_v1.SpeechClient()

        # storage_uri = 'gs://cloud-samples-data/speech/hello.wav'
        # model = 'phone_call'

        # The language of the supplied audio
        language_code = "en-US"
        config = {"model": model, "language_code": language_code}
        audio = {"uri": storage_uri}

        response = client.recognize(config, audio)
        for result in response.results:
            # First alternative is the most probable result
            alternative = result.alternatives[0]
            print(u"Transcript: {}".format(alternative.transcript))