音声文字変換モデルの選択

このページでは、Speech-to-Text への音声文字変換リクエストにおいて、特定の機械学習モデルを使用する方法について説明します。

音声文字変換モデル

Speech-to-Text は、入力音声を多数の機械学習モデルの 1 つと比較して、音声クリップ内の単語を検出します。各モデルは、膨大な数のサンプル(この場合は人の会話に関する多数の音声録音)を分析することでトレーニングを行ってきました。

Speech-to-Text には、特定のソース(電話での通話や動画など)の音声によってトレーニングされた特殊なモデルがあります。こうしたトレーニング プロセスのため、これらの特殊モデルは同様の音声データに適用された場合により良い結果をもたらします。

たとえば、Speech-to-Text には、電話でキャプチャされた音声を認識するようにトレーニングされた音声文字変換モデルがあります。Speech-to-Text でこのモデルを使用して電話音声の音声文字変換を行うと、他のモデルよりも著しく良い結果が得られます。

次の表に、Speech-to-Text で使用可能な音声文字変換モデルを示します。

モデル名 説明
command_and_search 音声コマンドや音声検索など、短い発話や 1 つの単語からなる発話に最適です。
phone_call 電話の通話音声に最適です(通常は 8 kHz のサンプリング レートで録音されています)。
video

動画の音声や話者が複数いる音声に最適です。音声は 16 kHz 以上のサンプリング レートで録音されていることが理想的です。

これは、標準レートよりも費用の高いプレミアム モデルです。詳細については、料金ページをご覧ください。

default 長時間の録音や口述といった、他の音声モデルに適合しない音声に最適です。16 kHz 以上のサンプリング レートで録音されたハイファイ音声であることが理想的です。

音声文字変換のモデルの選択

音声文字変換で使用する特定のモデルを指定するには、リクエストの RecognitionConfig パラメータの model フィールドに許される値のいずれか(videophone_callcommand_and_searchdefault)を設定する必要があります。Speech-to-Text では、speech:recognizespeech:longrunningrecognizeStreaming のどの音声認識メソッドでもモデルを選択できます。

ローカル音声ファイルの音声文字変換を実行する

プロトコル

詳細については、[speech:recognize] API エンドポイントをご覧ください。

同期音声認識を実行するには、POST リクエストを作成し、適切なリクエスト本文を指定します。次は、curl を使用した POST リクエストの例です。この例では、Google Cloud Cloud SDK を使用して、プロジェクト用に設定されたサービス アカウントのアクセス トークンを扱います。Cloud SDK のインストール、サービス アカウントがあるプロジェクトの設定、アクセス トークンの取得などの手順については、クイックスタートをご覧ください。

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "model": "video"
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
    }
}'

リクエスト本文の構成の詳細については、RecognitionConfig のリファレンス ドキュメントをご覧ください。

リクエストが成功すると、サーバーは 200 OK HTTP ステータス コードと JSON 形式のレスポンスを返します。

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "OK Google stream stranger things from
            Netflix to my TV okay stranger things from
            Netflix playing on TV from the people that brought you
            Google home comes the next evolution of the smart home
            and it's just outside your window me Google know hi
            how can I help okay no what's the weather like outside
            the weather outside is sunny and 76 degrees he's right
            okay no turn on the hose I'm holding sure okay no I'm can
            I eat this lemon tree leaf yes what about this Daisy yes
            but I wouldn't recommend it but I could eat it okay
            Nomad milk to my shopping list I'm sorry that sounds like
            an indoor request I keep doing that sorry you do keep
            doing that okay no is this compost really we're all
            compost if you think about it pretty much everything is
            made up of organic matter and will return",
          "confidence": 0.9251011
        }
      ]
    }
  ]
}

C#

static object SyncRecognizeModelSelection(string filePath, string model)
{
    var speech = SpeechClient.Create();
    var response = speech.Recognize(new RecognitionConfig()
    {
        Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
        SampleRateHertz = 16000,
        LanguageCode = "en",
        // The `model` value must be one of the following:
        // "video", "phone_call", "command_and_search", "default"
        Model = model
    }, RecognitionAudio.FromFile(filePath));
    foreach (var result in response.Results)
    {
        foreach (var alternative in result.Alternatives)
        {
            Console.WriteLine(alternative.Transcript);
        }
    }
    return 0;
}

Go


func modelSelection(w io.Writer, path string) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %v", err)
	}

	// path = "../testdata/Google_Gnome.wav"
	data, err := ioutil.ReadFile(path)
	if err != nil {
		return fmt.Errorf("ReadFile: %v", err)
	}

	req := &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
			Model:           "video",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	}

	resp, err := client.Recognize(ctx, req)
	if err != nil {
		return fmt.Errorf("Recognize: %v", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Java

/**
 * Performs transcription of the given audio file synchronously with the selected model.
 *
 * @param fileName the path to a audio file to transcribe
 */
public static void transcribeModelSelection(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speech = SpeechClient.create()) {
    // Configure request with video media type
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
    // Just print the first result here.
    SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

PHP

use Google\Cloud\Speech\V1\SpeechClient;
use Google\Cloud\Speech\V1\RecognitionAudio;
use Google\Cloud\Speech\V1\RecognitionConfig;
use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

/** Uncomment and populate these variables in your code */
// $audioFile = 'path to an audio file';
// $model = 'video';

// change these variables if necessary
$encoding = AudioEncoding::LINEAR16;
$sampleRateHertz = 32000;
$languageCode = 'en-US';

// get contents of a file into a string
$content = file_get_contents($audioFile);

// set string as audio content
$audio = (new RecognitionAudio())
    ->setContent($content);

// set config
$config = (new RecognitionConfig())
    ->setEncoding($encoding)
    ->setSampleRateHertz($sampleRateHertz)
    ->setLanguageCode($languageCode)
    ->setModel($model);

// create the speech client
$client = new SpeechClient();

// make the API call
$response = $client->recognize($config, $audio);
$results = $response->getResults();

// print results
foreach ($results as $result) {
    $alternatives = $result->getAlternatives();
    $mostLikely = $alternatives[0];
    $transcript = $mostLikely->getTranscript();
    $confidence = $mostLikely->getConfidence();
    printf('Transcript: %s' . PHP_EOL, $transcript);
    printf('Confidence: %s' . PHP_EOL, $confidence);
}

$client->close();

Python

from google.cloud import speech_v1
import io

def sample_recognize(local_file_path, model):
    """
    Transcribe a short audio file using a specified transcription model

    Args:
      local_file_path Path to local audio file, e.g. /path/audio.wav
      model The transcription model to use, e.g. video, phone_call, default
      For a list of available transcription models, see:
      https://cloud.google.com/speech-to-text/docs/transcription-model#transcription_models
    """

    client = speech_v1.SpeechClient()

    # local_file_path = 'resources/hello.wav'
    # model = 'phone_call'

    # The language of the supplied audio
    language_code = "en-US"
    config = {"model": model, "language_code": language_code}
    with io.open(local_file_path, "rb") as f:
        content = f.read()
    audio = {"content": content}

    response = client.recognize(config, audio)
    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        print(u"Transcript: {}".format(alternative.transcript))

Ruby

# file_path = "path/to/audio.wav"

require "google/cloud/speech"

speech = Google::Cloud::Speech.speech

config = {
  encoding:          :LINEAR16,
  sample_rate_hertz: 16_000,
  language_code:     "en-US",
  model:             model
}

file  = File.binread file_path
audio = { content: file }

operation = speech.long_running_recognize config: config, audio: audio

puts "Operation started"

operation.wait_until_done!

raise operation.results.message if operation.error?

results = operation.response.results

results.each_with_index do |result, i|
  alternative = result.alternatives.first
  puts "-" * 20
  puts "First alternative of result #{i}"
  puts "Transcript: #{alternative.transcript}"
end

Google Cloud Storage 音声ファイルの音声文字変換を実行する

Java

/**
 * Performs transcription of the remote audio file asynchronously with the selected model.
 *
 * @param gcsUri the path to the remote audio file to transcribe.
 */
public static void transcribeModelSelectionGcs(String gcsUri) throws Exception {
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure request with video media type
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);

    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    // Just print the first result here.
    SpeechRecognitionResult result = results.get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

from google.cloud import speech_v1

def sample_recognize(storage_uri, model):
    """
    Transcribe a short audio file from Cloud Storage using a specified
    transcription model

    Args:
      storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
      model The transcription model to use, e.g. video, phone_call, default
      For a list of available transcription models, see:
      https://cloud.google.com/speech-to-text/docs/transcription-model#transcription_models
    """

    client = speech_v1.SpeechClient()

    # storage_uri = 'gs://cloud-samples-data/speech/hello.wav'
    # model = 'phone_call'

    # The language of the supplied audio
    language_code = "en-US"
    config = {"model": model, "language_code": language_code}
    audio = {"uri": storage_uri}

    response = client.recognize(config, audio)
    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        print(u"Transcript: {}".format(alternative.transcript))