音声文字変換モデルを選択する

このページでは、Speech-to-Text の音声文字変換リクエストにおいて、特定の機械学習モデルを使用する方法について説明します。

音声文字変換モデル

Speech-to-Text は、入力音声を多数の機械学習モデルのいずれか 1 つと比較して、音声クリップ内の単語を検出します。各モデルは、膨大な数のサンプル（この場合は人の会話に関する多数の音声録音）を分析することでトレーニングを行ってきました。

Speech-to-Text には、特定のソース（電話での通話や動画など）の音声によってトレーニングされた特殊なモデルがあります。こうしたトレーニングプロセスのため、これらの特殊モデルは同様の音声データに適用された場合により良い結果をもたらします。

たとえば、Speech-to-Text には、電話で録音された音声を認識するようにトレーニングされた音声文字変換モデルが用意されています。Speech-to-Text が telephony または telephony_short モデルを使用して電話音声の音声文字変換を行うと、latest_short または latest_long モデルを使用して電話音声の音声文字変換を行った場合よりも精度の高い音声文字変換の結果が得られます。

次の表に、Speech-to-Text で使用可能な音声文字変換モデルを示します。

モデル名	説明
`latest_long`	このモデルは、メディアや自発的な会話など、あらゆる種類の長いコンテンツに対して使用します。特に動画モデルがターゲット言語で利用できない場合は、動画モデルの代わりにこのモデルを使用することを検討してください。これは、デフォルトモデルの代わりに使用することもできます。
`latest_short`	このモデルは、長さが数秒の短い発話に使用します。これは、コマンドやその他のシングルショットの音声のユースケースをキャプチャする場合に便利です。コマンドと検索モデルの代わりに、このモデルを使用することを検討してください。
`telephony`	「phone_call」モデルの改良版。電話の通話音声に最適です（通常は 8 kHz のサンプリングレートで録音されています）。
`telephony_short`	電話の通話音声の短い発話または 1 単語の発話に対応した、最新の「テレフォニー」モデルの専用版（通常は 8 kHz のサンプリングレートで録音されています）。
`medical_dictation`	このモデルは、医療専門家の指示を書き写すために使用します。これは、標準レートよりも費用の高いプレミアムモデルです。詳細については、料金ページをご覧ください。
`medical_conversation`	このモデルを使用して、医療従事者と患者の会話を文字変換します。これは、標準レートよりも費用の高いプレミアムモデルです。詳細については、料金ページをご覧ください。
以下のモデルは、コンフォーマーではない以前のーキテクチャに基づいており、主に下位互換性を維持するために保持されます。
`command_and_search`	音声コマンドや音声検索など、短い発話や 1 つの単語からなる発話に最適です。
`default`	長時間の録音や口述といった、他の音声モデルに適合しない音声に最適です。デフォルトモデルでは、特定のモデル用にカスタマイズされたモデルがある動画クリップなど、あらゆる種類の音声の音声文字変換結果が生成されます。ただし、デフォルトのモデルを使用して動画クリップの音声を認識すると、動画モデルを使用する場合よりも低品質の結果が生成されます。16 kHz 以上のサンプリングレートで録音されたハイファイ音声であることが理想的です。
`phone_call`	電話の通話音声に最適です（通常は 8 kHz のサンプリングレートで録音されています）。
`video`	複数の話者が存在する動画クリップやその他のソース（ポッドキャストなど）の音声に最適です。多くの場合、このモデルは、高音質のマイクで録音された音声や、周囲の雑音が多い音声に最適です。最良の結果を得るには、16,000 Hz 以上のサンプリングレートで録音された音声を使用してください。

音声文字変換のモデルを選択する

音声文字変換で使用する特定のモデルを指定するには、リクエストの RecognitionConfig パラメータの model フィールドに許可されている値のいずれか（latest_long、latest_short、telephony、telephony_short）を設定する必要があります。Speech-to-Text では、speech:recognize、speech:longrunningrecognize、ストリーミングのどの音声認識方法でもモデルを選択できます。

ローカル音声ファイルの音声文字変換を実行する

プロトコル

詳細については、speech:recognize API エンドポイントをご覧ください。

同期音声認識を実行するには、POST リクエストを作成し、適切なリクエスト本文を指定します。次は、curl を使用した POST リクエストの例です。この例では、Google Cloud CLI を使用してアクセストークンを生成します。gcloud CLI のインストール手順については、クイックスタートをご覧ください。

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "model": "video"
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
    }
}'

リクエスト本文の構成の詳細については、RecognitionConfig のリファレンスドキュメントをご覧ください。

リクエストが成功すると、サーバーは 200 OK HTTP ステータスコードと JSON 形式のレスポンスを返します。

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "OK Google stream stranger things from
            Netflix to my TV okay stranger things from
            Netflix playing on TV from the people that brought you
            Google home comes the next evolution of the smart home
            and it's just outside your window me Google know hi
            how can I help okay no what's the weather like outside
            the weather outside is sunny and 76 degrees he's right
            okay no turn on the hose I'm holding sure okay no I'm can
            I eat this lemon tree leaf yes what about this Daisy yes
            but I wouldn't recommend it but I could eat it okay
            Nomad milk to my shopping list I'm sorry that sounds like
            an indoor request I keep doing that sorry you do keep
            doing that okay no is this compost really we're all
            compost if you think about it pretty much everything is
            made up of organic matter and will return",
          "confidence": 0.9251011
        }
      ]
    }
  ]
}

Go

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Go API リファレンスドキュメントをご覧ください。

Speech-to-Text に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証を設定するをご覧ください。


func modelSelection(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	data, err := os.ReadFile("../testdata/Google_Gnome.wav")
	if err != nil {
		return fmt.Errorf("ReadFile: %w", err)
	}

	req := &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
			Model:           "video",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	}

	resp, err := client.Recognize(ctx, req)
	if err != nil {
		return fmt.Errorf("recognize: %w", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Java

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Java API リファレンスドキュメントをご覧ください。

/**
 * Performs transcription of the given audio file synchronously with the selected model.
 *
 * @param fileName the path to a audio file to transcribe
 */
public static void transcribeModelSelection(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speech = SpeechClient.create()) {
    // Configure request with video media type
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
    // Just print the first result here.
    SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Node.js API リファレンスドキュメントをご覧ください。

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

Speech-to-Text 用のクライアントライブラリをインストールして使用する方法については、Speech-to-Text クライアントライブラリをご覧ください。詳細については、Speech-to-Text の Python API リファレンスドキュメントをご覧ください。

from google.cloud import speech

# Instantiates a client
client = speech.SpeechClient()
# Reads a file as bytes
with open("resources/Google_Gnome.wav", "rb") as f:
    audio_content = f.read()

audio = speech.RecognitionAudio(content=audio_content)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    model="video",  # Chosen model
)

response = client.recognize(config=config, audio=audio)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(f"First alternative of result {i}")
    print(f"Transcript: {alternative.transcript}")

その他の言語

C#: クライアントライブラリページの C# の設定手順を行ってから、.NET の Speech-to-Text のリファレンスドキュメントをご覧ください。

PHP: クライアントライブラリページの PHP の設定手順を行ってから、PHP の Speech-to-Text のリファレンスドキュメントをご覧ください。

Ruby: クライアントライブラリページの Ruby の設定手順を行ってから、Ruby の Speech-to-Text のリファレンスドキュメントをご覧ください。

Cloud Storage 音声ファイルの音声文字変換を実行する

Go


import (
	"context"
	"fmt"
	"io"
	"strings"

	speech "cloud.google.com/go/speech/apiv1"
	"cloud.google.com/go/speech/apiv1/speechpb"
)

// transcribe_model_selection_gcs Transcribes the given audio file asynchronously with
// the selected model.
func transcribe_model_selection_gcs(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	audio := &speechpb.RecognitionAudio{
		AudioSource: &speechpb.RecognitionAudio_Uri{Uri: "gs://cloud-samples-tests/speech/Google_Gnome.wav"},
	}

	// The speech recognition model to use
	// See, https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model
	recognitionConfig := &speechpb.RecognitionConfig{
		Encoding:        speechpb.RecognitionConfig_LINEAR16,
		SampleRateHertz: 16000,
		LanguageCode:    "en-US",
		Model:           "video",
	}

	longRunningRecognizeRequest := &speechpb.LongRunningRecognizeRequest{
		Config: recognitionConfig,
		Audio:  audio,
	}

	operation, err := client.LongRunningRecognize(ctx, longRunningRecognizeRequest)
	if err != nil {
		return fmt.Errorf("error running recognize %w", err)
	}

	response, err := operation.Wait(ctx)
	if err != nil {
		return err
	}
	for i, result := range response.Results {
		alternative := result.Alternatives[0]
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "First alternative of result %d", i)
		fmt.Fprintf(w, "Transcript: %s", alternative.Transcript)
	}
	return nil
}