This page describes how to use a specific machine learning model for audio transcription requests to Speech-to-Text.
Transcription models
Speech-to-Text detects words in an audio clip by comparing input to one of many machine learning models. Each model has been trained by analyzing millions of examples—in this case, many, many audio recordings of people speaking.
Speech-to-Text has specialized models which are trained from audio for specific sources. These models provide better results when applied toward similar kinds of audio data to the data they were trained on.
For example, Speech-to-Text has a transcription model trained
to recognize speech recorded over the phone. When Speech-to-Text
uses the telephony
model to transcribe phone audio, it produces more accurate
transcription results than if it had transcribed phone audio using the
short
or long
models.
The following table shows the transcription models available for use with Speech-to-Text.
Model name | Description |
---|---|
long |
Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the `video` or the `default` model, especially if they aren't available in your target language. |
short |
Use this model for short utterances that are a few seconds in length. It is useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model. |
telephony |
Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate. Ideal for customer service, teleconferencing, and automated kiosk applications. |
medical_dictation |
Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results. |
medical_conversation |
Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient. Use the `medical_conversation` model when both a provider and a patient are speaking. Words uttered by each speaker are automatically detected and automatically labeled. |
chirp |
Use our Universal large Speech Model(USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities. |
chirp_telephony |
Universal large Speech Model(USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate). |
chirp_2 |
Use the next generation of our Universal large Speech Model (USM) powered by Gemini for non-streaming transcriptions and translations in diverse linguistic content and multilingual capabilities. |
Select a model for audio transcription
The model is specified by the Recognizer
used for the recognition request. Call speech/projects.locations.recognizers/create
to create a recognizer, and use the model
field to specify the model. Valid
models for each language can be found in the Supported Languages table.