Select a transcription model

This page describes how to use a specific machine learning model for audio transcription requests to Speech-to-Text.

Transcription models

Speech-to-Text detects words in an audio clip by comparing input to one of many machine learning models. Each model has been trained by analyzing millions of examples—in this case, many, many audio recordings of people speaking.

Speech-to-Text has specialized models which are trained from audio for specific sources. These models provide better results when applied toward similar kinds of audio data to the data they were trained on.

For example, Speech-to-Text has a transcription model trained to recognize speech recorded over the phone. When Speech-to-Text uses the telephony model to transcribe phone audio, it produces more accurate transcription results than if it had transcribed phone audio using the short or long models.

The following table shows the transcription models available for use with Speech-to-Text.

Model name Description
long Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the `video` or the `default` model, especially if they aren't available in your target language.
short Use this model for short utterances that are a few seconds in length. It is useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model.
telephony Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate. Ideal for customer service, teleconferencing, and automated kiosk applications.
medical_dictation Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results.
medical_conversation Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient. Use the `medical_conversation` model when both a provider and a patient are speaking. Words uttered by each speaker are automatically detected and automatically labeled.
chirp Use our Universal large Speech Model(USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities.
chirp_telephony Universal large Speech Model(USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate).
chirp_2 Use the next generation of our Universal large Speech Model (USM) powered by Gemini for non-streaming transcriptions and translations in diverse linguistic content and multilingual capabilities.

Select a model for audio transcription

The model is specified by the Recognizer used for the recognition request. Call speech/projects.locations.recognizers/create to create a recognizer, and use the model field to specify the model. Valid models for each language can be found in the Supported Languages table.