Select a transcription model

This page describes how to use a specific machine learning model for audio transcription requests to Speech-to-Text.

Transcription models

Speech-to-Text detects words in an audio clip by comparing input to one of many machine learning models. Each model has been trained by analyzing millions of examples—in this case, many, many audio recordings of people speaking.

Speech-to-Text has specialized models which are trained from audio for specific sources. These models provide better results when applied toward similar kinds of audio data to the data they were trained on.

The following table shows the transcription models that are available for use with Speech-to-Text V2 API.

Model name	Description
`chirp_3`	Use the latest generation of Google's multilingual Automatic Speech Recognition (ASR)-specific generative models that are designed to meet your user's needs based on feedback and experience. Chirp 3 provides enhanced accuracy and speed beyond earlier Chirp models and provides diarization and automatic language detection.
`chirp_2`	Use the next generation of our Universal large Speech Model (USM) powered by our large language model (LLM) technology for streaming and batch, and transcriptions and translations in diverse linguistic content and multilingual capabilities.
`telephony`	Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate. Ideal for customer service, teleconferencing, and automated kiosk applications.

The following models are based on earlier architectures; they aren't actively maintained and are primarily kept for legacy and backwards compatibility.

`chirp`	Use our Universal large Speech Model (USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities.
`chirp_telephony`	Universal large Speech Model (USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate).
`long`	Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the `video` or the `default` model especially, if they aren't available in your target language.
`short`	Use this model for short utterances that are a few seconds in length. It's useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model.
`telephony_short`	Dedicated version of the `telephony` model for short or even single-word utterances for audio that originated from a phone call, typically recorded at an 8 kHz sampling rate. Useful for utterances only a few seconds long in customer service, teleconferencing, and automated kiosk applications.
`medical_conversation`	Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient. Use the `medical_conversation` model when both a provider and a patient are speaking. Words uttered by each speaker are automatically detected and automatically labeled.
`medical_dictation`	Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results.

Select a model for audio transcription

The model is specified by the Recognizer used for the recognition request. Call speech/projects.locations.recognizers/create to create a recognizer, and use the model field to specify the model. Valid models for each language can be found in the Supported Languages table.

Select a transcription model Stay organized with collections Save and categorize content based on your preferences.

Transcription models

Select a model for audio transcription

Select a transcription model