Select a transcription model

This page describes how to use a specific machine learning model for audio transcription requests to Speech-to-Text.

Transcription models

Speech-to-Text detects words in an audio clip by comparing input to one of many machine learning models. Each model has been trained by analyzing millions of examples—in this case, many, many audio recordings of people speaking.

Speech-to-Text has specialized models which are trained from audio for specific sources. These models provide better results when applied toward similar kinds of audio data to the data they were trained on.

The following table shows the transcription models that are available for use with Speech-to-Text V2 API.

Model name Description
chirp_3 Use the latest generation of Google's multilingual Automatic Speech Recognition (ASR)-specific generative models that are designed to meet your user's needs based on feedback and experience. Chirp 3 provides enhanced accuracy and speed beyond earlier Chirp models and provides diarization and automatic language detection.
chirp_2 Use the next generation of our Universal large Speech Model (USM) powered by our large language model (LLM) technology for streaming and batch, and transcriptions and translations in diverse linguistic content and multilingual capabilities.
telephony Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate. Ideal for customer service, teleconferencing, and automated kiosk applications.

The following models are based on earlier architectures; they aren't actively maintained and are primarily kept for legacy and backwards compatibility.

chirp Use our Universal large Speech Model (USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities.
chirp_telephony Universal large Speech Model (USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate).
long Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the video or the default model especially, if they aren't available in your target language.
short Use this model for short utterances that are a few seconds in length. It's useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model.
telephony_short Dedicated version of the telephony model for short or even single-word utterances for audio that originated from a phone call, typically recorded at an 8 kHz sampling rate. Useful for utterances only a few seconds long in customer service, teleconferencing, and automated kiosk applications.
medical_conversation Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient. Use the medical_conversation model when both a provider and a patient are speaking. Words uttered by each speaker are automatically detected and automatically labeled.
medical_dictation Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results.

Select a model for audio transcription

The model is specified by the Recognizer used for the recognition request. Call speech/projects.locations.recognizers/create to create a recognizer, and use the model field to specify the model. Valid models for each language can be found in the Supported Languages table.