This page describes how to use a specific machine learning model for audio transcription requests to Speech-to-Text.
Transcription models
Speech-to-Text detects words in an audio clip by comparing input to one of many machine learning models. Each model has been trained by analyzing millions of examples—in this case, many, many audio recordings of people speaking.
Speech-to-Text has specialized models which are trained from audio for specific sources. These models provide better results when applied toward similar kinds of audio data to the data they were trained on.
The following table shows the transcription models that are available for use with Speech-to-Text V2 API.
Model name | Description |
---|---|
chirp_3 |
Use the latest generation of Google's multilingual Automatic Speech Recognition (ASR)-specific generative models that are designed to meet your user's needs based on feedback and experience. Chirp 3 provides enhanced accuracy and speed beyond earlier Chirp models and provides diarization and automatic language detection. |
chirp_2 |
Use the next generation of our Universal large Speech Model (USM) powered by our large language model (LLM) technology for streaming and batch, and transcriptions and translations in diverse linguistic content and multilingual capabilities. |
telephony |
Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate. Ideal for customer service, teleconferencing, and automated kiosk applications. |
The following models are based on earlier architectures; they aren't actively maintained and are primarily kept for legacy and backwards compatibility.
chirp |
Use our Universal large Speech Model (USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities. |
chirp_telephony |
Universal large Speech Model (USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate). |
long |
Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the video or the default model especially, if they aren't available in your target language. |
short |
Use this model for short utterances that are a few seconds in length. It's useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model. |
telephony_short |
Dedicated version of the telephony model for short or even single-word utterances for audio that originated from a phone call, typically recorded at an 8 kHz sampling rate. Useful for utterances only a few seconds long in customer service, teleconferencing, and automated kiosk applications. |
medical_conversation |
Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient. Use the medical_conversation model when both a provider and a patient are speaking. Words uttered by each speaker are automatically detected and automatically labeled. |
medical_dictation |
Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results. |
Select a model for audio transcription
The model is specified by the Recognizer
used for the recognition request. Call speech/projects.locations.recognizers/create
to create a recognizer, and use the model
field to specify the model. Valid
models for each language can be found in the Supported Languages table.