Compare transcription models

This page describes how to use a specific machine learning model for audio transcription requests to Speech-to-Text.

Select the right transcription model

Speech-to-Text detects words in an audio clip by comparing input to one of many machine learning models. Each model has been trained by analyzing millions of examples—in this case, many, many audio recordings of people speaking.

Speech-to-Text has specialized models which are trained from audio for specific sources. These models provide better results when applied toward similar kinds of audio data to the data they were trained on.

The following table shows the transcription models that are available for use with Speech-to-Text V2 API.

Model name	Description
`chirp_3`	Use the latest generation of Google's multilingual Automatic Speech Recognition (ASR)-specific generative models that are designed to meet your user's needs based on feedback and experience. Chirp 3 provides enhanced accuracy and speed beyond earlier Chirp models and provides diarization and automatic language detection.
`chirp_2`	Use the Universal large Speech Model (USM) that's powered by our large language model (LLM) technology for streaming and batch, and provides transcriptions and translations in diverse linguistic content and multilingual capabilities.
`telephony`	Use this model for audio that originates from an audio phone call, typically recorded at an 8 kHz sampling rate. Ideal for customer service, teleconferencing, and automated kiosk applications.

Select a model for audio transcription

To transcribe short audio clips (under 60 seconds), synchronous recognition is the simplest method. It processes your audio and returns the full transcription result in a single response after all audio has been processed.

Python

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"

# Instantiates a client
client = SpeechClient()

# Reads a file as bytes
with open("resources/audio.wav", "rb") as f:
    audio_content = f.read()

config = cloud_speech.RecognitionConfig(
    auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
    language_codes=["en-US"],
    model="chirp_3",
)

request = cloud_speech.RecognizeRequest(
    recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
    config=config,
    content=audio_content,
)

# Transcribes the audio into text
response = client.recognize(request=request)

for result in response.results:
    print(f"Transcript: {result.alternatives[0].transcript}")

To transcribe audio files longer than 60 seconds or for transcribing audio in real-time, you can use one of the following methods:

Batch Recognition: Ideal for transcribing long audio files (minutes to hours) stored in a Cloud Storage bucket. This is an asynchronous operation. To learn more about batch recognition, see Batch Recognition.

Streaming Recognition: Perfect for capturing and transcribing audio in real time, such as from a microphone feed or a live stream. To learn more about streaming recognition, see Streaming Recognition.

What's next

Learn how to transcribe streaming audio.
Learn how to transcribe long audio files.
Learn how to transcribe short audio files.
For best performance, accuracy, and other tips, see the best practices documentation.