Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window. Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window.

How-to guides

Performing speech recognition

Transcribing short audio files

Perform synchronous speech recognition on local and remote audio files.
Transcribing long audio files

Perform asynchronous speech recognition on remote audio files.
Transcribing audio from streaming input

Perform streaming speech recognition on local audio files and other audio input streams.

Configuring recognition requests

Enabling word-level confidence

Specify that Cloud Speech-to-Text indicate a value of accuracy, or confidence level, for individual words in a transcription.
Separating different speakers

Get labels for different speakers in audio data transcribed by Cloud Speech-to-Text.
Detecting language spoken automatically

Provide multiple language codes for audio transcription requests sent to Cloud Speech-to-Text.
Transcribing audio with multiple channels

Transcribe audio files that include more than one channel.
Selecting a transcription model

Select a specialized machine learning model for audio transcription.
Enabling data logging

Enable data logging on your Google Cloud project to receive discounted pricing.
Using enhanced models

Use enhanced speech recognition models.
Getting punctuation

Include punctuation in transcription results from Speech-to-Text.
Getting word timestamps

Perform speech recognition on a remote file and include time offset (timestamp) values for recognized words.

Base64 encoding

Learn how to base64 encode audio.

How-to guides

Transcribing short audio files

Transcribing long audio files

Transcribing audio from streaming input

Enabling word-level confidence

Separating different speakers

Detecting language spoken automatically

Transcribing audio with multiple channels

Selecting a transcription model

Enabling data logging

Using enhanced models

Getting punctuation

Getting word timestamps

Base64 encoding