Try Gemini 3, our best model for reasoning, coding, and multimodal understanding in Vertex AI

Text-to-Speech

Text-to-Speech AI

Convert text into natural-sounding speech using an API powered by the best of Google’s AI technologies.

New customers get up to $300 in free credits to try Text-to-Speech and other Google Cloud products.

Deliver intelligent, lifelike responses to users with natural AI voices
Build voice interfaces for apps with integrated text to speech
Personalize your communication and audio based on user voice and language preferences

Learn how to create synthetic speech using Text-to-Speech API

Start self-paced lab

Benefits

High fidelity speech

Deploy Google’s groundbreaking technologies to generate speech with humanlike intonation. Built based on DeepMind’s speech synthesis expertise, the API delivers voices that are near human quality.

Widest voice selection

Choose from a set of 380+ voices across 75+ languages and variants, including Mandarin, Hindi, Spanish, Arabic, Russian, and more. Pick the voice that works best for your user and application.

One-of-a-kind voice

Create a unique voice to represent your brand across all your customer touchpoints, instead of using a common voice shared with other organizations.

Demo

Put Text-to-Speech into action

Type what you want, select a language then click “Speak It” to hear.

Key features

Gemini-TTS

Synthesize single or multispeaker speech from short snippets to full-length narratives, maintaining contextuality. Precisely dictate style, accent, pace, tone, and emotional expression—all steerable through simple natural-language prompts in 75+ locales. Head to Media Studio or check out our documentation to learn more.

Chirp 3: HD voices

Build engaging agents using the latest spontaneous conversational voices based on AudioML. These voices offer high-quality audio, low-latency streaming, and natural-sounding speech, incorporating human disfluencies, emotional range, and accurate intonation. Head to Media Studio or check out our documentation to learn more.

Chirp 3: instant custom voice

Create personalized voice models with as little as 10 seconds of audio input. Perfect for video games, audiobooks, podcasts, and more. Available in more than 30+ locales. Head to Media Studio or check out our documentation to learn more.

Prompt, text, and SSML support

Control number and time formatting, delivery, pronunciation, and emotion using simple plaintext scripting, SSML tags, or even powerful natural-language prompts depending on model support. Head to Media Studio or check out our documentation to learn more.

What's new

Vector art of people saying ‘Hi’ in different languages

Blog post

Google Cloud Text-to-Speech API now supports custom voicesRead the blog

Person holding smartphone showing an audiobook created with text to speech

Video

How to convert PDFs to audiobooks with machine learningWatch video

sketch demonstrating AI-powered conversation with Contact CenterAI

Blog post

Conversational AI drives better customer experiencesRead the blog

Woman holding mobile phone in front of her and speaking into it

Video

Solving for accessible phone calls with Speech-to-Text and Text-to-SpeechWatch video

Caption of Cloud Text-to-Speech Languages and Voices above rows of ~25 flags of the world

Blog post

New voices and languages for Text-to-SpeechRead the blog

Documentation

Quickstart

Gemini-TTS

Learn how to precisely control speech synthesis with Gemini-TTS, using natural language prompts to dictate style, tone, pace, and emotional expression.

Quickstart

Chirp 3: HD voices overview

Learn how to synthesize realistic, emotionally resonant speech using Chirp 3: HD voices, and fine-tune the audio with advanced controls and scripting best practices.

Quickstart

Chirp 3: instant custom voice overview

Create personalized and unique voice models using just 10 seconds of audio recordings for your organization. It allows for the rapid generation of personal voices.

Tutorial

Speaking addresses with SSML

Learn how to use Speech Synthesis Markup Language (SSML) to speak a text file of addresses.

Google Cloud Basics

Text-to-Speech basics

A guide to the fundamental concepts of using the Text-to-Speech API.

Google Cloud Basics

Supported voices and languages

Browse guides and resources for this product.

Not seeing what you’re looking for?

Release notes

Read about the latest releases for Text-to-Speech

Use cases

Use case

Voicebots in contact centers

Deliver a better voice experience for customer service with voicebots on Dialogflow that dynamically generate speech, instead of playing static, pre-recorded audio. Engage with high-quality synthesized voices that give callers a sense of familiarity and personalization.

Use case

Voice generation in devices

Enable natural communications with your users by empowering your devices to speak humanlike voices as a text reader. Build an end-to-end voice user interface together with Speech-to-Text and Natural Language to improve user experience with easy and engaging interactions.

Use case

Accessible EPGs (Electronic Program Guides)

Easily have the EPGs read text aloud to provide a better user experience to your customers and meet accessibility requirements for your services and applications. Try the EPG demo.

Easily implement text-to-speech functionality in EPGs to provide a better user experience to your customers and meet accessibility requirements for your services and applications.

Generate a solution

What problem are you trying to solve?

What you'll get:

Step-by-step guide

Reference architecture

Available pre-built solutions

This service was built with Gemini Enterprise Agent Platform. You must be 18 or older to use it. Do not enter sensitive, confidential, or personal info.

All features

Streaming audio synthesis	Power your AI agents with ultra-low-latency speech for seamless, real-time conversations with streaming audio synthesis.
Long audio synthesis	Asynchronously synthesize up to 1 million bytes of input with long audio synthesis.
Voice and language selection	Choose from an extensive selection of 380+ voices across 75+ languages and variants, with more to come soon.
Text and SSML support	Customize your speech with SSML tags that allow you to add pauses, numbers, date and time formatting, and other pronunciation instructions.
Pitch tuning	Personalize the pitch of your selected voice, up to 20 semitones more or less than the default.
Speaking rate tuning	Adjust your speaking rate to be 4x faster or slower than the normal rate.
Volume gain control	Increase the volume of the output by up to 16 db or decrease the volume up to -96 db.
Integrated REST and gRPC APIs	Easily integrate with any application or device that can send a REST or gRPC request including phones, PCs, tablets, and IoT devices (for example cars, TVs, speakers).
Audio format flexibility	Convert text to MP3, Linear16, OGG Opus, and a number of other audio formats.
Audio profiles	Optimize for the type of speaker from which your speech is intended to play, such as headphones or phone lines.

Pricing

Text-to-Speech is priced based on the number of characters sent to the service to be synthesized into audio each month. The first 1 million characters for WaveNet voices are free each month. For Standard (non-WaveNet) voices, the first 4 million characters are free each month. After the free tier has been reached, Text-to-Speech is priced per 1 million characters of text processed.

If you pay in a currency other than USD, the prices listed in your currency on Google Cloud SKUs apply.

Take the next step

New customers get $300 in free credits to try Text-to-Speech and other Google Cloud products.

Need help getting started?
Contact sales
Work with a trusted partner
Find a partner
Continue browsing
See all products

Text-to-Speech AI

High fidelity speech

Widest voice selection

One-of-a-kind voice

Put Text-to-Speech into action

Key features

Gemini-TTS

Chirp 3: HD voices

Chirp 3: instant custom voice

Prompt, text, and SSML support

What's new

Documentation

Gemini-TTS

Chirp 3: HD voices overview

Chirp 3: instant custom voice overview

Speaking addresses with SSML

Text-to-Speech basics

Supported voices and languages

Not seeing what you’re looking for?

Explore more docs

Use cases

Voicebots in contact centers

Voice generation in devices

Accessible EPGs (Electronic Program Guides)

All features

Pricing

Take the next step

Need help getting started?

Work with a trusted partner

Continue browsing