Types of voices

Text-to-Speech generates audio with natural, human-like quality, which creates speech that sounds like a real person. To start, specify a voice when sending a synthesis request.

Text-to-Speech offers a variety of voices based on language, gender, and accent. Some languages have multiple options. For a full list, check the Supported Voices page. To select a voice, use the VoiceSelectionParams field in your API request. Refer to the Quickstarts for instructions on making a synthesize request.

Overview

Voice Type		Intended for	Launch stage	Controllability	Streaming
Chirp HD voices		Conversational Agents	Preview	-	Yes
Studio	Two speakers group	Media - Discussions and Interviews	Experimental	-	-
Studio	One speaker person	Media - Narration	GA	SSML	-
Neural2		General purpose	GA	SSML	-
Standard		Cost efficient	GA	SSML	-

Pricing Details

Chirp HD voices

Chirp HD voices is powered by the AudioLM engine. Chirp HD voices lets you create more engaging and empathetic speech for conversational applications. Through text streaming, Chirp HD voices produces low-latency real-time communication and supports the languages listed in the table of supported voices.

Chat experiences

Voice: en-US-Chirp-HD-F

Other examples

Virtual assistants

Voice: en-US-Chirp-HD-D

Customer service chatbots

Voice: en-US-Chirp-HD-F

Interactive education applications

Voice: en-US-Chirp-HD-O

Sales and pitches

Voice: en-US-Chirp-HD-D

Storytime

Voice: en-US-Chirp-HD-F

Studio multispeaker voices

Create discussions and interviews with the new multispeaker studio voices, which is based on the same technology behind Chirp HD voices.

Studio voices

Studio voices are designed for news reading and broadcast content.

Example 1. The en-US-Studio-O voice reading the Great Gatsby.

Neural2 voices

The Text-to-Speech API provides a voice tier called Neural2. Neural2 voices are based on the same technology used to create a Custom Voice. Neural2 allows anyone to use Custom Voice technology without training their own custom voice. They're available in global and single region endpoints.

Example 1. Neural2 voice

Standard voices

The voices offered by Text-to-Speech differ in how they are produced, the synthetic speech technology used to create the machine model of the voice. One common speech technology, parametric text-to-speech, typically generates audio data by passing outputs through signal processing algorithms known as vocoders. Many of the standard voices available in Text-to-Speech use a variation of this technology.