Cloud Text-to-Speech API Basics

Cloud Text-to-Speech API allows developers to include natural-sounding, synthetic human speech as playable audio in their applications. The Text-to-Speech API converts text or Speech Synthesis Markup Language (SSML) input into audio data like MP3 or LINEAR16 (the encoding used in WAV files).

This document is a guide to the fundamental concepts of using the Cloud Text-to-Speech API. Before diving into the API itself, review the quickstart.

Basic example

The Text-to-Speech API is ideal for any application that plays audio of human speech to users. It allows you to convert arbitrary strings, words, and sentences into the sound of a person speaking the same things.

Imagine that you have a voice assistant app that provides natural language feedback to your users as playable audio files. Your app might take an action and then provide human speech as feedback to the user.

For example, your app may want to report that it successfully added an event to the user's calendar. Your app constructs a response string to report the success to the user, something like "I've added the event to your calendar."

With the Text-to-Speech API, you can convert that response string to actual human speech to play back to the user, similar to the example provided below.


Example 1. Audio file generated from Text-to-Speech API

To create an audio file like example 1, you send a request to the Text-to-Speech API like the following code snippet.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) -H "Content-Type: application/json; charset=utf-8" --data "{
  'input':{
    'text':'I\'ve added the event to your calendar.'
  },
  'voice':{
    'languageCode':'en-gb',
    'name':'en-GB-Standard-A',
    'ssmlGender':'FEMALE'
  },
  'audioConfig':{
    'audioEncoding':'MP3'
  }
}" "https://texttospeech.googleapis.com/v1beta1/text:synthesize"

Speech synthesis

The process of translating text input into audio data is called synthesis and the output of synthesis is called synthetic speech. The Text-to-Speech API takes two types of input: raw text or SSML-formatted data (discussed below). To create a new audio file, you call the synthesize endpoint of the API.

The speech synthesis process generates raw audio data as a base64-encoded string. You must decode the base64-encoded string into an audio file before an application can play it. The client Text-to-Speech API convert the audio data into audio files for you. Otherwise, most platforms and operating systems have tools for decoding base64 text into playable media files.

To learn more about synthesis, review the quickstart or the Creating Voice Audio Files page.

Voices

The Text-to-Speech API creates raw audio data of natural, human speech. That is, it creates audio that sounds like a person talking. When you send a synthesis request to the Text-to-Speech API, you must specify a voice that 'speaks' the words.

The Text-to-Speech API has a wide selection of custom voices available for you to use. The voices differ by language, gender, and accent (for some languages). For example, you can produce create audio that mimics the sound of a female English speaker with a British accent like example 1, above. You can also convert the same text into a different voice, say a male English speaker with an Australian accent.


Example 2. Audio file generated with en-AU speaker

To see the complete list of the available voices, see Supported Voices.

WaveNet voices

Along with other, traditional synthetic voices, the Text-to-Speech API also provides premium, WaveNet-generated voices. Users find the Wavenet-generated voices to be more warm and human-like than other synthetic voices.

The key difference to a WaveNet voice is the WaveNet model used to generate the voice. WaveNet models have been trained using raw audio samples of actual humans speaking. As a result, these models generate synthetic speech with more human-like emphasis and inflection on syllables, phonemes, and words.

Compare the following two samples of synthetic speech.


Example 3. Audio file generated with a standard voice


Example 4. Audio file generated with a WaveNet voice

To learn more about the benefits of WaveNet-generated voices, see WaveNet and Other Synthetic Voices.

Other audio output settings

Besides the voice, you can also configure other aspects of the audio data output created by speech synthesis. Text-to-Speech API supports configuring the speaking rate, pitch, volume, and sample rate hertz.

Review the AudioConfig reference for more information.

Speech Synthesis Markup Language (SSML) support

You can enhance the synthetic speech produced by the Text-to-Speech API by marking up the text using Speech Synthesis Markup Language (SSML). SSML enables you to insert pauses, acronym pronounciations, or other additional details into the audio data created by Text-to-Speech API. The Text-to-Speech API supports a subset of the available SSML elements.

For example, you can ensure that the synthetic speech correctly pronounces ordinal numbers by providing Text-to-Speech API with SSML input that marks ordinal numbers as such.


Example 5. Audio file generated from plain text input


Example 6. Audio file generated from SSML input

To learn more about how to synthesize speech from SSML, see Creating Voice Audio Files

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Text-to-Speech API