Creating Voice Audio Files

Cloud Text-to-Speech API allows you to convert words and sentences into base64 encoded audio data of natural human speech. You can then convert the audio data into a playable audio file like an MP3 by decoding the base64 data. The Cloud Text-to-Speech API accepts input as raw text or Speech Synthesis Markup Language (SSML).

This document describes how to create an audio file from either text or SSML input using the Text-to-Speech API. You can also review the Text-to-Speech API basics article if you are unfamiliar with concepts like speech synthesis or SSML.

These samples require that you have set up gcloud and have created and activated a service account. For information about setting up gcloud, and also creating and activating a service account, see Quickstart:Text-to-Speech.

Converting text to synthetic voice audio

The following code samples demonstrate how to convert a string into audio data.

You can configure the output of speech synthesis in a variety of ways, including selecting a unique voice or modulating the output in pitch, volumn, speaking rate, and sample rate.

Protocol

Refer to the text:synthesize API endpoint for complete details.

To synthesize audio from text, make an HTTP POST request to the text:synthesize endpoint. In the body of your POST request, specify the type of voice to synthesize in the voice configuration section, specify the text to synthesize in the text field of the input section, and specify the type of audio to create in the audioConfig section.

The following code snippet sends a sythesis request to the text:synthesize endpoint and saves the results to a file named synthesize-text.txt.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
  -H "Content-Type: application/json; charset=utf-8" \
  --data "{
    'input':{
      'text':'Android is a mobile operating system developed by Google,
         based on the Linux kernel and designed primarily for
         touchscreen mobile devices such as smartphones and tablets.'
    },
    'voice':{
      'languageCode':'en-gb',
      'name':'en-GB-Standard-A',
      'ssmlGender':'FEMALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt

The Cloud Text-to-Speech API returns the synthesized audio as base64-encoded data contained in the JSON output. The JSON output in the synthesize-text.txt file looks similar to the following code snippet.

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

To decode the results from the Cloud Text-to-Speech API as an MP3 audio file, run the following command from the same directory as the synthesize-text.txt file.

sed 's|audioContent| |' < synthesize-text.txt > tmp-output.txt && \
tr -d '\n ":{}' < tmp-output.txt > tmp-output-2.txt && \
base64 tmp-output-2.txt --decode > synthesize-text-audio.mp3 && \
rm tmp-output*.txt

C#

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

/// <summary>
/// Creates an audio file from the text input.
/// </summary>
/// <param name="text">Text to synthesize into audio</param>
/// <remarks>
/// Generates a file named 'output.mp3' in project folder.
/// </remarks>
public static void SynthesizeText(string text)
{
    TextToSpeechClient client = TextToSpeechClient.Create();
    var response = client.SynthesizeSpeech(new SynthesizeSpeechRequest
    {
        Input = new SynthesisInput
        {
            Text = text
        },
        // Note: voices can also be specified by name
        Voice = new VoiceSelectionParams
        {
            LanguageCode = "en-US",
            SsmlGender = SsmlVoiceGender.Female
        },
        AudioConfig = new AudioConfig
        {
            AudioEncoding = AudioEncoding.Mp3
        }
    });

    using (Stream output = File.Create("output.mp3"))
    {
        response.AudioContent.WriteTo(output);
    }
}

Go

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

// SynthesizeText synthesizes plain text and saves the output to outputFile.
func SynthesizeText(w io.Writer, text, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Text{Text: text},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = ioutil.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}

Java

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

/**
 * Demonstrates using the Text to Speech client to synthesize text or ssml.
 *
 * @param text the raw text to be synthesized. (e.g., "Hello there!")
 * @throws Exception on TextToSpeechClient Errors.
 */
public static void synthesizeText(String text) throws Exception {
  // Instantiates a client
  try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create()) {
    // Set the text input to be synthesized
    SynthesisInput input = SynthesisInput.newBuilder().setText(text).build();

    // Build the voice request
    VoiceSelectionParams voice =
        VoiceSelectionParams.newBuilder()
            .setLanguageCode("en-US") // languageCode = "en_us"
            .setSsmlGender(SsmlVoiceGender.FEMALE) // ssmlVoiceGender = SsmlVoiceGender.FEMALE
            .build();

    // Select the type of audio file you want returned
    AudioConfig audioConfig =
        AudioConfig.newBuilder()
            .setAudioEncoding(AudioEncoding.MP3) // MP3 audio.
            .build();

    // Perform the text-to-speech request
    SynthesizeSpeechResponse response =
        textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);

    // Get the audio contents from the response
    ByteString audioContents = response.getAudioContent();

    // Write the response to the output file.
    try (OutputStream out = new FileOutputStream("output.mp3")) {
      out.write(audioContents.toByteArray());
      System.out.println("Audio content written to file \"output.mp3\"");
    }
  }
}

Node.js

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');

const client = new textToSpeech.TextToSpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const text = 'Text to synthesize, eg. hello';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';

const request = {
  input: {text: text},
  voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
  audioConfig: {audioEncoding: 'MP3'},
};

client.synthesizeSpeech(request, (err, response) => {
  if (err) {
    console.error('ERROR:', err);
    return;
  }

  fs.writeFile(outputFile, response.audioContent, 'binary', err => {
    if (err) {
      console.error('ERROR:', err);
      return;
    }
    console.log(`Audio content written to file: ${outputFile}`);
  });
});

Python

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

def synthesize_text(text):
    """Synthesizes speech from the input string of text."""
    from google.cloud import texttospeech
    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.types.SynthesisInput(text=text)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.types.VoiceSelectionParams(
        language_code='en-US',
        ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)

    audio_config = texttospeech.types.AudioConfig(
        audio_encoding=texttospeech.enums.AudioEncoding.MP3)

    response = client.synthesize_speech(input_text, voice, audio_config)

    # The response's audio_content is binary.
    with open('output.mp3', 'wb') as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

Ruby

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

require "google/cloud/text_to_speech"

client = Google::Cloud::TextToSpeech.new

input_text = { text: text }

# Note: the voice can also be specified by name.
# Names of voices can be retrieved with client.list_voices
voice = {
  language_code: "en-US",
  ssml_gender:   "FEMALE"
}

audio_config = { audio_encoding: "MP3" }

response = client.synthesize_speech input_text, voice, audio_config

# The response's audio_content is binary.
File.open("output.mp3", "wb") do |file|
  # Write the response to the output file.
  file.write(response.audio_content)
end

puts "Audio content written to file 'output.mp3'"

Converting SSML to synthetic voice audio

Using SSML in your audio synthesis request can produce audio that is more similar to natural human speech. Specifically, SSML gives you finer-grain control over how the audio output represents pauses in the speech or how the audio pronounces dates, times, acronyms, and abbreviations.

For more details on the SSML elements supported by Cloud Text-to-Speech API, see the SSML reference.

Protocol

Refer to the text:synthesize API endpoint for complete details.

To synthesize audio from SSML, make an HTTP POST request to the text:synthesize endpoint. In the body of your POST request, specify the type of voice to synthesize in the voice configuration section, specify the SSML to synthesize in the ssml field of the input section, and specify the type of audio to create in the audioConfig section.

The following code snippet sends a sythesis request to the text:synthesize endpoint and saves the results to a file named synthesize-ssml.txt.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
  -H "Content-Type: application/json; charset=utf-8" --data "{
    'input':{
     'ssml':'<speak>The <say-as interpret-as=\"characters\">SSML</say-as> standard
          is defined by the <sub alias=\"World Wide Web Consortium\">W3C</sub>.</speak>'
    },
    'voice':{
      'languageCode':'en-us',
      'name':'en-US-Standard-B',
      'ssmlGender':'MALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-ssml.txt

The Text-to-Speech API returns the synthesized audio as base64-encoded data contained in the JSON output. The JSON output in the synthesize-ssml.txt file looks similar to the following code snippet.

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

To decode the results from the Text-to-Speech API as an MP3 audio file, run the following command from the same directory as the synthesize-ssml.txt file.

sed 's|audioContent| |' < synthesize-ssml.txt > tmp-output.txt && \
tr -d '\n ":{}' < tmp-output.txt > tmp-output-2.txt && \
base64 tmp-output-2.txt --decode > synthesize-ssml-audio.mp3 && \
rm tmp-output*.txt

C#

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

/// <summary>
/// Creates an audio file from the SSML-formatted string.
/// </summary>
/// <param name="ssml">SSML string to synthesize</param>
/// <remarks>
/// Generates a file named 'output.mp3' in project folder.
/// Note: SSML must be well-formed according to:
///    https://www.w3.org/TR/speech-synthesis/
/// </remarks>
public static void SynthesizeSSML(string ssml)
{
    TextToSpeechClient client = TextToSpeechClient.Create();
    var response = client.SynthesizeSpeech(new SynthesizeSpeechRequest
    {
        Input = new SynthesisInput
        {
            Ssml = ssml
        },
        // Note: voices can also be specified by name
        Voice = new VoiceSelectionParams
        {
            LanguageCode = "en-US",
            SsmlGender = SsmlVoiceGender.Female
        },
        AudioConfig = new AudioConfig
        {
            AudioEncoding = AudioEncoding.Mp3
        }
    });

    using (Stream output = File.Create("output.mp3"))
    {
        response.AudioContent.WriteTo(output);
    }
}

Go

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

// SynthesizeSSML synthesizes ssml and saves the output to outputFile.
//
// ssml must be well-formed according to:
//   https://www.w3.org/TR/speech-synthesis/
// Example: <speak>Hello there.</speak>
func SynthesizeSSML(w io.Writer, ssml, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Ssml{Ssml: ssml},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = ioutil.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}

Java

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

/**
 * Demonstrates using the Text to Speech client to synthesize text or ssml.
 *
 * <p>Note: ssml must be well-formed according to: (https://www.w3.org/TR/speech-synthesis/
 * Example: <speak>Hello there.</speak>
 *
 * @param ssml the ssml document to be synthesized. (e.g., "<?xml...")
 * @throws Exception on TextToSpeechClient Errors.
 */
public static void synthesizeSsml(String ssml) throws Exception {
  // Instantiates a client
  try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create()) {
    // Set the ssml input to be synthesized
    SynthesisInput input = SynthesisInput.newBuilder().setSsml(ssml).build();

    // Build the voice request
    VoiceSelectionParams voice =
        VoiceSelectionParams.newBuilder()
            .setLanguageCode("en-US") // languageCode = "en_us"
            .setSsmlGender(SsmlVoiceGender.FEMALE) // ssmlVoiceGender = SsmlVoiceGender.FEMALE
            .build();

    // Select the type of audio file you want returned
    AudioConfig audioConfig =
        AudioConfig.newBuilder()
            .setAudioEncoding(AudioEncoding.MP3) // MP3 audio.
            .build();

    // Perform the text-to-speech request
    SynthesizeSpeechResponse response =
        textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);

    // Get the audio contents from the response
    ByteString audioContents = response.getAudioContent();

    // Write the response to the output file.
    try (OutputStream out = new FileOutputStream("output.mp3")) {
      out.write(audioContents.toByteArray());
      System.out.println("Audio content written to file \"output.mp3\"");
    }
  }
}

Node.js

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');

const client = new textToSpeech.TextToSpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const ssml = '<speak>Hello there.</speak>';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';

const request = {
  input: {ssml: ssml},
  voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
  audioConfig: {audioEncoding: 'MP3'},
};

client.synthesizeSpeech(request, (err, response) => {
  if (err) {
    console.error('ERROR:', err);
    return;
  }

  fs.writeFile(outputFile, response.audioContent, 'binary', err => {
    if (err) {
      console.error('ERROR:', err);
      return;
    }
    console.log(`Audio content written to file: ${outputFile}`);
  });
});

Python

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

def synthesize_ssml(ssml):
    """Synthesizes speech from the input string of ssml.

    Note: ssml must be well-formed according to:
        https://www.w3.org/TR/speech-synthesis/

    Example: <speak>Hello there.</speak>
    """
    from google.cloud import texttospeech
    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.types.SynthesisInput(ssml=ssml)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.types.VoiceSelectionParams(
        language_code='en-US',
        ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)

    audio_config = texttospeech.types.AudioConfig(
        audio_encoding=texttospeech.enums.AudioEncoding.MP3)

    response = client.synthesize_speech(input_text, voice, audio_config)

    # The response's audio_content is binary.
    with open('output.mp3', 'wb') as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

Ruby

For more on installing and creating a Text-to-Speech API client, refer to Text-to-Speech API Client Libraries.

require "google/cloud/text_to_speech"

client = Google::Cloud::TextToSpeech.new

input_text = { ssml: ssml }

# Note: the voice can also be specified by name.
# Names of voices can be retrieved with client.list_voices
voice = {
  language_code: "en-US",
  ssml_gender:   "FEMALE"
}

audio_config = { audio_encoding: "MP3" }

response = client.synthesize_speech input_text, voice, audio_config

# The response's audio_content is binary.
File.open("output.mp3", "wb") do |file|
  # Write the response to the output file.
  file.write(response.audio_content)
end

puts "Audio content written to file 'output.mp3'"

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Text-to-Speech API