Sprach-Audiodateien erstellen

Mit Cloud Text-to-Speech können Sie Wörter und Sätze in Base64-codierte Audiodaten in Form von natürlicher menschlicher Sprache umwandeln. Die Audiodaten wandeln Sie anschließend durch Decodieren der base64-Daten in eine abspielbare Audiodatei wie MP3 um. Die Text-to-Speech API akzeptiert Eingaben in Form von Rohtext oder Speech Synthesis Markup Language (SSML).

In diesem Dokument wird beschrieben, wie Sie eine Audiodatei in Text-to-Speech aus einer Texteingabe oder SSML-Eingabe erstellen. Wenn Sie mit Konzepten wie Sprachsynthese oder SSML nicht vertraut sind, finden Sie weitere Informationen im Artikel Grundlagen von Text-to-Speech.

Für diese Beispiele muss gcloud eingerichtet und ein Dienstkonto erstellt oder aktiviert sein. Informationen zum Einrichten von gcloud sowie zum Erstellen und Auswählen eines Dienstkontos finden Sie unter Kurzanleitung: Text-to-Speech.

Text in synthetisches Sprachaudio umwandeln

Die folgenden Codebeispielen zeigen, wie Sie einen String in Audiodaten umwandeln.

Die Ausgabe der Sprachsynthese ist auf verschiedene Weise konfigurierbar. Sie können beispielsweise eine bestimmte Stimme auswählen oder die Ausgabe hinsichtlich Tonhöhe, Lautstärke, Sprechgeschwindigkeit und Abtastrate anpassen.

Protokoll

Ausführliche Informationen finden Sie unter dem API-Endpunkt text:synthesize.

Wenn Sie Audioinhalte aus Text synthetisieren möchten, stellen Sie eine HTTP-POST-Anfrage an den Endpunkt text:synthesize. Geben Sie im Hauptteil der POST-Anfrage im Konfigurationsabschnitt voice die Art der zu synthetisierenden Stimme an, im Abschnitt input im Feld text den zu synthetisierenden Text und im Abschnitt audioConfig die Art der zu erstellenden Audioinhalte.

Mit dem folgenden Code-Snippet wird eine Syntheseanfrage an den Endpunkt text:synthesize gesendet. Die Ergebnisse werden in der Datei synthesize-text.txt gespeichert.

curl -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" \
  --data "{
    'input':{
      'text':'Android is a mobile operating system developed by Google,
         based on the Linux kernel and designed primarily for
         touchscreen mobile devices such as smartphones and tablets.'
    },
    'voice':{
      'languageCode':'en-gb',
      'name':'en-GB-Standard-A',
      'ssmlGender':'FEMALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt

Die Text-to-Speech API gibt die synthetisierten Audioinhalte in der JSON-Ausgabe als base64-codierte Daten zurück. Die JSON-Ausgabe in der Datei synthesize-text.txt ähnelt dem folgenden Code-Snippet.

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

Wenn Sie die Ergebnisse der Text-to-Speech API als MP3-Audiodatei decodieren möchten, führen Sie den folgenden Befehl aus demselben Verzeichnis wie die Datei synthesize-text.txt aus.

cat synthesize-text.txt | grep 'audioContent' | \
sed 's|audioContent| |' | tr -d '\n ":{},' > tmp.txt && \
base64 tmp.txt --decode > synthesize-text-audio.mp3 && \
rm tmp.txt

C#

/// <summary>
/// Creates an audio file from the text input.
/// </summary>
/// <param name="text">Text to synthesize into audio</param>
/// <remarks>
/// Generates a file named 'output.mp3' in project folder.
/// </remarks>
public static void SynthesizeText(string text)
{
    TextToSpeechClient client = TextToSpeechClient.Create();
    var response = client.SynthesizeSpeech(new SynthesizeSpeechRequest
    {
        Input = new SynthesisInput
        {
            Text = text
        },
        // Note: voices can also be specified by name
        Voice = new VoiceSelectionParams
        {
            LanguageCode = "en-US",
            SsmlGender = SsmlVoiceGender.Female
        },
        AudioConfig = new AudioConfig
        {
            AudioEncoding = AudioEncoding.Mp3
        }
    });

    using (Stream output = File.Create("output.mp3"))
    {
        response.AudioContent.WriteTo(output);
    }
}

Go


// SynthesizeText synthesizes plain text and saves the output to outputFile.
func SynthesizeText(w io.Writer, text, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Text{Text: text},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = ioutil.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}

Java

/**
 * Demonstrates using the Text to Speech client to synthesize text or ssml.
 *
 * @param text the raw text to be synthesized. (e.g., "Hello there!")
 * @throws Exception on TextToSpeechClient Errors.
 */
public static ByteString synthesizeText(String text) throws Exception {
  // Instantiates a client
  try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create()) {
    // Set the text input to be synthesized
    SynthesisInput input = SynthesisInput.newBuilder().setText(text).build();

    // Build the voice request
    VoiceSelectionParams voice =
        VoiceSelectionParams.newBuilder()
            .setLanguageCode("en-US") // languageCode = "en_us"
            .setSsmlGender(SsmlVoiceGender.FEMALE) // ssmlVoiceGender = SsmlVoiceGender.FEMALE
            .build();

    // Select the type of audio file you want returned
    AudioConfig audioConfig =
        AudioConfig.newBuilder()
            .setAudioEncoding(AudioEncoding.MP3) // MP3 audio.
            .build();

    // Perform the text-to-speech request
    SynthesizeSpeechResponse response =
        textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);

    // Get the audio contents from the response
    ByteString audioContents = response.getAudioContent();

    // Write the response to the output file.
    try (OutputStream out = new FileOutputStream("output.mp3")) {
      out.write(audioContents.toByteArray());
      System.out.println("Audio content written to file \"output.mp3\"");
      return audioContents;
    }
  }
}

Node.js

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');
const util = require('util');

const client = new textToSpeech.TextToSpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const text = 'Text to synthesize, eg. hello';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';

const request = {
  input: {text: text},
  voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
  audioConfig: {audioEncoding: 'MP3'},
};
const [response] = await client.synthesizeSpeech(request);
const writeFile = util.promisify(fs.writeFile);
await writeFile(outputFile, response.audioContent, 'binary');
console.log(`Audio content written to file: ${outputFile}`);

PHP

use Google\Cloud\TextToSpeech\V1\AudioConfig;
use Google\Cloud\TextToSpeech\V1\AudioEncoding;
use Google\Cloud\TextToSpeech\V1\SsmlVoiceGender;
use Google\Cloud\TextToSpeech\V1\SynthesisInput;
use Google\Cloud\TextToSpeech\V1\TextToSpeechClient;
use Google\Cloud\TextToSpeech\V1\VoiceSelectionParams;

/** Uncomment and populate these variables in your code */
// $text = 'Text to synthesize';

// create client object
$client = new TextToSpeechClient();

$input_text = (new SynthesisInput())
    ->setText($text);

// note: the voice can also be specified by name
// names of voices can be retrieved with $client->listVoices()
$voice = (new VoiceSelectionParams())
    ->setLanguageCode('en-US')
    ->setSsmlGender(SsmlVoiceGender::FEMALE);

$audioConfig = (new AudioConfig())
    ->setAudioEncoding(AudioEncoding::MP3);

$response = $client->synthesizeSpeech($input_text, $voice, $audioConfig);
$audioContent = $response->getAudioContent();

file_put_contents('output.mp3', $audioContent);
print('Audio content written to "output.mp3"' . PHP_EOL);

$client->close();

Python

def synthesize_text(text):
    """Synthesizes speech from the input string of text."""
    from google.cloud import texttospeech

    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(text=text)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Standard-C",
        ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    response = client.synthesize_speech(
        request={"input": input_text, "voice": voice, "audio_config": audio_config}
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

Ruby

require "google/cloud/text_to_speech"

client = Google::Cloud::TextToSpeech.text_to_speech

input_text = { text: text }

# Note: the voice can also be specified by name.
# Names of voices can be retrieved with client.list_voices
voice = {
  language_code: "en-US",
  ssml_gender:   "FEMALE"
}

audio_config = { audio_encoding: "MP3" }

response = client.synthesize_speech(
  input:        input_text,
  voice:        voice,
  audio_config: audio_config
)

# The response's audio_content is binary.
File.open output_file, "wb" do |file|
  # Write the response to the output file.
  file.write response.audio_content
end

puts "Audio content written to file '#{output_file}'"

SSML in synthetisches Sprachaudio umwandeln

Durch Verwendung von SSML in Ihrer Audiosyntheseanfrage erhalten Sie möglicherweise eine der natürlichen menschlichen Sprache ähnlichere Ausgabe. Mit SSML können Sie insbesondere Sprachpausen und die Aussprache von Datums- und Uhrzeitangaben, Akronymen sowie Abkürzungen besser steuern.

Weitere Informationen zu den von der Text-to-Speech API unterstützten SSML-Elementen finden Sie in der SSML-Referenz.

Protokoll

Ausführliche Informationen finden Sie unter dem API-Endpunkt text:synthesize.

Wenn Sie Audioinhalte aus SSML synthetisieren möchten, stellen Sie eine HTTP-POST-Anfrage an den Endpunkt text:synthesize. Geben Sie im Hauptteil der POST-Anfrage im Konfigurationsabschnitt voice die Art der zu synthetisierenden Stimme an, im Abschnitt input im Feld ssml die zu synthetisierenden SSML-Daten und im Abschnitt audioConfig die Art der zu erstellenden Audioinhalte.

Mit dem folgenden Code-Snippet wird eine Syntheseanfrage an den Endpunkt text:synthesize gesendet. Die Ergebnisse werden in der Datei synthesize-ssml.txt gespeichert.

curl -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json; charset=utf-8" --data "{
    'input':{
     'ssml':'<speak>The <say-as interpret-as=\"characters\">SSML</say-as> standard
          is defined by the <sub alias=\"World Wide Web Consortium\">W3C</sub>.</speak>'
    },
    'voice':{
      'languageCode':'en-us',
      'name':'en-US-Standard-B',
      'ssmlGender':'MALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-ssml.txt

Die Text-to-Speech API gibt die synthetisierten Audioinhalte in der JSON-Ausgabe als base64-codierte Daten zurück. Die JSON-Ausgabe in der Datei synthesize-ssml.txt ähnelt dem folgenden Code-Snippet.

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

Wenn Sie die Ergebnisse der Text-to-Speech API als MP3-Audiodatei decodieren möchten, führen Sie den folgenden Befehl aus demselben Verzeichnis wie die Datei synthesize-ssml.txt aus.

cat synthesize-ssml.txt | grep 'audioContent' | \
sed 's|audioContent| |' | tr -d '\n ":{},' > tmp.txt && \
base64 tmp.txt --decode > synthesize-ssml-audio.mp3 && \
rm tmp.txt

C#

/// <summary>
/// Creates an audio file from the SSML-formatted string.
/// </summary>
/// <param name="ssml">SSML string to synthesize</param>
/// <remarks>
/// Generates a file named 'output.mp3' in project folder.
/// Note: SSML must be well-formed according to:
///    https://www.w3.org/TR/speech-synthesis/
/// </remarks>
public static void SynthesizeSSML(string ssml)
{
    var client = TextToSpeechClient.Create();
    var response = client.SynthesizeSpeech(new SynthesizeSpeechRequest
    {
        Input = new SynthesisInput
        {
            Ssml = ssml
        },
        // Note: voices can also be specified by name
        Voice = new VoiceSelectionParams
        {
            LanguageCode = "en-US",
            SsmlGender = SsmlVoiceGender.Female
        },
        AudioConfig = new AudioConfig
        {
            AudioEncoding = AudioEncoding.Mp3
        }
    });

    using (Stream output = File.Create("output.mp3"))
    {
        response.AudioContent.WriteTo(output);
    }
}

Go


// SynthesizeSSML synthesizes ssml and saves the output to outputFile.
//
// ssml must be well-formed according to:
//   https://www.w3.org/TR/speech-synthesis/
// Example: <speak>Hello there.</speak>
func SynthesizeSSML(w io.Writer, ssml, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Ssml{Ssml: ssml},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = ioutil.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}

Java

/**
 * Demonstrates using the Text to Speech client to synthesize text or ssml.
 *
 * <p>Note: ssml must be well-formed according to: (https://www.w3.org/TR/speech-synthesis/
 * Example: <speak>Hello there.</speak>
 *
 * @param ssml the ssml document to be synthesized. (e.g., "<?xml...")
 * @throws Exception on TextToSpeechClient Errors.
 */
public static ByteString synthesizeSsml(String ssml) throws Exception {
  // Instantiates a client
  try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create()) {
    // Set the ssml input to be synthesized
    SynthesisInput input = SynthesisInput.newBuilder().setSsml(ssml).build();

    // Build the voice request
    VoiceSelectionParams voice =
        VoiceSelectionParams.newBuilder()
            .setLanguageCode("en-US") // languageCode = "en_us"
            .setSsmlGender(SsmlVoiceGender.FEMALE) // ssmlVoiceGender = SsmlVoiceGender.FEMALE
            .build();

    // Select the type of audio file you want returned
    AudioConfig audioConfig =
        AudioConfig.newBuilder()
            .setAudioEncoding(AudioEncoding.MP3) // MP3 audio.
            .build();

    // Perform the text-to-speech request
    SynthesizeSpeechResponse response =
        textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);

    // Get the audio contents from the response
    ByteString audioContents = response.getAudioContent();

    // Write the response to the output file.
    try (OutputStream out = new FileOutputStream("output.mp3")) {
      out.write(audioContents.toByteArray());
      System.out.println("Audio content written to file \"output.mp3\"");
      return audioContents;
    }
  }
}

Node.js

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');
const util = require('util');

const client = new textToSpeech.TextToSpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const ssml = '<speak>Hello there.</speak>';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';

const request = {
  input: {ssml: ssml},
  voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
  audioConfig: {audioEncoding: 'MP3'},
};

const [response] = await client.synthesizeSpeech(request);
const writeFile = util.promisify(fs.writeFile);
await writeFile(outputFile, response.audioContent, 'binary');
console.log(`Audio content written to file: ${outputFile}`);

PHP

use Google\Cloud\TextToSpeech\V1\AudioConfig;
use Google\Cloud\TextToSpeech\V1\AudioEncoding;
use Google\Cloud\TextToSpeech\V1\SsmlVoiceGender;
use Google\Cloud\TextToSpeech\V1\SynthesisInput;
use Google\Cloud\TextToSpeech\V1\TextToSpeechClient;
use Google\Cloud\TextToSpeech\V1\VoiceSelectionParams;

/** Uncomment and populate these variables in your code */
// $ssml = 'SSML to synthesize';

// create client object
$client = new TextToSpeechClient();

$input_text = (new SynthesisInput())
    ->setSsml($ssml);

// note: the voice can also be specified by name
// names of voices can be retrieved with $client->listVoices()
$voice = (new VoiceSelectionParams())
    ->setLanguageCode('en-US')
    ->setSsmlGender(SsmlVoiceGender::FEMALE);

$audioConfig = (new AudioConfig())
    ->setAudioEncoding(AudioEncoding::MP3);

$response = $client->synthesizeSpeech($input_text, $voice, $audioConfig);
$audioContent = $response->getAudioContent();

file_put_contents('output.mp3', $audioContent);
print('Audio content written to "output.mp3"' . PHP_EOL);

$client->close();

Python

def synthesize_ssml(ssml):
    """Synthesizes speech from the input string of ssml.

    Note: ssml must be well-formed according to:
        https://www.w3.org/TR/speech-synthesis/

    Example: <speak>Hello there.</speak>
    """
    from google.cloud import texttospeech

    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(ssml=ssml)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Standard-C",
        ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    response = client.synthesize_speech(
        input=input_text, voice=voice, audio_config=audio_config
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

Ruby

require "google/cloud/text_to_speech"

client = Google::Cloud::TextToSpeech.text_to_speech

input_text = { ssml: ssml }

# Note: the voice can also be specified by name.
# Names of voices can be retrieved with client.list_voices
voice = {
  language_code: "en-US",
  ssml_gender:   "FEMALE"
}

audio_config = { audio_encoding: "MP3" }

response = client.synthesize_speech(
  input:        input_text,
  voice:        voice,
  audio_config: audio_config
)

# The response's audio_content is binary.
File.open output_file, "wb" do |file|
  # Write the response to the output file.
  file.write response.audio_content
end

puts "Audio content written to file '#{output_file}'"