Speech Synthesis Markup Language (SSML) reference (Beta)

Text-to-Speech supports a number of SSML Beta features in addition to the Text-to-Speech standard SSML elements. For more information on how to customize your synthesized speech results using SSML elements, see the Text-to-Speech SSML tutorial and SSML reference documentation.

Summary of supported Beta SSML features:

  • <phoneme>: Customize the pronunciation of specific words.
  • <say-as interpret-as="duration">: Specify durations.
  • <voice>: Switch between voices in the same request.
  • <lang>: Use multiple languages in the same request.
  • Timepoints: Use the <mark> tag to return the timepoint of a specified point in your transcript.

<phoneme>

You can use the <phoneme> tag to produce custom pronunciations of words inline. Text-to-Speech accepts the IPA and X-SAMPA phonetic alphabets. See the phonemes page for a list of supported languages and phonemes.

Each application of the <phoneme> tag directs the pronunciation of a single word:

  <phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme>
  <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>

Stress markers

There are up to three levels of stress that can be placed in a transcription:

  1. Primary stress: Denoted with /ˈ/ in IPA and /"/ in X-SAMPA.
  2. Secondary stress: Denoted with /ˌ/ in IPA and /%/ in X-SAMPA.
  3. Unstressed: Not denoted with a symbol (in either notation).

Some languages might have fewer than three levels or not denote stress placement at all. See the phonemes page to see the stress levels available for your language. Stress markers are placed at the start of each stressed syllable. For example, in US English:

Example word IPA X-SAMPA
water ˈwɑːtɚ "wA:t@`
underwater ˌʌndɚˈwɑːtɚ %Vnd@"wA:t@

Broad vs Narrow Transcriptions

As a general rule, keep your transcriptions more broad and phonemic in nature. For example, in US English, transcribe intervocalic /t/ (instead of using a tap):

Example word IPA X-SAMPA
butter ˈbʌtɚ instead of ˈbʌɾɚ "bVt@` instead of "bV4@`

There are some instances where using the phonemic representation makes your TTS results sound unnatural (for example, if the sequence of phonemes is anatomically difficult to pronounce).

One example of this is voicing assimilation for /s/ in English. In this case the assimilation should be reflected in the transcription:

Example word IPA X-SAMPA
cats ˈkæts "k{ts
dogs ˈdɑːgz instead of ˈdɑːgs "dA:gz instead of "dA:gs

Reduction

Every syllable must contain one (and only one) vowel. This means that you should avoid syllabic consonants and instead transcribe them with a reduced vowel. For example:

Example word IPA X-SAMPA
kitten ˈkɪtən instead of ˈkɪtn "kIt@n instead of "kitn
kettle ˈkɛtəl instead of ˈkɛtl "kEt@l instead of "kEtl

Syllabification

You can optionally specify syllable boundaries by using /./. Each syllable must contain one (and only one) vowel. For example:

Example word IPA X-SAMPA
readability ˌɹiː.də.ˈbɪ.lə.tiː %r\i:.d@."bI.l@.ti:

Durations

Text-to-Speech supports <say-as interpret-as="duration"> to correctly read durations. For example, the following example would be verbalized as "five hours and thirty minutes":

<say-as interpret-as="duration" format="h:m">5:30</say-as>

The format string supports the following values:

Abbreviation Value
h hours
m minutes
s seconds
ms milliseconds

<voice>

The <voice> tag allows you to use more than one voice in a single SSML request. In the following example, the default voice is an English male voice. All words will be synthesized in this voice except for "qu'est-ce qui t'amène ici", which will be verbalized in French using a female voice instead of the default language (English) and gender (male).

<speak>And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui
t'amène ici</voice><break time="250ms"/> in her sweet and gentle voice.</speak>

Alternatively, you can use a <voice> tag to specify an individual voice (the voice name on the supported voices page) rather than specifying a language and/or gender:

<speak>The dog is friendly<voice name="fr-CA-Wavenet-B">mais la chat est
mignon</voice><break time="250ms"/> said a pet shop
owner</speak>

When you use the <voice> tag, Text-to-Speech expects to receive either a name (the name of the voice you want to use) or a combination of the following attributes. All three attributes are optional but you must provide at least one if you don't provide a name.

  • gender: One of "male", "female" or "neutral".
  • variant: Used as a tiebreaker in cases where there are multiple possibilities of which voice to use based on your configuration.
  • language: Your desired language. Only one language can be specified in a given <voice> tag. Specify your language in BCP-47 format. You can find the BCP-47 code for your language in the language code column on the supported voices and languages page.

You can also control the relative priority of each of the gender, variant, and language attributes using two additonal tags: required and ordering.

  • required: If an attribute is designated as required and not configured properly, the request will fail.
  • ordering: Any attributes listed after an ordering tag are considered as preferred attributes rather than required. The Text-to-Speech API considers preferred attributes on a best effort basis in the order they are listed after the ordering tag. If any preferred attributes are configured incorrectly, Text-to-Speech might still return a valid voice but with the incorrect configuration dropped.

Examples of configurations using the required and ordering tags:

<speak>And there it was <voice language="en-GB" gender="male" required="gender"
ordering="gender language">a flying bird </voice>roaring in the skies for the
first time.</speak>
<speak>Today is supposed to be <voice language="en-GB" gender="female"
ordering="language gender">Sunday Funday.</voice></speak>

<lang>

You can use <lang> to include text in multiple languages within the same SSML request. All languages will be synthesized in the same voice unlesss you use the <voice> tag to explicitly change the voice. The xml:lang string must contain the target language in BCP-47 format (this value is listed as "language code" in the supported voices table). In the following example "chat" will be verbalized in French instead of the default language (English):

<speak>The french word for cat is <lang xml:lang="fr-FR">chat</lang></speak>

Text-to-Speech supports the <lang> tag on a best effort basis. Not all language combinations produce the same quality results if specified in same SSML request. In some cases, a language combination might produce an effect that is detectible but subtle or perceived as negative. Known issues:

  • Japanese with Kanji characters is not supported by the <lang> tag. The input is transliterated and read as Chinese characters.
  • Semitic languages such as Arabic, Hebrew, and Persian are not supported by the <lang> tag and will result in silence. If you want to use any of these languages we recommend using the <voice> tag to switch to a voice that speaks your desired language (if available).

SSML timepoints

The Text-to-Speech API supports the use of timepoints in your created audio data. A timepoint is a timestamp (in seconds, measured from the beginning of the generated audio) that corresponds to a designated point in the script. You can set a timepoint in your script using the <mark> tag. When the audio is generated, the API then returns the time offset between the beginning of the audio and the timepoint.

There are two steps to setting a timepoint:

  1. Add a <mark> SSML tag to the point in the script that you want a timestamp for.
  2. Set TimepointType to SSML_MARK. If this field is not set, timepoints are not returned by default.

The following example returns two timepoints:

  • timepoint_1: Indicates the time (in seconds) that the word "Mark" appears in the generated audio.
  • timepoint_2: Indicates the time (in seconds) that the word "see" appears in the generated audio.
<speak>Hello <mark name="timepoint_1"/> Mark. Good to <mark
name="timepoint_2"/> see you.</speak>