Text-to-Speech supports a number of SSML Beta features in addition to the Text-to-Speech standard SSML elements. For more information on how to customize your synthesized speech results using SSML elements, see the Text-to-Speech SSML tutorial and SSML reference documentation.
Summary of supported Beta SSML features:
<phoneme>: Customize the pronunciation of specific words.
<say-as interpret-as="duration">: Specify durations.
<voice>: Switch between voices in the same request.
<lang>: Use multiple languages in the same request.
- Timepoints: Use the
<mark>tag to return the timepoint of a specified point in your transcript.
You can use the
<phoneme> tag to produce custom pronunciations of words
inline. Text-to-Speech accepts the
X-SAMPA phonetic alphabets. See the
phonemes page for a list of supported languages
Each application of the
<phoneme> tag directs the pronunciation of a single
<phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme> <phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>
There are up to three levels of stress that can be placed in a transcription:
- Primary stress: Denoted with /ˈ/ in IPA and /"/ in X-SAMPA.
- Secondary stress: Denoted with /ˌ/ in IPA and /%/ in X-SAMPA.
- Unstressed: Not denoted with a symbol (in either notation).
Some languages might have fewer than three levels or not denote stress placement at all. See the phonemes page to see the stress levels available for your language. Stress markers are placed at the start of each stressed syllable. For example, in US English:
Broad vs Narrow Transcriptions
As a general rule, keep your transcriptions more broad and phonemic in nature. For example, in US English, transcribe intervocalic /t/ (instead of using a tap):
|butter||ˈbʌtɚ instead of ˈbʌɾɚ||"bVt@` instead of "bV4@`|
There are some instances where using the phonemic representation makes your TTS results sound unnatural (for example, if the sequence of phonemes is anatomically difficult to pronounce).
One example of this is voicing assimilation for /s/ in English. In this case the assimilation should be reflected in the transcription:
|dogs||ˈdɑːgz instead of ˈdɑːgs||"dA:gz instead of "dA:gs|
Every syllable must contain one (and only one) vowel. This means that you should avoid syllabic consonants and instead transcribe them with a reduced vowel. For example:
|kitten||ˈkɪtən instead of ˈkɪtn||"kIt@n instead of "kitn|
|kettle||ˈkɛtəl instead of ˈkɛtl||"kEt@l instead of "kEtl|
You can optionally specify syllable boundaries by using /./. Each syllable must contain one (and only one) vowel. For example:
<say-as interpret-as="duration"> to correctly read
durations. For example, the following example would be verbalized as
"five hours and thirty minutes":
<say-as interpret-as="duration" format="h:m">5:30</say-as>
The format string supports the following values:
<voice> tag allows you to use more than one voice in a single SSML
request. In the following example, the default voice is an English male voice.
All words will be synthesized in this voice except for "qu'est-ce qui t'amène
ici", which will be verbalized in French using a female voice instead of the
default language (English) and gender (male).
<speak>And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui t'amène ici</voice><break time="250ms"/> in her sweet and gentle voice.</speak>
Alternatively, you can use a
<voice> tag to specify an individual voice (the
voice name on the supported voices page)
rather than specifying a
<speak>The dog is friendly<voice name="fr-CA-Wavenet-B">mais la chat est mignon</voice><break time="250ms"/> said a pet shop owner</speak>
When you use the
<voice> tag, Text-to-Speech expects to receive either
name (the name of the voice you want to use)
or a combination of the following attributes. All three attributes are
optional but you must provide at least one if you don't provide a
gender: One of "male", "female" or "neutral".
variant: Used as a tiebreaker in cases where there are multiple possibilities of which voice to use based on your configuration.
language: Your desired language. Only one language can be specified in a given
<voice>tag. Specify your language in BCP-47 format. You can find the BCP-47 code for your language in the language code column on the supported voices and languages page.
You can also control the relative priority of each of the
language attributes using two additonal tags:
required: If an attribute is designated as
requiredand not configured properly, the request will fail.
ordering: Any attributes listed after an
orderingtag are considered as preferred attributes rather than required. The Text-to-Speech API considers preferred attributes on a best effort basis in the order they are listed after the
orderingtag. If any preferred attributes are configured incorrectly, Text-to-Speech might still return a valid voice but with the incorrect configuration dropped.
Examples of configurations using the
<speak>And there it was <voice language="en-GB" gender="male" required="gender" ordering="gender language">a flying bird </voice>roaring in the skies for the first time.</speak>
<speak>Today is supposed to be <voice language="en-GB" gender="female" ordering="language gender">Sunday Funday.</voice></speak>
You can use
<lang> to include text in multiple languages within the same SSML
request. All languages will be synthesized in the same voice unlesss you use the
<voice> tag to explicitly change the voice. The
xml:lang string must contain
the target language in BCP-47 format (this value is listed as "language code" in
the supported voices table). In the following
example "chat" will be verbalized in French instead of the default language
<speak>The french word for cat is <lang xml:lang="fr-FR">chat</lang></speak>
Text-to-Speech supports the
<lang> tag on a best effort basis. Not all
language combinations produce the same quality results if specified in same
SSML request. In some cases, a language combination might produce an effect that
is detectible but subtle or perceived as negative. Known issues:
- Japanese with Kanji characters is not supported by the
<lang>tag. The input is transliterated and read as Chinese characters.
- Semitic languages such as Arabic, Hebrew, and Persian are not supported
<lang>tag and will result in silence. If you want to use any of these languages we recommend using the
<voice>tag to switch to a voice that speaks your desired language (if available).
The Text-to-Speech API supports the use of timepoints in your created audio
data. A timepoint is a timestamp (in seconds, measured from the beginning of
the generated audio) that corresponds to a designated point in the script. You
can set a timepoint in your script using the
<mark> tag. When the audio is
generated, the API then returns the time offset between the beginning of the
audio and the timepoint.
There are two steps to setting a timepoint:
- Add a
<mark>SSML tag to the point in the script that you want a timestamp for.
- Set TimepointType
SSML_MARK. If this field is not set, timepoints are not returned by default.
The following example returns two timepoints:
- timepoint_1: Indicates the time (in seconds) that the word "Mark" appears in the generated audio.
- timepoint_2: Indicates the time (in seconds) that the word "see" appears in the generated audio.
<speak>Hello <mark name="timepoint_1"/> Mark. Good to <mark name="timepoint_2"/> see you.</speak>