Overview
You can use the speech adaptation feature to help Speech-to-Text recognize specific words or phrases more frequently than other options that might otherwise be suggested. For example, suppose that your audio data often includes the word "weather". When Speech-to-Text encounters the word "weather," you want it to transcribe the word as "weather" more often than "whether." In this case, you might use speech adaptation to bias Speech-to-Text toward recognizing "weather."
Speech adaptation is particularly helpful to the following use cases:
Improving the accuracy of words and phrases that occur frequently in your audio data. For example, you can alert the recognition model to voice commands that are typically spoken by your users.
Expanding the vocabulary of words recognized by Speech-to-Text. Speech-to-Text includes a very large vocabulary. However, if your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using speech adaptation.
Improving the accuracy of speech transcription when the supplied audio contains noise or is not very clear.
Optionally, you can fine-tune the biasing of the recognition model using the speech adaptation boost feature (Beta).
Improve recognition of specified words
To increase the probability that Speech-to-Text recognizes the word
"weather" when it transcribes your audio data, pass "weather" in the
phrases
field of a
SpeechContext
object. Assign the
SpeechContext
object to the speechContexts
field of
the RecognitionConfig
object in your request to the
Speech-to-Text API.
The following snippet shows part of a JSON payload sent to the Speech-to-Text API. The JSON snippet provides the word "weather" for speech adaptation.
"config": { "encoding":"LINEAR16", "sampleRateHertz": 8000, "languageCode":"en-US", "speechContexts": [{ "phrases": ["weather"] }] }
Improve recognition of multi-word phrases
When you provide a multi-word phrase, Speech-to-Text is more likely to recognize those words in sequence. Providing a phrase also increases the probability of recognizing portions of the phrase, including individual words. See the content limits page for limits on the number and size of these phrases.
The following snippet shows part of a JSON payload sent to the
Speech-to-Text API. The JSON snippet includes an array of
multi-word phrases assigned to the phrases
field of a
SpeechContext
object.
"config": { "encoding":"LINEAR16", "sampleRateHertz": 8000, "languageCode":"en-US", "speechContexts": [{ "phrases": ["weather is hot", "weather is cold"] }] }
Improve recognition using classes
Classes represent common concepts that occur in natural language, such as monetary units and calendar dates. A class allows you to improve transcription accuracy for large groups of words that map to a common concept, but that don't always include identical words or phrases.
For example, suppose that your audio data includes recordings of people saying
their street address. You might have an audio recording of someone saying
"My house is 123 Main Street, the fourth house on the left." In this case, you
want Speech-to-Text to recognize the first sequence of numerals ("123")
as an address rather than as an ordinal ("one-hundred twenty-third"). However,
not all people live at "123 Main Street." It's impractical to list every
possible street address in a SpeechContext
object. Instead, you can use a
class to indicate that a street number should be recognized no matter what the
number actually is. In this example, Speech-to-Text could then more
accurately transcribe phrases like "123 Main Street" and "987 Grand Boulevard"
because they are both recognized as address numbers.
Class tokens
To use a class in speech adaptation, include a class token in
the phrases
field of the SpeechContext
object. Refer to the
list of supported class tokens
to see which tokens are available for your language. For example, to
improve the transcription of address numbers from your source audio, provide the
value $ADDRESSNUM
in your SpeechContext
object.
You can use classes either as stand-alone items in the phrases
array or embed
one or more class tokens in longer multi-word phrases. For example, you can
indicate an address number in a larger phrase by including the class token in
a string: ["my address is $ADDRESSNUM"]
. However, this phrase will not help
in cases where the audio contains a similar but non-identical phrase, such as
"I am at 123 Main Street". To aid recognition of similar phrases, it's important
to additionally include the class token by itself:
["my address is $ADDRESSNUM", "$ADDRESSNUM"]
. If you use an invalid or
malformed class token, Speech-to-Text ignores the token without
triggering an error but still uses the rest of the phrase for
context.
The following snippet shows an example of a JSON payload sent to the
Speech-to-Text API. The JSON snippet includes a SpeechContext
object
that uses a class token.
"config": { "encoding":"LINEAR16", "sampleRateHertz": 8000, "languageCode":"en-US", "speechContexts": [{ "phrases": ["$ADDRESSNUM"] }] }
Fine-tune transcription results using boost (Beta)
By default speech adaptation provides a relatively small effect, especially for one-word phrases. The speech adaptation boost feature allows you to increase the recognition model bias by assigning more weight to some phrases than others. We recommend that you implement boost if 1) you have already implemented speech adaptation, and 2) you would like to further adjust the strength of speech adaptation effects on your transcription results. To see whether the boost feature is available for your language, see the language support page.
For example, you have many recordings of people
asking about the "fare to get into the county fair," with the word "fair"
occurring more frequently than "fare." In this case, you can use speech
adaptation to increase the probability of the model recognizing both "fair" and
"fare" by adding them as phrases in a SpeechContext
object. This will tell
Speech-to-Text to recognize "fair" and "fare" more often than, for
example, "hare" or "lair."
However, "fair" should be recognized more often than "fare" due to its more frequent appearances in the audio. You might have already transcribed your audio using the Speech-to-Text API and found a high number of errors recognizing the correct word ("fair"). In this case, you might want to use the boost feature to assign a higher boost value to "fair" than "fare". The higher weighted value assigned to "fair" biases the Speech-to-Text API toward picking "fair" more frequently than "fare". Without boost values, the recognition model will recognize "fare" and "fare" with equal probability.
Boost basics
When you use boost, you assign a weighted value to phrases
items in a
SpeechContext
object. Speech-to-Text refers to this
weighted value when selecting a possible transcription for words in
your audio data. The higher the value, the higher the
likelihood that Speech-to-Text chooses that word or phrase from the
possible alternatives.
If you assign a boost value to a multi-word phrase, boost is applied to the
entire phrase and only the entire phrase. For example, you want to assign a
boost value to the phrase "My favorite exhibit at the American Museum of Natural
History is the blue whale". If you add that phrase to a SpeechContext
object
and assign a boost value, the recognition model will be more likely to recognize
that phrase in its entirety, word-for-word.
If you don't get the results you're looking for by boosting a multi-word
phrase, we suggest that you add all bigrams (2-words, in order) that make up the
phrase as additional phrases
items and assign boost values to each. Continuing
the above example, you could investigate adding additional bigrams and endgrams
(more than 2 words) such as "my favorite", "my favorite exhibit",
"favorite exhibit", "my favorite exhibit at the American Museum of Natural
History", "American Museum of Natural History", "blue whale", and so on. The STT
recognition model is then more likely to recognize related phrases in your
audio that contain parts of the original boosted phrase but don't match it
word-for-word.
Setting boost values
Boost values must be a float value greater than 0. The practical maximum limit for boost values is 20. For best results, experiment with your transcription results by adjusting your boost values up or down until you get accurate transcription results.
Higher boost values can result in fewer false negatives, which are cases where the word or phrase occurred in the audio but wasn't correctly recognized by Speech-to-Text. However, boost can also increase the likelihood of false positives; that is, cases where the word or phrase appears in the transcription even though it didn't occur in the audio.
Example of speech adaptation boost
To set different boost values for "fair" and "fare" in your speech transcription
request, set two SpeechContext
objects to the speechContexts
array of the
RecognitionConfig
object. Set a boost
value to a non-negative
float value for each SpeechContext
object, one containing "fair" and the
other containing "fare".
The following snippet shows an example of a JSON payload sent to the
Speech-to-Text API. The JSON snippet includes a RecognitionConfig
object
that uses boost values to weight the words "fair" and "fare" differently.
"config": { "encoding":"LINEAR16", "sampleRateHertz": 8000, "languageCode":"en-US", "speechContexts": [{ "phrases": ["fair"], "boost": 15 }, { "phrases": ["fare"], "boost": 2 }] }
What's next
- Learn how to use speech adaptation in a request to Speech-to-Text.
- Learn how to use speech adaptation boost to fine tune your speech adaptation results.
- Review the list of supported class tokens.