Overview
You can use the model adaptation feature to help Speech-to-Text recognize specific words or phrases more frequently than other options that might otherwise be suggested. For example, suppose that your audio data often includes the word "weather". When Speech-to-Text encounters the word "weather," you want it to transcribe the word as "weather" more often than "whether." In this case, you might use model adaptation to bias Speech-to-Text toward recognizing "weather."
Model adaptation is particularly helpful in the following use cases:
Improving the accuracy of words and phrases that occur frequently in your audio data. For example, you can alert the recognition model to voice commands that are typically spoken by your users.
Expanding the vocabulary of words recognized by Speech-to-Text. Speech-to-Text includes a very large vocabulary. However, if your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using model adaptation.
Improving the accuracy of speech transcription when the supplied audio contains noise or is not very clear.
To see whether the model adaptation feature is available for your language, please refer to the language support page.
Improve recognition of words and phrases
To increase the probability that Speech-to-Text recognizes the word
"weather" when it transcribes your audio data, you can pass the single word
"weather" in the
PhraseSet
object in a
SpeechAdaptation
resource.
When you provide a multi-word phrase, Speech-to-Text is more likely to recognize those words in sequence. Providing a phrase also increases the probability of recognizing portions of the phrase, including individual words. See the content limits page for limits on the number and size of these phrases.
Optionally, you can fine-tune the strength of model adaptation using the model adaptation boost feature.
Improve recognition using classes
Classes represent common concepts that occur in natural language, such as monetary units and calendar dates. A class allows you to improve transcription accuracy for large groups of words that map to a common concept, but that don't always include identical words or phrases.
For example, suppose that your audio data includes recordings of people saying
their street address. You might have an audio recording of someone saying "My
house is 123 Main Street, the fourth house on the left." In this case, you want
Speech-to-Text to recognize the first sequence of numerals ("123") as an
address rather than as an ordinal ("one-hundred twenty-third"). However, not all
people live at "123 Main Street." It's impractical to list every possible street
address in a PhraseSet
resource. Instead, you can use a class to indicate that
a street number should be recognized no matter what the number actually is. In
this example, Speech-to-Text could then more accurately transcribe
phrases like "123 Main Street" and "987 Grand Boulevard" because they are both
recognized as address numbers.
Class tokens
To use a class in model adaptation, include a class token in
the phrases
field of a PhraseSet
resource. Refer to the
list of supported class tokens to see which
tokens are available for your language. For example, to improve the
transcription of address numbers from your source audio, provide the value
$ADDRESSNUM
in your SpeechContext
object.
You can use classes either as stand-alone items in the phrases
array or embed
one or more class tokens in longer multi-word phrases. For example, you can
indicate an address number in a larger phrase by including the class token in a
string: ["my address is $ADDRESSNUM"]
. However, this phrase will not help in
cases where the audio contains a similar but non-identical phrase, such as "I am
at 123 Main Street". To aid recognition of similar phrases, it's important to
additionally include the class token by itself: ["my address is $ADDRESSNUM",
"$ADDRESSNUM"]
. If you use an invalid or malformed class token,
Speech-to-Text ignores the token without triggering an error but still
uses the rest of the phrase for context.
Custom classes
You can also create your own CustomClass
, a class composed of your own custom
list of related items or values. For example, you want to transcribe audio data
that is likely to include the name of any one of several hundred regional
restaurants. Restaurant names are relatively rare in general speech and
therefore less likely to be chosen as the "correct" answer by the recognition
model. You can bias the recognition model toward correctly identifying these
names when they appear in your audio using a custom class.
To use a custom class, create a
CustomClass
resource that includes each restaurant name as a ClassItem
. Custom classes
function in the same way as the
pre-built class tokens. A phrase
can
include both pre-built class tokens and custom classes.
ABNF Grammars
You can also use grammars in augmented Backus–Naur form (ABNF) to specify patterns of words. Including an ABNF grammar in the request's model adaptation will increase the probability that Speech-to-Text recognizes all words that match the specified grammar.
To use this feature, include an ABNF
grammar
object in your request's SpeechAdaptation
field. ABNF grammars can also include references
to CustomClass
and PhraseSet
resources. To learn more about the syntax for this field, see the
Speech Recognition Grammar Specification
and our code sample
below.
Fine-tune transcription results using boost
By default model adaptation should already provide a sufficient effect in most of the cases. The model adaptation boost feature allows you to increase the recognition model bias by assigning more weight to some phrases than others. We recommend that you implement boost only if 1) you have already implemented model adaptation, and 2) you would like to further adjust the strength of model adaptation effects on your transcription results.
For example, you have many recordings of people asking about the "fare to get
into the county fair," with the word "fair" occurring more frequently than
"fare." In this case, you can use model adaptation to increase the probability
of the model recognizing both "fair" and "fare" by adding them as phrases
in a
PhraseSet
resource. This will tell Speech-to-Text to recognize "fair"
and "fare" more often than, for example, "hare" or "lair."
However, "fair" should be recognized more often than "fare" due to its more frequent appearances in the audio. You might have already transcribed your audio using the Speech-to-Text API and found a high number of errors recognizing the correct word ("fair"). In this case, you might want to additionally use the phrases with boost to assign a higher boost value to "fair" than "fare". The higher weighted value assigned to "fair" biases the Speech-to-Text API toward picking "fair" more frequently than "fare". Without boost values, the recognition model will recognize "fair" and "fare" with equal probability.
Boost basics
When you use boost, you assign a weighted value to phrase
items in a
PhraseSet
resource. Speech-to-Text refers to this weighted value when
selecting a possible transcription for words in your audio data. The higher the
value, the higher the likelihood that Speech-to-Text chooses that word
or phrase from the possible alternatives.
For example, you want to assign a boost value to the phrase "My favorite exhibit
at the American Museum of Natural History is the blue whale". If you add that
phrase to a phrase
object and assign a boost value, the recognition model will
be more likely to recognize that phrase in its entirety, word-for-word.
If you don't get the results you're looking for by boosting a multi-word phrase,
we suggest that you add all bigrams (2-words, in order) that make up the phrase
as additional phrase
items and assign boost values to each. Continuing the
above example, you could investigate adding additional bigrams and N-grams (more
than 2 words) such as "my favorite", "my favorite exhibit", "favorite exhibit",
"my favorite exhibit at the American Museum of Natural History", "American
Museum of Natural History", "blue whale", and so on. The Speech-to-Text
recognition model is then more likely to recognize related phrases in your audio
that contain parts of the original boosted phrase but don't match it
word-for-word.
Set boost values
Boost values must be a float value greater than 0. The practical maximum limit for boost values is 20. For best results, experiment with your transcription results by adjusting your boost values up or down until you get accurate transcription results.
Higher boost values can result in fewer false negatives, which are cases where the word or phrase occurred in the audio but wasn't correctly recognized by Speech-to-Text. However, boost can also increase the likelihood of false positives; that is, cases where the word or phrase appears in the transcription even though it didn't occur in the audio.
Get timeout notifications
Speech-to-Text responses include a
SpeechAdaptationInfo
field, which provides information on model adaptation behavior during
recognition. If a timeout related to model adaptation occurred,
adaptationTimeout
will be true
and timeoutMessage
will specify which
adaptation configuration caused the timeout. When a timeout occurs, model
adaptation has no effect on the returned transcript.
Example use case using model adaptation
The following example walks you through the process of using model adaptation to transcribe an audio recording of someone saying "call me fionity and oh my gosh what do we have here ionity". In this case it is important that the model identify "fionity" and "ionity" correctly.
The following command performs recognition on the audio without model adaptation. The resulting transcription is incorrect: "call me Fiona tea and oh my gosh what do we have here I own a day".
curl -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/speech:recognize" -d '{"config": {"languageCode": "en-US"}, "audio": {"uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav"}}'
Example request:
{ "config":{ "languageCode":"en-US" }, "audio":{ "uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav" } }
Improve transcription using a PhraseSet
Create a
PhraseSet
:curl -X POST -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/phraseSets" -d '{"phraseSetId": "test-phrase-set-1"}'
Example request:
{ "phraseSetId":"test-phrase-set-1" }
Get the
PhraseSet
:curl -X GET -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id>/locations/global/phraseSets/test-phrase-set-1"\
Add the phrases "fionity" and "ionity" to the
PhraseSet
and assign aboost
value of 10 to each:curl -X PATCH -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/phraseSets/test-phrase-set-1?updateMask=phrases"\ -d '{"phrases": [{"value": "ionity", "boost": 10}, {"value": "fionity", "boost": 10}]}'
The
PhraseSet
is now updated to:{ "phrases":[ { "value":"ionity", "boost":10 }, { "value":"fionity", "boost":10 } ] }
Recognize the audio again, this time using model adaptation and the
PhraseSet
created previously. The transcribed results are now correct: "call me fionity and oh my gosh what do we have here ionity".curl -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/speech:recognize" -d '{"config": {"adaptation": {"phrase_set_references": ["projects/project_id/locations/global/phraseSets/test-phrase-set-1"]}, "languageCode": "en-US"}, "audio": {"uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav"}}'
Example request:
{ "config":{ "adaptation":{ "phrase_set_references":[ "projects/project_id/locations/global/phraseSets/test-phrase-set-1" ] }, "languageCode":"en-US" }, "audio":{ "uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav" } }
Improve transcription results using a CustomClass
Create a
CustomClass
:curl -X POST -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/customClasses" -d '{"customClassId": "test-custom-class-1"}'
Example request:
{ "customClassId": "test-custom-class-1" }
Get the
CustomClass
:curl -X GET -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/customClasses/test-custom-class-1"
Recognize the test audio clip. The
CustomClass
is empty, so the returned transcript is still incorrect: "call me Fiona tea and oh my gosh what do we have here I own a day":curl -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/speech:recognize" -d '{"config": {"adaptation": {"phraseSets": [{"phrases": [{"value": "${projects/project_idlocations/global/customClasses/test-custom-class-1}", "boost": "10"}]}]}, "languageCode": "en-US"}, "audio": {"uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav"}}'
Example request:
{ "config":{ "adaptation":{ "phraseSets":[ { "phrases":[ { "value":"${projects/project_id/locations/global/customClasses/test-custom-class-1}", "boost":"10" } ] } ] }, "languageCode":"en-US" }, "audio":{ "uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav" } }
Add the phrases "fionity" and "ionity" to the custom class:
curl -X PATCH -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/customClasses/test-custom-class-1?updateMask=items" -d '{"items": [{"value": "ionity"}, {"value": "fionity"}]}'
This updates the custom class to the following:
{ "items":[ { "value":"ionity" }, { "value":"fionity" } ] }
Recognize the sample audio again, this time with "fionity" and "ionity" in the
CustomClass
. The transcript is now correct: "call me fionity and oh my gosh what do we have here ionity".curl -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/speech:recognize" -d '{"config": {"adaptation": {"phraseSets": [{"phrases": [{"value": "${projects/project_id/locations/global/customClasses/test-custom-class-1}", "boost": "10"}]}]}, "languageCode": "en-US"}, "audio": {"uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav"}}'
Example request:
{ "config":{ "adaptation":{ "phraseSets":[ { "phrases":[ { "value":"${projects/project_id/locations/global/customClasses/test-custom-class-1}", "boost":"10" } ] } ] }, "languageCode":"en-US" }, "audio":{ "uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav" } }
Refer to a CustomClass
in a PhraseSet
Update the
PhraseSet
resource created earlier to refer to theCustomClass
:curl -X PATCH -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/phraseSets/test-phrase-set-1?updateMask=phrases" -d '{"phrases": [{"value": "${projects/project_id/locations/global/customClasses/test-custom-class-1}", "boost": 10}]}'
Example request:
{ "config":{ "adaptation":{ "phraseSets":[ { "phrases":[ { "value":"${projects/project_id/locations/global/customClasses/test-custom-class-1}", "boost":"10" } ] } ] }, "languageCode":"en-US" }, "audio":{ "uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav" } }
Recognize the audio using the
PhraseSet
resource (which refers to theCustomClass
). The transcript is correct: "call me fionity and oh my gosh what do we have here ionity".curl -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/speech:recognize" -d '{"config": {"adaptation": {"phrase_set_references": ["projects/project_id/locations/global/phraseSets/test-phrase-set-1"]}, "languageCode": "en-US"}, "audio": {"uri":"gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav"}}'
Example request:
{ "phrases":[ { "value":"${projects/project_id/locations/global/customClasses/test-custom-class-1}", "boost":10 } ] }
Delete the CustomClass
and PhraseSet
Delete the
PhraseSet
:curl -X DELETE -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/phraseSets/test-phrase-set-1"
Delete the
CustomClass
:curl -X DELETE -H "Authorization: Bearer $(gcloud auth --impersonate-service-account=$SA_EMAIL print-access-token)" -H "Content-Type: application/json; charset=utf-8" "https://speech.googleapis.com/v1p1beta1/projects/project_id/locations/global/customClasses/test-custom-class-1"
Improve transcription results using an ABNF Grammar
Recognize the audio using the an
abnf_grammar
. This example refers to aCustomClass
resource: projects/project_id/locations/global/customClasses/test-custom-class-1, an inlinedCustomClass
: test-custom-class-2, class token: ADDRESSNUM, and aPhraseSet
resource: projects/project_id/locations/global/phraseSets/test-phrase-set-1. The first rule in the strings (after external declarations) will be treated as the root.Example request:
{ "config":{ "adaptation":{ "abnf_grammar":{ "abnf_strings": [ "external ${projects/project_id/locations/global/phraseSets/test-phrase-set-1}" , "external ${projects/project_id/locations/global/customClasses/test-custom-class-1}" , "external ${test-custom-class-2}" , "external $ADDRESSNUM" , "$root = $test-phrase-set-1 $name lives in $ADDRESSNUM;" , "$name = $title $test-custom-class-1 $test-custom-class-2" , "$title = Mr | Mrs | Miss | Dr | Prof ;" ] } } } }
What's next
- Learn how to use model adaptation in a request to Speech-to-Text.
- Review the list of supported class tokens.