Very often what a user says strongly depends on the particular context they are in. The context can depend on, to highlight a few examples, options presented to the user, conversation topic, time of the day or location, or prior information about the user (such as favorite radio stations or top contacts).
Model adaptation is used to provide the context to the recognizer. It changes the underlying probabilities of the Speech-to-Text model, so that the contextual words or phrases are more likely to be considered by the recognizer than other options that might otherwise be selected.
Relevant contextual phrases can significantly improve recognition performance. Irrelevant phrases pose a risk of degrading the recognition performance: if the users don't speak those phrases, the recognizer is guided in the wrong direction.
To increase the probability that the libgspeech
library recognizes the words
and/or phrases you need when it transcribes your audio data, pass them as
phrases
within the phrase_sets
field of a SpeechAdaptation
object. Assign
the SpeechAdaptation
object to the adaptation
field of the
RecognitionConfig
object in your request:
adaptation: { phrase_sets { phrases: "weather is hot" phrases: "weather is cold" }
Use class tokens to bias the model
Classes represent common concepts that occur in natural language, such as numeric values and calendar dates. A class lets you improve transcription accuracy for large groups of words that map to a common concept but don't always include identical words or phrases.
For example, suppose that your audio data includes recordings of people saying their street address. You might have an audio recording of someone saying "My house is 123 Main Street, the fourth house on the left." In this case, you want Speech-to-Text to recognize the first sequence of numerals ("123") as an address rather than as an ordinal number ("one-hundred twenty-third"). However, not all people live at "123 Main Street." It's impractical to list every possible street address in phrases. Instead, you can use a class to indicate that a street number should be recognized no matter what the number actually is.
To use class tokens, include them in your speech adaptation phrases. You can
use classes either as stand-alone items in the phrases array or embed them in
longer multi-word phrases. For example, to improve the transcription of address
numbers from your source audio, use $ADDRESSNUM
class. You can indicate an
address number in a larger phrase by including the class token in a string: "my
address is $ADDRESSNUM
". However, this phrase doesn't help in cases where the
audio contains a similar but non-identical phrase, such as "I am at 123 Main
Street". To aid in recognition of similar phrases, it's important to add the
class token by itself:
adaptation: {
phrase_sets {
phrases: "my address is $ADDRESSNUM"
phrases: "$ADDRESSNUM"
}
}
To learn which class tokens are available in your locale of interest, contact Google.
Improve recognition using predefined classes
A custom class is a customized list of related items or values. Google
provides several predefined custom classes (such as contacts
or navigation
)
that we recommend for use with the libgspeech
library. These custom classes
are likely to represent phrases that your application might have and yield
better recognition accuracy than your custom classes would.
Predefined custom classes are grouped into two categories, which you need to reference differently in your requests:
Regular custom classes for which you need to provide
phrases
anditems
, for example,contacts
.Phraseless custom classes, which are referenced by
custom_class_id
only, for example,navigation
.
To use a regular custom class that requires phrases
and items
, create a
CustomClass
object that includes each value in items
and reference this
class by its custom_class_id
in your phrases
. For example:
adaptation: {
custom_classes {
custom_class_id: "contacts"
items: "Asia"
items: "Alex"
items: "Nuno Pereira"
}
phrase_sets {
phrases: "call ${contacts}"
}
}
You don't need to provide phrases
and items
for phraseless custom classes.
In that case the phrases
list is a Google-provided list of the most common
phrases users are likely to use in that context (for example, "take highway" in
navigation
).
adaptation: {
custom_classes {
custom_class_id: "navigation"
}
}
You can provide several custom classes in your request. In that case, each phrase set should only contain phrases that correspond to a single custom class. For example:
adaptation: {
# An example of the regular custom class.
custom_classes {
custom_class_id: "radio-stations"
items: "90s Rock & Hip-Hop"
items: "Z100"
}
# An example of the phraseless custom class. (Note, no items and phrases need
# to be provided.)
custom_classes {
custom_class_id: "navigation"
}
custom_classes {
custom_class_id: "contacts"
items: "Nuno Pereira"
items: "Carl Jung"
}
phrase_sets {
phrases: "play ${radio-stations}"
phrases: "tune to ${radio-stations}"
}
phrase_sets {
phrases: "message ${contacts}"
}
phrase_sets {
phrases: "send email to ${contacts}"
}
}
Create custom classes
If you have a specific business need that isn't met by the predefined custom classes, you can create custom classes.
For example, you might want to transcribe audio data that is likely to include the name of any one of several hundred regional restaurants. Restaurant names are relatively rare in general speech and less likely to be chosen as "correct" by the recognition model. So, specify the names in a custom class. For example:
adaptation: {
custom_classes {
custom_class_id: "restaurants"
items: "sushido"
items: "taneda sushi"
items: "altura"
}
phrase_sets {
phrases: "visit restaurants like ${restaurants}"
}
}
Recommendations
The maximum recommended limits of the number of phrases and custom classes in your requests are as follows:
SpeechAdaptation {
CustomClass { [max of 20]
class id
items [max of 100]
}
PhraseSet { [max of 5]
phrases [max of 300 across all 5 PhraseSets]
}
}
Doing otherwise may result in the following issues:
- Quality degradation:
- Overtriggering or recognizing biased phrases not present in the audio.
- Stuckiness or truncating parts of a transcript.
- Indeterminism or flakiness of transcription results, given the same audio. It's caused by providing many, likely-similar speech adaptation phrases or irrelevant phrases, which can result in the recognizer choosing either option at random.
- Increased latency (possibly, seconds) and memory as the underlying models become larger.
Example Adaptation Configurations
Following is a sample of adaptation configurations for popular cases.
Contacts
An example to adapt the recognizer for a list of contacts:
adaptation: {
custom_classes {
custom_class_id: "contacts"
items: "Nuno Pereira"
items: "Carl Jung"
}
phrase_sets {
phrases: "message ${contacts}"
}
phrase_sets {
phrases: "send email to ${contacts}"
}
}
See the Improve recognition using predefined classes section for more information.