Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window. Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window.

Speech adaptation

When performing a detect intent request, you can optionally supply phrase_hints to provide hints to the speech recognizer. These hints can help with recognition in a specific conversation state.

Auto speech adaptation

The auto speech adaptation feature improves the speech recognition accuracy of your agent by automatically using conversation state to pass relevant entities and training phrases as speech context hints for all detect intent requests. This feature is disabled by default.

Enable or disable auto speech adaptation

To enable or disable auto speech adaptation:

Console

Open the Dialogflow CX Console.
Choose your GCP project.
Select your agent.
Click Agent Settings.
Click the Speech and IVR tab.
Toggle Enable auto speech adaptation on or off.
Click Save.

API

See the get and patch/update methods for the Agent type.

Select a protocol and version for the Agent reference:

Protocol	V3	V3beta1
REST	Agent resource	Agent resource
RPC	Agent interface	Agent interface
C++	AgentsClient	Not available
C#	AgentsClient	Not available
Go	AgentsClient	Not available
Java	AgentsClient	AgentsClient
Node.js	AgentsClient	AgentsClient
PHP	Not available	Not available
Python	AgentsClient	AgentsClient
Ruby	Not available	Not available

Agent design for speech recognition improvements

With auto speech adaptation enabled, you can build your agent in ways to take advantage of it. The following sections explain how speech recognition may be improved with certain changes to your agent's training phrases, and entities.

Training phrases

If you define training phrases with a phrase like "stuffy nose", a similar sounding end-user utterance is reliably recognized as "stuffy nose" and not "stuff he knows".

When you have a required parameter that forces Dialogflow into form-filling prompts, auto speech adaptation will strongly bias towards the entity being filled.

In all cases, auto speech adaptation is only biasing the speech recognition, not limiting it. For example, even if Dialogflow is prompting a user for a required parameter, users will still be able to trigger other intents such as a top-level "talk to an agent" intent.

System entities

If you define a training phrase that uses the @sys.number system entity , and the end user says "I want two", it may be recognized as "to", "too", "2", or "two".

With auto speech adaptation enabled, Dialogflow uses the @sys.number entity as a hint during speech recognition, and the parameter is more likely to be extracted as "2".

Custom entities

If you define a custom entity for product or service names offered by your company, and the end-user mentions these terms in an utterance, they are more likely to be recognized. A training phrase "I love Dialogflow", where "Dialogflow" is annotated as the @product entity, will tell auto speech adaptation to bias for "I love Dialogflow", "I love Cloud Speech", and all other entries in the @product entity.
It is especially important to define clean entity synonyms when using Dialogflow to detect speech. Imagine you have two @product entity entries, "Dialogflow" and "Dataflow". Your synonyms for "Dialogflow" might be "Dialogflow", "dialogue flow", "dialogue builder", "Speaktoit", "speak to it", "API.ai", "API dot AI". These are good synonyms because they cover the most common variations. You don't need to add "the dialogue flow builder" because "dialogue flow" already covers that.

Note: Why is this important? Consider that you have two entities "Dialogflow" and "Dataflow", and two synonyms are "the dialogue flow builder" and "Google Cloud Dataflow". An end-user might very reasonably say "Google Cloud Dialogflow", but because there is no "Google Cloud Dialogflow" synonym, the speech recognition will likely hear "Google Cloud Dataflow" because the entity definitions are biased towards that phrase. Likewise, if someone says "the dataflow builder" speech will most likely hear "the dialogue flow builder" because it's the only entity defined with "builder". Instead, you will get better performance by defining only the key phrases as listed in the bullet above. In summary, be careful to not add generic data to entity definitions as this is what intent training phrases are designed for. A training phrase "Google Cloud Dataflow", where "Dataflow" is annotated as the @product entity allows auto speech adaptation to listen for "Google Cloud Dataflow" and "Google Cloud Dialogflow" with equal weight. See Agent design for more best practices.

User utterances with consecutive but distinct number entities can be ambiguous. For example, "I want two sixteen packs" might mean 2 quantities of 16 packs, or 216 quantities of packs. Speech adaptation can help disambiguate these cases if you set up entities with spelled-out values:
- Define a quantity entity with entries:
  zero
  one
  ...
  twenty
- Define a product or size entity with entries:
  sixteen pack
  two ounce
  ...
  five liter
- Only entity synonyms are used in speech adaptation, so you can define an entity with reference value 1 and single synonym one to simplify your fulfillment logic.

Regexp entities

Regexp entities can trigger auto speech adaptation for alphanumeric and digit sequences like "ABC123" or "12345" when configured and tested properly.

To recognize these sequences over voice, implement all four of the requirements below:

1. Regexp entry requirement

While any regular expression can be used to extract entities from text inputs, only certain expressions will tell auto speech adaptation to bias for spelled-out alphanumeric or digit sequences when recognizing speech.

In the regexp entity, at least one entry must follow all of these rules:

Should match some alphanumeric characters, for example: \d, \w, [a-zA-Z0-9]
Should not contain whitespace or \s, though \s* and \s? are allowed
Should not contain capture or non-capture groups ()
Should not try to match any special characters or punctuation like: ` ~ ! @ # $ % ^ & * ( ) - _ = + , . < > / ? ; ' : " [ ] { } \ |

This entry can have character sets [] and repetition quantifiers like *, ?, +, {3,5}.

See Examples.

2. Parameter definition requirement

Mark the regexp entity as a required form parameter, so it can be collected during form filling. This allows auto speech adaptation to strongly bias for sequence recognition instead of trying to recognize an intent and sequence at the same time. Otherwise, "Where is my package for ABC123" might be misrecognized as "Where is my package 4ABC123".

3. Training phrases annotation requirement

Do not use the regexp entity for an intent training phrase annotation. This ensures that the parameter is resolved as part of form filling.

4. Testing requirement

See Testing speech adaptation.

Examples

For example, a regexp entity with a single entry ([a-zA-Z0-9]\s?){5,9} will not trigger the speech sequence recognizer because it contains a capture group. To fix this, simply add another entry for [a-zA-Z0-9]{5,9}. Now you will benefit from the sequence recognizer when matching "ABC123", yet the NLU will still match inputs like "ABC 123" thanks to the original rule that allows spaces.

The following examples of regular expressions adapt for alphanumeric sequences:

^[A-Za-z0-9]{1,10}$
WAC\d+
215[2-8]{3}[A-Z]+
[a-zA-Z]\s?[a-zA-Z]\s?[0-9]\s?[0-9]\s?[0-9]\s?[a-zA-Z]\s?[a-zA-Z]

The following examples of regular expressions adapt for digit sequences:

\d{2,8}
^[0-9]+$
2[0-9]{7}
[2-9]\d{2}[0-8]{3}\d{4}

Regexp workaround

Auto speech adaptation's built-in support for regexp entities varies by language. Check Speech class tokens for $OOV_CLASS_ALPHANUMERIC_SEQUENCE and $OOV_CLASS_DIGIT_SEQUENCE supported languages.

If your language is not listed, you can work around this limitation. For example, if you want an employee ID that is three letters followed by three digits to be accurately recognized, you could build your agent with the following entities and parameters:

Define a digit entity that contains 10 entity entries (with synonyms):
0, 0
1, 1
...
9, 9
Define a letter entity that contains 26 entity entries (with synonyms):
A, A
B, B
...
Z, Z
Define a employee-id entity that contains a single entity entry (without synonyms):
@letter @letter @letter @digit @digit @digit
Use @employee-id as a parameter in a training phrase.

Manual speech adaptation

Manual speech adaptation allows you to manually configure speech adaptation phrases for a flow or a page. It also overrides implicit speech contexts generated by auto speech adaptation when the latter is enabled.

The flow level and page level speech adaptation settings have a hierarchical relation, which means that a page inherits speech adaptation settings from the flow level by default and the more fine-grained page level always overrides flow level if the page has a customized setting.

For speech adaptation setting, flow level setting and page level setting can be enabled independently. If the flow level adaptation setting is not enabled, you can still choose Customize at page level to enable manual speech adaptation for that specific page. Similarly, if you disable manual speech adaptation in flow level setting, pages in the flow with Customize selected will not be impacted.

However, flow level setting and page level setting cannot be disabled independently. If a flow has manual speech adaptation enabled, you cannot disable it for a page under the flow through the Customize choice. Therefore, if you want to have a mixed usage of manual speech adaptation and auto speech adaptation for pages within a flow, you should not enable manual speech adaptation at flow level and should only use page level adaptation settings instead. You can refer to the table below to understand what combination of flow and page setting you should use for your case of adaptation.

Target effect	Recommended use of adaptation settings
Disable auto adaptation for a flow	Flow enabled with no phrase sets (pages within the flow by default use flow setting).
Disable auto adaptation for a page	Flow disabled and page enabled (Customize chosen) with no phrase sets.
Only use manual speech adaptation for all pages within a flow	Flow enabled. Customize pages that need use phrase sets different from flow.
Mix use of auto and manual adaptation within a flow	Flow disabled. Customize pages that you want to apply manual adaptation.
Only use auto speech adaptation for all pages within a flow	Flow disabled.

Enable or disable manual speech adaptation

To enable or disable manual speech adaptation at flow or page level:

Flow Settings

Open the Dialogflow CX Console.
Choose your GCP project.
Hover your mouse over the flow in the Flows section.
Click the options button.
Select Flow Settings in the dropdown menu.
Select the checkbox Enable manual speech adaptation or deselect it.
Edit, add or delete phrase sets in the phrase set table
Click Save.

Page Settings

Open the Dialogflow CX Console.
Choose your GCP project.
Hover your mouse over the page in the Pages section.
Click the options button.
Select Page Settings in the dropdown menu.
Use flow level is chosen by default and when chosen, flow level adaptation phrases will be re-used for this page. You can choose Customize to configure adaptation phrases different to flow level settings. Even if manual speech adaptation is disabled at flow level, you can still enable and configure manual speech adaptation for a page in that flow through the Customize option.
Edit, add or delete phrase set in the adaptation phrase set table
Click Save.

Manual phrase set configuration for speech recognition improvements

1. Words and phrases

In an adaptation phrase set, you can define single-word or multi-word phrases with optional references to speech class tokens. For example, you can add phrases like "great rate", "tracking number is $OOV_CLASS_ALPHANUMERIC_SEQUENCE", or "$FULLPHONENUM". These provided phrases increase the probability of them getting transcribed over other phonetically similar phrases. When you add a multi-word phrase without any boost, the bias is applied to both the whole phrase and the continuous portions within the phrase. In general, the number of phrases should be kept small and you should only add phrases that the speech recognition struggles to get right without speech adaptation. If Speech-to-Text can already recognize a phrase correctly, then there's no need to add this phrase into speech adaptation settings. If you see a few phrases that Speech-to-Text often misrecognizes at a page or flow, you can add the correct phrases to its corresponding adaptation settings.

Recognition error correction example

Here's an example of how you can use speech adaptation to correct recognition issues. Let's say you are designing a phone device trading agent, and the user may either say something including the phrases "sell phones" or "cell phone" after the agent asks its first question "what do you need help with?". Then how can we use speech adaptation to improve recognition accuracy on both phrases?

If you include both phrases in the adaptation settings, Speech-to-Text may still be confused, as they sound similar. If you just provide one phrase out of the two, then Speech-to-Text may misrecognize one phrase as the other. To improve speech recognition accuracy for both phrases, you need to provide Speech-to-Text with more context clues to distinguish between when it should hear "sell phones" and when it should hear "cell phone". For example, you may notice people often use "sell phones" as a part of utterances like "how to sell phones", "want to sell phones" or "do you sell phones", whereas "cell phone" as a part of utterances like "purchase cell phone", "cell phone bill", and "cell phone service". If you provide these more precise phrases to the model instead of the short original phrases "cell phone" and "sell phones", Speech-to-Text will learn that "sell phone" as a verb phrase is more likely to follow after words like "how to", "want to" and "do you", while "cell phone" as a noun phrase is more likely to follow after words like "purchase" or be followed by words like "bill" or "service". Therefore, as a rule of thumb for configuring adaptation phrases, it is usually better to provide more precise phrases like "how to sell phones" or "do you sell phones" than only including "sell phone".

2. Speech class tokens

Apart from natural language words, you can also embed references to speech class tokens into a phrase. Speech class tokens represent common concepts that usually follow certain format in writing. For example, for the address number in an address like "123 Main Street", people would usually expect to see an address number's numerical format "123" in an address instead of its fully spelled-out version "one-hundred twenty-three". If you expect certain formatting in the transcription results, especially for alphanumeric sequences, please refer to the list of supported class tokens to see which tokens are available for your language and your use case.

If the page already has intent routes or parameters with references to system entities, here's a reference table for mappings between common system entities and speech class tokens:

System entities	Speech class tokens
`@sys.date`	`$MONTH $DAY $YEAR`
`@sys.date-time`	`$MONTH $DAY $YEAR`
`@sys.date-period`	`$MONTH $DAY $YEAR`
`@sys.time`	`$TIME`
`@sys.time-period`	`$TIME`
`@sys.age`	`$OPERAND`
`@sys.number`	`$OPERAND`
`@sys.number-integer`	`$OPERAND`
`@sys.cardinal`	`$OPERAND`
`@sys.ordinal`	`$OPERAND`
`@sys.percentage`	`$OPERAND`
`@sys.duration`	`$OPERAND`
`@sys.currency-name`	`$MONEY`
`@sys.unit-currency`	`$MONEY`
`@sys.phone-number`	`$FULLPHONENUM`
`@sys.zip-code`	`$POSTALCODE` or `$OOV_CLASS_POSTALCODE`
`@sys.address`	`$ADDRESSNUM $STREET $POSTALCODE`
`@sys.street-address`	`$ADDRESSNUM $STREET $POSTALCODE`
`@sys.temperature`	`$OOV_CLASS_TEMPERATURE`
`@sys.number-sequence`	`$OOV_CLASS_DIGIT_SEQUENCE`
`@sys.flight-number`	`$OOV_CLASS_ALPHANUMERIC_SEQUENCE`

3. Boost value

If adding phrases without the boost value does not provide a strong enough biasing effect, you can use the boost value to further strengthen speech adaptation biasing effect.

Boost applies additional bias when set to values greater than 0 and no more than 20. When boost is empty or 0, the default biasing effect helps recognize the whole phrase and the continuous portions within the phrase. For example, a non-boosted phrase "are you open to sell phones" helps recognize that phrase and also similar phrases like "I sell phones" and "Hi are you open".

When positive boost is applied, the biasing effect is stronger, but it only applies to the exact phrase. For example, a boosted phrase "sell phones" helps recognize "can you sell phones" but not "do you sell any phones".

For these reasons, you will get the best results if you provide phrases both with and without boost.

Higher boost values can result in fewer false negatives, which are cases where the word or phrase occurred in the audio but wasn't correctly recognized by Speech-to-Text (underbiasing). However, boost can also increase the likelihood of false positives; that is, cases where the word or phrase appears in the transcription even though it didn't occur in the audio (overbiasing). You usually need to fine-tune your biasing phrases to find a good trade-off point between the two biasing issues.

You can learn more about how to fine-tune boost value for phrases in Cloud Speech doc about boost.

When to use auto or manual speech adaptation

In general, if you are not sure if speech adaptation will improve speech recognition quality for your agent (no clear transcription error patterns in mind), you are encouraged to try auto speech adaptation first before resorting to manual speech adaptation. For more nuanced decisions, consider the following factors to decide between auto speech adaptation or manual speech adaptation:

1. Form filling

Auto speech adaptation works very well with form filling since it uses ABNF grammar context for the form parameters and enforces grammar rules based on their entity types. Since manual speech adaptation doesn't support ABNF grammars yet, Auto Speech adaptation is generally preferred over manual speech adaptation for a form filling page. Still for pages with only system entity parameters and simple regexp entities that are supported by speech class tokens, you can also use manual speech adaptation to achieve biasing effect similar to auto speech adaptation without the need to tune regexp entities.

2. Page or flow transition complexity

For a simple page or flow with a few intent routes, auto speech adaptation will likely generate representative biasing phrases and perform reasonably well.

However, if a page or flow has a large amount of intent routes (for a page, please also consider the number of flow-level routes), or if any of the intents have overly long or short unimportant training phrases (For example, a whole sentence or a single word with only one or two syllables), it is very likely that speech adaptation model won't work well with these phrases. You should first try disabling speech adaptation for the open-ended pages with high complexity by enabling manual speech adaptation with empty phrase sets (empty adaptation override). And after that evaluate whether there are special non-ambiguous phrases that still need to be provided to Speech-to-Text to improve recognition quality.

Another symptom of this complexity issue is seeing a wide range of underbiasing or overbiasing issues when auto speech adaptation is enabled. Similar to the case above, you also need to test with speech adaptation disabled for the specific page first. If erroneous behaviors persist after disabling speech adaptation, then you can add the phrases you want to correct towards into speech adaptation settings and even add boost values to further strengthen biasing effects when necessary.

Testing speech adaptation

When testing your agent's speech adaptation capabilities for a particular training phrase or entity match, you should not jump directly to testing the match with the first voice utterance of a conversation. You should use only voice or event inputs for the entire conversation prior to the match you want to test. The behavior of your agent when tested in this manner will be similar to the behavior in actual production conversations.

Limitations

The following limitations apply:

Speech adaptation is not available to all speech models and language combinations. Refer to Cloud Speech language support page to verify if "model adaptation" is available to your speech model and language combination.

Currently manual speech adaptation does not support custom classes or ABNF grammar yet. You can enable auto speech adaptation or use runtime detect intent request to make use of these adaptation features.
The same boost value may perform differently for different speech models and languages, so use caution when configuring them manually for agents using multiple languages or speech models. Currently, manual speech adaptation applies to all languages in an agent, so multilingual agents should only use language-agnostic phrases or split each language into a separate agent. Since the default biasing behavior (not providing boost or 0 boost) usually performs reasonably well for all languages and models, you don't need to configure language-specific boost values unless stronger biasing is required for your recognition use case. You can learn more about how to fine-tune boost value in this Cloud Speech-to-Text guide.

Recognizing long character sequences is challenging. The number of characters that are captured in a single turn is directly related to the quality of your input audio. If you have followed all of the regexp entity guidelines and tried using relevant speech class tokens in manual speech adaptation settings and are still struggling to capture the entire sequence in a single turn, you may consider some more conversational alternatives:
- When validating the sequence against a database, consider cross-referencing other collected parameters like dates, names, or phone numbers to allow for incomplete matches. For example, instead of just asking a user for their order number, also ask for their phone number. Now, when your webhook queries your database for order status, it can rely first on the phone number, then return the closest matching order for that account. This could allow Dialogflow to mishear "ABC" as "AVC", yet still return the correct order status for the user.
- For extra long sequences, consider designing a flow that encourages end-users to pause in the middle so that the bot can confirm as you go.

Speech models

Speech model migration Q1 2024