Improve transcription results with speech adaptation

Overview

You can use the speech adaptation feature to help Speech-to-Text recognize specific words or phrases more frequently than other options that might otherwise be suggested. For example, suppose that your audio data often includes the word "weather". When Speech-to-Text encounters the word "weather," you want it to transcribe the word as "weather" more often than "whether." In this case, you might use speech adaptation to bias Speech-to-Text toward recognizing "weather."

Speech adaptation is particularly helpful to the following use cases:

  • Improving the accuracy of words and phrases that occur frequently in your audio data. For example, you can alert the recognition model to voice commands that are typically spoken by your users.

  • Expanding the vocabulary of words recognized by Speech-to-Text. Speech-to-Text includes a very large vocabulary. However, if your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using speech adaptation.

  • Improving the accuracy of speech transcription when the supplied audio contains noise or is not very clear.

Optionally, you can fine-tune the biasing of the recognition model using the speech adaptation boost feature (Beta).

Improve recognition of specified words

To increase the probability that Speech-to-Text recognizes the word "weather" when it transcribes your audio data, pass "weather" in the phrases field of a SpeechContext object. Assign the SpeechContext object to the speechContexts field of the RecognitionConfig object in your request to the Speech-to-Text API.

The following snippet shows part of a JSON payload sent to the Speech-to-Text API. The JSON snippet provides the word "weather" for speech adaptation.

"config": {
    "encoding":"LINEAR16",
    "sampleRateHertz": 8000,
    "languageCode":"en-US",
    "speechContexts": [{
      "phrases": ["weather"]
    }]
}

Improve recognition of multi-word phrases

When you provide a multi-word phrase, Speech-to-Text is more likely to recognize those words in sequence. Providing a phrase also increases the probability of recognizing portions of the phrase, including individual words. See the content limits page for limits on the number and size of these phrases.

The following snippet shows part of a JSON payload sent to the Speech-to-Text API. The JSON snippet includes an array of multi-word phrases assigned to the phrases field of a SpeechContext object.

"config": {
    "encoding":"LINEAR16",
    "sampleRateHertz": 8000,
    "languageCode":"en-US",
    "speechContexts": [{
      "phrases": ["weather is hot", "weather is cold"]
    }]
}

Improve recognition using classes

Classes represent common concepts that occur in natural language, such as monetary units and calendar dates. A class allows you to improve transcription accuracy for large groups of words that map to a common concept, but that don't always include identical words or phrases.

For example, suppose that your audio data includes recordings of people saying their street address. You might have an audio recording of someone saying "My house is 123 Main Street, the fourth house on the left." In this case, you want Speech-to-Text to recognize the first sequence of numerals ("123") as an address rather than as an ordinal ("one-hundred twenty-third"). However, not all people live at "123 Main Street." It's impractical to list every possible street address in a SpeechContext object. Instead, you can use a class to indicate that a street number should be recognized no matter what the number actually is. In this example, Speech-to-Text could then more accurately transcribe phrases like "123 Main Street" and "987 Grand Boulevard" because they are both recognized as address numbers.

Class tokens

To use a class in speech adaptation, include a class token in in the phrases field of the SpeechContext object. Refer to the list of supported class tokens to see which tokens are available for your language. For example, to improve the transcription of address numbers from your source audio, provide the value $ADDRESSNUM in your SpeechContext object.

You can use classes either as stand-alone items in the phrases array or embed one or more class tokens in longer multi-word phrases. For example, you can indicate an address number in a larger phrase by including the class token in a string: ["my address is $ADDRESSNUM"]. However, this phrase will not help in cases where the audio contains a similar but non-identical phrase, such as "I am at 123 Main Street". To aid recognition of similar phrases, it's important to additionally include the class token by itself: ["my address is $ADDRESSNUM", "$ADDRESSNUM"]. If you use an invalid or malformed class token, Speech-to-Text ignores the token without triggering an error but still uses the rest of the phrase for context.

The following snippet shows an example of a JSON payload sent to the Speech-to-Text API. The JSON snippet includes a SpeechContext object that uses a class token.

  "config": {
    "encoding":"LINEAR16",
    "sampleRateHertz": 8000,
    "languageCode":"en-US",
    "speechContexts": [{
      "phrases": ["$ADDRESSNUM"]
     }]
  }

What's next