Improving accuracy with speech adaptation

This page describes how to improve the output for speech transcription from Cloud Speech-to-Text.

When you send a transcription request to Cloud Speech-to-Text, you can include a list of phrases to act as "hints" to Cloud Speech-to-Text. Providing these hints, a technique called speech adaptation, helps Speech-to-Text API to recognize the specified phrases from your audio data. Thus, if your source audio includes a speaker saying "meet" frequently and you specify the phrase "meat" as a speech adaptation, Speech-to-Text API more likely transcribes the word as "meat" rather than "meet."

When you use speech adaptation, you can also specify how much more likely Cloud Speech-to-Text selects those phrases as well as provide phrases as types or classes of entities.

Speech adaptation

For any given recognition task, you can provide speechContexts (of type SpeechContext) that provides information to aid in processing the given audio. Currently, a context can hold a list of phrases to act as "hints" to the recognizer; these phrases can boost the probability that such words or phrases are recognized.

You might use speech adaptation in a few ways:

  • Improve the accuracy for specific words and phrases that tend to be overrepresented in your audio data. For example, if specific commands are typically spoken by the user, you can provide these as speech adaptations. Such additional phrases may be particularly useful if the supplied audio contains noise or the contained speech is not very clear.

  • Add additional words to the vocabulary of the recognition task. Cloud Speech-to-Text includes a very large vocabulary. However, if proper names or domain-specific words are out-of-vocabulary, you can add them to the phrases provided to your the speechContexts in your request.

You can provide speech adaptation both as small groups of words or as single words. (See the content limits for limits on the number and size of these phrases.) When provided as multi-word phrases, the speech adaptation boosts the probability of recognizing those words in sequence but also, to a lesser extent, boosts the probability of recognizing portions of the phrase, including individual words.

"config": {
    "encoding":"LINEAR16",
    "sampleRateHertz": 8000,
    "languageCode":"en-US",
    "speechContexts": [{
      "phrases": ["Weather is hot"]
    }]
}

Boost

In your source audio, some common phrases may occur often but with relatively different degrees of frequency. For example, if your audio recordings included recordings of customer phone calls, the audio might frequently contain product or brand names. However, if the same recording also includes command prompts spoken by the customer—"check balance" or "talk to representative" as examples—then these phrases may occur even more frequently than the product or brand names.

By default, speech adaptation provides a relatively small effect, especially for one-word phrases (also known as unigrams). To amplify the effect of speech adaptation, you can use boost-based adaptation. With boost adaptation, you provide a relative value to bias Cloud Speech-to-Text towards a speech adaptation for transcription. You can provide different values for different phrases. For phrases with the highest frequency of occurrence in your audio data, you should give them the highest values of boost. With other common phrases, you might want to bias Cloud Speech-to-Text towards them, but not to the same degree as the most frequent or common phrases in your audio data.

To use boost adaptation, you set the boost field in the speechContexts field of the RecognitionConfig. The boost value must be a non-negative float value, ideally between 0 and 20. The higher the value, the more likely that Cloud Speech-to-Text chooses the phrase from the possible alternatives.

Higher boost values can result in less false negatives—cases where the utterance occurred in the audio but wasn't recognized by Cloud Speech-to-Text. However, boost can also increase the likelihood of false positives, that is, cases where the audio data doesn't contain the utterance but appears in the transcription. For best results, you should experiment by using some initial value and adjust up or down as needed.

The following snippet shows an example of a RecognitionConfig field that uses boost adaptation.

"config": {
    "encoding":"LINEAR16",
    "sampleRateHertz": 8000,
    "languageCode":"en-US",
    "speechContexts": [{
      "phrases": ["Weather is hot"],
      "boost": 2
     }]
  }

Classes

Classes represent groups of words that represent common concepts that occur in natural language. For example, a typical English speaker might categorize the words "February," "July," "August," and "November" all as 'months'.

In cases where you want to detect a specific type of concept in speech recognition, like a month, you can use a token in the phrases collection of your RecognitionConfig. Extending the earlier example, rather than providing all possible months in your speechContexts field, you only need to provide the value $MONTH. By using the $MONTH class in your recognition config, Cloud Speech-to-Text is more likely to correctly transcribe audio that includes months.

You can use classes either as stand-alone phrases or embedded in larger phrases, such as Meet me at $ADDRESSNUM. If you use an invalid or malformed class token, Cloud Speech-to-Text ignores the token without triggering an error but still uses the rest of the phrase for context.

The following code snippet shows an example of speech context using a class token.

  "config": {
    "encoding":"LINEAR16",
    "sampleRateHertz": 8000,
    "languageCode":"en-US",
    "speechContexts": [{
      "phrases": ["$MONTH"]
     }]
  }

Supported class tokens

The following table shows the classes supported in Cloud Speech-to-Text. Note that not all classes are available in all languages.

Class token Example Description
$ADDRESSNUM "335," "1"

Street number for an address. You can use this to recognize a whole number as well.

$POSTALCODE "98024," "10011" 5-digit postal code used in the United States.
$FULLPHONENUM "718-212-6101," "1 123-456-1234," "911"

Phone number, with or without country code.

This also recognizes special numbers such as "1-800" and "911." Audio can include hyphens.

$TIME "5 o'clock," "7:30" Time within a day. Suffixes "AM" and "PM" are not supported.
$DAY "3rd," "15th," "31st"

Date within a month, as an ordinal number from "1st" to "31st."

Phrases with "day within a week" are not supported.

$MONTH "May," "October" Month in a year. Contextual phrases like "2 months from now" are not supported.
$YEAR "1999," "2020" Year
$FULLDATE "9/3/2019" Full date in format D/M/YYYY. Only available in de-DE.
$MONEY eight dollars five cents Number followed by a currency unit name.
$OOV_CLASS_DIGIT_SEQUENCE "123 45 6789," "1234" Digit sequence of any length (for example, social security number)
$OPERAND "1," "0.46," "345," "1 1/2" Numerical value including whole numbers, fractions, and decimals.
$ORDINAL "1st," "third" Ordinal number.
$PERCENT "50%," "0.01%" Percentage number in format number-percent.

Using speech contexts in speech recognition

The following code sample demonstrates how to specify strength for speech contexts provided in a transcription request sent to Speech-to-Text API.

REST API

Refer to the speech:recognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the quickstart.

The following example show how to send a POST request using curl, where the body of the request the sets speech context strength for specific speech contexts.

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    'https://speech.googleapis.com/v1p1beta1/speech:recognize' \
    -d '{
        "config":{
            "languageCode":"en-US",
            "speechContexts":[{
                "phrases":["meat"],
                "boost": 2
            }]
        },
        "audio":{
            "uri":"gs://cloud-samples-tests/speech/context-test.wav"
        }
    }' > context-strength.txt

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format, saved to a file named context-strength.txt.


{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "I am testing speech context with saying for children went for school and wanted to meet for a party and then heat meat",
          "confidence": 0.9463943
        }
      ],
      "languageCode": "en-us"
    }
  ]
}

Java

/*
 * Please include the following imports to run this sample.
 *
 * import com.google.cloud.speech.v1p1beta1.RecognitionAudio;
 * import com.google.cloud.speech.v1p1beta1.RecognitionConfig;
 * import com.google.cloud.speech.v1p1beta1.RecognizeRequest;
 * import com.google.cloud.speech.v1p1beta1.RecognizeResponse;
 * import com.google.cloud.speech.v1p1beta1.SpeechClient;
 * import com.google.cloud.speech.v1p1beta1.SpeechContext;
 * import com.google.cloud.speech.v1p1beta1.SpeechRecognitionAlternative;
 * import com.google.cloud.speech.v1p1beta1.SpeechRecognitionResult;
 * import java.util.Arrays;
 * import java.util.List;
 */

/**
 * Performs synchronous speech recognition with speech adaptation.
 *
 * @param sampleRateHertz Sample rate in Hertz of the audio data sent in all `RecognitionAudio`
 *     messages. Valid values are: 8000-48000.
 * @param languageCode The language of the supplied audio.
 * @param phrase Phrase "hints" help Speech-to-Text API recognize the specified phrases from your
 *     audio data.
 * @param boost Positive value will increase the probability that a specific phrase will be
 *     recognized over other similar sounding phrases.
 * @param uriPath Path to the audio file stored on GCS.
 */
public static void sampleRecognize(
    int sampleRateHertz, String languageCode, String phrase, float boost, String uriPath) {
  try (SpeechClient speechClient = SpeechClient.create()) {
    // sampleRateHertz = 44100;
    // languageCode = "en-US";
    // phrase = "Brooklyn Bridge";
    // boost = 20.0F;
    // uriPath = "gs://cloud-samples-data/speech/brooklyn_bridge.mp3";
    RecognitionConfig.AudioEncoding encoding = RecognitionConfig.AudioEncoding.MP3;
    List<String> phrases = Arrays.asList(phrase);
    SpeechContext speechContextsElement =
        SpeechContext.newBuilder().addAllPhrases(phrases).setBoost(boost).build();
    List<SpeechContext> speechContexts = Arrays.asList(speechContextsElement);
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(encoding)
            .setSampleRateHertz(sampleRateHertz)
            .setLanguageCode(languageCode)
            .addAllSpeechContexts(speechContexts)
            .build();
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(uriPath).build();
    RecognizeRequest request =
        RecognizeRequest.newBuilder().setConfig(config).setAudio(audio).build();
    RecognizeResponse response = speechClient.recognize(request);
    for (SpeechRecognitionResult result : response.getResultsList()) {
      // First alternative is the most probable result
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcript: %s\n", alternative.getTranscript());
    }
  } catch (Exception exception) {
    System.err.println("Failed to create the client due to: " + exception);
  }
}

Node.js


const speech = require('@google-cloud/speech').v1p1beta1;

/**
 * Performs synchronous speech recognition with speech adaptation.
 *
 * @param sampleRateHertz {number} Sample rate in Hertz of the audio data sent in all
 * `RecognitionAudio` messages. Valid values are: 8000-48000.
 * @param languageCode {string} The language of the supplied audio.
 * @param phrase {string} Phrase "hints" help Speech-to-Text API recognize the specified phrases from
 * your audio data.
 * @param boost {number} Positive value will increase the probability that a specific phrase will be
 * recognized over other similar sounding phrases.
 * @param uriPath {string} Path to the audio file stored on GCS.
 */
function sampleRecognize(
  sampleRateHertz,
  languageCode,
  phrase,
  boost,
  uriPath
) {
  const client = new speech.SpeechClient();
  // const sampleRateHertz = 44100;
  // const languageCode = 'en-US';
  // const phrase = 'Brooklyn Bridge';
  // const boost = 20.0;
  // const uriPath = 'gs://cloud-samples-data/speech/brooklyn_bridge.mp3';
  const encoding = 'MP3';
  const phrases = [phrase];
  const speechContextsElement = {
    phrases: phrases,
    boost: boost,
  };
  const speechContexts = [speechContextsElement];
  const config = {
    encoding: encoding,
    sampleRateHertz: sampleRateHertz,
    languageCode: languageCode,
    speechContexts: speechContexts,
  };
  const audio = {
    uri: uriPath,
  };
  const request = {
    config: config,
    audio: audio,
  };
  client
    .recognize(request)
    .then(responses => {
      const response = responses[0];
      for (const result of response.results) {
        // First alternative is the most probable result
        const alternative = result.alternatives[0];
        console.log(`Transcript: ${alternative.transcript}`);
      }
    })
    .catch(err => {
      console.error(err);
    });
}

Python

from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1 import enums


def sample_recognize(storage_uri, phrase):
    """
    Transcribe a short audio file with speech adaptation.

    Args:
      storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
      phrase Phrase "hints" help recognize the specified phrases from your audio.
    """

    client = speech_v1p1beta1.SpeechClient()

    # storage_uri = 'gs://cloud-samples-data/speech/brooklyn_bridge.mp3'
    # phrase = 'Brooklyn Bridge'
    phrases = [phrase]

    # Hint Boost. This value increases the probability that a specific
    # phrase will be recognized over other similar sounding phrases.
    # The higher the boost, the higher the chance of false positive
    # recognition as well. Can accept wide range of positive values.
    # Most use cases are best served with values between 0 and 20.
    # Using a binary search happroach may help you find the optimal value.
    boost = 20.0
    speech_contexts_element = {"phrases": phrases, "boost": boost}
    speech_contexts = [speech_contexts_element]

    # Sample rate in Hertz of the audio data sent
    sample_rate_hertz = 44100

    # The language of the supplied audio
    language_code = "en-US"

    # Encoding of audio data sent. This sample sets this explicitly.
    # This field is optional for FLAC and WAV audio formats.
    encoding = enums.RecognitionConfig.AudioEncoding.MP3
    config = {
        "speech_contexts": speech_contexts,
        "sample_rate_hertz": sample_rate_hertz,
        "language_code": language_code,
        "encoding": encoding,
    }
    audio = {"uri": storage_uri}

    response = client.recognize(config, audio)
    for result in response.results:
        # First alternative is the most probable result
        alternative = result.alternatives[0]
        print(u"Transcript: {}".format(alternative.transcript))

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Speech-to-Text Documentation
Need help? Visit our support page.