Improve transcription results with model adaptation

Overview

You can use the model adaptation feature to help Speech-to-Text recognize specific words or phrases more frequently than other options that might otherwise be suggested. For example, suppose that your audio data often includes the word "weather." When Speech-to-Text encounters the word "weather," you want it to transcribe the word as "weather" more often than "whether." In this case, you might use model adaptation to bias Speech-to-Text toward recognizing "weather."

Model adaptation is particularly helpful in the following use cases:

  • Improving the accuracy of words and phrases that occur frequently in your audio data. For example, you can alert the recognition model to voice commands that are typically spoken by your users.

  • Expanding the vocabulary of words recognized by Speech-to-Text. Speech-to-Text includes a very large vocabulary. However, if your audio data often contains words that are rare in general language use (such as proper names or domain-specific words), you can add them using model adaptation.

  • Improving the accuracy of speech transcription when the supplied audio contains noise or is not very clear.

Optionally, you can fine-tune the biasing of the recognition model using the model adaptation boost feature.

Improve recognition of words and phrases

To increase the probability that Speech-to-Text recognizes the word "weather" when it transcribes your audio data, you can pass the single word "weather" in the PhraseSet object in a SpeechAdaptation resource.

When you provide a multi-word phrase, Speech-to-Text is more likely to recognize those words in sequence. Providing a phrase also increases the probability of recognizing portions of the phrase, including individual words. See the content limits page for limits on the number and size of these phrases.

Improve recognition using classes

Classes represent common concepts that occur in natural language, such as monetary units and calendar dates. A class helps you improve transcription accuracy for large groups of words that map to a common concept, but that don't always include identical words or phrases.

For example, suppose that your audio data includes recordings of people saying their street address. You might have an audio recording of someone saying "My house is 123 Main Street, the fourth house on the left." In this case, you want Speech-to-Text to recognize the first sequence of numerals ("123") as an address rather than as an ordinal ("one-hundred twenty-third"). However, not all people live at "123 Main Street." It's impractical to list every possible street address in a PhraseSet resource. Instead, you can use a class to indicate that a street number should be recognized no matter what the number actually is. In this example, Speech-to-Text could then more accurately transcribe phrases like "123 Main Street" and "987 Grand Boulevard" because they are both recognized as address numbers.

Class tokens

To use a class in model adaptation, include a class token in the phrases field of a PhraseSet resource. Refer to the list of supported class tokens to see which tokens are available for your language. For example, to improve the transcription of address numbers from your source audio, provide the value $ADDRESSNUM within a phrase in a PhraseSet.

You can use classes as standalone items in the phrases array or embed one or more class tokens in longer multi-word phrases. For example, you can indicate an address number in a larger phrase by including the class token in a string: ["my address is $ADDRESSNUM"]. However, this phrase will not help in cases where the audio contains a similar but non-identical phrase, such as "I am at 123 Main Street". To aid recognition of similar phrases, it's important to additionally include the class token by itself: ["my address is $ADDRESSNUM", "$ADDRESSNUM"]. If you use an invalid or malformed class token, Speech-to-Text ignores the token without triggering an error but still uses the rest of the phrase for context.

Custom classes

You can also create your own CustomClass, a class composed of your own custom list of related items or values. For example, you want to transcribe audio data that is likely to include the name of any one of several hundred regional restaurants. Restaurant names are relatively rare in general speech and therefore less likely to be chosen as the "correct" answer by the recognition model. You can bias the recognition model toward correctly identifying these names when they appear in your audio using a custom class.

To use a custom class, create a CustomClass resource that includes each restaurant name as a ClassItem. Custom classes function in the same way as the pre-built class tokens. A phrase can include both prebuilt class tokens and custom classes.

Fine-tune transcription results using boost

By default model adaptation provides a relatively small effect, especially for one-word phrases. The model adaptation boost feature lets you increase the recognition model bias by assigning more weight to some phrases than others. We recommend that you implement boost if all of the following are true:

  1. You have already implemented model adaptation.
  2. You would like to further adjust the strength of model adaptation effects on your transcription results. To see whether the boost feature is available for your language, see the language support page.

For example, you have many recordings of people asking about the "fare to get into the county fair," with the word "fair" occurring more frequently than "fare." In this case, you can use model adaptation to increase the probability of the model recognizing both "fair" and "fare" by adding them as phrases in a PhraseSet resource. This will tell Speech-to-Text to recognize "fair" and "fare" more often than, for example, "hare" or "lair."

However, "fair" should be recognized more often than "fare" due to its more frequent appearances in the audio. You might have already transcribed your audio using the Speech-to-Text API and found a high number of errors recognizing the correct word ("fair"). In this case, you might want to use the boost feature to assign a higher boost value to "fair" than "fare". The higher weighted value assigned to "fair" biases the Speech-to-Text API toward picking "fair" more frequently than "fare". Without boost values, the recognition model will recognize "fair" and "fare" with equal probability.

Boost basics

When you use boost, you assign a weighted value to phrase items in a PhraseSet resource. Speech-to-Text refers to this weighted value when selecting a possible transcription for words in your audio data. The higher the value, the higher the likelihood that Speech-to-Text chooses that word or phrase from the possible alternatives.

If you assign a boost value to a multi-word phrase, boost is applied to the entire phrase and only the entire phrase. For example, you want to assign a boost value to the phrase "My favorite exhibit at the American Museum of Natural History is the blue whale". If you add that phrase to a phrase object and assign a boost value, the recognition model will be more likely to recognize that phrase in its entirety, word-for-word.

If you don't get the results you're looking for by boosting a multi-word phrase, we suggest that you add all bigrams (2-words, in order) that make up the phrase as additional phrase items and assign boost values to each. Continuing the previous example, you could investigate adding additional bigrams and endgrams (more than two words), such as "my favorite", "my favorite exhibit", "favorite exhibit", "my favorite exhibit at the American Museum of Natural History", "American Museum of Natural History", and "blue whale". The STT recognition model is then more likely to recognize related phrases in your audio that contain parts of the original boosted phrase but don't match it word-for-word.

Set boost values

Boost values must be a float value greater than 0. The practical maximum limit for boost values is 20. For best results, experiment with your transcription results by adjusting your boost values up or down until you get accurate transcription results.

Higher boost values can result in fewer false negatives, which are cases where the word or phrase occurred in the audio but wasn't correctly recognized by Speech-to-Text. However, boost can also increase the likelihood of false positives; that is, cases where the word or phrase appears in the transcription even though it didn't occur in the audio.

Example use case using model adaptation

The following example walks you through the process of using model adaptation to transcribe an audio recording of someone saying "The word is fare". In this case, without speech adaptation, Speech-to-Text identifies the word "fair." Using speech adaptation Speech-to-Text can identify the word "fare" instead.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Speech-to-Text APIs.

    Enable the APIs

  5. Make sure that you have the following role or roles on the project: Cloud Speech Administrator

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find the row that has your email address.

      If your email address isn't in that column, then you do not have any roles.

    4. In the Role column for the row with your email address, check whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your email address.
    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.
  6. Install the Google Cloud CLI.
  7. To initialize the gcloud CLI, run the following command:

    gcloud init
  8. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  9. Make sure that billing is enabled for your Google Cloud project.

  10. Enable the Speech-to-Text APIs.

    Enable the APIs

  11. Make sure that you have the following role or roles on the project: Cloud Speech Administrator

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find the row that has your email address.

      If your email address isn't in that column, then you do not have any roles.

    4. In the Role column for the row with your email address, check whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your email address.
    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.
  12. Install the Google Cloud CLI.
  13. To initialize the gcloud CLI, run the following command:

    gcloud init
  14. Client libraries can use Application Default Credentials to easily authenticate with Google APIs and send requests to those APIs. With Application Default Credentials, you can test your application locally and deploy it without changing the underlying code. For more information, see Authenticate for using client libraries.

  15. Create local authentication credentials for your Google Account:

    gcloud auth application-default login

Also ensure you have installed the client library.

Improve transcription using a PhraseSet

  1. The following sample builds a PhraseSet with the phrase "fare" and adds it as an inline_phrase_set in a recognition request:

Python

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech


def adaptation_v2_inline_phrase_set(
    project_id: str,
    audio_file: str,
) -> cloud_speech.RecognizeResponse:
    # Instantiates a client
    client = SpeechClient()

    # Reads a file as bytes
    with open(audio_file, "rb") as f:
        content = f.read()

    # Build inline phrase set to produce a more accurate transcript
    phrase_set = cloud_speech.PhraseSet(phrases=[{"value": "fare", "boost": 10}])
    adaptation = cloud_speech.SpeechAdaptation(
        phrase_sets=[
            cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                inline_phrase_set=phrase_set
            )
        ]
    )
    config = cloud_speech.RecognitionConfig(
        auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
        adaptation=adaptation,
        language_codes=["en-US"],
        model="short",
    )

    request = cloud_speech.RecognizeRequest(
        recognizer=f"projects/{project_id}/locations/global/recognizers/_",
        config=config,
        content=content,
    )

    # Transcribes the audio into text
    response = client.recognize(request=request)

    for result in response.results:
        print(f"Transcript: {result.alternatives[0].transcript}")

    return response

  1. This sample creates a PhraseSet resource with the same phrase and then references it in a recognition request:

Python

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech


def adaptation_v2_phrase_set_reference(
    project_id: str,
    phrase_set_id: str,
    audio_file: str,
) -> cloud_speech.RecognizeResponse:
    """Transcribe audio files using a PhraseSet.

    Args:
        project_id: The GCP project ID.
        phrase_set_id: The ID of the PhraseSet to use.
        audio_file: The path to the audio file to transcribe.

    Returns:
        The response from the recognize call.
    """
    # Instantiates a client
    client = SpeechClient()

    # Reads a file as bytes
    with open(audio_file, "rb") as f:
        content = f.read()

    # Create a persistent PhraseSet to reference in a recognition request
    request = cloud_speech.CreatePhraseSetRequest(
        parent=f"projects/{project_id}/locations/global",
        phrase_set_id=phrase_set_id,
        phrase_set=cloud_speech.PhraseSet(phrases=[{"value": "fare", "boost": 10}]),
    )

    operation = client.create_phrase_set(request=request)
    phrase_set = operation.result()

    # Add a reference of the PhraseSet into the recognition request
    adaptation = cloud_speech.SpeechAdaptation(
        phrase_sets=[
            cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                phrase_set=phrase_set.name
            )
        ]
    )
    config = cloud_speech.RecognitionConfig(
        auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
        adaptation=adaptation,
        language_codes=["en-US"],
        model="short",
    )

    request = cloud_speech.RecognizeRequest(
        recognizer=f"projects/{project_id}/locations/global/recognizers/_",
        config=config,
        content=content,
    )

    # Transcribes the audio into text
    response = client.recognize(request=request)

    for result in response.results:
        print(f"Transcript: {result.alternatives[0].transcript}")

    return response

Improve transcription results using a CustomClass

  1. The following sample builds a CustomClass with an item "fare" and name "fare". It then references the CustomClass within an inline_phrase_set in a recognition request:

Python

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech


def adaptation_v2_inline_custom_class(
    project_id: str,
    audio_file: str,
) -> cloud_speech.RecognizeResponse:
    """Transcribe audio file using inline custom class

    Args:
        project_id: The GCP project ID.
        audio_file: The audio file to transcribe.

    Returns:
        The response from the recognizer.
    """
    # Instantiates a client
    client = SpeechClient()

    # Reads a file as bytes
    with open(audio_file, "rb") as f:
        content = f.read()

    # Build inline phrase set to produce a more accurate transcript
    phrase_set = cloud_speech.PhraseSet(phrases=[{"value": "${fare}", "boost": 20}])
    custom_class = cloud_speech.CustomClass(name="fare", items=[{"value": "fare"}])
    adaptation = cloud_speech.SpeechAdaptation(
        phrase_sets=[
            cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                inline_phrase_set=phrase_set
            )
        ],
        custom_classes=[custom_class],
    )
    config = cloud_speech.RecognitionConfig(
        auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
        adaptation=adaptation,
        language_codes=["en-US"],
        model="short",
    )

    request = cloud_speech.RecognizeRequest(
        recognizer=f"projects/{project_id}/locations/global/recognizers/_",
        config=config,
        content=content,
    )

    # Transcribes the audio into text
    response = client.recognize(request=request)

    for result in response.results:
        print(f"Transcript: {result.alternatives[0].transcript}")

    return response

  1. This sample creates a CustomClass resource with the same item. It then creates a PhraseSet resource with a phrase referencing the CustomClass resource name. It then references thePhraseSet resource in a recognition request:

Python

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech


def adaptation_v2_custom_class_reference(
    project_id: str,
    phrase_set_id: str,
    custom_class_id: str,
    audio_file: str,
) -> cloud_speech.RecognizeResponse:
    """Transcribe audio file using a custom class.

    Args:
        project_id: The GCP project ID.
        phrase_set_id: The ID of the phrase set to use.
        custom_class_id: The ID of the custom class to use.
        audio_file: The audio file to transcribe.

    Returns:
        The transcript of the audio file.
    """
    # Instantiates a client
    client = SpeechClient()

    # Reads a file as bytes
    with open(audio_file, "rb") as f:
        content = f.read()

    # Create a persistent CustomClass to reference in phrases
    request = cloud_speech.CreateCustomClassRequest(
        parent=f"projects/{project_id}/locations/global",
        custom_class_id=custom_class_id,
        custom_class=cloud_speech.CustomClass(items=[{"value": "fare"}]),
    )

    operation = client.create_custom_class(request=request)
    custom_class = operation.result()

    # Create a persistent PhraseSet to reference in a recognition request
    request = cloud_speech.CreatePhraseSetRequest(
        parent=f"projects/{project_id}/locations/global",
        phrase_set_id=phrase_set_id,
        phrase_set=cloud_speech.PhraseSet(
            phrases=[{"value": f"${{{custom_class.name}}}", "boost": 20}]
        ),
    )

    operation = client.create_phrase_set(request=request)
    phrase_set = operation.result()

    # Add a reference of the PhraseSet into the recognition request
    adaptation = cloud_speech.SpeechAdaptation(
        phrase_sets=[
            cloud_speech.SpeechAdaptation.AdaptationPhraseSet(
                phrase_set=phrase_set.name
            )
        ]
    )
    config = cloud_speech.RecognitionConfig(
        auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
        adaptation=adaptation,
        language_codes=["en-US"],
        model="short",
    )

    request = cloud_speech.RecognizeRequest(
        recognizer=f"projects/{project_id}/locations/global/recognizers/_",
        config=config,
        content=content,
    )

    # Transcribes the audio into text
    response = client.recognize(request=request)

    for result in response.results:
        print(f"Transcript: {result.alternatives[0].transcript}")

    return response

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

  1. Optional: Revoke the authentication credentials that you created, and delete the local credential file.

    gcloud auth application-default revoke
  2. Optional: Revoke credentials from the gcloud CLI.

    gcloud auth revoke

Console

  • In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  • In the project list, select the project that you want to delete, and then click Delete.
  • In the dialog, type the project ID, and then click Shut down to delete the project.
  • gcloud

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID

    What's next