Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window. Try Gemini 1.5 Pro, our most advanced multimodal model in Vertex AI, and see what you can build with a 1M token context window.

Send a recognition request with model adaptation

You can improve the accuracy of the transcription results you get from Speech-to-Text by using model adaptation. The model adaptation feature lets you specify words and/or phrases that Speech-to-Text must recognize more frequently in your audio data than other alternatives that might otherwise be suggested. Model adaptation is particularly useful for improving transcription accuracy in the following use cases:

Your audio contains words or phrases that are likely to occur frequently.
Your audio is likely to contain words that are rare (such as proper names) or words that do not exist in general use.
Your audio contains noise or is otherwise not very clear.

For more information about using this feature, see Improve transcription results with model adaptation. For information about phrase and character limits per model adaptation request, see Quotas and limits. Not all models support speech adaptation. See Language Support to see which models support adaptation.

Code sample

Speech Adaptation is an optional Speech-to-Text configuration that you can use to customize your transcription results according to your needs. See the RecognitionConfig documentation for more information about configuring the recognition request body.

The following code sample shows how to improve transcription accuracy using a SpeechAdaptation resource: PhraseSet, CustomClass, and model adaptation boost. To use a PhraseSet or CustomClass in future requests, make a note of its resource name, returned in the response when you create the resource.

For a list of the pre-built classes available for your language, see Supported class tokens.

Python

To learn how to install and use the client library for Speech-to-Text, see Speech-to-Text client libraries. For more information, see the Speech-to-Text Python API reference documentation.

To authenticate to Speech-to-Text, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


from google.cloud import speech_v1p1beta1 as speech


def transcribe_with_model_adaptation(
    project_id: str,
    location: str,
    storage_uri: str,
    custom_class_id: str,
    phrase_set_id: str,
) -> str:
    """Create`PhraseSet` and `CustomClasses` to create custom lists of similar
    items that are likely to occur in your input data.

    Args:
        project_id: The GCP project ID.
        location: The GCS location of the input audio.
        storage_uri: The Cloud Storage URI of the input audio.
        custom_class_id: The ID of the custom class to create

    Returns:
        The transcript of the input audio.
    """

    # Create the adaptation client
    adaptation_client = speech.AdaptationClient()

    # The parent resource where the custom class and phrase set will be created.
    parent = f"projects/{project_id}/locations/{location}"

    # Create the custom class resource
    adaptation_client.create_custom_class(
        {
            "parent": parent,
            "custom_class_id": custom_class_id,
            "custom_class": {
                "items": [
                    {"value": "sushido"},
                    {"value": "altura"},
                    {"value": "taneda"},
                ]
            },
        }
    )
    custom_class_name = (
        f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
    )
    # Create the phrase set resource
    phrase_set_response = adaptation_client.create_phrase_set(
        {
            "parent": parent,
            "phrase_set_id": phrase_set_id,
            "phrase_set": {
                "boost": 10,
                "phrases": [
                    {"value": f"Visit restaurants like ${{{custom_class_name}}}"}
                ],
            },
        }
    )
    phrase_set_name = phrase_set_response.name
    # The next section shows how to use the newly created custom
    # class and phrase set to send a transcription request with speech adaptation

    # Speech adaptation configuration
    speech_adaptation = speech.SpeechAdaptation(phrase_set_references=[phrase_set_name])

    # speech configuration object
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=24000,
        language_code="en-US",
        adaptation=speech_adaptation,
    )

    # The name of the audio file to transcribe
    # storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]

    audio = speech.RecognitionAudio(uri=storage_uri)

    # Create the speech client
    speech_client = speech.SpeechClient()

    response = speech_client.recognize(config=config, audio=audio)

    for result in response.results:
        print(f"Transcript: {result.alternatives[0].transcript}")