Speech-to-Text basics

Stay organized with collections Save and categorize content based on your preferences.

Overview

This document is a guide to the basics of using Speech-to-Text. This conceptual guide covers the types of requests you can make to Speech-to-Text, how to construct those requests, and how to handle their responses. We recommend that all users of Speech-to-Text read this guide and one of the associated tutorials before diving into the API itself.

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how Speech-to-Text performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Try Speech-to-Text free

Recognizers

Recognizers are the fundamental resource for recognition in V2. They define a configurable, reusable recognition configuration. Recognizers hold the model used for recognition and a list of language codes used for recognition. These values cannot be overridden when making recognition requests with the recognizer. Recognizers also have a default RecognitionConfig, which can be overridden per recognition request.

A RecognitionConfig contains the following sub-fields:

  • decoding_config: choose to specify either auto_decoding_config to enable automatic detection of audio metadata or an explicit_decoding_config to specify specific audio metadata for recognition. You must set explicit_decoding_config for headerless PCM audio.
  • features: the RecognitionFeatures to set for recognition.
  • adaptation: the SpeechAdaptation to use for recognition. For more information, see the speech adaptation concepts page.

Before making any recognition requests, you first need to create a Recognizer by calling the CreateRecognizer method.

Then, when making a recognition request, you can specify your Recognizer's name in the request, which should have the below format.

projects/PROJECT_ID/locations/LOCATION/recognizers/RECOGNIZER_ID

Replace PROJECT_ID with your Google Cloud project ID, LOCATION with the desired location (us, global, etc.), and RECOGNIZER_ID with an identifier for your Recognizer.

Speech requests

Speech-to-Text has three main methods to perform speech recognition. These are listed below:

  • Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.

  • Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.

Requests contain configuration parameters as well as audio data. The following sections describe these type of recognition requests, the responses they generate, and how to handle those responses in more detail.

Speech-to-Text API recognition

A Speech-to-Text API synchronous recognition request is the simplest method for performing recognition on speech audio data. Speech-to-Text can process up to 1 minute of speech audio data sent in a synchronous request. After Speech-to-Text processes and recognizes all of the audio, it returns a response.

A synchronous request is blocking, meaning that Speech-to-Text must return a response before processing the next request. Speech-to-Text typically processes audio faster than realtime, processing 30 seconds of audio in 15 seconds on average. In cases of poor audio quality, your recognition request can take significantly longer.

Speech-to-Text has both REST and gRPC methods for calling Speech-to-Text API synchronous requests. This article demonstrates the REST API because it is simpler to show and explain basic use of the API. However, the basic makeup of a REST or gRPC request is quite similar. Streaming Recognition Requests are only supported by gRPC.

Synchronous Speech Recognition Requests

A synchronous Speech-to-Text API request consists of a speech recognition configuration, and audio data. A sample request is shown below:

{
    "recognizer": "projects/PROJECT_ID/locations/LOCATION/recognizers/RECOGNIZER_ID",
    "config": {
        "explicitDecodingConfig": {
            "encoding": "LINEAR16",
            "sampleRateHertz": 16000,
        }
    },
    "uri": "gs://bucket-name/path_to_audio_file"
}

Audio is supplied to Speech-to-Text through the audio_source parameter of type RecognitionAudio. The audio_source field contains either of the following sub-fields:

  • content: Contains the audio to evaluate, embedded within the request. See Embedding Audio Content below for more information.
  • uri: Contains a URI pointing to the audio content. The file must not be compressed (for example, gzip). Currently, this field must contain a Google Cloud Storage URI (of format gs://bucket-name/path_to_audio_file). See Passing Audio reference by a URI below.)

Audio for synchronous recognition is limited to 60 seconds in duration and 10 MB in size.

For more information about these request and response parameters, see the following sections.

Audio Metadata

Within the RecognitionConfig, you can specify either an ExplicitDecodingConfig or an AutoDetectDecodingConfig.

With an AutoDetectDecodingConfig, the service will automatically detect the audio metadata.

You can only use the ExplicitDecodingConfig for headerless PCM audio. To set the ExplicitDecodingConfig, specify the sample rate of your audio in the sampleRateHertz field. It must match the sample rate of the associated audio content or stream. Sample rates between 8000 Hz and 48000 Hz are supported within Speech-to-Text. You must also set the encoding field to any AudioEncoding value.

If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this may impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.

However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, do not resample your audio to 16000 Hz. Most legacy telephony audio, for example, use sample rates of 8000 Hz, which may give less accurate results. If you must use such audio, provide the audio to the Speech API at its native sample rate.

Languages

Speech-to-Text's recognition engine supports a variety of languages and dialects. You specify the language (and national or regional dialect) of your audio within the request configuration's languageCode field, using a BCP-47 identifier.

A full list of supported languages for each feature is available on the Language Support page.

Time offsets (timestamps)

Speech-to-Text can include time offset values (timestamps) for the beginning and end of each spoken word that is recognized in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.

Time offsets are especially useful for analyzing longer audio files, where you may need to search for a particular word in the recognized text and locate it (seek) in the original audio. Time offsets are supported for all our recognition methods: recognize and streamingrecognize.

Time offset values are only included for the first alternative provided in the recognition response.

To include time offsets in the results of your request, set the enableWordTimeOffsets parameter to true in your RecognitionFeatures. For examples using the REST API or the client libraries, see Using Time Offsets (Timestamps). For example, you can include the enableWordTimeOffsets parameter in the request as shown here:

{
    "recognizer": "projects/PROJECT_ID/locations/LOCATION/recognizers/RECOGNIZER_ID",
    "config": {
      "features": {
        "enableWordTimeOffsets": true
      }
    },
    "uri":"gs://gcs-test-data/gettysburg.flac"
}

The result returned by the Speech-to-Text API will contain time offset values for each recognized word as shown following:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "Four score and twenty...(etc)...",
          "confidence": 0.97186122,
          "words": [
            {
              "startTime": "1.300s",
              "endTime": "1.400s",
              "word": "Four"
            },
            {
              "startTime": "1.400s",
              "endTime": "1.600s",
              "word": "score"
            },
            {
              "startTime": "1.600s",
              "endTime": "1.600s",
              "word": "and"
            },
            {
              "startTime": "1.600s",
              "endTime": "1.900s",
              "word": "twenty"
            },
            ...
          ]
        }
      ]
    },
    {
      "alternatives": [
        {
          "transcript": "for score and plenty...(etc)...",
          "confidence": 0.9041967,
        }
      ]
    }
  ]
}

Model selection

Speech-to-Text can use one of several machine learning models to transcribe your audio file. Google has trained these speech recognition models for specific audio types and sources.

When you send an audio transcription request to Speech-to-Text, you can improve the results that you receive by specifying the source of the original audio. This allows the Speech-to-Text API to process your audio files using a machine learning model trained to recognize speech audio from that particular type of source.

To specify a model for speech recognition, set the model field in the Recognizer when creating it and then reference that recognizer when making a recognition request.

Speech-to-Text can use the following types of machine learning models for transcribing your audio files.

Type Enum constant Description
Latest Long latest_long

Use this model for any kind of long form content such as media or spontaneous speech and conversations. Consider using this model in place of the video model, especially if the video model is not available in your target language. You can also use this in place of the default model.

Latest Short latest_short

Use this model for short utterances that are a few seconds in length. It is useful for trying to capture commands or other single shot directed speech use cases. Consider using this model instead of the command and search model.

Telephony telephony

Use this model for transcribing audio from a phone call. Typically, phone audio is recorded at 8,000Hz sampling rate.

Medical dictation medical_dictation

Use this model to transcribe notes dictated by a medical professional.

Medical conversation medical_conversation

Use this model to transcribe a conversation between a medical professional and a patient.

Embedded audio content

Embedded audio is included in the speech recognition request when passing a content parameter within the request's audio_source field. For embedded audio provided as content within a gRPC request, that audio must be compatible for Proto3 serialization, and provided as binary data. For embedded audio provided as content within a REST request, that audio must be compatible with JSON serialization and first be Base64-encoded. See Base64 Encoding Your Audio for more information.)

When constructing a request using a Google Cloud client library, you generally will write out this binary (or base-64 encoded) data directly within the content field.

Pass audio referenced by a URI

More typically, you will pass a uri parameter within the Speech request's audio_source field, pointing to an audio file (in binary format, not base64) located on Google Cloud Storage of the following form:

gs://bucket-name/path_to_audio_file

For example, the following part of a Speech request references the sample audio file used within the Quickstart:

...
"uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
...

You must create a service account for Speech-to-Text and give that account read access to the relevant storage object. To create a service account, in Cloud Shell, run the following command to create the account if it doesn't exist, and display it.

gcloud beta services identity create --service=speech.googleapis.com \
    --project=PROJECT_ID

If you're prompted to install the gcloud Beta Commands component, type Y. After installation, the command is automatically restarted.

The service account ID is formatted like an email address:

Service identity created: service-xxx@gcp-sa-speech.iam.gserviceaccount.com

Give this account read access to the relevant storage object on which you want to run recognition.

More information about managing access to Google Cloud Storage is available at Creating and Managing Access Control Lists in the Google Cloud Storage documentation.

Speech-to-Text API responses

As indicated previously, a synchronous Speech-to-Text API response may take some time to return results, proportional to the length of the supplied audio. Once processed, the API will return a response as shown below:

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.98267895,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ],
     "resultEndOffset": "0.780s",
     "languageCode": "en-US"
    }
  ]
}

These fields are explained below:

  • results: contains the list of results (of type SpeechRecognitionResult) where each result corresponds to a segment of audio (segments of audio are separated by pauses). Each result will consist of one or more of the following fields:
    • alternatives: Contains a list of possible transcriptions, of type SpeechRecognitionAlternative. Whether more than one alternative appears depends both on whether you requested more than one alternative (by setting maxAlternatives in RecognitionFeatures to a value greater than 1) and on whether Speech-to-Text produced alternatives of high enough quality. Each alternative will consist of the following fields:
    • channelTag: The channel tag corresponding to the recognized result for the audio from that channel. This is only set for multi-channel audio.
    • resultEndOffset: The time offset of the end of this result relative to the beginning of the audio.
    • languageCode: Corresponds to the language code used for recognition in this result. If multiple language codes were given for recognition, then this value corresponds to the language that has the most likelihood of being spoked in the audio.

If no speech from the supplied audio could be recognized, then the returned results list will contain no items. Unrecognized speech is commonly the result of very poor-quality audio, or from language code, encoding, or sample rate values that do not match the supplied audio.

The components of this response are explained in the following sections.

Each synchronous Speech-to-Text API response returns a list of results, rather than a single result containing all recognized audio. The list of recognized audio (within the transcript elements) will appear in contiguous order.

Select alternatives

Each result within a successful synchronous recognition response can contain one or more alternatives (if the maxAlternatives value in RecognitionFeatures is greater than 1). If Speech-to-Text determines that an alternative has a sufficient Confidence Value, then that alternative is included in the response. The first alternative in the response is always the best (most likely) alternative.

Setting maxAlternatives to a higher value than 1 does not imply or guarantee that multiple alternatives will be returned. In general, more than one alternative is more appropriate for providing real-time options to users getting results via a Streaming Recognition Request.

Handling transcriptions

Each alternative supplied within the response will contain a transcript containing the recognized text. When provided with sequential alternatives, you should concatenate these transcriptions together.

The following Python code iterates over a result list and concatenates the transcriptions together. Note that we take the first alternative (the zeroth) in all cases.

response = service_request.execute()
recognized_text = 'Transcribed Text: \n'
for i in range(len(response['results'])):
    recognized_text += response['results'][i]['alternatives'][0]['transcript']

Confidence values

The confidence value is an estimate between 0.0 and 1.0. It's calculated by aggregating the "likelihood" values assigned to each word in the audio. A higher number indicates an estimated greater likelihood that the individual words were recognized correctly. This field is typically provided only for the top hypothesis, and only for results where is_final=true. For example, you may use the confidence value to decide whether to show alternative results to the user or ask for confirmation from the user.

Be aware, however, that the model determines the "best", top-ranked result based on more signals than the confidence score alone (such as sentence context). Because of this there are occasional cases where the top result doesn't have the highest confidence score. If you haven't requested multiple alternative results, the single "best" result returned may have a lower confidence value than anticipated. This can occur, for example, in cases where rare words are being used. A word that's rarely used can be assigned a low "likelihood" value even if it's recognized correctly. If the model determines the rare word to be the most likely option based on context, that result is returned at the top even if the result's confidence value is lower than alternative options.

Streaming Speech-to-Text API Recognition Requests

A streaming Speech-to-Text API recognition call is designed for real-time capture and recognition of audio, within a bi-directional stream. Your application can send audio on the request stream, and receive interim and final recognition results on the response stream in real time. Interim results represent the current recognition result for a section of audio, while the final recognition result represents the last, best guess for that section of audio.

Streaming requests

Unlike synchronous, in which you send both the configuration and audio within a single request, calling the streaming Speech API requires sending multiple requests. The first StreamingRecognizeRequest must contain a configuration of type StreamingRecognitionConfig without any accompanying audio. Subsequent StreamingRecognizeRequests sent over the same stream will then consist of consecutive frames of raw audio bytes.

A StreamingRecognitionConfig consists of the following fields:

  • config: An optional configuration, of type RecognitionConfig and is the same as that shown within synchronous requests.
  • config: An optional list of fields that override the values in the default RecognitionConfig in the Recognizer. If no mask is provided, all non-default valued fields in config override the values in the Recognizer for this recognition request. If a mask is provided, only the fields listed in the mask override the config in the Recognizer for this recognition request. If a wildcard (*) is provided, config completely overrides and replaces the config in the recognizer for this recognition request.
  • features: Optional. Speech recognition features to enable specific to streaming audio recognition requests. See StreamingRecognitionFeatures for more information.

Streaming responses

Streaming speech recognition results are returned within a series of responses of type StreamingRecognizeResponse. Such a response consists of the following fields:

  • speechEventType: Contains events of type SpeechEventType. The value of these events will indicate when a single utterance has been determined to have been completed. The speech events serve as markers within your stream's response.
  • results: Contains the list of results, which may be either interim or final results, of type StreamingRecognitionResult. The results list contains following the sub-fields:
    • alternatives: Contains a list of alternative transcriptions.
    • isFinal: Indicates whether the results obtained within this list entry are interim or are final. Google might return multiple isFinal=true results throughout a single stream, but the isFinal=true result is only guaranteed after the write stream is closed (half-close).
    • stability: Indicates the volatility of results obtained so far, with 0.0 indicating complete instability while 1.0 indicates complete stability. Note that unlike confidence, which estimates whether a transcription is correct, stability estimates whether the given partial result may change. If isFinal is set to true, stability will not be set.
    • result_end_offset: The offset of the end of this result relative to the beginning of the audio.
    • channel_tag: The channel number corresponding to the recognized result for the audio from that channel. This is only set for multi-channel audio.
    • language_code: The BCP-47 language tag of the language in this result..
  • speech_event_offset: The offset between the beginning of the audio and event emission.
  • metadata: Contains RecognitionResponseMetadata related to the number of billed audio seconds.