Speech-to-Text basics

This document is a guide to the basics of using Speech-to-Text. This conceptual guide covers the types of requests you can make to Speech-to-Text, how to construct those requests, and how to handle their responses. We recommend that all users of Speech-to-Text read this guide and one of the associated tutorials before diving into the API itself.

Speech-to-Text recognition requests

Speech-to-Text has three main methods to perform speech recognition. These are listed below:

  • Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.

  • Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a long-running operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.

  • Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.

Requests contain configuration parameters as well as audio data. Recognition requests can optionally contain a recognizer, a stored and reusable recognition configuration.

Audio Metadata

For most audio files, Speech-to-Text API can automatically deduce the audio metadata. Speech-to-Text parses the header of the file and decodes it according to that information. See the encoding page for which file types are supported.

For headerless audio files, Speech-to-Text API allows specifying the audio metadata explicitly in the recognition config. See the encoding page for more details.

If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this may impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.

However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, do not resample your audio to 16000 Hz. Most legacy telephony audio, for example, use sample rates of 8000 Hz, which may give less accurate results. If you must use such audio, provide the audio to the Speech-to-Text API at its native sample rate.

Languages

Speech-to-Text's recognition engine supports a variety of languages and dialects. You specify the language (and national or regional dialect) of your audio within the request configuration's languageCode field, using a BCP-47 identifier.

A full list of supported languages for each feature is available on the Language Support page.

Recognition features

Speech-to-Text API has additional recognition features such as automatic punctuation and word-level confidence. These are enabled in the recognition configuration in requests. See the sample code provided in the links above and the languages page for availability of these features.

Model selection

Speech-to-Text can use one of several machine learning models to transcribe your audio file. Google has trained these speech recognition models for specific audio types and sources. Refer to the model selection documentation to learn about the available models and how to select one in your requests.

Embedded audio content

Embedded audio is included in the speech recognition request when passing a content parameter within the request's audio_source field. For embedded audio provided as content within a gRPC request, that audio must be compatible for Proto3 serialization, and provided as binary data. For embedded audio provided as content within a REST request, that audio must be compatible with JSON serialization and first be Base64-encoded. See [Base64 Encoding Your Audio][base64-encoding] for more information.

When constructing a request using a Google Cloud client library, you generally will write out this binary (or base-64 encoded) data directly within the content field.

Pass audio referenced by a URI

More typically, you will pass a uri parameter within the Speech-to-Text API request's audio_source field, pointing to an audio file (in binary format, not base64) located on Cloud Storage of the following form:

gs://bucket-name/path/to/audio/file

Speech-to-Text uses a service account to access your files in Cloud Storage. By default, the service account has access to Cloud Storage files in the same project.

The service account email address is the following:

service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com

In order to transcribe Cloud Storage files in another project, you can give this service account the Speech-to-Text Service Agent role in the other project:

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com \
    --role=roles/speech.serviceAgent

More information about project IAM policy is available at Manage access to projects, folders, and organizations.

You can also give the service account more granular access by giving it permission to a specific Cloud Storage bucket:

gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
    --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com \
    --role=roles/storage.admin

More information about managing access to Cloud Storage is available at Create and Manage access control lists in the Cloud Storage documentation.

Speech-to-Text API responses

Once audio is processed, Speech-to-Text API returns the transcription results in SpeechRecognitionResult messages for synchronous and batch requests and in StreamingRecognitionResult messages for streaming requests. In synchronous and batch requests, the RPC response contains a list of results. The list of recognized audio appears in contiguous order. For streaming responses, all results marked as is_final appear in contiguous order.

Select alternatives

Each result within a successful synchronous recognition response can contain one or more alternatives (if the max_alternatives is greater than 1). If Speech-to-Text determines that an alternative has a sufficient confidence value, then that alternative is included in the response. The first alternative in the response is always the best (most likely) alternative.

Setting max_alternatives to a higher value than 1 does not imply or guarantee that multiple alternatives will be returned. In general, more than one alternative is more appropriate for providing real-time options to users getting results through a streaming recognition request.

Handling transcriptions

Each alternative supplied within the response will contain a transcript containing the recognized text. When provided with sequential alternatives, you should concatenate these transcriptions together.

Confidence values

The confidence value is an estimate between 0.0 and 1.0. It's calculated by aggregating the "likelihood" values assigned to each word in the audio. A higher number indicates an estimated greater likelihood that the individual words were recognized correctly. This field is typically provided only for the top hypothesis, and only for results where is_final=true. For example, you may use the confidence value to decide whether to show alternative results to the user or ask for confirmation from the user.

Be aware, however, that the model determines the "best", top-ranked result based on more signals than the confidence score alone (such as sentence context). Because of this there are occasional cases where the top result doesn't have the highest confidence score. If you haven't requested multiple alternative results, the single "best" result returned may have a lower confidence value than anticipated. This can occur, for example, in cases where rare words are being used. A word that's rarely used can be assigned a low "likelihood" value even if it's recognized correctly. If the model determines the rare word to be the most likely option based on context, that result is returned at the top even if the result's confidence value is lower than alternative options.

What's next