This document is a guide to the basics of using Speech-to-Text. This conceptual guide covers the types of requests you can make to Speech-to-Text, how to construct those requests, and how to handle their responses. We recommend that all users of Speech-to-Text read this guide and one of the associated tutorials before diving into the API itself.
Speech-to-Text recognition requests
Speech-to-Text has three main methods to perform speech recognition. These are listed below:
Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.
Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a long-running operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.
Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.
Requests contain configuration parameters as well as audio data. Recognition requests can optionally contain a recognizer, a stored and reusable recognition configuration.
For most audio files, Speech-to-Text API can automatically deduce the audio metadata. Speech-to-Text parses the header of the file and decodes it according to that information. See the encoding page for which file types are supported.
For headerless audio files, Speech-to-Text API allows specifying the audio metadata explicitly in the recognition config. See the encoding page for more details.
If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this may impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.
However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, do not resample your audio to 16000 Hz. Most legacy telephony audio, for example, use sample rates of 8000 Hz, which may give less accurate results. If you must use such audio, provide the audio to the Speech-to-Text API at its native sample rate.
Speech-to-Text's recognition engine supports a variety of languages and
dialects. You specify the language (and national or regional dialect) of your
audio within the request configuration's
languageCode field, using a
A full list of supported languages for each feature is available on the Language Support page.
Speech-to-Text API has additional recognition features such as automatic punctuation and word-level confidence. These are enabled in the recognition configuration in requests. See the sample code provided in the links above and the languages page for availability of these features.
Speech-to-Text can use one of several machine learning models to transcribe your audio file. Google has trained these speech recognition models for specific audio types and sources. Refer to the model selection documentation to learn about the available models and how to select one in your requests.
Embedded audio content
Embedded audio is included in the speech recognition request when passing a
content parameter within the request's
audio_source field. For embedded audio
provided as content within a gRPC request, that audio must be compatible for
serialization, and provided as binary data. For embedded audio provided as
content within a REST request, that audio must be compatible with JSON
serialization and first be Base64-encoded. See
[Base64 Encoding Your Audio][base64-encoding] for more information.
When constructing a request using a
Google Cloud client library,
you generally will write out this binary (or base-64 encoded) data
directly within the
Pass audio referenced by a URI
More typically, you will pass a
uri parameter within the Speech-to-Text API request's
audio_source field, pointing to an audio file (in binary format, not base64)
located on Cloud Storage of the following form:
Speech-to-Text uses a service account to access your files in Cloud Storage. By default, the service account has access to Cloud Storage files in the same project.
The service account email address is the following:
In order to transcribe Cloud Storage files in another project, you can give this service account the Speech-to-Text Service Agent role in the other project:
gcloud projects add-iam-policy-binding PROJECT_ID \ --member=serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com \ --role=roles/speech.serviceAgent
More information about project IAM policy is available at Manage access to projects, folders, and organizations.
You can also give the service account more granular access by giving it permission to a specific Cloud Storage bucket:
gsutil iam ch serviceAccount:service-PROJECT_NUMBER@gcp-sa-speech.iam.gserviceaccount.com:admin \ gs://BUCKET_NAME
More information about managing access to Cloud Storage is available at Create and Manage access control lists in the Cloud Storage documentation.
Speech-to-Text API responses
Once audio is processed, Speech-to-Text API returns the transcription results in
SpeechRecognitionResult messages for
synchronous and batch requests and in
StreamingRecognitionResult messages for
streaming requests. In synchronous and batch requests, the RPC response contains
a list of results. The list of recognized audio appears in contiguous
order. For streaming responses, all results marked as
in contiguous order.
Each result within a successful synchronous recognition response can contain
one or more
alternatives (if the
is greater than
1). If Speech-to-Text determines that an alternative
has a sufficient
confidence value, then that alternative is included in
the response. The first alternative in the response is always the best
(most likely) alternative.
max_alternatives to a higher value than
1 does not imply or guarantee
that multiple alternatives will be returned. In general, more than one
alternative is more appropriate for providing real-time options to users getting
results through a streaming recognition request.
Each alternative supplied within the response will contain a
containing the recognized text. When provided with sequential alternatives,
you should concatenate these transcriptions together.
confidence value is an estimate between 0.0 and 1.0. It's calculated
by aggregating the "likelihood" values assigned to each word in the
audio. A higher number indicates an estimated greater likelihood that the
individual words were recognized correctly. This field is typically
provided only for the top hypothesis, and only for results where
is_final=true. For example, you may use the
to decide whether to show alternative results to the user or ask for
confirmation from the user.
Be aware, however, that the model determines the "best", top-ranked result based
on more signals than the
confidence score alone (such as sentence context).
Because of this there are occasional cases where the top result doesn't
have the highest confidence score. If you haven't requested multiple alternative
results, the single "best" result returned may have a lower confidence value
than anticipated. This can occur, for example, in cases where rare words are
being used. A word that's rarely used can be assigned a low "likelihood" value
even if it's recognized correctly. If the model determines the rare word to be
the most likely option based on context, that result is returned at the top even
if the result's
confidence value is lower than alternative options.
- Use client libraries to transcribe audio using your favorite programming language.
- Practice transcribing short audio files.
- Learn how to transcribe streaming audio.
- Learn how to transcribe long audio files.