This document is a guide to the basics of using Speech-to-Text. This conceptual guide covers the types of requests you can make to Speech-to-Text, how to construct those requests, and how to handle their responses. We recommend that all users of Speech-to-Text read this guide and one of the associated tutorials before diving into the API itself.
Try it for yourself
If you're new to Google Cloud, create an account to evaluate how Speech-to-Text performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
Try Speech-to-Text freeSpeech requests
Speech-to-Text has three main methods to perform speech recognition. These are listed below:
Synchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.
Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API and initiates a Long Running Operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.
Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.
Requests contain configuration parameters as well as audio data. The following sections describe these type of recognition requests, the responses they generate, and how to handle those responses in more detail.
Speech-to-Text API recognition
A Speech-to-Text API synchronous recognition request is the simplest method for performing recognition on speech audio data. Speech-to-Text can process up to 1 minute of speech audio data sent in a synchronous request. After Speech-to-Text processes and recognizes all of the audio, it returns a response.
A synchronous request is blocking, meaning that Speech-to-Text must return a response before processing the next request. Speech-to-Text typically processes audio faster than realtime, processing 30 seconds of audio in 15 seconds on average. In cases of poor audio quality, your recognition request can take significantly longer.
Speech-to-Text has both REST and gRPC methods for calling Speech-to-Text API synchronous and asynchronous requests. This article demonstrates the REST API because it is simpler to show and explain basic use of the API. However, the basic makeup of a REST or gRPC request is quite similar. Streaming Recognition Requests are only supported by gRPC.
Synchronous Speech Recognition Requests
A synchronous Speech-to-Text API request consists of a speech recognition configuration, and audio data. A sample request is shown below:
{ "config": { "encoding": "LINEAR16", "sampleRateHertz": 16000, "languageCode": "en-US", }, "audio": { "uri": "gs://bucket-name/path_to_audio_file" } }
All Speech-to-Text API synchronous recognition requests must include a speech recognition
config
field (of type
RecognitionConfig). A
RecognitionConfig
contains the following sub-fields:
encoding
- (required) specifies the encoding scheme of the supplied audio (of typeAudioEncoding
). If you have a choice in codec, prefer a lossless encoding such asFLAC
orLINEAR16
for best performance. (For more information, see Audio Encodings.) Theencoding
field is optional forFLAC
andWAV
files where the encoding is included in the file header.sampleRateHertz
- (required) specifies the sample rate (in Hertz) of the supplied audio. (For more information on sample rates, see Sample Rates below.) ThesampleRateHertz
field is optional forFLAC
andWAV
files where the sample rate is included in the file header.languageCode
- (required) contains the language + region/locale to use for speech recognition of the supplied audio. The language code must be a BCP-47 identifier. Note that language codes typically consist of primary language tags and secondary region subtags to indicate dialects (for example, 'en' for English and 'US' for the United States in the above example.) (For a list of supported languages, see Supported Languages.)maxAlternatives
- (optional, defaults to1
) indicates the number of alternative transcriptions to provide in the response. By default, the Speech-to-Text API provides one primary transcription. If you wish to evaluate different alternatives, setmaxAlternatives
to a higher value. Note that Speech-to-Text will only return alternatives if the recognizer determines alternatives to be of sufficient quality; in general, alternatives are more appropriate for real-time requests requiring user feedback (for example, voice commands) and therefore are more suited for streaming recognition requests.profanityFilter
- (optional) indicates whether to filter out profane words or phrases. Words filtered out will contain their first letter and asterisks for the remaining characters (e.g. f***). The profanity filter operates on single words, it does not detect abusive or offensive speech that is a phrase or a combination of words.speechContext
- (optional) contains additional contextual information for processing this audio. A context contains the following sub-fields:boost
- contains a value that assigns a weight to recognizing a given word or phrase.phrases
- contains a list of words and phrases that provide hints to the speech recognition task. For more information, see the information on speech adaptation.
Audio is supplied to Speech-to-Text through the audio
parameter of type
RecognitionAudio. The
audio
field contains either of the following sub-fields:
content
contains the audio to evaluate, embedded within the request. See Embedding Audio Content below for more information. Audio passed directly within this field is limited to 1 minute in duration.uri
contains a URI pointing to the audio content. The file must not be compressed (for example, gzip). Currently, this field must contain a Google Cloud Storage URI (of formatgs://bucket-name/path_to_audio_file
). See Passing Audio reference by a URI below.
More information on these request and response parameters appears below.
Sample rates
You specify the sample rate of your audio in the sampleRateHertz
field
of the request configuration, and it must match the sample rate of the associated audio
content or stream. Sample rates between 8000 Hz and 48000 Hz are supported
within Speech-to-Text. You can specify the sample rate for a FLAC
or
WAV
file in the file header instead of using the sampleRateHertz
field.
A FLAC
file must contain the sample rate in the FLAC
header in order to be
submitted to the Speech-to-Text API.
If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this may impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.
However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, do not resample your audio to 16000 Hz. Most legacy telephony audio, for example, use sample rates of 8000 Hz, which may give less accurate results. If you must use such audio, provide the audio to the Speech API at its native sample rate.
Languages
Speech-to-Text's recognition engine supports a variety of languages and
dialects. You specify the language (and national or regional dialect) of your
audio within the request configuration's languageCode
field, using a
BCP-47 identifier.
A full list of supported languages for each feature is available on the Language Support page.
Time offsets (timestamps)
Speech-to-Text can include time offset values (timestamps) for the beginning and end of each spoken word that is recognized in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.
Time offsets are especially useful for analyzing longer audio files, where you
may need to search for a particular word in the recognized text and
locate it (seek) in the original audio. Time offsets are supported for all
our recognition methods: recognize
, streamingrecognize
, and
longrunningrecognize
.
Time offset values are only included for the first alternative provided in the recognition response.
To include time offsets in the results of your request, set the
enableWordTimeOffsets
parameter to true in your request configuration. For
examples using the REST API or the Client Libraries, see
Using Time Offsets (Timestamps).
For example, you can include the enableWordTimeOffsets
parameter in the
request configuration as shown here:
{ "config": { "languageCode": "en-US", "enableWordTimeOffsets": true }, "audio":{ "uri":"gs://gcs-test-data/gettysburg.flac" } }
The result returned by the Speech-to-Text API will contain time offset values for each recognized word as shown following:
{ "name": "6212202767953098955", "metadata": { "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata", "progressPercent": 100, "startTime": "2017-07-24T10:21:22.013650Z", "lastUpdateTime": "2017-07-24T10:21:45.278630Z" }, "done": true, "response": { "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse", "results": [ { "alternatives": [ { "transcript": "Four score and twenty...(etc)...", "confidence": 0.97186122, "words": [ { "startTime": "1.300s", "endTime": "1.400s", "word": "Four" }, { "startTime": "1.400s", "endTime": "1.600s", "word": "score" }, { "startTime": "1.600s", "endTime": "1.600s", "word": "and" }, { "startTime": "1.600s", "endTime": "1.900s", "word": "twenty" }, ... ] } ] }, { "alternatives": [ { "transcript": "for score and plenty...(etc)...", "confidence": 0.9041967, } ] } ] } }
Model selection
Speech-to-Text can use one of several machine learning models to transcribe your audio file. Google has trained these speech recognition models for specific audio types and sources.
When you send an audio transcription request to Speech-to-Text, you can improve the results that you receive by specifying the source of the original audio. This allows the Speech-to-Text API to process your audio files using a machine learning model trained to recognize speech audio from that particular type of source.
To specify a model for speech recognition, include the model
field
in the RecognitionConfig
object for your
request, specifying the model that you want to use.
See the Speech-to-Text transcription models list for the available machine learning models.
Embedded audio content
Embedded audio is included in the speech recognition request when passing a
content
parameter within the request's audio
field. For embedded audio
provided as content within a gRPC request, that audio must be compatible for
Proto3
serialization, and provided as binary data. For embedded audio provided as
content within a REST request, that audio must be compatible with JSON
serialization and first be Base64-encoded. See
Base64 Encoding Your Audio for more information.
When constructing a request using a
Google Cloud client library,
you generally will write out this binary (or base-64 encoded) data
directly within the content
field.
Pass audio referenced by a URI
More typically, you will pass a uri
parameter within the Speech request's
audio
field, pointing to an audio file (in binary format, not base64)
located on Google Cloud Storage of the following form:
gs://bucket-name/path_to_audio_file
For example, the following part of a Speech request references the sample audio file used within the Quickstart:
... "audio": { "uri":"gs://cloud-samples-tests/speech/brooklyn.flac" } ...
You must have proper access permissions to read Google Cloud Storage files, such as one of the following:
- Publicly readable (such as our sample audio files)
- Readable by your service account, if using service account authorization.
- Readable by a user account, if using 3-legged OAuth for user account authorization.
More information about managing access to Google Cloud Storage is available at Creating and Managing Access Control Lists in the Google Cloud Storage documentation.
Speech-to-Text API responses
As indicated previously, a synchronous Speech-to-Text API response may take some time to return results, proportional to the length of the supplied audio. Once processed, the API will return a response as shown below:
{ "results": [ { "alternatives": [ { "confidence": 0.98267895, "transcript": "how old is the Brooklyn Bridge" } ] } ] }
These fields are explained below:
results
contains the list of results (of typeSpeechRecognitionResult
) where each result corresponds to a segment of audio (segments of audio are separated by pauses). Each result will consist of one or more of the following fields:alternatives
contains a list of possible transcriptions, of typeSpeechRecognitionAlternatives
. Whether more than one alternative appears depends both on whether you requested more than one alternative (by settingmaxAlternatives
to a value greater than1
) and on whether Speech-to-Text produced alternatives of high enough quality. Each alternative will consist of the following fields:transcript
contains the transcribed text. See Handling Transcriptions below.confidence
contains a value between 0 and 1 indicating how confident Speech-to-Text is of the given transcription. See Interpreting Confidence Values below.
If no speech from the supplied audio could be recognized, then
the returned results
list will contain no items.
Unrecognized speech is commonly the result of very poor-quality audio,
or from language code, encoding, or sample rate values that do not match
the supplied audio.
The components of this response are explained in the following sections.
Each synchronous Speech-to-Text API response returns a list of results,
rather than a single result containing all recognized audio. The list of
recognized audio (within the transcript
elements) will appear in
contiguous order.
Select alternatives
Each result within a successful synchronous recognition response can contain
one or more alternatives
(if the maxAlternatives
value for the request
is greater than 1
). If Speech-to-Text determines that an alternative
has a sufficient
Confidence Value, then that alternative is included in
the response. The first alternative in the response is always the best
(most likely) alternative.
Setting maxAlternatives
to a higher value than 1
does not imply or guarantee
that multiple alternatives will be returned. In general, more than one
alternative is more appropriate for providing real-time options to users getting
results via a Streaming Recognition Request.
Handling transcriptions
Each alternative supplied within the response will contain a transcript
containing the recognized text. When provided with sequential alternatives,
you should concatenate these transcriptions together.
The following Python code iterates over a result list and concatenates the transcriptions together. Note that we take the first alternative (the zeroth) in all cases.
response = service_request.execute() recognized_text = 'Transcribed Text: \n' for i in range(len(response['results'])): recognized_text += response['results'][i]['alternatives'][0]['transcript']
Confidence values
The confidence
value is an estimate between 0.0 and 1.0. It's calculated
by aggregating the "likelihood" values assigned to each word in the
audio. A higher number indicates an estimated greater likelihood that the
individual words were recognized correctly. This field is typically
provided only for the top hypothesis, and only for results where
is_final=true
. For example, you may use the confidence
value
to decide whether to show
alternative results
to the user or ask for confirmation from the user.
Be aware, however, that the model determines the "best", top-ranked result based
on more signals than the confidence
score alone (such as sentence context).
Because of this there are occasional cases where the top result doesn't
have the highest confidence score. If you haven't requested multiple alternative
results, the single "best" result returned may have a lower confidence value
than anticipated. This can occur, for example, in cases where rare words are
being used. A word that's rarely used can be assigned a low "likelihood" value
even if it's recognized correctly. If the model determines the rare word to be
the most likely option based on context, that result is returned at the top even
if the result's confidence
value is lower than alternative options.
Asynchronous Requests and Responses
An asynchronous Speech-to-Text API request to the LongRunningRecognize method is identical in form to a synchronous Speech-to-Text API request. However, instead of returning a response, the asynchronous request will initiate a Long Running Operation (of type Operation) and return this operation to the callee immediately. You can use asynchronous speech recognition with audio of any length up to 480 minutes.
A typical operation response is shown below:
{ "name": "operation_name", "metadata": { "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata" "progressPercent": 34, "startTime": "2016-08-30T23:26:29.579144Z", "lastUpdateTime": "2016-08-30T23:26:29.826903Z" } }
Note that no results are yet present. Speech-to-Text will continue to
process the audio and use this operation to store the results. Results will
appear in the response
field of the
operation
returned when the LongRunningRecognize
request is complete.
A full response after completion of the request appears below:
{ "name": "1268386125834704889", "metadata": { "lastUpdateTime": "2016-08-31T00:16:32.169Z", "@type": "type.googleapis.com/google.cloud.speech.v1.LongrunningRecognizeMetadata", "startTime": "2016-08-31T00:16:29.539820Z", "progressPercent": 100 } "response": { "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse", "results": [{ "alternatives": [{ "confidence": 0.98267895, "transcript": "how old is the Brooklyn Bridge" }]}] }, "done": True, }
Note that done
has been set to True
and that the operation's response
contains a set of results of type
SpeechRecognitionResult
which is the same type returned by a synchronous Speech-to-Text API recognition request.
By default, an asynchronous REST response will set
done
to False
, its default value; however, because
JSON does not require default values to be present within a field, when testing
whether an operation is completed, you should test both that the
done
field is present and that it is set to True
.
Streaming Speech-to-Text API Recognition Requests
A streaming Speech-to-Text API recognition call is designed for real-time capture and recognition of audio, within a bi-directional stream. Your application can send audio on the request stream, and receive interim and final recognition results on the response stream in real time. Interim results represent the current recognition result for a section of audio, while the final recognition result represents the last, best guess for that section of audio.
Streaming requests
Unlike synchronous and asynchronous calls, in which you send both the
configuration and audio within a single request, calling the streaming Speech
API requires sending multiple requests. The first StreamingRecognizeRequest
must contain a configuration of type
StreamingRecognitionConfig
without any accompanying audio. Subsequent StreamingRecognizeRequest
s sent
over the same stream will then consist of consecutive frames of raw audio bytes.
A StreamingRecognitionConfig
consists of the following fields:
config
- (required) contains configuration information for the audio, of type RecognitionConfig and is the same as that shown within synchronous and asynchronous requests.single_utterance
- (optional, defaults tofalse
) indicates whether this request should automatically end after speech is no longer detected. If set, Speech-to-Text will detect pauses, silence, or non-speech audio to determine when to end recognition. If not set, the stream will continue to listen and process audio until either the stream is closed directly, or the stream's limit length has been exceeded. Settingsingle_utterance
totrue
is useful for processing voice commands.interim_results
- (optional, defaults tofalse
) indicates that this stream request should return temporary results that may be refined at a later time (after processing more audio). Interim results will be noted within responses through the setting ofis_final
tofalse
.
Streaming responses
Streaming speech recognition results are returned within a series of responses of type StreamingRecognitionResponse. Such a response consists of the following fields:
speechEventType
contains events of type SpeechEventType. The value of these events will indicate when a single utterance has been determined to have been completed. The speech events serve as markers within your stream's response.results
contains the list of results, which may be either interim or final results, of type StreamingRecognitionResult. Theresults
list contains following the sub-fields:alternatives
contains a list of alternative transcriptions.isFinal
indicates whether the results obtained within this list entry are interim or are final. Google might return multipleisFinal=true
results throughout a single stream, but theisFinal=true
result is only guaranteed after the write stream is closed (half-close).stability
indicates the volatility of results obtained so far, with0.0
indicating complete instability while1.0
indicates complete stability. Note that unlike confidence, which estimates whether a transcription is correct,stability
estimates whether the given partial result may change. IfisFinal
is set totrue
,stability
will not be set.