Cloud Speech API Basics

This document is a guide to the basics of using the Google Cloud Speech API. This conceptual guide covers the types of requests you can make to the Speech API, how to construct those requests, and how to handle their responses. We recommend that all users of the Speech API read this guide and one of the associated tutorials before diving into the API itself.

Speech requests

The Speech API has three main methods to perform speech recognition. These are listed below:

  • Synchronous Recognition (REST and gRPC) sends audio data to the Speech API, performs recognition on that data, and returns results after all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.

  • Asynchronous Recognition (REST and gRPC) sends audio data to the Speech API and initiates a Long Running Operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 80 minutes.

  • Streaming Recognition (gRPC only) performs recognition on audio data provided within a gRPC bi-directional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing result to appear, for example, while a user is still speaking.

Requests contain configuration parameters as well as audio data. The sections below will cover these type of recognition requests, the responses they generate, and how to handle those responses in more detail.

Synchronous Speech API recognition

A Speech API synchronous recognition request is the simplest method for performing recognition on speech audio data. Synchronous Speech API requests may process up to 1 minute of speech audio data, and will return a response once all of the audio has been processed and recognition has been performed.

A synchronous request is blocking. The Speech API will take roughly the same amount of time to process audio data sent synchronously as the duration of the supplied audio data. That is, if you send audio data of 30 seconds in length, expect the synchronous request to take approximately 30 seconds to return results.

The Speech API has both REST and gRPC methods for calling Speech API synchronous and asynchronous requests. Because the REST API is simpler to show and explain, we will use that within this guide to illustrate basic use of the API. However, the basic makeup of a REST or gRPC request is quite similar. For more information about setting up gRPC requests, see Streaming Recognition Requests.

Synchronous Speech Recognition Requests

A synchronous speech API request consists of a speech recognition configuration, and audio data. A sample request is shown below:

{
    "config": {
        "encoding": "LINEAR16",
        "sampleRate": 16000,
        "languageCode": "en-US",
    },
    "audio": {
        "uri": "gs://bucket-name/path_to_audio_file"
    }
}

All Speech API synchronous recognition requests must include a speech recognition config field (of type RecognitionConfig). A RecognitionConfig contains the following sub-fields:

  • encoding - (required) specifies the encoding scheme of the supplied audio (of type AudioEncoding). If you have a choice in codec, prefer a lossless encoding such as FLAC or LINEAR16 for best performance. (For more information, see Audio Encodings below.)
  • sampleRate - (required) specifies the sample rate (in Hertz) of the supplied audio. (For more information on sample rates, see Sample Rates below.)
  • languageCode - (required) contains the language + region/locale to use for speech recognition of the supplied audio. Note that language codes typically consist of primary language tags and secondary region subtags to indicate dialects (for example, 'en' for English and 'US' for the United States in the above example.) (For a list of supported languages, see Supported Languages.)
  • maxAlternatives - (optional, defaults to 1) indicates the number of alternative transcriptions to provide in the response. By default, the Speech API provides one primary transcription. If you wish to evaluate different alternatives, set maxAlternatives to a higher value. Note that the Speech API will only return alternatives if the recognizer determines alternatives to be of sufficient quality; in general, alternatives are more appropriate for real-time requests requiring user feedback (for example, voice commands) and therefore are more suited for streaming recognition requests.
  • profanityFilter - (optional) indicates whether to filter out profane words or phrases. Words filtered out will contain their first letter and asterisks for the remaining characters (e.g. f***).
  • speechContext - (optional) contains additional contextual information for processing this audio. A context contains the following sub-field:
    • phrases - contains a list of words and phrases that provide hints to the speech recognition task. (See Phrase Hints below.)

Audio is supplied to the Speech API through the audio parameter of type RecognitionAudio. The audio field contains either of the following sub-fields:

  • content contains the audio to evaluate, embedded within the request. See Embedding Audio Content below for more information. Audio passed directly within this field is limited to 1 minute in duration.
  • uri contains a URI pointing to the audio content. Currently, this field must contain a Google Cloud Storage URI (of format gs://bucket-name/path_to_audio_file). See Passing Audio reference by a URI below.)

More information on these request and response parameters appears below.

Audio encodings

An audio file format's encoding (or codec) refers to the manner in which audio data is digitally stored and transmitted. You specify this encoding scheme within the request configuration's encoding field, and it must match the encoding of the associated audio content or stream.

The Speech API supports a number of different encodings. The following table lists supported audio codecs:

Codec Name Lossless Speech API Usage Notes
FLAC Free Lossless Audio Codec Yes 16-bit or 24-bit required for streams
LINEAR16 Linear PCM Yes Required for Asynchronous (longer) audio
MULAW μ-law No
AMR Adaptive Multi-Rate Narrowband No Sample rate must be 8000 Hz
AMR_WB Adaptive Multi-Rate Wideband No Sample rate must be 16000 Hz

For more information on Cloud Speech API audio codecs, consult the AudioEncoding reference documentation.

If you have a choice when encoding the source material, use a lossless encoding such as FLAC or LINEAR16 for better speech recognition. For guidelines on selecting the appropriate codec for your task, see Best Practices.

For a general overview of audio encoding, see the Audio Encoding guide.

Sample rates

You specify the sample rate of your audio within the request configuration's sampleRate field, and it must match the sample rate of the associated audio content or stream. Sample rates between 8000 Hz and 48000 Hz are supported within the Speech API.

If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Values lower than this may impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.

However, if your audio data has already been recorded at an existing sample rate other than 16000 Hz, do not resample your audio to 16000 Hz. Most legacy telephony audio, for example, use sample rates of 8000 Hz, which may give less accurate results. If you must use such audio, provide the audio to the Speech API at its native sample rate.

Languages

The Speech API's recognition engine supports a variety of languages and dialects. You specify the language (and national or regional dialect) of your audio within the request configuration's languageCode field, using a BCP-47 identifier.

A full list of supported languages, and explanation of BCP-47 identifier tags, is available on the Language Support page.

Phrase hints

For any given recognition task, you may also pass a speechContext (of type SpeechContext) that provides information to aid in processing the given audio. Currently, a context can hold a list of phrases to act as "hints" to the recognizer; these phrases can boost the probability that such words or phrases will be recognized.

You may use these phrase hints in a few ways:

  • Improve the accuracy for specific words and phrases that may tend to be overrepresented in your audio data. For example, if specific commands are typically spoken by the user, you can provide these as phrase hints. Such additional phrases may be particularly useful if the supplied audio contains noise or the contained speech is not very clear.

  • Add additional words to the vocabulary of the recognition task. The Cloud Speech API includes a very large vocabulary. However, if proper names or domain-specific words are out-of-vocabulary, you can add them to the phrases provided to your requests's speechContext.

Phrases may be provided both as small groups of words or as single words. (See Content Limits for limits on the number and size of these phrases.) When provided as multi-word phrases, hints boost the probability of recognizing those words in sequence but also, to a lesser extent, boost the probability of recognizing portions of the phrase, including individual words.

For example, this shwazil_hoful.flac file contains some made-up words. If recognition is performed without supplying these out-of-vocabulary words, the recognizer will not return the desired transcript, but instead return words that are in vocabulary, such as: "it's a swallow whole day".

{
  "config": {
    "encoding":"FLAC",
    "sample_rate": 16000,
    "language_code":"en-US"
  },
  "audio":{
    "uri":"gs://speech-demo/shwazil_hoful.flac"
  }
}

However, when these out-of-vocabulary words are supplied with the recognition request, the recognizer will return the desired transcript: "it's a shwazil hoful day".

{
  "config": {
    "encoding":"FLAC",
    "sample_rate": 16000,
    "language_code":"en-US",
    "speech_context": {
      "phrases":["hoful","shwazil"]
     }
  },
  "audio":{
    "uri":"gs://speech-demo/shwazil_hoful.flac"
  }
}

Alternatively, if certain words are typically said together in a phrase, they can be grouped together, which may further increase the confidence that they will be recognized.

{
  "config": {
    "encoding":"FLAC",
    "sample_rate": 16000,
    "language_code":"en-US",
    "speech_context": {
      "phrases":["shwazil hoful day"]
     }
  },
  "audio":{
    "uri":"gs://speech-demo/shwazil_hoful.flac"
  }
}

In general, be sparing when providing speech context hints. Better recognition accuracy can be achieved by limiting phrases to only those expected to be spoken. For example, if there are multiple dialog states or device operating modes, provide only the hints that correspond to the current state, rather than always supplying hints for all possible states.

Embedding audio content

Embedded audio is included in the speech recognition request when passing a content parameter within the request's audio field. For embedded audio provided as content within a gRPC request, that audio must be compatible for Proto3 serialization, and provided as binary data. For embedded audio provided as content within a REST request, that audio must be compatible with JSON serialization and first be Base64-encoded. See Base64 Encoding Your Audio for more information.)

When constructing a request using a Google Cloud client library, you generally will write out this binary (or base-64 encoded) data directly within the content field.

For example, the following Python code takes a passed audio file, Base64-encodes the audio data, and then constructs a synchronous recognition request:

with open(speech_file, 'rb') as speech:
    # Base64 encode the binary audio file for inclusion in the JSON
    # request.
    speech_content = base64.b64encode(speech.read())

# Construct the request
service = get_speech_service()
service_request = service.speech().syncrecognize(
    body={
        'config': {
            'encoding': 'LINEAR16',  # raw 16-bit signed LE samples
            'sampleRate': 16000,  # 16 khz
            'languageCode': 'en-US',  # a BCP-47 language tag
        },
        'audio': {
            'content': speech_content
            }
        })

Passing audio referenced by a URI

More typically, you will pass a uri parameter within the Speech request's audio field, pointing to an audio file located on Google Cloud Storage of the following form:

gs://bucket-name/path_to_audio_file

For example, the following part of a Speech request references the sample audio file used within the Quickstart:

...
    'audio': {
        'uri':'gs://cloud-samples-tests/speech/brooklyn.flac'
    }
...

You must have proper access permissions to read Google Cloud Storage files, such as one of the following:

  • Publicly readable (such as our sample audio files)
  • Readable by your service account, if using service account authorization.
  • Readable by a user account, if using 3-legged OAuth for user account authorization.

More information about managing access to Google Cloud Storage is available at Creating and Managing Access Control Lists in the Google Cloud Storage documentation.

Synchronous Speech API responses

As indicated previously, a synchronous Speech API response may take some time to return results, proportional to the length of the supplied audio. Once processed, the API will return a response as shown below:

{
  'results': [
    {
      'alternatives': [
        {
          'confidence': 0.98267895,
          'transcript': 'how old is the Brooklyn Bridge'
        }
      ]
    }
  ]
}

These fields are explained below:

  • results contains the list of results (of type SpeechRecognitionResult). Each result will consist of one or more of the following fields:
    • alternatives contains a list of possible transcriptions, of type
      SpeechRecognitionAlternatives. Whether more than one alternative appears depends both on whether you requested more than alternatives (by setting maxAlternatives to a value greater than 1) and on whether the Speech API produced alternatives of high enough quality. Each alternative will consist of the following fields:

The components of this response are explained in the following sections.

Each synchronous speech API response returns a list of results, rather than a single result containing all recognized audio. The list of recognized audio (within the transcript elements) will appear in contiguous order.

Selecting alternatives

Each result within a successful synchronous recognition response will contain one or more alternatives, depending both on whether more than one alternative was requested (by setting maxAlternatives greater than 1 within the request) and on whether the Speech API considered the alternative with a sufficient Confidence Value to merit inclusion in the response.

Setting maxAlternatives to a higher value than 1 does not imply or guarantee that multiple alternatives will be returned. In general, more than one alternative is more appropriate for providing real-time options to users getting results via a Streaming Recognition Request.

Handling transcriptions

Each alternative supplied within the response will contain a transcript containing the recognized text. When provided with sequential alternatives, you should concatenate these transcriptions together.

The following Python code iterates over a result list and concatenates the transcriptions together. Note that we take the first alternative (the zeroth) in all cases.

response = service_request.execute()
recognized_text = 'Transcribed Text: \n'
for i in range(len(response['results'])):
    recognized_text += response['results'][i]['alternatives'][0]['transcript']

Confidence values

The confidence value is an estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is typically provided only for the top hypothesis, and only for results where is_final=true. For example, you may use the confidence value to decide whether to show alternative results to the user or ask for confirmation from the user. If the confidence for the top value is high, it's likely correct. Whereas if the confidence for the top value is lower, there is a greater chance that one of the other alternatives is more accurate. Your code should not require the confidence field as it is not guaranteed to be accurate, or even set, in any of the results.

Asynchronous Speech API responses

An asynchronous Speech API request is identical in form to a synchronous Speech API request. (See Synchronous Requests However, instead of returning a response, the asynchronous request will initiate a Long Running Operation (of type Operation) and return this operation to the callee immediately.

A typical operation response is shown below:

{
  "name": "operation_name",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeMetadata"
    "progress_percent": 34,
    "start_time": "2016-08-30T23:26:29.579144Z",
    "last_update_time": `2016-08-30T23:26:29.826903Z"
  }
}

Note that no results are yet present. The Speech API will continue to process the supplied audio and use this operation to store eventual results, which will appear within the operation"s response field (of type AsyncRecognizeResponse) upon completion of the request.

A full response after completion of the request appears below:

{
  "name": "1268386125834704889",
  "metadata": {
    "lastUpdateTime": "2016-08-31T00:16:32.169Z",
    "@type": "type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeMetadata",
    "startTime": "2016-08-31T00:16:29.539820Z",
    "progressPercent": 100
  }
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeResponse",
    "results": [{
      "alternatives": [{
        "confidence": 0.98267895,
        "transcript": "how old is the Brooklyn Bridge"
      }]}]
  },
  "done": True,
}

Note that done has been set to True and that the operation's response contains a set of results of type SpeechRecognitionResult which is the same type returned by a synchronous Speech API recognition request.

By default, an asynchronous REST response will set done to False, its default value; however, because JSON does not require default values to be present within a field, when testing whether an operation is completed, you should test both that the done field is present and that it is set to True.

Streaming Speech API Recognition Requests

A streaming Speech API recognition call is designed for real-time capture and recognition of audio, within a bi-directional stream. Once a stream is opened, consecutive frames of audio data can be sent to the API, while interim results will be returned within the same stream.

Streaming requests

Unlike synchronous and asynchronous calls, in which you send both the configuration and audio within a single request, calling the streaming Speech API requires sending multiple requests. The first StreamingRecognizeRequest must contain a configuration of type StreamingRecognitionConfig without any accompanying audio. Subsequent StreamingRecognizeRequests sent over the same stream will then consist of consecutive frames of raw audio bytes.

A StreamingRecognitionConfig consists of the following fields:

  • config - (required) contains configuration information for the audio, of type RecognitionConfig and is the same as that shown within synchronous and asynchronous requests.
  • single_utterance - (optional, defaults to false) indicates whether this request should automatically end after speech is no longer detected. If set, the Speech API will detect pauses, silence, or non-speech audio to determine when to end recognition. If not set, the stream will continue to listen and process audio until either the stream is closed directly, or the stream's limit length has been exceeded. Setting single_utterance to true is useful for processing voice commands.
  • interim_results - (optional, defaults to false) indicates that this stream request should return temporary results that may be refined at a later time (after processing more audio). Interim results will be noted within responses through the setting of is_final to false.

Streaming responses

Streaming speech recognition results are returned within a series of responses of type SpeechRecognitionResponse. Such a response consists of the following fields:

  • endpointerType contains events of type EndpointerType. The value of these events will indicate when the speech stream has started, stopped, audio has ended, or a single utterance has been determined to have been completed. The endpointer events serve as markers within your stream's response.
  • results contains the list of results, which may be either interim or final results, of type
    StreamingRecognitionResult. The results list contains following the sub-fields:
    • alternatives contains a list of alternative transcriptions.
    • isFinal indicates whether the results obtained within this list entry are interim or are final.
    • stability indicates the volatility of results obtained so far, with 0.0 indicating complete instability while 1.0 indicates complete stability. Note that unlike confidence, which estimates whether a transcription is correct, stability estimates whether the given partial result may change. If isFinal is set to true, stability will not be set.

Send feedback about...

Google Cloud Speech API