Troubleshooting

Learn about troubleshooting steps that you might find helpful if you run into problems using Speech-to-Text.

Cannot authenticate to Speech-to-Text

You might receive an error message indicating that your "Application Default Credentials" are unavailable or you might be wondering how to get an API key to use when calling Speech-to-Text.

Speech-to-Text uses Application Default Credentials for authentication.

You must have a service account for your project, download the key (JSON file) for your service account to your development environment, and then set the location of that JSON file to an environment variable named GOOGLE_APPLICATION_CREDENTIALS.

Furthermore, the GOOGLE_APPLICATION_CREDENTIALS environment variable must be available within the context that you call the Speech-to-Text API. For example, if you set the variable from within an terminal session but run your code in the debugger of your IDE, the execution context of your code might not have access to the variable. In that circumtance, your request to Speech-to-Text might fail for lack of proper authentication.

For more information on how to set the GOOGLE_APPLICATION_CREDENTIALS environment variable, see the Speech-to-Text quickstarts or the documentation on using the Application Default Credentials.

Speech-to-Text returns an empty response

There are multiple reasons why Speech-to-Text might return an empty response. The source of the problem can be the RecognitionConfig or the audio itself.

Troubleshoot RecognitionConfig

RecognitionConfig object (or StreamingRecognitionConfig) is part of a Speech-to-Text recognition request. There are 2 main categories of fields that must be set in order to correctly perform a transcription:

  • Audio configuration.
  • Model and language.

One of the most common causes of empty responses (for example, you receive an empty {} JSON response) is providing incorrect information about the audio metadata. If the audio configuration fields are not set correctly, transcription will most likely fail and the recognition model will return empty results.

Audio configuration contains the metadata of the provided audio. You can obtain the metadata for your audio file using the ffprobe command, which is part of FFMPEG.

The following example demonstrates using ffprobe to get the metadata for https://storage.googleapis.com/cloud-samples-tests/speech/commercial_mono.wav.

$ ffprobe commercial_mono.wav
[...]
Input #0, wav, from 'commercial_mono.wav':
  Duration: 00:00:35.75, bitrate: 128 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, 1 channels, s16, 128 kb/s

With the command above, we can see the file has:

  • sample_rate_hertz: 8000
  • channels: 1
  • encoding LINEAR16 (s16)

You can use this information in your RecognitionConfig.

Additional audio-related reasons for an empty response can be related to audio encoding. Here are some other tools and things to try:

  1. Play the file and listen to the output. Is the audio clear and the speech intelligible?

    To play files, you can use the SoX (Sound eXchange) play command. A few examples based on different audio encodings are shown below.

    FLAC files include a header that indicates the sample rate, encoding type and number of channels, and can be played as follows:

    play audio.flac

    LINEAR16 files do not include a header. To play them you must specify the sample rate, encoding type and number of channels. The LINEAR16 encoding must be 16-bits, signed-integer, little-endian.

    play --channels=1 --bits=16 --rate=16000 --encoding=signed-integer \
    --endian=little audio.raw

    MULAW files also do not include a header and often use a lower sample rate.

    play --channels=1 --rate=8000 --encoding=u-law audio.raw
  2. Check that the audio encoding of your data matches the parameters you sent in RecognitionConfig. For example, if your request specified "encoding":"FLAC" and "sampleRateHertz":16000, the audio data parameters listed by the SoX play command should match these parameters, as follows:

    play audio.flac

    should list:

    Encoding: FLAC
    Channels: 1 @ 16-bit
    Sampleratehertz: 16000Hz
    

    If the SoX listing shows a Sampleratehertz other than 16000Hz, change the "sampleRateHertz" in InitialRecognizeRequest to match. If the Encoding is not FLAC or Channels is not 1 @ 16-bit, you cannot use this file directly, and will need to convert it to a compatible encoding (see next step).

  3. If your audio file is not in FLAC encoding, try converting it to FLAC using SoX, and repeat the steps above to play the file and verify the encoding, sampleRateHertz, and channels. Here are some examples that convert various audio file-formats to FLAC encoding.

    sox audio.wav --channels=1 --bits=16 audio.flac
    sox audio.ogg --channels=1 --bits=16 audio.flac
    sox audio.au --channels=1 --bits=16 audio.flac
    sox audio.aiff --channels=1 --bits=16 audio.flac
    

    To convert a raw file to FLAC, you need to know the audio-encoding of the file. For example, to convert stereo 16-bit signed little-endian at 16000Hz to FLAC:

    sox --channels=2 --bits=16 --rate=16000 --encoding=signed-integer \
    --endian=little audio.raw --channels=1 --bits=16 audio.flac
    
  4. Run the Quickstart example or one of the Sample Applications with the supplied sample audio file. Once the example is running successfully, replace the sample audio file with your audio file.

Model and Language Configuration

Model selection is a very important to obtaining high-quality transcription results. Speech-to-Text provides multiple models that have been tuned to different use cases and must be chosen to most closely match your audio. For example, some models (such as latest_short and command_and_search are short-form models, which means that are more suited to short audios and prompts. These models are likely to return results as soon as they detect a period of silence. Long-form models, on the other hand (such as latest_short, phone_call, video and default are more suited for longer audios and are not as sensitive to interpreting silence as the end of the audio.

If your recognition ends too abruptly or doesn't return quickly, you might want to check and experiment with other models to see if you can get better transcription quality. You can experiment with multiple models using the Speech UI.

Unexpected results from speech recognition

If the results returned by Speech-to-Text are not what you expected: