Transcribe short audio files

This page demonstrates how to transcribe a short audio file to text using synchronous speech recognition.

Synchronous speech recognition returns the recognized text for short audio (less than 60 seconds). To process a speech recognition request for audio longer than 60 seconds, use Asynchronous Speech Recognition.

Audio content can be sent directly to Speech-to-Text from a local file, or Speech-to-Text can process audio content stored in a Google Cloud Storage bucket. See the quotas & limits page for limits on synchronous speech recognition requests.

Perform synchronous speech recognition on a local file

Here is an example of performing synchronous speech recognition on a local audio file:

REST & CMD LINE

Refer to the speech:recognize API endpoint for complete details. See the RecognitionConfig reference documentation for more information on configuring the request body.

The audio content supplied in the request body must be base64-encoded. For more information on how to base64-encode audio, see Base64 Encoding Audio Content. For more information on the content field, see RecognitionAudio.

Before using any of the request data, make the following replacements:

  • LANGUAGE_CODE: the BCP-47 code of the language spoken in your audio clip.
  • ENCODING: the encoding of the audio you want to transcribe.
  • SAMPLE_RATE_HERTZ: sample rate in hertz of the audio you want to transcribe.
  • ENABLE_TIME_WORD_OFFSETS: enable this field if you want word start and end time offsets (timestamps) returned.
  • INPUT_AUDIO: a base64-encoded string of the audio data that you want to transcribe.

HTTP method and URL:

POST https://speech.googleapis.com/v1/speech:recognize

Request JSON body:

{
  "config":{
      "languageCode":"LANGUAGE_CODE",
      "encoding":ENCODING
      "sampleRateHertz":SAMPLE_RATE_HERTZ
      "enableTimeWordOffsets":ENABLE_TIME_WORD_OFFSETS
  },
  "audio":{
    "content":"INPUT_AUDIO"
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.98267895
        }
      ]
    }
  ]
}

gcloud

Refer to recognize command for complete details.

To perform speech recognition on a local file, use the gcloud command line tool, passing in the local filepath of the file to perform speech recognition on.

gcloud ml speech recognize PATH-TO-LOCAL-FILE --language-code='en-US'

If the request is successful, the server returns a response in JSON format:

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ]
    }
  ]
}

Go


func recognize(w io.Writer, file string) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	data, err := ioutil.ReadFile(file)
	if err != nil {
		return err
	}

	// Send the contents of the audio file with the encoding and
	// and sample rate information to be transcripted.
	resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	})

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
		}
	}
	return nil
}

Node.js

// Imports the Google Cloud client library
const fs = require('fs');
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

def transcribe_file(speech_file):
    """Transcribe the given audio file."""
    from google.cloud import speech
    import io

    client = speech.SpeechClient()

    with io.open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
    )

    response = client.recognize(config=config, audio=audio)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for Ruby.

Perform synchronous speech recognition on a remote file

For your convenience, Speech-to-Text API can perform synchronous speech recognition directly on an audio file located in Google Cloud Storage, without the need to send the contents of the audio file in the body of your request.

Here is an example of performing synchronous speech recognition on a file located in Cloud Storage:

REST & CMD LINE

Refer to the speech:recognize API endpoint for complete details. See the RecognitionConfig reference documentation for more information on configuring the request body.

The audio content supplied in the request body must be base64-encoded. For more information on how to base64-encode audio, see Base64 Encoding Audio Content. For more information on the content field, see RecognitionAudio.

Before using any of the request data, make the following replacements:

  • LANGUAGE_CODE: the BCP-47 code of the language spoken in your audio clip.
  • ENCODING: the encoding of the audio you want to transcribe.
  • SAMPLE_RATE_HERTZ: sample rate in Hertz of the audio you want to transcribe.
  • ENABLE_TIME_WORD_OFFSETS: enable this field if you want word start and end time offsets (timestamps) returned.
  • STORAGE_BUCKET: a Cloud Storage bucket.
  • INPUT_AUDIO: the audio data file that you want to transcribe.

HTTP method and URL:

POST https://speech.googleapis.com/v1/speech:recognize

Request JSON body:

{
  "config":{
      "languageCode":"LANGUAGE_CODE",
      "encoding":ENCODING
      "sampleRateHertz":SAMPLE_RATE_HERTZ
      "enableTimeWordOffsets":ENABLE_TIME_WORD_OFFSETS
  },
  "audio":{
    "uri":"gs://STORAGE_BUCKET/INPUT_AUDIO"
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.98267895
        }
      ]
    }
  ]
}

gcloud

Refer to recognize command for complete details.

To perform speech recognition on a local file, use the gcloud command line tool, passing in the local filepath of the file to perform speech recognition on.

gcloud ml speech recognize 'gs://cloud-samples-tests/speech/brooklyn.flac' \
--language-code='en-US'

If the request is successful, the server returns a response in JSON format:

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ]
    }
  ]
}

Go


func recognizeGCS(w io.Writer, gcsURI string) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	// Send the request with the URI (gs://...)
	// and sample rate information to be transcripted.
	resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
		},
	})

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
		}
	}
	return nil
}

Node.js

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
};
const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

def transcribe_gcs(gcs_uri):
    """Transcribes the audio file specified by the gcs_uri."""
    from google.cloud import speech

    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code="en-US",
    )

    response = client.recognize(config=config, audio=audio)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for Ruby.