Transcribing phone audio with enhanced models

This tutorial shows how to transcribe the audio recorded from a phone using Speech-to-Text.

Audio files can come from many different sources. Audio data can come from a phone (like voicemail) or a soundtrack included in a video file.

Speech-to-Text can use one of several machine learning models to transcribe your audio file, to best match the original source of the audio. You can get better results from your speech transcription by specifying the source of the original audio. This allows the Speech-to-Text to process your audio files using a machine learning model trained for data similar to your audio file.


  • Send a audio transcription request for audio recorded from a phone (like voicemail) to Speech-to-Text.
  • Specify an enhanced speech recognition model for an audio transcription request.


This tutorial uses billable components of Cloud Platform, including:

  • Speech-to-Text

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

This tutorial has several prerequisites:

Sending a request

To best transcribe audio captured on a phone, like a phone call or voicemail, you can set the model field in your RecognitionConfig payload to phone_call. The model field tells Speech-to-Text API which speech recognition model to use for the transcription request.

You can improve the results of phone audio transcription by using an enhanced model. To use an enhanced model, you set the useEnhanced field to true in your RecognitionConfig payload.

The following code samples demonstrate how to select a specific transcription model when calling Speech-to-Text.


Refer to the speech:recognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the quickstart.

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \ \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "languageCode": "en-US",
        "enableWordTimeOffsets": false,
        "enableAutomaticPunctuation": true,
        "model": "phone_call",
        "useEnhanced": true
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/commercial_mono.wav"

See the RecognitionConfig reference documentation for more information on configuring the request body.

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

  "results": [
      "alternatives": [
          "transcript": "Hi, I'd like to buy a Chromecast. I was wondering whether you could help me with that.",
          "confidence": 0.8930228
      "resultEndTime": "5.640s"
      "alternatives": [
          "transcript": " Certainly, which color would you like? We are blue black and red.",
          "confidence": 0.9101991
      "resultEndTime": "10.220s"
      "alternatives": [
          "transcript": " Let's go with the black one.",
          "confidence": 0.8818244
      "resultEndTime": "13.870s"
      "alternatives": [
          "transcript": " Would you like the new Chromecast Ultra model or the regular Chromecast?",
          "confidence": 0.94733626
      "resultEndTime": "18.460s"
      "alternatives": [
          "transcript": " Regular Chromecast is fine. Thank you. Okay. Sure. Would you like to ship it regular or Express?",
          "confidence": 0.9519095
      "resultEndTime": "25.930s"
      "alternatives": [
          "transcript": " Express, please.",
          "confidence": 0.9101229
      "resultEndTime": "28.260s"
      "alternatives": [
          "transcript": " Terrific. It's on the way. Thank you. Thank you very much. Bye.",
          "confidence": 0.9321616
      "resultEndTime": "34.150s"


func enhancedModel(w io.Writer, path string) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %v", err)
	defer client.Close()

	// path = "../testdata/commercial_mono.wav"
	data, err := ioutil.ReadFile(path)
	if err != nil {
		return fmt.Errorf("ReadFile: %v", err)

	resp, err := client.Recognize(ctx, &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 8000,
			LanguageCode:    "en-US",
			UseEnhanced:     true,
			// A model must be specified to use enhanced model.
			Model: "phone_call",
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
	if err != nil {
		return fmt.Errorf("Recognize: %v", err)

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
	return nil


 * Please include the following imports to run this sample.
 * import;
 * import;
 * import;
 * import;
 * import;
 * import;
 * import;
 * import;
 * import java.nio.file.Files;
 * import java.nio.file.Path;
 * import java.nio.file.Paths;

public static void sampleRecognize() {
  // TODO(developer): Replace these variables before running the sample.
  String localFilePath = "resources/hello.wav";
  String model = "phone_call";
  sampleRecognize(localFilePath, model);

 * Transcribe a short audio file using a specified transcription model
 * @param localFilePath Path to local audio file, e.g. /path/audio.wav
 * @param model The transcription model to use, e.g. video, phone_call, default For a list of
 *     available transcription models, see:
public static void sampleRecognize(String localFilePath, String model) {
  try (SpeechClient speechClient = SpeechClient.create()) {

    // The language of the supplied audio
    String languageCode = "en-US";
    RecognitionConfig config =
    Path path = Paths.get(localFilePath);
    byte[] data = Files.readAllBytes(path);
    ByteString content = ByteString.copyFrom(data);
    RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(content).build();
    RecognizeRequest request =
    RecognizeResponse response = speechClient.recognize(request);
    for (SpeechRecognitionResult result : response.getResultsList()) {
      // First alternative is the most probable result
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcript: %s\n", alternative.getTranscript());
  } catch (Exception exception) {
    System.err.println("Failed to create the client due to: " + exception);


// Imports the Google Cloud client library for Beta API
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

 * TODO(developer): Uncomment the following lines before running the sample.
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
const audio = {
  content: fs.readFileSync(filename).toString('base64'),

const request = {
  config: config,
  audio: audio,

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
console.log('Transcription: ', transcription);


def transcribe_model_selection(speech_file, model):
    """Transcribe the given audio file synchronously with
    the selected model."""
    from import speech

    client = speech.SpeechClient()

    with open(speech_file, "rb") as audio_file:
        content =

    audio = speech.RecognitionAudio(content=content)

    config = speech.RecognitionConfig(

    response = client.recognize(config=config, audio=audio)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print("-" * 20)
        print("First alternative of result {}".format(i))
        print(u"Transcript: {}".format(alternative.transcript))

Additional languages

C#: Please follow the C# setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for .NET.

PHP: Please follow the PHP setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for PHP.

Ruby: Please follow the Ruby setup instructions on the client libraries page and then visit the Speech-to-Text reference documentation for Ruby.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting instances

To delete a Compute Engine instance:

  1. In the Cloud Console, go to the VM instances page.

    Go to VM instances

  2. Select the checkbox for the instance that you want to delete.
  3. To delete the instance, click More actions, click Delete, and then follow the instructions.

Deleting firewall rules for the default network

To delete a firewall rule:

  1. In the Cloud Console, go to the Firewall page.

    Go to Firewall

  2. Select the checkbox for the firewall rule that you want to delete.
  3. To delete the firewall rule, click Delete.