{% setvar launch_type %}api{% endsetvar %} {% setvar launch_name %}Cloud Speech-to-Text API{% endsetvar %} {% setvar info_params %}realtime_warning{% endsetvar %}

Transcribing Phone Audio with Enhanced Models

This tutorial shows how to transcribe the audio recorded from a phone using Cloud Speech-to-Text.

Audio files can come from many different sources. Audio data can come from a phone (like voicemail) or a soundtrack included in a video file.

Speech-to-Text can use one of several machine learning models to transcribe your audio file, to best match the original source of the audio. You can get better results from your speech transcription by specifying the source of the original audio. This allows the Speech-to-Text to process your audio files using a machine learning model trained for data similar to your audio file.

Objectives

  • Send a audio transcription request for audio recorded from a phone (like voicemail) to Speech-to-Text .
  • Request an enhanced speech recognition model for an audio transcription request.

Costs

This tutorial uses billable components of Cloud Platform, including:

  • Cloud Speech-to-Text

Use the Pricing Calculator to generate a cost estimate based on your projected usage. New Cloud Platform users might be eligible for a free trial.

Before you begin

This tutorial has several prerequisites:

Sending a request

To best transcribe audio captured on a phone, like a phone call or voicemail, you can set the model field in your RecognitionConfig payload to phone_model. The model field tells Speech-to-Text API which speech recognition model to use for the transcription request.

You can improve the results of phone audio transcription by using an enhanced model. The enhanced models are available to customers who participate in the data logging program for their project. To use an enhanced model, you set the useEnhanced field to true in your RecognitionConfig payload.

The following code samples demonstrate how to select a specific transcription model when calling Speech-to-Text .

Protocol

Refer to the speech:recognize API endpoint for complete details.

To perform synchronous speech recognition, make a POST request and provide the appropriate request body. The following shows an example of a POST request using curl. The example uses the access token for a service account set up for the project using the Google Cloud Platform Cloud SDK. For instructions on installing the Cloud SDK, setting up a project with a service account, and obtaining an access token, see the Quickstart.

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    https://speech.googleapis.com/v1p1beta1/speech:recognize \
    --data "{
  'config': {
    'encoding': 'LINEAR16',
    'languageCode': 'en-US',
    'enableWordTimeOffsets': false,
    'enableAutomaticPunctuation': true,
    'model': 'phone_call',
    'useEnhanced': true
  },
  'audio': {
    'uri':'gs://cloud-samples-tests/speech/commercial_mono.wav'
  }
}"

See the RecognitionConfig reference documentation for more information on configuring the request body.

If the request is successful, the server returns a 200 OK HTTP status code and the response in JSON format:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "hi I'd like to buy a Chromecast I'm always wondering whether you can help me with that certain me which color would you like blue black and red",
          "confidence": 0.94601107
        }
      ]
    },
    {
      "alternatives": [
        {
          "transcript": " let's go with the black one",
          "confidence": 0.9824439
        }
      ]
    },
    {
      "alternatives": [
        {
          "transcript": " would you like the new Chromecast Ultra model or the regular Chromecast",
          "confidence": 0.9354997
        }
      ]
    },
    {
      "alternatives": [
        {
          "transcript": " regular Chromecast is fine thank you",
          "confidence": 0.92987233
        }
      ]
    }
  ]
}

Java

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

/**
 * Performs transcription of the given audio file synchronously with
 * the selected model.
 * @param fileName the path to a audio file to transcribe
 */
public static void transcribeModelSelection(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speech = SpeechClient.create()) {
    // Configure request with video media type
    RecognitionConfig recConfig = RecognitionConfig.newBuilder()
        // encoding may either be omitted or must match the value in the file header
        .setEncoding(AudioEncoding.LINEAR16)
        .setLanguageCode("en-US")
        // sample rate hertz may be either be omitted or must match the value in the file header
        .setSampleRateHertz(16000)
        .setModel("video")
        .build();

    RecognitionAudio recognitionAudio = RecognitionAudio.newBuilder()
        .setContent(ByteString.copyFrom(content))
        .build();


    RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
    // Just print the first result here.
    SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }

Node.js

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
client
  .recognize(request)
  .then(data => {
    const response = data[0];
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log(`Transcription: `, transcription);
  })
  .catch(err => {
    console.error('ERROR:', err);
  });

Python

For more on installing and creating a Speech-to-Text client, refer to Speech-to-Text Client Libraries.

def transcribe_file_with_enhanced_model(path):
    """Transcribe the given audio file using an enhanced model."""
    client = speech.SpeechClient()

    with io.open(path, 'rb') as audio_file:
        content = audio_file.read()

    audio = speech.types.RecognitionAudio(content=content)
    config = speech.types.RecognitionConfig(
        encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=8000,
        language_code='en-US',
        # Enhanced models are only available to projects that
        # opt in for audio data collection.
        use_enhanced=True,
        # A model must be specified to use enhanced model.
        model='phone_call')

    response = client.recognize(config, audio)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print('-' * 20)
        print('First alternative of result {}'.format(i))
        print('Transcript: {}'.format(alternative.transcript))

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

Deleting the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting instances

To delete a Compute Engine instance:

  1. In the GCP Console, go to the VM Instances page.

    Go to the VM Instances page

  2. Click the checkbox next to the instance you want to delete.
  3. Click the Delete button at the top of the page to delete the instance.

Deleting firewall rules for the default network

To delete a firewall rule:

  1. In the GCP Console, go to the Firewall Rules page.

    Go to the Firewall Rules page

  2. Click the checkbox next to the firewall rule you want to delete.
  3. Click the Delete button at the top of the page to delete the firewall rule.

Was this page helpful? Let us know how we did:

Send feedback about...

Cloud Speech API Documentation