Visualize speech data with Speech Analysis Framework

This solution describes the Speech Analysis Framework, a collection of components and code from Google Cloud that you can use to transcribe audio, create a data pipeline workflow to display analytics of the transcribed audio files, and then visually represent the data.

This document contains code snippets from the Speech Analysis Framework demonstrating how to use Google Cloud APIs. For building the UI, we recommend using Google Cloud partners (for example, systems integrators). For more information about how to build the Speech Analysis Framework UI, or about systems integrators, fill in this interest form.

The solution illustrates the framework by walking you through a contact center use case, where capturing and analyzing audio recordings is a critical function to help answer operational questions. This includes both tactical questions ("Who are our best live agents?") and strategic questions ("Why are customers calling us?").

The sample code in this solution demos some of the highlights of how to use the framework, but this solution is not a tutorial. You see how to create an application that's hosted on App Engine and how you might customize the framework to your use cases. All the components discussed in this document are Google Cloud services or are available through our partners as Contact Center AI solutions.

The problem: rising call volume

Most contact centers record customer calls for human analysts to manually review later. But with the volume of these recordings and the inherent resource constraints of a manual approach, only a small portion of these recordings are ever analyzed or even collected. As a result, huge amounts of data go untapped that could help provide better customer service, increase customer satisfaction, and decrease contact center volume.

Most contact centers implement key performance indicators (KPIs) for customer satisfaction such as average time in queue, average abandonment rate, and feedback surveys to track and improve overall contact center performance. Typically, the data for these KPIs is provided directly by telephony software without any transcription needed.

However, metrics that are not collected by the telephony provider, such as call sentiment while the call is in progress, also have a major impact on meeting the customer satisfaction goal. Questions like the following are often answered today by people listening to and scoring calls manually:

  • How did the sentiment of the call start?
  • How did the sentiment of the call progress?
  • How did the sentiment of the call end?

This approach is not only tedious and inefficient, but it's prone to human bias.

Finally, contact centers that use live-agent scripts for interacting with customers must enforce compliance. You can either manually listen to recorded calls or join live sessions to determine whether the agent is following the script.

Unfortunately, due to the volume of calls and limited time slots available for quality analysis, most calls are never analyzed for compliance—which, like sentiment, is a valuable metric for scorecards.

The goal: analyze all recorded calls

Instead of taking the potentially biased and inefficient manual approach, contact centers can use Google Cloud to meet the goal of transcribing and analyzing all recorded calls to get insights in near real time. These insights can include:

  • Overall call sentiment.
  • Sentence-by-sentence sentiment.
  • Insight into which agent quality metrics to track (such as call silence, call duration, agent speaking time, user speaking time, and sentence heatmaps).
  • Insights on how to reduce call center volume by analyzing keywords in transcripts.

The framework presented in this solution uses Google Cloud machine learning (AI Platform) services such as Speech-to-Text and Cloud Natural Language API. The framework also employs Pub/Sub and Dataflow for data streaming and transformation. The resulting application is completely serverless. To implement the framework, you don't need any machine learning experience, and all data infrastructure needs, such as storage, scaling, and security, are managed by Google Cloud.

Visualizations and reporting

It's easier to gain insights from visualizations of audio recording data than it is from simply reviewing transcripts. The framework in this solution helps you by producing visualizations that put the data into the context of the user.

For example, the following transcript of a recording might be meaningful, but it would be better to see a transcript on a per-speaker basis—that is, instead of reading a paragraph, you can read based on who said what.

Agent: Hello. Thank you for calling the Google merchandise store. How can I help you?
Client: Hi. I ordered a pair for Google crew socks last week.
Agent: Okay have they arrived?
Client: Yes, they have but I have two issues.

Using Google Cloud's AI Platform services, you can produce a range of visualizations, as described in the following sections.

Call recording scorecards

The top-level scorecard produced by the framework includes common industry metrics, as well as transcript-derived metrics such as sentiment. Metrics such as call silence (calculated through differences in timestamps) provide additional insight into customer satisfaction—for example, an unusually high call silence percentage is a red flag.

Scorecard showing call silence

Volume-based call recording scorecards

With the volume-based scorecard, you can create a view based on date-filter metrics for the calls processed by the framework. Metrics such as average agent talk percentage provide insight into the overall customer satisfaction based on a date filter. For example, you can view the average agent talk percentage for the past 24 hours, for the past week, and for the last 30 days.

Scorecard showing calls over time

Transcripts search and tag cloud

The combination of free-form search and word cloud makes it easy to find conversations involving specific categories (such as "returns") or keywords (such as product names). This capability helps analysts more deeply understand seasonal, event-driven, or other trends.

UI with free-form search and word cloud tags

Entity extraction

Understanding sentiment associated with a specific entity (such as company, product, store location, and sale) can help uncover trends. For example, you can learn whether a specific store, season, or campaign is associated with a large number of complaints.

Entity extraction that shows locations

Drill-down heatmaps

You can observe how sentiment fluctuates across the complete timeline of a call, and then drill down into a specific part of the conversation for playback if needed.

Heatmap of a call

The framework: visualize call recording insights

The framework that this solution offers is designed to let you quickly deploy and process audio files in near real time to gain insights from the recordings. This section explains how the components of the framework work together.

Architecture diagram

The following diagram illustrates the workflow and its components. Architecture of the workflow

The architecture includes the following Google Cloud components:

Walking through the workflow

The following sections describe at a high level the steps you would take to use the framework.

Store audio recordings

The framework assumes that audio files are in a Cloud Storage bucket. In the framework design, you configure a Cloud Function to be triggered when you upload an audio file to the bucket. The Cloud Function sends the audio file to various APIs to be processed.

When you upload the audio file to the bucket, you can add custom metadata that can identify the recording with categories like caller ID, customer ID, and other metrics collected from the contact center. To add custom metadata to a Cloud Storage object, you use the x-goog-meta command with the -h flag. The following command shows an example; substitute your audio file name for [FILENAME] and Cloud Storage bucket name for [BUCKET].

gsutil -h x-goog-meta-agentid:55551234 -h cp [FILENAME].flac gs://[BUCKET]

Process audio recordings

As noted, uploading audio files invokes a Cloud Function. (This Cloud Function must already exist.) For example, you might configure notifications so that when a file is uploaded, the upload triggers a function that starts the process of transcribing the audio file using the Speech-to-Text API, getting the sentiment from the Natural Language API API, and redacting sensitive data with the Data Loss Prevention API.

exports.helloGCSGeneric = (event, callback) => {
  const file =;

Gather data from the audio file

The Cloud Function that sends the raw audio file to the Speech-to-Text API returns a response object and then sends the transcript output to the Natural Language API API and the Data Loss Prevention API. These APIs return additional response objects.

The response objects from the three APIs are then sent to Pub/Sub to be processed by Dataflow. In addition to the response object from the APIs, the Cloud Function can extract the custom metadata and add that field to the Pub/Sub payload using the following JavaScript code:

let strAgentID  = object.metadata.agentid === undefined ? 'undefined' : object.metadata.agentid;

This allows you to aggregate two disparate data sources.

Extract text from an audio recording

With Speech-to-Text, you can convert audio to text by applying neural network models in an API. The API recognizes 120 languages and variants to support your global user base. You can enable voice command-and-control, transcribe audio from call centers, and more. Speech-to-Text can process real-time streaming or prerecorded audio.

The following code snippet is an example of how you can process long audio files. It employs speaker diarization, a feature that detects when speakers change, and adds a numbered label to the individual voices detected in the audio.

const audioConfig = {
   sampleRateHertz: 44100,
   languageCode: `en-US`,
   enableSpeakerDiarization: true,
   diarizationSpeakerCount: 2,
   enableAutomaticPunctuation: true,
   enableWordTimeOffsets: false,
   useEnhanced: true,
   model: 'phone_call'

 const audioPath = {
   uri: `gs://${object.bucket}/${}`

 const audioRequest = {
   audio: audioPath,
   config: audioConfig,

 return spclient
   .then(data => {
     const operation = data[0];
     return operation.promise();

Redact sensitive information from transcripts

You can use Cloud DLP to better understand and manage sensitive data. It provides fast, scalable classification and redaction for sensitive data elements like credit card numbers, names, US social security numbers and selected international identifier numbers, phone numbers, and Google Cloud credentials.

Cloud DLP classifies this data using more than 90 predefined detectors to identify patterns, formats, and checksums. It even understands contextual clues. You can redact data using techniques like masking, secure hashing, bucketing, and format-preserving encryption.

Analyze the data

After the audio file has been transcribed, a sentiment analysis captured, and sensitive data redacted, you store the results in BigQuery to visualize and report on the insights collected.

Using the Pub/Sub-to-BigQuery template

You can use a Google-provided template to read and write the response objects. The Pub/Sub-to-BigQuery template creates a streaming pipeline. Dataflow reads JSON-formatted messages from a Pub/Sub topic and writes them to a BigQuery table.

The following code snippet shows a portion of the Pub/Sub to BigQuery template that runs the gcloud command-line tool.

gcloud dataflow jobs run [JOB_NAME] \
    --gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
    --parameters \

The following illustration shows the complete Speech Analysis Framework, with Dataflow reading the Pub/Sub messages, then converting them from JSON to tableRow format, and finally writing records to BigQuery.

Speech Analysis Framework

Determining the speaker, talk time, and silence time

You can use the speaker diarization and word timestamps features to determine the speaker, speaker talk time, and call silence. You can also create a sentiment heatmap for more details.

Call center leads can see the progression of the call, including how the call started and ended. In addition to the visual progression, they can also drill into each square to view the sentence sentiment.

Drill into sentence sentiment

By clicking further, they can read and listen to the sentence.

Playback for call recording

Using word timestamps

Speech-to-Text can include Time offset (timestamp) values in the response text for your recognize request. Time offset values show the beginning and end of each spoken word that is recognized in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100 milliseconds.

The following response object sample includes the startTime and endTime for each word of the transcript. With these values, you can create custom metrics to see call silence time and how long each speaker speaks. You can also identify who the speaker is by using a keyword search within the script.

"words": [
            "startTime": "1.300s",
            "endTime": "1.400s",
            "word": "Four"
            "startTime": "1.400s",
            "endTime": "1.600s",
            "word": "score"

After collecting the text word timestamps, you can create scorecards like the following:

Call scorecards

Identifying different speakers (speaker diarization)

With speaker diarization, Speech-to-Text can recognize multiple speakers in the same audio clip. When you send an audio transcription request to Speech-to-Text, you can include a parameter telling Speech-to-Text to identify the different speakers in the audio sample.

Speech-to-Text detects when speakers change, and it adds a numbered label to the individual voices detected in the audio. A transcription result can include numbers for as many speakers as Speech-to-Text can uniquely identify in the audio sample.

The following code snippet is an example that shows how to enable speaker diarization:

const config = {
  encoding: `LINEAR16`,
  sampleRateHertz: 8000,
  languageCode: `en-US`,
  enableSpeakerDiarization: true,
  diarizationSpeakerCount: 2,
  model: `phone_call`

Next, you can build speaker transcripts:

Agent: Hello. Thank you for calling the Google merchandise store. How can I help you?
Client: Hi. I ordered a pair for Google crew socks last week.
Agent: Okay have they arrived?
Client: Yes, they have but I have two issues.

Extracting sentiment from conversation

You can use Natural Language API to understand sentiment about your product on social media, or to parse intent from customer conversations in a call center or a messaging app.

You can extract overall transcription sentiment, sentence sentiment, and entities. With this data available for the transcribed audio file, you can create heatmaps and sentiment timelines. You can also build word clouds.

The following example shows a code snippet to capture sentence sentiment:

     .analyzeSentiment({document: document})
     .then(results => {

       const sentences = results[0].sentences;
       sentences.forEach(sentence => {
           'sentence': sentence.text.content,
           'score': sentence.sentiment.score,
           'magnitude': sentence.sentiment.magnitude

The framework described in this solution allows users to search for keywords in the transcript.

The image below shows a word cloud visualization created by the framework. The word cloud includes popular words extracted from the audio files. You can search these terms as an effective method to mine their data.

Word cloud

Building the sentiment heatmap

Because it can draw on rich response objects from Google Cloud APIs, the framework code can produce visualizations that let users explore by clicking. The code includes an API built with Express.js that leverages the BigQuery Node.js SDK to run SQL statements to retrieve data. The SQL commands are invoked in response to a user clicking on the visualization.

The following sample query looks for all the words in the transcript that are currently stored as a nested repeated field. The query statement is executed using the BigQuery SDK, which gets the all the words from the relevant record.

const sqlQueryCallLogsWords = `SELECT
  ARRAY(SELECT AS STRUCT word, startSecs, endSecs FROM UNNEST(words)) words
  FROM \`` + bigqueryDatasetTable + `\`
  where fileid = \'` + queryFileId + `\'`

After the SQL statement is executed, the response is sent to the UI:


The framework leverages multiple SQL statements to retrieve data from BigQuery. As an example, two objects that contain arrays are used to build the heatmap, as shown in the following table.

Object Name Contents
{sentence:sentenceString, sentiment: sentimentValue}

With these two objects, you can create a mapping of the start time of a sentence for audio playback, visually represent the sentiment of the sentence, and display the sentence.

When a user clicks on a heatmap square, the onSeekChange function is called using the following code:

onSeekChange = (value) => {
   let currentPlayerTime = this.player.getCurrentTime()

You capture the startTime value from the sentence that the user clicked, as represented in the mapping as shown below:

{sentence:sentenceString, sentiment: sentimentValue, start: startTime}

You capture the startTime for the sentence by joining the sentences and then splitting them into an array. Next, you match the starting word in the sentence to the split array to find the startTime value. This way you can take the user to the start time of the audio file for playback.

By combining the two APIs, you can produce visualizations like the following sentiment timeline:

Sentiment timeline visualization

What's next