Categorizing audio content using machine learning

This document describes the architecture for an audio categorization pipeline that uses machine learning to review audio files, transcribe them, and analyze them for sentiment. Using machine learning can make this process faster and more accurate than when people perform these tasks. The document is intended for architects or others who want to learn how to use Google Cloud products and Google Cloud machine learning APIs to help categorize audio content. An accompanying tutorial helps developers walk through the process of deploying a sample that illustrates this pattern.

Introduction

If you collect audio files as part of your business, you might want to extract text transcriptions of the audio. You might also want to categorize the content or add tags for search indexing. It takes time for people to create the transcriptions, and there's often so much audio content that it's impractical to have people transcribe or even categorize it all. In addition, when people tag content, if they supply their own tags, they might not include useful ones, or they might not tag content accurately.

By using machine learning, you can instead build an automated categorization pipeline. This document describes an approach to automating the review process for audio files using Google Cloud machine learning APIs. This approach increases the efficiency of the categorization process and increases the breadth of automatically tagging content. The solution does this by extending categorization for all audio content and by automatically generating tags. By using this approach, you can shift to proactively classifying all content and adding accurate tags for search operations.

The first step in the categorization process of an audio file is for you to transcribe the audio file to text. After you've converted the audio file to text, you can use machine learning to get a summary of the content. You can then use this summary to extract common entities (proper nouns and common nouns) in the text, analyze the overall content, and then offer categorization and tags.

Challenges of building an audio processing pipeline

Building a pipeline to process user-generated audio files poses the following challenges:

  • Scalability: The number of audio file submissions can increase or decrease quickly, and the number can vary significantly over time. There might be peak upload times during an event or campaign that significantly increases the number of content submissions and therefore the processing load.
  • Performance: Processing each audio file requires an efficient pipeline. Audio files can be large and require significant time to process. The app must scale to be able to efficiently store and process each submitted audio file, store the resulting text, and then call the machine learning APIs to analyze the sentiment and perspective and store the results.
  • Intelligence: An audio format isn't conducive to traditional analysis, and so it must first be converted to text. Converting to text is either an intensive manual process done by a human or requires a machine learning-based approach. After the audio is converted to text, the entities, concepts, and sentiment must be reviewed for categorization and tagging.

Architecture

The following diagram shows the architecture for the audio categorization pipeline solution described in this document. The pipeline has the following fundamental characteristics that you can use for any use case that involves processing audio files:

  • An event-driven pipeline that processes content that automatically starts when audio content is uploaded to a storage location.
  • Scalable, serverless processing that's invoked automatically in response to events in the pipeline.
  • Machine learning that performs the tasks of transcribing the audio files and analyzing sentiment and entities. It uses existing machine learning models, so you don't need to create or find custom models.

Architecture of pipeline that processes audio files.

The pipeline illustrated in the diagram consists of the following processing steps:

  1. Upload audio file. An app or process uploads audio files to Cloud Storage. You could use a Dataflow pipeline or a real-time uploading process; this step is independent of the pipeline itself.
  2. Store audio file. The audio files are stored in a Cloud Storage bucket that operates as a staging bucket to hold the files before they run through the rest of the pipeline.
  3. Trigger Cloud Function. A Cloud Storage Object Finalize notification is generated whenever the audio files are uploaded to the staging bucket. The notification triggers a Cloud Function.
  4. Call the Speech-to-Text API. The Cloud Function calls the Speech-to-Text API to get a transcription of the audio file. This process is asynchronous, so the Speech-to-Text API returns a job ID to the Cloud Function.
  5. Publish Speech-to-Text job IDs. The job ID and the name of the audio file are published to a Pub/Sub topic. These published Pub/Sub messages with job IDs from different audio file submissions accumulate in the topic as more uploads occur.
  6. Speech-to-Text polling. On a scheduled frequency (for example, every 10 minutes), Cloud Scheduler publishes a message to a Pub/Sub topic, which triggers a second Cloud Function.
  7. Get Speech-to-Text API results. The second Cloud Function pulls all messages from the first Pub/Sub topic and extracts the job IDs and filenames for each message. It then calls the Speech-to-Text API to check the status of each job ID:

    • If a job is done, the transcription results for that job are written to a second Cloud Storage bucket. The audio file is then moved by the Cloud Function from the staging Cloud Storage bucket to a Cloud Storage bucket for processed files.
    • If the job is not done, a Pub/Sub message with the job ID and filename is readded to the Pub/Sub topic. That way, the job will be rechecked the next time Cloud Scheduler calls the Cloud Function.

    If for any reason a transcript is not returned from Speech-to-Text, the Cloud Function moves the audio file from the staging bucket to a Cloud Storage error bucket.

  8. Store Speech-to-Text API results. A text file that contains a transcript of the audio file is written to a Cloud Storage bucket.

  9. Trigger Cloud Functions. When the transcription file is uploaded to Cloud Storage, another Object Finalize notification is sent; this notification triggers further processing by two additional Cloud Functions.

  10. Call Perspective API. The third Cloud Function calls the Perspective API, which returns a response about the probability of "toxicity" in the transcription, which is described on the Perspectives API website this way:

    The Perspective API model was trained by asking people to rate internet comments on a scale from "Very toxic" to "Very healthy" contribution. A toxic comment is defined as "a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion."

    When this analysis is complete, the Cloud Function writes the results to another Cloud Storage bucket.

  11. Call the Cloud Natural Language API. The fourth Cloud Function calls the Natural Language API to analyze the overall attitude or sentiment of the transcription, and to determine which entities are discussed. When the Cloud Function receives the results from the Natural Language API, the function writes the results to another Cloud Storage bucket.

  12. Analyze results. You get the results of the analysis by Speech-to-Text, the Natural Language API, and the Perspective API and integrate the information into your own review pipeline or app. In the preceding diagram, a web app that's hosted on App Engine provides a simple UI to view the results; the web app pulls its data from the outputs stored in Cloud Storage buckets.

Processing the audio files

After you've uploaded an audio file, the file is processed using several APIs. First, you use Speech-to-Text to convert the audio file to text. Next, the text is submitted to the Natural Language API and Perspective API to extract sentiment and entities. Processing the audio file to text can take a significant amount of time. Thus, the architecture uses Cloud Functions, Pub/Sub, and Cloud Storage notification features to implement a scalable, asynchronous event processing pipeline.

Cloud Functions

Cloud Functions provide a way to construct purpose-built, asynchronous functions without managing infrastructure. Calling the three APIs in the architecture described in this document requires an event-driven orchestrator. Cloud Functions can be triggered by Cloud Storage notifications or by Pub/Sub messages; both methods of triggering allow you to construct an event-driven architecture. Cloud Functions can dynamically scale up and down as new files are added and as processing volumes change.

Pub/Sub messages

Pub/Sub provides a scalable messaging service that can be used to send and receive data and can also be used to trigger Cloud Functions. Pub/Sub provides the asynchronous messaging in this architecture two ways:

  1. It provides a messaging queue for job IDs that are generated by the Speech-to-Text API. Later you can use the IDs to determine whether the jobs have completed.
  2. The Cloud Scheduler job sends a Pub/Sub message that triggers the Cloud Function that checks the results from Speech-to-Text for any messages that have not yet been processed.

Cloud Storage notifications

Cloud Storage notifications provide hooks into events that occur in Cloud Storage. Each time an object is uploaded or modified, a notification is generated. You can use this notification along with Cloud Functions to build an event-based system.

Cloud Storage notifications are used in two places in the architecture:

  • To notify a Cloud Function that a new audio file has been uploaded. This notification then starts the processing flow.
  • To notify two of the Cloud Functions that the text output of the Speech-to-Text API has been stored and to then run the text output through the Natural Language API and Perspective API.

Converting speech to text

Converting the audio recording to the text version of the content—that is, transcribing the audio recording—is a key step in this process. One way to perform this task is to implement a custom speech-to-text algorithm in one of the popular machine learning frameworks, such as TensorFlow. Another way is to use a pre-trained machine learning API such as the Speech-to-Text API. Each approach has its advantages and disadvantages.

Option 1: Build your own model in TensorFlow

Advantages

  • Allows you to train the model against specific phrases or domain-specific terms.

Disadvantages

  • Implementing speech-to-text is a solved problem, and creating a custom algorithm doesn't add value unless you have domain-specific terms.
  • Requires machine learning expertise to implement the model.
  • Requires significant effort to implement, select, and tune the model.
  • Requires you to manage and tune the model over time.

Option 2: Use Speech-to-Text

Advantages

  • Easy for developers to use because using an API is easier than developing a model and then building your own API to use it.
  • Doesn't require machine-learning expertise.
  • Provides coverage for more than 120 languages.

Disadvantages

  • Doesn't allow you to train the model against specific phrases or domain-specific terms.

Generally, using Speech-to-Text is an excellent option if your audio doesn't contain specific technical terms and doesn't contain uncommon phrases or phrases that are hard to decipher as words. Because the scenario described earlier evaluates the probability that an audio recording contains inappropriate content, Speech-to-Text should accurately classify the audio even if it misses some technical terms.

Given the broad range of languages covered by Speech-to-Text (more than 120) and the ease of integrating it into a Google Cloud-based solution, it's the recommended choice for the scenario described in this document.

Calling the Speech-to-Text API

The Speech-to-Text API provides both synchronous and asynchronous transcriptions services. To process audio files that are lengthy, it makes the most sense to use the asynchronous transcription mode. When you use asynchronous mode, you submit an audio file to the API, and the API returns a job ID. You can then poll with the job ID to check the job status.

In the architecture described in this document, Cloud Scheduler is used to regularly check the job status for all outstanding jobs. Cloud Scheduler triggers a Cloud Function that pulls all of the job IDs from the Pub/Sub topic. The Cloud Function checks the status of each job by calling the Speech-to-Text API. The API returns the results for any jobs that are completed; these results are then stored as text files in Cloud Storage. For jobs that aren't yet complete, the job ID is sent back to the Pub/Sub topic for processing during the next iteration.

The Speech-to-Text API accepts a list of specific audio encodings. For best results, you should work with audio sources that have been captured and transmitted using a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition might be reduced if you use lossy codecs to capture or transmit audio, particularly if there's background noise in the recording. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, and MP3.

Analyzing sentiment and entities

After the audio content has been converted to text, it's possible to perform further analysis on the content to determine which entities are mentioned in the audio along with the sentiment for each. There are several ways to extract entities and sentiment from text; some approaches are more complex than others.

The first option is to build your own entity and sentiment extractor using a machine learning framework such as TensorFlow. Another option is to leverage the work already done by others and use transferred learning to customize an existing model. Using a pre-trained machine learning API such as the Natural Language API or AutoML Natural Language are additional options. Each approach has its advantages and disadvantages.

Option 1: Build your own model in TensorFlow or use transferred learning

Advantages:

  • Allows you to train the model against specific entities or domain-specific terms.

Disadvantages:

  • Implementing entity extraction is a solved problem and doesn't add value unless you have domain-specific terms.
  • Requires machine learning expertise to implement the model.
  • Requires significant effort in model implementation, model selection, and tuning.
  • Requires you to manage and tune the model over time.

Option 2: Use the Natural Language API

Advantages:

  • Easy for developers to use because using an API is easier than developing a model and then building the API to use it.
  • Doesn't require machine learning expertise, because the model is pre-built, trained, and hosted.
  • Includes sentiment analysis, entity analysis, entity sentiment analysis, content classification, and syntax analysis.

Disadvantages:

  • Doesn't allow you to train against specific entities or domain-specific terms.

Option 3: Use AutoML Natural Language

Advantages:

  • Doesn't require machine learning expertise, because the model is pre-built, trained, and hosted.
  • Allows you to train the model against specific entities or domain-specific terms.
  • Surfaces the resulting trained model as an API.

Disadvantages:

  • Requires you to supply training data.
  • Requires you to manage and tune the model over time.

Generally, using the Natural Language API is an excellent option if your text doesn't contain technical terms and doesn't contain uncommon phrases or phrases that are hard to decipher as words. In the use case described in this document, because the sample extracts sentiment and entities from the text, the Natural Language API provides an acceptable level of performance. Given the ease of using the API as compared to developing a custom machine learning model, you should use the Natural Language API.

Calling the Natural Language API

In the architecture described in this document, the Natural Language API is used to extract the entities and perform entity sentiment analysis. This information is then used to tag the text extracted from the audio file and provide you with an understanding of the overall theme of the text.

Analyzing text perspective

Another signal in the analysis is the negative impact that a remark or comments might have on a given topic. The Perspective API was created by Jigsaw and Google's Counter Abuse Technology team and launched as an open source project. It determines the probability that a given comment can be perceived as "toxic." The API uses machine learning models to score the perceived impact that a comment might have on a conversation.

The Perspective API

The Perspective API offers several models, including TOXICITY, PROFANITY, and INSULT. The sample architecture uses the TOXICITY model, which reports the probability that the supplied text is a "rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion." Because the text content includes the start and end times in the corresponding audio file, the results from the Perspective API are stored with those start and end times. This lets you associate the Perspective API results with a given section of the original audio file.

Viewing audio pipeline results

The processing pipeline is initiated when you upload an audio file to Cloud Storage. In practice, you can integrate this into any user-upload process. This can be an upload in real time using your web app, or in a batch by uploading many audio files at once through a data pipeline.

When the processing has completed, you can read the results of the analysis that was created by the APIs used in the architecture. The architecture diagram illustrates a web UI that displays the text content, the entities and the sentient identified in the text, and the start and end time for the related audio clip. As shown in the diagram, the web app uses App Engine and Cloud Storage.

App Engine

The web app is implemented as an App Engine app. App Engine is a platform-as-a-service (PaaS) product that supports many common web programming languages, and which can automatically scale up and down based on user traffic. App Engine is well integrated with Google Cloud, which simplifies the process of uploading the audio files to Cloud Storage.

What's next