Best practices: Topic modeling

It is critical for many organizations to track trending and emerging topics. Contact Center AI Insights lets you detect and identify top conversation drivers from calls or chats without prior labeling. Follow these best practices for using the CCAI Insights topic modeling V2.1 feature. Its unsupervised modeling makes setup simpler while automatically identifying top drivers surfaced in analyzed conversations.

Assessing training data quality

For voice data, the quality of Speech-to-Text outputs is critical to the performance of the topic model.

  • Ensure that the conversation's speaker roles are assigned properly when the conversation is ingested:

    • Accurately label conversation turns as coming from customer or agent.
    • Use AGENT for human roles, AUTOMATED_AGENT for virtual ones, and, for customer roles, END_USER or CUSTOMER.
    • Make sure that most conversations have transcripts with customer and agent roles labeled. Conversations with only one role won't be used in training.
  • Ensure that the conversations are sufficiently long: About 10 total turns, with 5 from the agent and 5 from the customer.

  • Avoid using duplicate conversations in the dataset.

For better quality topics from the model, try using redacted conversations for topic modeling. However, if the redaction is overly aggressive and removes important information from the transcripts, it can affect the length of your training conversations. If applicable, check the Cloud Data Loss Prevention redaction quality.

Data requirements, including smaller datasets

A minimum of 1k conversations with 5 back-and-forth turns between an agent and customer are required. We also recommend using about 10k conversations for training.

Train the topic model

Start by training an unsupervised topic model without providing custom topics. Though optional, to identify the language used in the conversations, set a language filter.

When training the topic model, understand your business use case, and decide on the granularity of the topics that you want. The use case might include what you're looking for in the topic model or the business values the model brings. Granularity determines model size in the training. Ask, What is the rough number of topic clusters in the existing solution, if any?

If quality from the unsupervised model is low, try training another topic model with a list of custom topics.

Use the custom topics

If you want specific topics, add them during or after topic model creation.

  • Topic Names: Use short, descriptive topics of three to six words, such as "troubleshooting remote control" or "inquiring about billing policy", avoiding generic or abstract names, such as "Sales". You can use readily available custom topic names, such as "Billing" or you can add a short description to the topic name, as in "Billing Errors & Refunds". The model configuration can be chosen suitably based on the kind of results that you are looking for.

  • Descriptions: Use a sentence with a general description followed by one with a few examples. Avoid personal information like names, dates, or locations. For example: "The customer is inquiring about their landline phone service. They may want to cancel it or consult about the current billing" or "The customer is inquiring about their bill. They may want to know the amount or the due date."

Evaluating the topic model

The evaluation of a trained topic model can be carried out in 2 phases: validation of generated topics & descriptions followed by evaluation of analyzed conversations for identified topics.

Step 1: Validation of generated topics and descriptions

We recommend that you remove topics that aren't relevant to your needs. For example, remove a "greetings" topic talking mostly about greeting utterances in the conversations. Details about removing a topic can be found at Delete a Topic.

Merging topics is useful if two or more topics have conversations with similar subject matter and you want them under a unified topic. For example, a custom topic that you provide and an identified topic might serve overlapping needs. If "billing" was one of the custom topics given and "troubleshooting billing issues" was one of the topics identified by the model, then you might merge them.

Add a new topic for any set of topics not identified by the topic model. You're most likely to see this scenario with a model trained without custom topics.

Step 2: Evaluate analyzed conversations

When step 1 is completed, the analyzed conversations should be evaluated for topics identified.

Why are fewer topics learned than selected for granularity?

If there are few topics to be discovered, the final number of topics might be significantly less than the numbers indicated in the selected granularity.

  • More coarse: up to 30 topics
  • Coarse: up to 50 topics
  • Standard: up to 150 topics
  • Fine: up to 200 topics
  • More fine: up to 350 topics

The number of topics learned by the trained topic model can also be lower due to data quality issues. Ensure that the data requirements are met in terms of training conversation count, agent or customer turns in the conversations, and STT and redaction quality.

How to increase the topic count?

Verify that the training data quality requirements are met, as previously discussed. Also, ensure that they're met in terms of training conversation count, agent or customer turns in the conversations, and STT and redaction quality. If conversations continue for too many turns, you can experiment by removing their authentication portion.

Experimenting with a minimum turn-count filter (say 5, 10, 15, 20) while selecting training conversations in the training phase can help increase the number of topics learned. It can also help discard conversations which are too short to predict topics from.

Why is training count smaller than conversation count?

The training volume count is the number of conversations used for training the topic model after downsampling the provided conversation data during the training phase.

During training, the conversations that don't have long enough utterances from the customer or are too short get filtered out. After the model is trained, the sum of the training volume in the CCAI Insights console represents the training/post-filtered conversation set.

Why does my model have more topics than the provided custom ones?

Depending on the model size, additional topics can be added to the model outputs. Model size impacts topic number in the output, but the actual number depends on the real topics learned in the model.

If you train a model with a "More fine" model size and expect it to learn up to 350 topics, but you bring only 30 custom topics into the training, you might see in the model outputs that only 32 topics are trained. This is because the model is trained on conversations with about 30 topics, not 350.