Topic modeling best practices

Follow these best practice guidelines to get the most out of your topic models.

Fine-tune a topic model

The best way to improve topic assignments is to fine-tune your model. Follow these guidelines to optimize your topic model when adding, editing, and removing topics.

Add or edit a topic

Avoid adding duplicate or similar topics because they will negatively impact the quality of topic inferences. When creating or changing a topic, apply the following naming and description guidelines.

Name

Use short, descriptive topics of three to six words, such as troubleshooting remote control or inquiring about billing policy.
Avoid generic or abstract names, such as Sales.

Optionally, follow these best practices:

Use readily available custom topic names, such as Billing.
Add a short description to the topic name, as in "Billing Errors and Refunds".
Choose a suitable model configuration based on the results you want.

Example

A credit card support center runs topic modeling on their archived support call logs. The modeling creates a topic from a cluster of conversations and names it Credit card over the limit inquiries. The business shortens the name to Credit limit inquiries.

Description

Use a general description followed by a few examples.
Avoid including personal information like names, dates, or locations.
Too much detail, such as "don't include X topic", can negatively impact topic inference.

Examples

The customer is inquiring about their landline phone service. They may want to cancel it or consult about the current billing.
The customer is inquiring about their bill. They may want to know the amount or the due date.

Remove secondary topics

After you've deployed your topic model and completed an analysis, check the topic distribution in the Topic Model Deployed data page. Secondary topics might be the dominant topic in deployed results because they can be common and have stronger matches. Topics that match to a high proportion (more than 30%) of your sample conversations are likely secondary topics. Carefully examine these topics and delete them if they aren't relevant.

Whether or not irrelevant secondary topics exist highly depends on the input data. If all the major topics on the Deployed data page have a relatively even distribution, and each topic only matches to a small proportion (less than 20%) of conversations, then there are probably no secondary topics to delete.

Training data

For voice data, the quality of Speech-to-Text outputs is critical to the performance of the topic model. Follow these guidelines to improve the quality of your training data.

Conversations

Avoid using duplicate conversations in the dataset.
Each conversation should contain at least 10 total turns, with 5 from the agent and 5 from the customer.
Use redacted conversations, but check the Cloud Data Loss Prevention redaction quality. Sometimes redaction removes important information from the transcripts, which can affect the length of your training conversations.
Make sure almost all the conversations are in the same language.

Speaker roles

Make sure that the conversation's speaker roles are assigned properly after the conversation is ingested.

Accurately label conversation turns as coming from a customer or an agent. Conversations with only one role won't be used in training.
Use AGENT for human roles and AUTOMATED_AGENT for virtual ones.
Use END_USER or CUSTOMER for customer roles.