Quality AI best practices

This document outlines Google's recommendations for how best to use Quality AI. Following the guidelines in this document will ensure that Quality AI provides the most accurate and useful information possible for your business needs.

Scorecards

Scorecards provide access to agent performance metrics and detailed instructions for answering questions about a conversation. You must enter your conversation data, questions, and possible answer options, along with instructions for how to interpret those answers. For best results, use the Scorecard console to upload your labeled data.

Conversation data

Conversation data are transcripts of either voice or chat conversations with personally identifiable information redacted. Upload at least 2,000 conversations for each business unit or call center.

You can also upload audio recordings of voice conversations. For best results, record audio using the following specifications:

Two channels
16,000 Hz sampling rate (or 8,000-48,000 Hz)
Lossless encoding: FLAC or LINEAR16
Lossless encoding for WAV audio files: LINEAR16 or MULAW

The metadata for audio recordings of a voice call should include the following information:

Channel labels to identify the agent and customer
Agent ID, name, location, queue, and wait time
Audio language as a BCP-47 language tag, such as en-US

Questions

Within each scorecard, the questions and instructions for answering them provide valuable information for Quality AI to evaluate conversations and agent performance. To maximize the accuracy of automatic evaluations, write questions and instructions with the following concepts in mind:

Clarity: Write questions that are clear and a human can understand.
Specificity: Add answer options and instructions that are as specific as possible.
Details: Include instructions that provide enough details for a human to confidently and reliably evaluate the conversations.
Examples: Quality AI is even more accurate if you provide examples from real conversations that illustrate each answer to your questions.

Questions can take a variety of forms. The following are some useful question templates:

"Did the agent…?" with a specific action. This format indicates that the evaluator must look for something the agent said.
"Did the customer…?" with a specific action. This format indicates that the evaluator must look for something the customer said.
Beginning with question words such as what or why encourages evaluation of the whole conversation.

Questions with multiple answers

Users often write questions with only yes and no answers. However, a question might not apply to the conversation, which warrants N/A. Alternatively, the question could be interpreted as yes or no in a variety of circumstances, which leads to inconsistent responses with only two options. Including questions that require other types of answers shows a greater depth of understanding of the conversation.

Acoustic analysis

Quality AI evaluates conversation transcripts and cannot perform acoustic analysis. Exclude questions that require acoustic analysis. For example, neither a person nor Quality AI can answer the question "Did the agent use a greeting with an upbeat tone of voice?" solely by reading a transcript of the conversation.

Instructions

Instructions define how each answer is interpreted; so instructions must be specific and leave no room for interpretation. The definition ensures that each evaluation of a conversation provides the same answer.

Examples

Examples are also useful for clarifying question interpretation. Quality AI learns from real conversation data, so take examples from your existing conversations. Include 1000 example conversations for training, 1000 conversations for testing, and 250 conversations for validation. Include a percentage of training examples for each response category for every question in a scorecard. For example,

If a question on a scorecard is "Did the agent exhibit empathy towards the customer?" and the response to that question can be yes or no, include both of the following:

75% example conversations that received a yes rating.
25% example conversations that received a no rating.

Question	Possible answers	% of example conversations
Did the agent exhibit empathy towards the customer?	Yes	75
	No	25

Labeled data

Training and customizing the Quality AI LLM requires labeled data, which is derived from the information you gather from human evaluations. Labeled data minimally must include identifiers for each conversation, scorecard, and question as well as the expected answer. Your labeled data can also include the weight, answer options, instructions, and an abbreviated name for each question.

The ideal format for your labeled data is a table with each row containing information to identify a single answer and each column containing separate identifications, as shown in the following table:

Conversation ID	Scorecard ID	Question ID	Answer
44748735396	5727080762913918243	4097398336657302301	`str_value`: `YES`
44748735396	5727080762913918243	3576133206121890384	`str_value`: `NO`
3495523396	5727080762913918243	4097398336657302301	`str_value`: `YES`
3495523396	5727080762913918243	3576133206121890384	`str_value`: `NO`

Evaluating a conversation

Human labelers determine the answers to scorecard questions when evaluating a conversation. When multiple people evaluate the same conversation, they sometimes provide different answers to each question. This inconsistency in labeling introduces noise and confusion to the machine learning process. Within a conversation, if the same or a similar question is associated with multiple different answers, Quality AI has no way to learn the mapping between questions and answers.

Any of the following can cause labeling inconsistency when multiple people answer the same questions for a single conversation:

Subjective questions that lead to different interpretations between labelers.
Rubrics with insufficient details or unclear guidelines.
Different versions of a question, answer options, or instructions, for example:
- You can begin with only yes or no answer options, then later change to a more fine-grained approach with no-a, no-b, and no-c options.
- However, combining the yes or no approach with no-a, no-b, and no-c options will confuse the model.
A labeling task that requires a large cognitive load.

Measure consistency

To measure consistency in your labeled data, ask multiple labelers to independently evaluate the same conversation. Then, compute agreements between them using the Cohen's kappa coefficient. You want to see a Cohen's kappa coefficient of no less than 0.2. If consistency is low, try one of the following options:

Refine the question and instructions to provide less room for interpretation.
Communicate between annotators so they can resolve discrepancies and agree on a single grading standard.
Continuously monitor consistencies among annotators.