The Battle for Higher Quality Customer Service: Chat GPT vs QA Analyst

Can machine learning match human objectivity and sensibility of QA in customer service?

Can QA be automated with ChatGPT-type tech?

Manual QA programs were invented to solve the Metrics Trust Gap.

Poll #1
Are people in your company asking this question?
Poll #2
What % of QA do you think can be automated w/ ChatGPT-type technology?

Experiment Design

ChatGPT, an AI-powered tool, has the capability to automate parts of the QA process, but can it match the objectivity and sensibility of human QA analysts?

In our experiment, we aimed to investigate this question. We tasked ChatGPT with grading 200 tickets based on a single question: "Did the agent demonstrate active listening?" We then compared ChatGPT's results with those of a human QA analyst.

Active listening was chosen as the primary focus due to the following reasons:

It is fair and applicable across various companies.
It doesn't require knowledge of internal systems or company-specific data.
GPT does not have access to these systems, and thus cannot grade based on them.
Active listening remains a complex and intriguing criterion.

Data collection and cleaning

Preparing and cleaning the data is essential for accurate results in testing, as the quality of the underlying data cannot be compromised and requires proper labeling, formatting, and anonymization.

Our data collection involved 5 key parts:

Assembled a small group of test customers.
Selected one Yes/No question from each Scorecard that had at least 200 scores.
Focused on chat and email conversations.
Anonymized and removed any sensitive information from the data.
Cleaned the data to exclude "internal" notes, chatbots, and other non-customer-agent messages.

Example Prompts

GPT, as a text prediction tool, relies on high-quality prompts to generate accurate answers.

Our prompt consisted of four parts:

Generic context
Question context
Specific examples of good and bad examples
The conversation to grade

We conducted experiments using different prompts to improve our word prediction system. However, prompt size had limitations, and we had to be careful about compliance and the quality of examples we used.

Our main objective was to determine if the agent displayed active listening skills, allowing it to offer personalized recommendations and identify opportunities for customer success.


The study assessed GPT's performance in answering 200 questions, comparing it to a human grader.
10 iterations were conducted to enhance the prompts following initial testing.
The final results revealed a 58% alignment between GPT and the human grader.
Unfortunately, this level of alignment did not meet the expectations of pilot customers to integrate it into their QA program.

Potential Follow Up Ideas

The results don’t mean GPT can’t work. There is still a lot to explore:

Further prompting tests needed for GPT, but complexity can lead to errors.
Third-party vendor customization necessary for high-quality automated QA, effectiveness of ChatGPT uncertain.
Johnny Appleseed may need more resources for AutoQA, GPT-based approach effectiveness unknown.
Optimal results may require a combination of narrow models and expert systems.
Data cleaning is crucial for accuracy and security.

If you're interested in learning more about ChatGPT and its potential for automating QA processes, request a demo today. You can also sign up for our CEO Series to stay updated on the latest trends in customer service technology.

Request Demo

Request Demo

Sign Up for CEO Series

Sign Up