Executive Insights

A Comprehensive Guide to Evaluating AI Vendors

Request a demo

Watch the Webinar

Introduction

Experiment Design

Data Collection and Cleaning

Example Prompts

Experiment Results

Exploring Future Possibilities

Conclusion

Manual QA programs were invented to solve the Metrics Trust Gap. With the rise of AI-powered tools like ChatGPT, there’s growing interest in automating QA processes. But can machine learning truly replicate the objectivity and sensibility of human QA analysts? To answer this, we conducted an experiment comparing ChatGPT’s ability to grade customer service tickets with that of a human QA analyst.

Poll #1

Are people in your company asking if QA can be automated w/ ChatGPT-type technology?

Poll #2

What % of QA do you think can be automated w/ ChatGPT-type technology?

Experiment Design

ChatGPT, an AI-powered tool, has the capability to automate parts of the QA process, but can it match the objectivity and sensibility of human QA analysts?

In our experiment, we aimed to investigate this question by tasking ChatGPT with grading 200 tickets based on a single question: "Did the agent demonstrate active listening?" We then compared ChatGPT's results with those of a human QA analyst.

Why Active Listening?

Active listening was chosen as the primary focus due to the following reasons:

It is fair and applicable across various companies.

It doesn't require knowledge of internal systems or company-specific data.

GPT does not have access to these systems, and thus cannot grade based on them.

Active listening remains a complex and intriguing criterion.

Phase 1: Data Collection and Cleaning

Preparing and cleaning the data is essential for accurate results in testing, as the quality of the underlying data cannot be compromised and requires proper labeling, formatting, and anonymization.

Our data collection involved 5 key parts:

Assembled a small group of test customers

Selected one Yes/No question from each Scorecard that had at least 200 scores

Focused on chat and email conversations

Anonymized and removed any sensitive information

Cleaned the data to exclude "internal" notes, chatbots, and other non-customer-agent messages

Phase 2: Example Prompts

GPT, as a text prediction tool, relies on high-quality prompts to generate accurate answers.

Our prompt consisted of four parts:

Generic context

Question context

Specific examples of good and bad active listening

The conversation to grade

Objective of Experiment

Determine if the agent displayed active listening skills, allowing it to offer personalized recommendations and identify opportunities for customer success.

Challenges in Prompt Engineering

During the experiment, we encountered several challenges, such as the limitations of prompt size and the risk of over-specification, which could lead to errors or hallucinations by the AI.

Phase 3: Experiment Results

The study assessed GPT's performance in answering 200 questions, comparing it to a human grader.

10 iterations were conducted to enhance the prompts following initial testing.

The final results revealed a 58% alignment between GPT and the human grader.

Unfortunately, this level of alignment did not meet the expectations of pilot customers to integrate it into their QA program.

Chapter 4: Exploring Future Possibilities

Potential Follow-Up Ideas

The results do not imply that GPT cannot work. There is still a lot to explore:

Further prompting tests needed for GPT, but complexity can lead to errors.

Third-party vendor customization necessary for high-quality automated QA, effectiveness of ChatGPT uncertain.

Johnny Appleseed may need more resources for AutoQA, GPT-based approach effectiveness unknown.

Optimal results may require a combination of narrow models and expert systems.

Data cleaning is crucial for accuracy and security.

Conclusion: The Ongoing Battle Between AI and Human QA

While ChatGPT shows potential in automating QA processes, the current technology falls short of replacing human analysts. As AI continues to evolve, businesses will need to explore new strategies and tools to meet the demands of high-quality customer service.

If you're interested in learning more about ChatGPT and its potential for automating QA processes, request a demo today. Stay updated on the latest trends in customer service technology by signing up for our CEO Series.

Webinar

GPT vs. QA Analyst

Walk through MaestroQA Labs Experiment with CTO/CEO

In this chat, Vasu Prathipati and Harrison Hunter, the CEO and CTO of MaestroQA, respectively, will present an experiment they conducted to compare the Auto QA capabilities of ChatGPT against those of a QA Analyst.

Watch the Webinar

Request A Demo

Explore more resources

AI & Technology in CX

From Checklists to Conversation Intelligence: How AI Is Redefining Quality in Insurance

Discover how AI-powered QA is transforming insurance quality assurance by improving compliance, claims accuracy, and driving cross-functional insights.

Learn More

AI & Technology in CX

The Real Insurance Crisis Isn’t Claims—It’s Communication

Satisfaction scores more than double when communication is clear. Discover why leading insurers are focusing on QA to rebuild trust.

Learn More

Quality Assurance

What 400 Mystery Shopper Calls Just Taught Us About Quality in the Insurance Industry

What 400 mystery shopper calls revealed about agent communication, trust gaps, and quality in today’s high-risk homeowners insurance market.

Learn More

Request a demo

See the product in action + learn how we improve service quality