
.png)
In Part 1, I outlined common patterns of perceived hallucination in our RAG system — like missing citations or vague answers that broke user trust.
This post continues that journey, focusing on how we acted on those signals — using real user feedback, lightweight QA loops, and product–ML collaboration, all without retraining the model.
Spotting patterns is only the first half of the job. The real impact comes from translating those insights into structured, actionable feedback that product and ML teams can rally around.
As we cataloged hallucination-adjacent failures — missing citations, incomplete answers, wrong references — we realized that the system wasn’t broken. It was behaving exactly as designed. But it wasn’t meeting user expectations.
And that gap had to be bridged — not just through model updates, but through product rigor and close collaboration with our users.
We didn’t have an automated hallucination evaluator or gold-standard benchmark suite. But what we did have was usage data, internal testing, and a clear understanding of what our enterprise users considered “trustworthy.”
To turn observations into progress, we built a high-signal manual feedback loop that included:
This turned scattered frustration into structured insight. I set up a simple shared sheet to log issues, link the correct documents, tag themes, and review them regularly. No fancy tools — just discipline and iteration.
We’ve recently integrated our MCP setup to partially automate response evaluation — adding structure to what was previously a manual loop. While we’re using it internally for QA, the same system powers how AskNeedl routes insights into reports, dashboards, and decision systems — turning grounded answers into actual outcomes.
It’s not a fully autonomous QA system — but it’s no longer just spreadsheets and gut feel either.
We actively worked with a few early adopter teams — compliance officers, market analysts, and documentation experts — who were already using AskNeedl in production or pilot settings. Their feedback became essential in shaping what we tagged as:
In many cases, they weren’t pointing out factual errors — they were flagging breaks in trust. And that distinction shaped how we evaluated and prioritized issues internally.
These early users didn’t just report bugs — they taught us what “truthful” means in context.
By narrowing the problem space and attaching clear, user-validated examples, we enabled the ML team to:
Instead of “model is wrong,” the message became:
“Here’s what this user expected, why this output felt unreliable, and what could have made it better.”
I often found myself translating between what users said — “this feels vague” — and what the ML team needed to hear, like “retrieval precision dropped due to a fuzzy match between HDFC Bank and HDFC Securities.” That translation layer became part of the product muscle.
As a PM, your job isn’t just to surface what’s broken — it’s to decide what’s worth fixing now. We focused on:
This let us direct ML and engineering effort where it mattered most — not for model elegance, but business-critical trust restoration.
One of the most complex challenges in AI product development — especially with large language models and RAG systems — is evaluating output quality when there’s no clear right answer.
In traditional software, correctness is binary. In AI, it’s often a matter of:
We faced all of these — and often, at once.
.png)
Many user queries — especially in enterprise search — were broad or open-ended:
“What are the key risk disclosures in the latest filings?”
“Has the company responded to the recent allegation?”
“What’s the company’s outlook for next quarter?”
There wasn’t one perfect answer, and no benchmark dataset to score against. What mattered was whether the answer:
In other words, quality wasn’t just factuality — it was auditability.
In the absence of automated benchmarks, we built a multi-layered, semi-manual evaluation loop:
1. Test Sets Built Around Real Queries
We created a bank of ~200 task-specific queries, each tagged with:
2. Live Usage Reviews
We routinely sampled real user sessions:
This usage behavior became a proxy for satisfaction and, indirectly, trust.
3. Early User Feedback as Truth Proxy
We looped in early users and asked:
“Would you trust this answer in a report?”
“Is anything missing?”
“Does this feel inferred or grounded?”
“Does it has all the documents you wanted?”
Their feedback formed the human layer of our quality assurance process.
While our setup was manual and grounded in real-world QA, we also drew perspective from how leading AI products — like OpenAI’s ChatGPT, Perplexity, and Claude — approach answer quality and trust. Observing their strengths and failure patterns gave us valuable signals for where RAG systems tend to succeed — and where they often struggle.
Some common practices and themes we noted across these systems:
1. Human Evaluation is Still Central
Even the most advanced tools rely heavily on human judgment for evaluating output quality. These tools often use:
This reaffirmed our decision to center real-user feedback in our own QA loops, rather than depend on automation too early.
2. Multi-Dimensional Metrics Over One-Liner Scores
Rather than relying on metrics like BLEU or accuracy alone, these tools evaluate based on:
This inspired us to tag answers not just as “correct/incorrect” but as:
3. Adversarial Prompting and “Stress Tests”
Models like GPT-4 and Claude are often tested on ambiguous, multi-intent, or underspecified queries to evaluate reasoning boundaries.
This reflected our own discovery: the more ambiguous or summary-based the user query, the more fragile the RAG output became, especially when citations were missing or retrieval was partial.
In traditional product management, we focus on usability, conversion, and retention. In AI product management, we add something deeper: credibility.
AI systems — especially ones built on RAG — don’t just need to answer questions. They need to do so with transparency, humility, and traceability. Because when they don’t, even correct answers can feel wrong. And trust, once lost, is hard to win back.
As product managers, we may not train the model or write the prompt, but we shape the environment in which trust is earned. That environment includes:
In our journey, we learned that:
In AI products, quality is a shared responsibility. And as PMs, we’re responsible not just for what the system says, but what the user believes about it.
And if there’s one thing I’ve learned along the way, it’s that for ML engineers, it’s never a “bug” or an “issue.” It’s always an “optimization” 😉
You can also check this post on Medium.