Citations vs hallucinations: why grounding matters
An unsourced answer is a guess wearing a tie. Here's how grounding closes the gap between confidence and correctness.
A model that cannot point to a source is a model that cannot be audited. That single property — auditability — is the line between a chatbot that ships and a chatbot that gets quietly turned off after the first refund dispute, the first compliance review, or the first time it tells a customer your refund window is 90 days when the actual answer is 30.
Grounding is the discipline of making every assertion traceable to text the team controls. It is not a feature you bolt on at the end. It is the spine of any retrieval system that intends to outlive the demo.
This post walks through what grounding actually means, the failure modes that look like grounding but are not, the architecture choices that bake it in, and the operational habits that keep it honest over time.
Why hallucinations are not random
A hallucination is rarely the model “making things up.” It is almost always a predictable consequence of an under-constrained generation step:
- The retrieval pass surfaced nothing relevant, but the model was prompted to produce a confident answer anyway. So it produced one — drawn from its pretraining corpus, which is statistically averaged across the public internet circa some cutoff date.
- The retrieval pass surfaced something adjacent but wrong. The model synthesized over the wrong evidence and wrote a plausible paragraph that contradicts your docs.
- The retrieval pass surfaced the right chunk, but the chunk was truncated at a clause boundary that flipped its meaning. “Refund within 30 days, except for digital goods.” → “Refund within 30 days.” The exception got chopped.
In each case the model behaved as designed. The system around it failed to give it a usable input. Calling the result a hallucination misplaces the blame.
What “grounded” actually means
Grounding is more than appending a URL after the answer. A grounded reply has four traits, in order of difficulty:
- Retrieval precision. The chunk surfaced is semantically close to the question, not just keyword-similar. “How do I cancel” should retrieve cancellation policy, not the cancellation FAQ for an unrelated product line.
- Conditioning, not concatenation. The model is prompted in a way that forces it to use the chunk during generation rather than treat it as decoration. This is a prompt-engineering problem and a fine-tuning problem; both matter.
- Per-claim attribution. The citation maps to the exact paragraph or sentence the assertion came from, not the page it lives on. A reader who hovers a citation should land on the specific text that justified the specific clause.
- Refusal when grounding is weak. When retrieval returns low-confidence chunks, the system declines to confabulate. It says it does not know, or it offers an escalation. Refusal is the most underrated grounding behavior in production.
Skip any of those and you have a footnote, not a citation. Skip the fourth and you have a confident liar with a hyperlinked tail.
The two failure modes we see weekly
The plausible paraphrase
The model rewords the source so smoothly that even the writer forgets where it came from. The reply sounds like documentation. It compresses caveats, smooths edge cases, and sometimes invents the next sentence the reader was hoping for.
The fix is per-sentence attribution at evaluation time, not just at the end of the message. We score every sentence in the generated reply against the retrieved chunks. Sentences whose maximum similarity falls below a threshold get flagged. In the dashboard, those flagged sentences light up so the reviewer can decide whether to add a source, soften the claim, or reject the answer outright.
The dropped clause
The retrieved chunk says “free for the first 30 days, then $19/mo.” The model emits “free.” The truncation is invisible unless the chunk is rendered next to the answer.
The fix is rendering. We show the retrieved chunk inline in the reviewer’s interface, with the relevant sentence highlighted. Customer-facing replies link to the full source. The point is to make truncation expensive to hide — both for the model (because the comparison is inevitable) and for the reviewer (because the missed clause is right there).
Retrieval is the hard part
Most teams overspend on the model and underspend on retrieval. They will A/B four prompt variants in a week and never look at their chunking strategy. That is backwards.
Three retrieval levers do the heavy lifting:
- Chunking policy. Chunks should respect semantic boundaries, not just byte counts. A 512-token splitter that cuts mid-sentence will silently halve your precision. We chunk along heading boundaries first, then paragraph, then sentence, with overlap to preserve context across boundaries.
- Hybrid search. Pure vector search misses exact-match queries (“error code E_5012”). Pure BM25 misses paraphrases. The right answer is hybrid — run both, fuse the results, rerank. The cost is latency; the payoff is recall.
- Reranking. A cross-encoder reranker over the top 50 lexical/vector hits routinely beats raw similarity by double digits on grounded-QA benchmarks. It costs maybe 80ms. Add it.
If you have one afternoon to improve grounding quality, spend it on chunking. If you have two, add a reranker. The model can wait.
Confidence belongs on retrieval, not on generation
LLM confidence scores are notorious. The token probabilities have almost nothing to do with whether the underlying claim is true. A model can produce a fluent, internally-consistent paragraph at 0.95 mean probability that contradicts the retrieved source.
Retrieval confidence is more honest. The cosine similarity between the query embedding and the top retrieved chunk is a defensible signal. If your top hit is 0.42, you do not have evidence; you have noise.
We expose retrieval confidence in three places:
- The reviewer dashboard sees the top retrieved chunk and its score on every conversation.
- The system prompt includes the score so the model can decide whether to commit, hedge, or refuse.
- The fallback ladder consults the score before deciding whether to escalate.
Confidence is a knob, not a number to display to end users. They do not want to see “I am 73% sure.” They want correctness or escalation.
The fallback ladder
When retrieval is shaky, the bot does not try harder by inventing more. It steps down a ladder:
- Rephrase. Sometimes the query is the problem. The bot rewrites it once and retries retrieval.
- Retrieve again with broadened scope. If the first pass was too narrow (single source, single language), widen to siblings.
- Ask a clarifying question. “Are you asking about the API plan or the dashboard plan?” Disambiguation is grounding.
- Refuse. “I don’t have a confident answer for that. I can connect you to a human.”
- Escalate. Hand the conversation to a human with the full transcript and the retrieved (and rejected) chunks attached.
Most production failures we see come from teams who skipped step four. Refusal is unsexy; it is also the difference between a bot you trust and a bot you babysit.
What we ship
Every reply in FluentBot carries:
- A chunk-level pointer to the source paragraph, not the page. Hover, click, verify.
- A retrieval confidence band displayed in the dashboard but not the widget. The band drives behavior; the user sees the result.
- An explicit fallback when grounding is weak, including a one-tap escalation path. The bot is allowed to say “I don’t know.”
- A content gap signal when retrieval missed because no good source exists. That feeds the writer’s queue, not the user’s reply.
The result is not a smarter model. It is a system whose answers are checkable — and that, in support, is the whole game.
How to evaluate grounding without lying to yourself
The most common mistake in RAG evaluation is using the same model for generation and judgment. Self-assessment overstates correctness by a wide margin. Use a different model, or — better — use human reviewers with a forced rubric:
- Did the answer make a claim?
- Does the cited chunk support that claim?
- Is the answer consistent with the rest of the chunk, or did it drop context?
Score those three on a binary scale. Aggregate weekly. Watch the trend. The absolute number is less interesting than the slope.
Pair the human eval with two cheap automated checks:
- Answerable-or-not classifier. Hold out a set of queries you know your corpus cannot answer. Measure how often the bot refuses on that set. If it answers them, your grounding is theatrical.
- Citation-clickthrough rate. What fraction of users actually click the citation? Below a few percent and your citations are decorative; the language above is doing the work and you have lost the audit trail.
The cultural part
Grounding is also a posture. Teams that build trustworthy bots write differently:
- They do not mark a query “resolved” because the bot replied. They mark it resolved because the user did not come back.
- They treat the content gap queue as a real backlog, not a vanity dashboard.
- They protect the refusal threshold from product managers who want the deflection number to look better.
That last one is the test. If your team can hold the line on “the bot will say I don’t know when it doesn’t know,” you have a grounded product. If the threshold quietly creeps down every quarter to lift a metric, you have a confidence machine pointed at your customers.
Closing
Hallucinations are not the disease. They are the symptom of an unconstrained generator wired to a thin retriever and a self-graded eval. Fix the retriever, give the generator something to anchor to, force the system to refuse when the anchor is weak, and grade the whole thing on whether claims trace back to text the team can defend.
Do that, and the bot stops sounding like a confident stranger. It starts sounding like the documentation — which, after all, is what your customers came for.