“Our RAG is hallucinating” often translates to: retrieval is returning the wrong context, or the right context in the wrong shape.
A good retrieval layer makes answers boring—in the best way. It consistently returns relevant, high-signal passages with clear provenance, so the generator doesn’t have to guess.
This post focuses on retrieval design patterns you can apply immediately.
1) Start with the target: what does the model need to answer?
Don’t index everything the same way. First classify your knowledge base:
- Reference docs (policies, specs): stable, high precision required
- Tickets / conversations: messy, requires deduplication and privacy handling
- Product data (catalog, pricing): structured, best served via tools/queries
- Runbooks: procedural, often better as step snippets + citations
If the answer should come from a database query, don’t force it through embeddings. Use tool calling and return structured results.
2) Chunking is a product decision
Chunking determines what can be retrieved. Bad chunking creates impossible retrieval tasks.
Practical chunking rules
- Chunk by semantic boundaries: headings, sections, Q/A pairs.
- Keep chunks self-contained: include the definition that makes the chunk interpretable.
- Avoid giant chunks. As a starting point:
- 200–500 tokens for dense reference docs
- 500–1,000 tokens for procedural content
- Add overlap only when needed (e.g., 10–20%) and measure the effect.
Include “breadcrumb context”
For each chunk, store:
- Document title
- Section heading path (e.g.,
Refunds > Eligibility > Exceptions) - Source URL/id
- Last updated timestamp
This improves user trust (citations) and helps the generator stay grounded.
3) Metadata is your strongest retrieval lever
Embedding similarity alone is rarely enough, especially in large corpora.
Add metadata fields that support filtering:
- Product area / team
- Region / locale
- Customer tier
- Doc type (policy, runbook, FAQ)
- Access level (public/internal)
Then implement hard filters before vector search whenever you can.
Example: if the user is asking about “EU refunds,” filtering to region=EU and doc_type=policy removes most ambiguity.
4) Use hybrid search by default
Vector search is great for semantic similarity; keyword search is great for exact matches (SKUs, error codes, function names).
A robust pattern:
- Run BM25/keyword search and vector search in parallel.
- Merge results (weighted) into a single candidate set.
- Apply reranking (next section).
This reduces failures where the user includes identifiers like ERR_CONNECTION_RESET or “S3 403” that embeddings may not prioritize.
5) Reranking is where quality jumps
Most retrieval stacks improve dramatically with reranking.
Recommended approach
- Retrieve top N=50–200 candidates cheaply (hybrid).
- Rerank to top k=5–10 using a stronger model:
- Cross-encoder reranker
- LLM-based reranker (costlier, sometimes best for complex queries)
Reranking helps when:
- many chunks are “kind of related”
- the query is long or ambiguous
- your corpus has repeated boilerplate
6) Query rewriting: helpful, but constrain it
Query rewriting can increase recall, especially for messy user input. But it can also introduce drift.
If you add rewriting:
- Keep the original query.
- Log the rewrite.
- Put rewrite behind a simple policy (only rewrite when the query is short, or contains pronouns, or lacks nouns).
A safe pattern is to generate multiple sub-queries:
- one literal query
- one expanded query (synonyms)
- one “structured intent” (entities + constraints)
Then retrieve with all of them and let reranking decide.
7) Context packing: give the generator usable inputs
Even with great retrieval, you can lose the benefit if you pack context poorly.
Context packing tips
- Provide chunks with clear separators and provenance.
- Prefer fewer, higher-quality chunks over many noisy ones.
- Include titles/headings so the model can orient.
- If chunks conflict, pass both and instruct the model to resolve by recency or policy hierarchy.
When possible, require citations: each claim must reference a chunk id.
8) Handle freshness and versioning explicitly
RAG often fails when docs change.
- Store
last_updatedand show it to the generator (and optionally the user). - Re-embed on a schedule or on publish events.
- If your domain changes quickly, consider:
- smaller chunks (faster re-embed)
- separate indices per version
- “current policy index” vs “archive index”
9) Evaluate retrieval separately from generation
If you only evaluate final answers, you won’t know whether failures come from retrieval or generation.
Retrieval evaluation metrics
- Recall@k: does the correct chunk appear in the top k?
- MRR (mean reciprocal rank): how high does it appear?
- nDCG: usefulness-weighted ranking quality
How to build a retrieval eval set
- Collect real queries.
- Label relevant chunks (even 1–3 per query is enough to start).
- Track performance over time as the corpus evolves.
A practical cadence:
- weekly: run retrieval evals on a fixed set
- per release: compare against baseline
- monthly: refresh the eval set with new query types
10) Common failure modes and fixes
Failure: irrelevant chunks dominate
- Add metadata filters
- Add reranking
- Reduce chunk size or remove boilerplate
Failure: model ignores the context
- Use structured citations
- Tighten instructions (“Answer only using sources”)
- Reduce context to fewer, higher-signal chunks
Failure: correct info exists but never retrieved
- Improve chunking boundaries
- Add keyword fields (titles, headings)
- Hybrid search + query expansion
Failure: conflicting sources
- Add doc hierarchy and recency rules
- Prefer canonical docs (policies/specs) over tickets
A simple baseline architecture
If you want a starting point that works in most real systems:
- Chunk by headings into ~300–800 tokens
- Store rich metadata and enforce filters
- Hybrid retrieve top 100
- Rerank to top 8
- Provide context with provenance and require citations
- Evaluate retrieval (Recall@k) + answer quality separately
RAG isn’t magic. It’s search engineering with a generator attached. When you invest in retrieval fundamentals, the model becomes easier to control—and your product becomes easier to trust.
