Overclock AI Engineering Accelerator

“Our RAG is hallucinating” often translates to: retrieval is returning the wrong context, or the right context in the wrong shape.

A good retrieval layer makes answers boring—in the best way. It consistently returns relevant, high-signal passages with clear provenance, so the generator doesn’t have to guess.

This post focuses on retrieval design patterns you can apply immediately.

1) Start with the target: what does the model need to answer?

Don’t index everything the same way. First classify your knowledge base:

Reference docs (policies, specs): stable, high precision required
Tickets / conversations: messy, requires deduplication and privacy handling
Product data (catalog, pricing): structured, best served via tools/queries
Runbooks: procedural, often better as step snippets + citations

If the answer should come from a database query, don’t force it through embeddings. Use tool calling and return structured results.

2) Chunking is a product decision

Chunking determines what can be retrieved. Bad chunking creates impossible retrieval tasks.

Practical chunking rules

Chunk by semantic boundaries: headings, sections, Q/A pairs.
Keep chunks self-contained: include the definition that makes the chunk interpretable.
Avoid giant chunks. As a starting point:
200–500 tokens for dense reference docs
500–1,000 tokens for procedural content
Add overlap only when needed (e.g., 10–20%) and measure the effect.

Include “breadcrumb context”

For each chunk, store:

Document title
Section heading path (e.g., Refunds > Eligibility > Exceptions)
Source URL/id
Last updated timestamp

This improves user trust (citations) and helps the generator stay grounded.

3) Metadata is your strongest retrieval lever

Embedding similarity alone is rarely enough, especially in large corpora.

Add metadata fields that support filtering:

Product area / team
Region / locale
Customer tier
Doc type (policy, runbook, FAQ)
Access level (public/internal)

Then implement hard filters before vector search whenever you can.

Example: if the user is asking about “EU refunds,” filtering to region=EU and doc_type=policy removes most ambiguity.

4) Use hybrid search by default

Vector search is great for semantic similarity; keyword search is great for exact matches (SKUs, error codes, function names).

A robust pattern:

Run BM25/keyword search and vector search in parallel.
Merge results (weighted) into a single candidate set.
Apply reranking (next section).

This reduces failures where the user includes identifiers like ERR_CONNECTION_RESET or “S3 403” that embeddings may not prioritize.

5) Reranking is where quality jumps

Most retrieval stacks improve dramatically with reranking.

Recommended approach

Retrieve top N=50–200 candidates cheaply (hybrid).
Rerank to top k=5–10 using a stronger model:

Cross-encoder reranker
LLM-based reranker (costlier, sometimes best for complex queries)

Reranking helps when:

many chunks are “kind of related”
the query is long or ambiguous
your corpus has repeated boilerplate

6) Query rewriting: helpful, but constrain it

Query rewriting can increase recall, especially for messy user input. But it can also introduce drift.

If you add rewriting:

Keep the original query.
Log the rewrite.
Put rewrite behind a simple policy (only rewrite when the query is short, or contains pronouns, or lacks nouns).

A safe pattern is to generate multiple sub-queries:

one literal query
one expanded query (synonyms)
one “structured intent” (entities + constraints)

Then retrieve with all of them and let reranking decide.

7) Context packing: give the generator usable inputs

Even with great retrieval, you can lose the benefit if you pack context poorly.

Context packing tips

Provide chunks with clear separators and provenance.
Prefer fewer, higher-quality chunks over many noisy ones.
Include titles/headings so the model can orient.
If chunks conflict, pass both and instruct the model to resolve by recency or policy hierarchy.

When possible, require citations: each claim must reference a chunk id.

8) Handle freshness and versioning explicitly

RAG often fails when docs change.

Store last_updated and show it to the generator (and optionally the user).
Re-embed on a schedule or on publish events.
If your domain changes quickly, consider:
smaller chunks (faster re-embed)
separate indices per version
“current policy index” vs “archive index”

9) Evaluate retrieval separately from generation

If you only evaluate final answers, you won’t know whether failures come from retrieval or generation.

Retrieval evaluation metrics

Recall@k: does the correct chunk appear in the top k?
MRR (mean reciprocal rank): how high does it appear?
nDCG: usefulness-weighted ranking quality

How to build a retrieval eval set

Collect real queries.
Label relevant chunks (even 1–3 per query is enough to start).
Track performance over time as the corpus evolves.

A practical cadence:

weekly: run retrieval evals on a fixed set
per release: compare against baseline
monthly: refresh the eval set with new query types

10) Common failure modes and fixes

Failure: irrelevant chunks dominate

Add metadata filters
Add reranking
Reduce chunk size or remove boilerplate

Failure: model ignores the context

Use structured citations
Tighten instructions (“Answer only using sources”)
Reduce context to fewer, higher-signal chunks

Failure: correct info exists but never retrieved

Improve chunking boundaries
Add keyword fields (titles, headings)
Hybrid search + query expansion

Failure: conflicting sources

Add doc hierarchy and recency rules
Prefer canonical docs (policies/specs) over tickets

A simple baseline architecture

If you want a starting point that works in most real systems:

Chunk by headings into ~300–800 tokens
Store rich metadata and enforce filters
Hybrid retrieve top 100
Rerank to top 8
Provide context with provenance and require citations
Evaluate retrieval (Recall@k) + answer quality separately

RAG isn’t magic. It’s search engineering with a generator attached. When you invest in retrieval fundamentals, the model becomes easier to control—and your product becomes easier to trust.