AI

RAG without the magic: what actually makes retrieval work in production

The boring parts of retrieval-augmented generation: chunking, hybrid search, and the eval suite that decides everything.

Back to insights

2026-02-2010 min min read

RAG demos are easy. RAG in production is mostly retrieval, a tiny bit of generation, and a lot of plumbing nobody mentions. The magic, when there's any, lives in the retrieval — not the model.

Chunking is the part that determines whether RAG works at all. Splitting documents on token count is a starting point and a wrong one. We split on semantic boundaries — section headers, paragraph breaks, tables, code blocks — and we measure retrieval quality offline before any LLM call. If the right chunk isn't in the top-K, no model can recover.

Hybrid retrieval is non-negotiable. BM25 catches exact-match cases vector search misses (product SKUs, code identifiers, error codes). Vector search catches semantic intent. The combination — typically a reciprocal rank fusion — is what we ship. Pure vector retrieval leaves 20-30% of cases on the table.

The evaluation suite is the artifact that matters. We start every RAG project with 200-500 questions sourced from real users (support tickets, sales calls, anything human-generated). Each question has a known-correct answer or a known-correct citation. Every change to chunking, embeddings, retrieval, or prompt is measured against this set. No eval = no progress, just opinion.

What we don't do: pretend embeddings choose themselves, ship 'agentic' loops without latency budgets, write multi-step plans with no fallback path. RAG that earns its place in production is boring, measured, and replaceable provider-by-provider.

Engineering the Future of Digital Infrastructure

Talk to an engineer