RAG demos are easy. RAG in production is mostly retrieval, a tiny bit of generation, and a lot of plumbing nobody mentions. The magic, when there's any, lives in the retrieval — not the model.
Chunking is the part that determines whether RAG works at all. Splitting documents on token count is a starting point and a wrong one. We split on semantic boundaries — section headers, paragraph breaks, tables, code blocks — and we measure retrieval quality offline before any LLM call. If the right chunk isn't in the top-K, no model can recover.
Hybrid retrieval is non-negotiable. BM25 catches exact-match cases vector search misses (product SKUs, code identifiers, error codes). Vector search catches semantic intent. The combination — typically a reciprocal rank fusion — is what we ship. Pure vector retrieval leaves 20-30% of cases on the table.
The evaluation suite is the artifact that matters. We start every RAG project with 200-500 questions sourced from real users (support tickets, sales calls, anything human-generated). Each question has a known-correct answer or a known-correct citation. Every change to chunking, embeddings, retrieval, or prompt is measured against this set. No eval = no progress, just opinion.
What we don't do: pretend embeddings choose themselves, ship 'agentic' loops without latency budgets, write multi-step plans with no fallback path. RAG that earns its place in production is boring, measured, and replaceable provider-by-provider.
