RAG in Production: Lessons From 50 Deployments
Practical lessons on chunking strategies, embedding models, and evaluation frameworks from our most successful RAG implementations.
Why Most RAG Systems Fail
We've deployed over 50 RAG systems across industries — legal, healthcare, fintech, and enterprise SaaS. The pattern we see again and again is that teams underestimate the complexity of retrieval and overestimate the capability of the language model.
The most common failure mode isn't the LLM hallucinating — it's the retrieval step returning irrelevant chunks. If your retriever doesn't surface the right context, even the best language model will produce confident-sounding garbage.
This is why we spend 70% of our RAG development time on the retrieval pipeline and only 30% on the generation layer. Get retrieval right, and the rest follows.
Chunking Strategy Matters More Than You Think
The naive approach — split documents into fixed-size chunks of 512 tokens — works for demos but breaks in production. Real documents have structure: headers, lists, tables, code blocks. Ignoring this structure means your chunks lose critical context.
Our go-to approach is semantic chunking: we parse document structure first, then create chunks that respect natural boundaries. A section heading always stays with its content. A table is never split across chunks. Code blocks are kept whole.
We've also found that chunk overlap is overrated. Instead of overlapping chunks by 20%, we add parent context — a summary of the surrounding section — to each chunk's metadata. This gives the retriever more signal without inflating the index.
Embedding Models: The Unsung Hero
Choosing the right embedding model has more impact on RAG quality than choosing the right LLM. We've tested dozens of models across MTEB benchmarks and real-world retrieval tasks. The gap between a mediocre and excellent embedding model can mean the difference between 60% and 90% retrieval accuracy.
For most production use cases, we recommend domain-specific fine-tuned embeddings over general-purpose models. Fine-tuning on as few as 1,000 query-document pairs from your actual domain can boost retrieval accuracy by 15-25%.
Don't forget hybrid search. Combining dense embeddings with sparse BM25 retrieval consistently outperforms either approach alone. We run both in parallel and use reciprocal rank fusion to merge results.
Evaluation Is Non-Negotiable
You can't improve what you can't measure. Every RAG system we deploy includes an evaluation framework from day one. We track retrieval precision and recall, answer faithfulness (does the answer actually come from the retrieved context?), and answer relevance.
We build golden datasets — curated question-answer pairs verified by domain experts — for each deployment. These serve as regression tests: every change to the chunking strategy, embedding model, or prompt template is validated against the golden set before going to production.
Automated evaluation with LLM-as-judge has gotten remarkably good. We use it for continuous monitoring, but critical decisions still go through human evaluation. The combination of automated and human eval is what gives us confidence to ship.