RAG Done Right: Lessons From Building Production Retrieval Systems

2026-03-22 · 8 min read

RAGLLMArchitectureProduction ML

Retrieval-Augmented Generation sounds straightforward: retrieve relevant documents, stuff them into a prompt, get a better answer. The concept is simple. Getting it to work reliably in production is not.

We've built RAG systems for document analysis, knowledge bases, and intelligent search across multiple client engagements. Here's what we've learned — the architecture patterns that work, the pitfalls that waste months, and the decisions that matter most.

The Naive RAG Pipeline (And Why It Breaks)

Most RAG tutorials show this pipeline:

  1. Chunk your documents
  2. Embed the chunks with an embedding model
  3. Store embeddings in a vector database
  4. At query time, embed the query, find similar chunks, stuff them into a prompt

This works in demos. It breaks in production for three reasons:

Chunking destroys context. Splitting a 50-page contract into 500-token chunks loses the relationships between sections. A clause in paragraph 3 that references a definition in paragraph 47 becomes two disconnected chunks.

Semantic similarity isn't relevance. A query about "payment terms" might retrieve chunks about "payment processing" (semantically similar) instead of the actual payment terms clause (relevant but uses different language).

The LLM can't verify its sources. When you stuff multiple chunks into a prompt, the model might synthesize an answer from two chunks that shouldn't be combined, or confidently cite information that's actually from the wrong document.

Architecture That Actually Works

Hybrid Retrieval

Don't rely on vector search alone. Combine it with keyword search (BM25) in a hybrid retrieval strategy:

  • Vector search catches semantic matches — different words, same meaning
  • Keyword search catches exact matches — specific terms, names, codes
  • Reciprocal rank fusion merges the results

In our experience, hybrid retrieval improves answer accuracy by 15-25% over vector-only search. The implementation is straightforward — most vector databases (Pinecone, Weaviate, Qdrant) support hybrid search natively.

Smart Chunking

Stop using fixed-size chunks. Use structure-aware chunking:

  • Documents with headings: Split on headings, preserving section hierarchy
  • Legal/regulatory documents: Split on clause boundaries
  • Code: Split on function/class boundaries
  • Conversations: Split on speaker turns

For every chunk, store metadata: source document, section hierarchy, page number, date. This metadata is critical for filtering and citation.

We add a parent-child relationship where each small retrieval chunk links to its parent section. The small chunk gets retrieved (better precision), but the larger parent section gets passed to the LLM (better context).

Query Transformation

Raw user queries are often poor retrieval queries. A user asking "what's our refund policy?" might need chunks about "return policy," "cancellation terms," and "money-back guarantee."

Use a lightweight LLM call to transform the query before retrieval:

  • Expand the query with synonyms and related terms
  • Decompose complex questions into sub-queries
  • Identify the actual information need behind vague questions

This adds 200-500ms of latency but significantly improves retrieval quality.

Re-Ranking

After retrieval, use a cross-encoder re-ranker to reorder results by actual relevance to the query. Bi-encoder embeddings (used for initial retrieval) are fast but approximate. Cross-encoders are slower but much more accurate at judging query-document relevance.

The pipeline becomes: retrieve 20-50 candidates with hybrid search, re-rank to top 5-10 with a cross-encoder, pass the top results to the LLM.

The Decisions That Matter Most

Embedding Model Selection

Your embedding model determines your retrieval ceiling. We've found:

  • For English-only, general purpose: OpenAI text-embedding-3-large or Cohere embed-v3 work well out of the box
  • For multilingual or domain-specific: Fine-tuned open-source models (e.g., fine-tuned E5 or BGE) significantly outperform general models
  • For code: Code-specific embeddings (CodeBERT, StarEncoder) outperform general embeddings by 20-30% on code retrieval

Test embedding models on your actual data before committing. We've seen 15% accuracy differences between models on the same dataset.

Chunk Size Tradeoffs

Smaller chunks (200-400 tokens): Better retrieval precision, worse context for the LLM Larger chunks (800-1200 tokens): Better context, worse retrieval precision

The parent-child approach gives you both: retrieve on small chunks, pass large chunks to the LLM. If you can't implement parent-child, err on the side of larger chunks — the LLM handles extra context better than missing context.

When to Fine-Tune vs. Prompt Engineer

For most RAG applications, good retrieval + well-structured prompts is enough. Fine-tune the LLM only when:

  • You need a specific output format consistently
  • Domain terminology is causing errors
  • You need the LLM to follow domain-specific reasoning patterns

Fine-tuning the retrieval components (embedding model, re-ranker) gives more ROI than fine-tuning the LLM in most RAG systems.

Production Monitoring

RAG systems degrade silently. Build monitoring for:

  • Retrieval quality: Are the top-k retrieved chunks actually relevant? Sample and manually evaluate weekly.
  • Answer groundedness: Is the LLM's answer supported by the retrieved chunks? Use an LLM-as-judge to score groundedness.
  • Latency breakdown: Where is time being spent? Retrieval, re-ranking, or generation?
  • User feedback: Let users flag wrong answers. This is your most valuable signal.

Common Pitfalls

Throwing everything into one index. Different document types need different chunking strategies and often different retrieval approaches. A FAQ and a legal contract shouldn't be processed the same way.

Ignoring stale data. If your document corpus changes, your embeddings need to be updated. Build an incremental indexing pipeline, not a batch-and-forget process.

Skipping evaluation. Build a test set of 50-100 question-answer pairs from your actual use case. Run it against every architecture change. Without this, you're optimizing blind.

Over-engineering on day one. Start with the simplest pipeline that works: basic chunking, single embedding model, no re-ranking. Add complexity only when you've measured where the bottleneck is.

What We'd Recommend

If you're building your first RAG system:

  1. Start simple — basic chunking, OpenAI embeddings, Pinecone or pgvector
  2. Build an evaluation set immediately
  3. Add hybrid retrieval (the highest-ROI improvement)
  4. Add re-ranking if retrieval quality is still insufficient
  5. Explore query transformation for complex queries
  6. Fine-tune embeddings only if domain-specific vocabulary is causing retrieval failures

The goal isn't a perfect system on day one. It's a system that works well enough to be useful, with the instrumentation to improve continuously.

Building a RAG system? Let's talk architecture →