RAG in Production: What Nobody Tells You

Everyone writes about RAG. The tutorials make it look easy: embed your documents, throw them into a vector database, wire up an LLM, done. Ship it.

I believed that too — until I had to build RAG systems that actually worked in production. Not demos. Not notebooks. Systems where real users asked unpredictable questions, where wrong answers had consequences, and where "it works most of the time" wasn't good enough.

Over the past two years, I've built and shipped three production RAG systems — an AI-powered documentation assistant deployed on AWS with Kubernetes, a parliamentary speech analysis platform processing 100K+ documents, and a QHSE AI assistant built with Spring AI and MCP. Each one taught me things no tutorial ever mentioned.

This is what I wish someone had told me before I started.

The Document Ingestion Trap

The first system I built was a documentation assistant — a chatbot that could answer questions about internal project documentation. The stack was Python, Weaviate as the vector store, and AWS SageMaker for the LLM. The deployment ran on EKS with Docker containers, orchestrated through a CI/CD pipeline.

The prototype took two weeks. Getting it production-ready took three months.

The first lesson hit immediately: your ingestion pipeline is your most important component. Not your LLM. Not your prompt. The way you process, chunk, and embed documents determines 80% of your system's quality.

Chunking Is Not Splitting

Every tutorial says "split your documents into chunks of ~500 tokens." This advice is technically correct and practically useless.

In the documentation assistant, project docs had nested structures — specs with subsections referencing other subsections, tables that only made sense with their headers, code blocks that lost meaning when cut in half. A naive recursive text splitter destroyed all of this context.

What actually worked:

Structure-aware chunking — parsing the document format first (Markdown headers, HTML sections, PDF layout) and chunking along semantic boundaries, not token counts
Overlapping windows with metadata — 15-20% overlap between chunks, with parent section titles injected as metadata so the retriever could score relevance against the broader context
Table preservation — tables were extracted whole and stored as separate chunks with a "table" type flag, never split across rows
Small chunks for retrieval, large chunks for generation — the "parent-child" pattern, where you embed small granular chunks but retrieve the larger parent when a match is found

Lesson Learned

If your retrieval accuracy is bad, don't tune your prompt or swap your LLM. Fix your chunking first. I've seen 20-30% accuracy improvements just from better chunking — more than any prompt engineering ever delivered.

The Retrieval Quality Problem

The second system — a parliamentary RAG platform — forced me to confront retrieval quality at scale. We had 100,000+ parliamentary speeches that citizens could query in natural language. The topic classification pipeline hit 95% accuracy, but the RAG component initially struggled.

The problem was deceptively simple: semantic similarity is not the same as relevance.

A citizen asking "What did parliament say about healthcare reform in 2023?" would get semantically similar results — speeches that mentioned healthcare — but not the most relevant ones. The actual policy debates, the key votes, the turning-point speeches were buried under dozens of tangential mentions.

Hybrid Search Saved Us

Pure vector search wasn't enough. Pure keyword search (BM25) wasn't enough either. The breakthrough came from combining both:

Vector search caught semantic intent — understanding that "healthcare reform" relates to "medical policy changes"
BM25 keyword search caught exact terminology — specific bill numbers, politician names, legal terms that embeddings often blur
Reciprocal Rank Fusion merged the two result sets, giving us significantly better recall than either approach alone

At the scale of 100K+ documents, I also learned that metadata filtering is not optional. Date ranges, document types, speaker roles, parliamentary sessions — these filters happen before the vector search, dramatically reducing the search space and improving both speed and relevance.

Re-Ranking: The Hidden Multiplier

Here's something almost no tutorial covers: your initial retrieval results are just candidates. They're not the final context you should feed to your LLM.

In the documentation assistant, I added a cross-encoder re-ranking step after retrieval. The process:

Retrieve top-20 chunks from Weaviate using hybrid search
Re-rank those 20 chunks using a cross-encoder model that scores each (query, chunk) pair
Pass only the top-5 re-ranked chunks to the LLM

The cross-encoder is slower than embedding similarity (it processes each pair individually), but it's dramatically more accurate because it sees the query and the document together, not as independent embeddings.

Retrieval gets you in the neighborhood. Re-ranking gets you to the right door.

The accuracy improvement was striking. On our internal evaluation set, re-ranking lifted answer quality from roughly 72% to 89% — measured by human evaluation of "did this answer the question correctly and completely."

The Failures Nobody Talks About

Production RAG systems fail in ways that are invisible in demos. Here are the failure modes I've encountered across all three systems:

1. The Confident Wrong Answer

The LLM retrieves somewhat-relevant context and generates a fluent, confident answer that's subtly wrong. The user has no way to know. This is worse than no answer at all.

My fix: always surface the source chunks alongside the answer. In the parliamentary system, every response included clickable references to the original speeches. In the QHSE platform, answers linked back to the specific regulation or audit document. This doesn't prevent hallucination, but it makes it verifiable.

2. The Unanswerable Question

Users ask questions your knowledge base simply cannot answer. A naive RAG system will retrieve the "least irrelevant" chunks and hallucinate an answer anyway.

The solution that worked: a relevance threshold on retrieval scores. If the best chunk scores below a calibrated threshold, the system responds with "I don't have enough information to answer this" instead of guessing. Calibrating that threshold per-domain took experimentation — too high and you reject valid queries, too low and you let garbage through.

3. The Stale Knowledge Problem

Documents get updated. Regulations change. New speeches get published. Your vector store doesn't know this unless you tell it.

In the QHSE platform (built with Spring AI, MCP, and Groq), I implemented an incremental ingestion pipeline — documents were versioned, and updates triggered re-embedding of only the changed sections. Deletes were handled through metadata-based invalidation, not full re-indexing.

4. The Multi-Hop Question

"How did the healthcare budget compare between 2022 and 2023?" This requires retrieving from two different time periods and synthesizing. Single-shot retrieval fails here.

The approach that worked: query decomposition. The LLM first breaks the question into sub-queries ("healthcare budget 2022" and "healthcare budget 2023"), retrieves separately for each, then synthesizes. It's slower but dramatically more accurate for comparative or multi-step questions.

Infrastructure Lessons from AWS

The documentation assistant ran on AWS EKS — Kubernetes-managed containers with SageMaker endpoints for the LLM. Here's what I learned about running RAG in the cloud:

Separate your embedding service from your generation service — they have completely different scaling profiles. Embedding is CPU/GPU-bound and bursty during ingestion. Generation is latency-sensitive and scales with user traffic
Vector database sizing matters more than you think — Weaviate's memory footprint grows linearly with the number of vectors. We underestimated this initially and hit OOM kills in production. Plan for 2-3x your expected index size
Cache aggressively — identical or near-identical queries are more common than you'd expect. A simple query hash cache with a 1-hour TTL cut our SageMaker inference costs significantly
S3 as the source of truth — raw documents lived in S3, processed chunks in the vector store. If the vector store corrupted or needed re-indexing, we could always rebuild from S3. This saved us twice

Architecture Tip

Your vector database is a derived store, not a primary store. Treat it like a cache that can be rebuilt. Keep your raw documents and processing metadata somewhere durable. When — not if — you need to re-index with a new embedding model or different chunking strategy, you'll be glad you did.

Evaluation: The Hardest Part

How do you know your RAG system is actually good? This is the question I still struggle with, even after three production systems.

What I've settled on is a three-layer evaluation:

Retrieval quality — given a question, are the right chunks in the top-5? Measured by recall@5 against a manually curated test set. This is the most important metric and the one most people skip
Answer quality — given the right chunks, does the LLM produce a correct answer? This isolates generation from retrieval. Measured by human evaluation on a sample
End-to-end quality — given a question, is the final answer correct? This is what users experience. Measured by a combination of automated checks and periodic human review

The critical insight: if retrieval is broken, nothing downstream can fix it. I've seen teams spend weeks tuning prompts when the real problem was that the right documents weren't being retrieved in the first place. Always debug from retrieval upward.

What I'd Do Differently

If I were starting a new RAG system tomorrow, here's what I'd change:

Start with evaluation, not with building — create your test set of 50-100 question/answer pairs before writing a single line of pipeline code. Every design decision should be measured against this set
Invest in chunking early — spend a week on chunking strategy before touching retrieval or generation. The ROI is enormous
Hybrid search from day one — don't start with pure vector search thinking you'll "add BM25 later." The architectural assumptions are different
Build the feedback loop first — instrument your system to capture which answers users found helpful and which they didn't. This data is gold for iterating
Keep your LLM swappable — I've switched LLMs mid-project twice. Abstract the generation layer so you can swap models without rewiring your pipeline

RAG is not a solved problem. It's an engineering discipline — one that requires as much attention to data quality, infrastructure, and evaluation as it does to the AI itself. The tutorials get you 20% of the way there. The remaining 80% is what happens when real users meet real documents in production.

That 80% is where the actual work — and the actual value — lives.

RAG in Production:What Nobody Tells You

The Document Ingestion Trap

Chunking Is Not Splitting

The Retrieval Quality Problem

Hybrid Search Saved Us

Re-Ranking: The Hidden Multiplier

The Failures Nobody Talks About

1. The Confident Wrong Answer

2. The Unanswerable Question

3. The Stale Knowledge Problem

4. The Multi-Hop Question

Infrastructure Lessons from AWS

Evaluation: The Hardest Part

What I'd Do Differently

RAG in Production:
What Nobody Tells You