March 2026 · 8 min read

RAG in Production:
What Nobody Tells You

Beyond basics — chunking strategies, retrieval quality, re-ranking, and the subtle failures that kill accuracy in real deployments. Written from 3 production systems.

RAG LLMs Production AWS

Everyone writes about RAG. The tutorials make it look easy: embed your documents, throw them into a vector database, wire up an LLM, done. Ship it.

I believed that too — until I had to build RAG systems that actually worked in production. Not demos. Not notebooks. Systems where real users asked unpredictable questions, where wrong answers had consequences, and where "it works most of the time" wasn't good enough.

Over the past two years, I've built and shipped three production RAG systems — an AI-powered documentation assistant deployed on AWS with Kubernetes, a parliamentary speech analysis platform processing 100K+ documents, and a QHSE AI assistant built with Spring AI and MCP. Each one taught me things no tutorial ever mentioned.

This is what I wish someone had told me before I started.


The Document Ingestion Trap

The first system I built was a documentation assistant — a chatbot that could answer questions about internal project documentation. The stack was Python, Weaviate as the vector store, and AWS SageMaker for the LLM. The deployment ran on EKS with Docker containers, orchestrated through a CI/CD pipeline.

The prototype took two weeks. Getting it production-ready took three months.

The first lesson hit immediately: your ingestion pipeline is your most important component. Not your LLM. Not your prompt. The way you process, chunk, and embed documents determines 80% of your system's quality.

Chunking Is Not Splitting

Every tutorial says "split your documents into chunks of ~500 tokens." This advice is technically correct and practically useless.

In the documentation assistant, project docs had nested structures — specs with subsections referencing other subsections, tables that only made sense with their headers, code blocks that lost meaning when cut in half. A naive recursive text splitter destroyed all of this context.

What actually worked:

Lesson Learned

If your retrieval accuracy is bad, don't tune your prompt or swap your LLM. Fix your chunking first. I've seen 20-30% accuracy improvements just from better chunking — more than any prompt engineering ever delivered.

The Retrieval Quality Problem

The second system — a parliamentary RAG platform — forced me to confront retrieval quality at scale. We had 100,000+ parliamentary speeches that citizens could query in natural language. The topic classification pipeline hit 95% accuracy, but the RAG component initially struggled.

The problem was deceptively simple: semantic similarity is not the same as relevance.

A citizen asking "What did parliament say about healthcare reform in 2023?" would get semantically similar results — speeches that mentioned healthcare — but not the most relevant ones. The actual policy debates, the key votes, the turning-point speeches were buried under dozens of tangential mentions.

Hybrid Search Saved Us

Pure vector search wasn't enough. Pure keyword search (BM25) wasn't enough either. The breakthrough came from combining both:

At the scale of 100K+ documents, I also learned that metadata filtering is not optional. Date ranges, document types, speaker roles, parliamentary sessions — these filters happen before the vector search, dramatically reducing the search space and improving both speed and relevance.

Re-Ranking: The Hidden Multiplier

Here's something almost no tutorial covers: your initial retrieval results are just candidates. They're not the final context you should feed to your LLM.

In the documentation assistant, I added a cross-encoder re-ranking step after retrieval. The process:

The cross-encoder is slower than embedding similarity (it processes each pair individually), but it's dramatically more accurate because it sees the query and the document together, not as independent embeddings.

Retrieval gets you in the neighborhood. Re-ranking gets you to the right door.

The accuracy improvement was striking. On our internal evaluation set, re-ranking lifted answer quality from roughly 72% to 89% — measured by human evaluation of "did this answer the question correctly and completely."

The Failures Nobody Talks About

Production RAG systems fail in ways that are invisible in demos. Here are the failure modes I've encountered across all three systems:

1. The Confident Wrong Answer

The LLM retrieves somewhat-relevant context and generates a fluent, confident answer that's subtly wrong. The user has no way to know. This is worse than no answer at all.

My fix: always surface the source chunks alongside the answer. In the parliamentary system, every response included clickable references to the original speeches. In the QHSE platform, answers linked back to the specific regulation or audit document. This doesn't prevent hallucination, but it makes it verifiable.

2. The Unanswerable Question

Users ask questions your knowledge base simply cannot answer. A naive RAG system will retrieve the "least irrelevant" chunks and hallucinate an answer anyway.

The solution that worked: a relevance threshold on retrieval scores. If the best chunk scores below a calibrated threshold, the system responds with "I don't have enough information to answer this" instead of guessing. Calibrating that threshold per-domain took experimentation — too high and you reject valid queries, too low and you let garbage through.

3. The Stale Knowledge Problem

Documents get updated. Regulations change. New speeches get published. Your vector store doesn't know this unless you tell it.

In the QHSE platform (built with Spring AI, MCP, and Groq), I implemented an incremental ingestion pipeline — documents were versioned, and updates triggered re-embedding of only the changed sections. Deletes were handled through metadata-based invalidation, not full re-indexing.

4. The Multi-Hop Question

"How did the healthcare budget compare between 2022 and 2023?" This requires retrieving from two different time periods and synthesizing. Single-shot retrieval fails here.

The approach that worked: query decomposition. The LLM first breaks the question into sub-queries ("healthcare budget 2022" and "healthcare budget 2023"), retrieves separately for each, then synthesizes. It's slower but dramatically more accurate for comparative or multi-step questions.

Infrastructure Lessons from AWS

The documentation assistant ran on AWS EKS — Kubernetes-managed containers with SageMaker endpoints for the LLM. Here's what I learned about running RAG in the cloud:

Architecture Tip

Your vector database is a derived store, not a primary store. Treat it like a cache that can be rebuilt. Keep your raw documents and processing metadata somewhere durable. When — not if — you need to re-index with a new embedding model or different chunking strategy, you'll be glad you did.

Evaluation: The Hardest Part

How do you know your RAG system is actually good? This is the question I still struggle with, even after three production systems.

What I've settled on is a three-layer evaluation:

The critical insight: if retrieval is broken, nothing downstream can fix it. I've seen teams spend weeks tuning prompts when the real problem was that the right documents weren't being retrieved in the first place. Always debug from retrieval upward.

What I'd Do Differently

If I were starting a new RAG system tomorrow, here's what I'd change:


RAG is not a solved problem. It's an engineering discipline — one that requires as much attention to data quality, infrastructure, and evaluation as it does to the AI itself. The tutorials get you 20% of the way there. The remaining 80% is what happens when real users meet real documents in production.

That 80% is where the actual work — and the actual value — lives.