Arabic Broke My RAG. Here's What Saved It.

RAG looks like magic in English demos. You drop in PDFs, retrieve, prompt the model, and it talks. I shipped one in Egyptian Arabic for an agriculture use case. On day one, retrieval came back almost empty. Users typed dialect on their phones; my docs were in Modern Standard Arabic. The English playbook didn't survive them.

Here are the six small fixes that turned almost-nothing into something I'm willing to put in front of farmers.

Arabic Isn't One Language

Arabic users don't type the same way. The same word can carry different alif shapes (ٱ, آ, إ, أ), diacritics or no diacritics, stray RTL marks from a keyboard. Dialect is a separate problem on top: the dictionary form your docs use and what a farmer types on WhatsApp are often very different. Most embedding models, even multilingual ones, can't tell that "how do I water my courgette in summer" and ازاى اروى الكوسة فى الصيف؟ mean the same thing.

Fix #1 — Normalize the Query, Not the Docs

The simplest win came first. On every incoming query I strip diacritics, the tatweel character, RTL marks, and unify the alif variants into a single ا. Ya folds into maqsura. One thing I deliberately leave alone: ta-marbuta. It carries meaning in technical Arabic, and stripping it muddied real distinctions.

The harder rule I learned: don't run the same normalization on your knowledge base. You'll quietly damage the exact words farmers search for. Normalize what comes in. Leave what's already indexed.

text = strip_diacritics(text)
text = re.sub(r"[ٱآإأ]", "ا", text)
text = re.sub(r"ي", "ى", text)
text = re.sub(r"\s+", " ", text).strip()

Fix #2 — Rewrite Dialect Into MSA, but Freeze the Nouns

Normalization fixes spelling, not dialect. So before retrieval, I send the query through a small GPT-4o call with a few-shot prompt that rewrites Egyptian colloquial into MSA.

Original: ازاى اروى الكوسة فى الصيف؟
Rewritten: كيف أروي الكوسة في فصل الصيف؟

The trick most people miss is in the prompt itself: tell the model what not to touch. Crop names, pesticide names, fertilizer names, varieties, numbers, dates, and units stay verbatim. If GPT "translates" a courgette name, your knowledge base will never match it.

A few details earned their keep. I pass the original, unnormalized query into the rewriter — register cues live in the spelling. I cache rewrites by the normalized form, so dialect variants share a slot. I cache failures too. I fail open: if the rewriter dies, retrieval falls back to the original. Then I run both versions and merge by max score.

Fix #3 — Hybrid Search With the Arabic Tokenizer

Vector search alone misses crop names. BM25 alone misses dialect paraphrases. Together they cover both. I use pgvector for the semantic side and Postgres full-text for the keyword side, merged by node ID with the higher score winning.

The catch is small but vicious: Postgres' default full-text config tokenizes English. On Arabic it produces nonsense. You have to switch to the Arabic full-text config explicitly. That one line did more for recall than any prompt change.

Fix #4 — Recalibrate Your Similarity Scores

Multilingual embeddings don't score Arabic the way they score English. The 0.7 threshold I trusted from English projects rejected almost every correct Arabic match. Real cosine values for relevant Arabic chunks were landing between 0.20 and 0.45.

So I built a small benchmark — 20 queries, half in-scope agriculture, half off-topic, across Arabic, French, and English. Then I picked an empirical floor (0.15) and ceiling (0.50) from what I was actually seeing, and remapped raw cosine into a 0–1 confidence score. Threshold sits at raw 0.30.

Fix #5 — Confidence Is Two Questions, Not One

I ask the LLM to return its own confidence inside its JSON answer — {"answer": "...", "confidence": 0.XX}. That's the primary signal. I keep max_similarity from retrieval as a safety net for when the JSON is malformed, or when the model is hallucinating its own certainty.

Three buckets:

≥ 0.7 — answer directly.
0.5–0.7 — answer with a disclaimer.
< 0.5 — hand it to a human expert.

In agriculture, a wrong dose can kill a crop. The "I don't know, let me get someone" path isn't a fallback. It's a feature.

Fix #6 — Every Expert Answer Becomes Part of the Brain

When an expert resolves an escalated question, that whole Q+A is embedded back into the knowledge base — one un-chunked node, tagged expert_answer, in the user's actual language. The next time a farmer asks something similar in dialect, the closest match isn't a generic doc. It's an expert answer to a real question, in the same register.

Most of the dialect knowledge in my retriever wasn't there on day one. The system grew it, one escalation at a time.

What Actually Mattered

Arabic isn't English with different letters. What users type is rarely what your docs say, and that gap is where RAG quietly fails. The fixes above are small. Together they took retrieval from almost-nothing to usable.

The Quiet Truth

No Arabic-specific embedding solved this for me. What worked was a small pipeline of cheap, honest steps — query normalization, dialect rewriting that respects nouns, hybrid search with the right tokenizer, recalibrated thresholds, two-source confidence, and a feedback loop. No magic model.