MONT3 - Dispatch #370 · Why Enterprise RAG Systems Are Failing: The Case for Simple, Document-Focused Architecture

The enterprise AI world has fallen in love with a seductive narrative: throw more tools at the problem until it works. Retrieval-Augmented Generation (RAG) systems have become the poster child for this approach, with vendors pushing increasingly complex architectures that promise to solve document intelligence challenges through sheer technological force. The reality? Most production RAG systems are delivering disappointing results, and the industry’s reflexive response—adding more layers, agents, and frameworks—is making things worse, not better.

A new technical analysis from Towards Data Science cuts through the hype with surgical precision, arguing that the fundamental issue isn’t infrastructure—it’s understanding. The piece dismantles the conventional wisdom around enterprise RAG deployment and makes a compelling case for returning to basics.

The Standard RAG Recipe Is Broken

The industry has converged on what seems like a simple five-step process:

Chunk the documents into manageable pieces
Push chunks into a vector store for similarity matching
Embed incoming questions as vectors
Retrieve top-k results by cosine similarity, with optional reranking
Send the retrieved content to a large language model

This approach became orthodoxy precisely because it looked so clean on consulting decks and conference slides. Vendors built entire platforms around it. Engineering teams implemented it religiously. Then the deployments started shipping, and reality hit hard.

The problems were consistent across organizations: users didn’t trust the answers, citations were vague or missing entirely, and retrieved passages were often tangential to the actual questions being asked. The system was technically functional but practically useless.

“Enterprises increasingly want custom models that are tailored to their internal tools and processes, without sacrificing intelligence or reliability. Often times this involves tasks that are out-of-distribution for existing models: think custom document formats that aren’t on the public web or company-specific legacy APIs, things that never appeared in pretraining.” — @appliedcompute

The Infrastructure Fallacy

When RAG systems underperform, the reflexive response is always the same: deploy a stronger model, extend the context window, add a better reranker, implement more MLOps infrastructure. The framing treats every problem as an IT challenge that better tools will eventually solve.

This mirrors a pattern we’ve seen throughout computing history. In the 1990s, when enterprise resource planning (ERP) systems failed to deliver promised efficiencies, companies didn’t question the underlying business process design—they bought bigger servers and more sophisticated middleware. When customer relationship management (CRM) systems produced garbage analytics in the 2000s, the solution was always more data integration tools, not better understanding of customer behavior patterns.

The current RAG crisis follows the same script. Teams pile on complexity:

Query-rewriter agents that transform simple questions into elaborate search strategies
Grader agents that evaluate retrieval quality using opaque scoring methods
Orchestrator frameworks that turn every question into ten separate LLM calls
Fine-tuned embedding models that nobody can verify are actually helping

Each addition makes the demo more impressive and the system more brittle. The foundation remains broken: there’s still no reliable way to determine if retrieved passages are relevant, and no clear method to explain why specific results were returned.

What Actually Works: Domain Knowledge Over Infrastructure

The analysis reveals a uncomfortable truth for the AI industry: the work that produces real improvements isn’t infrastructural. It’s engineering combined with deep domain understanding, plus enough mathematical literacy to comprehend what embeddings actually measure and what rerankers actually accomplish.

Most crucially, it requires knowing the documents the system is supposed to process. Who reads them regularly? What vocabulary do domain experts use? What questions come up repeatedly? What citation standards do users expect?

This represents a fundamental shift in thinking. Instead of treating RAG as a machine learning problem that can be solved with better models, successful implementations treat it as a document intelligence problem that requires understanding both the content and the users.

The Enterprise Context Is Different

Most companies aren’t Google. They’re not research laboratories running open-domain question answering across the entire web. They have specific constraints that make the standard RAG approach particularly ill-suited:

A limited set of core document types that follow predictable patterns
Domain experts who already understand the corpus intimately
Recurring questions that need answers with proper citations and audit trails
Compliance requirements that demand explainable results

The right architecture for this context amplifies existing expert knowledge rather than trying to replace it. It uses predictable retrieval methods where possible and focuses on structured information extraction rather than creative text generation.

From Augmented to Grounded: A Critical Distinction

The original 2020 RAG paper made a deliberate word choice: “Augmented” rather than “Grounded” or “Conditioned.” That choice carries significant implications. Augmented generation assumes the model should blend its pre-training knowledge with retrieved passages—two memory systems working together.

Enterprise requirements invert this assumption completely. Every factual claim must be backed by a specific retrieved passage. The LLM’s parametric memory becomes a liability rather than an asset for factual content. The model should be grounded in retrieved documents, not augmented by them.

This shift demands architectural changes. Instead of one LLM call that mixes retrieval, extraction, and creative composition, successful systems use two distinct phases:

Extraction phase: Pull specific values from documents with line citations
Composition phase: Generate longer narrative content using validated extracted information

Two phases create two audit surfaces. When everything happens in a single LLM call, the audit trail collapses and trust erodes.

“If you’re an AI engineer who’d like to understand the nitty-gritty details of document intelligence workflows, don’t miss Angela Shi and Kezhan Shi’s new series.” — @TDataScience

The Hundred-Line Solution

Perhaps the most provocative claim in the analysis: a simple hundred-line Python script often outperforms elaborate production RAG systems. This script has no vector database, no framework orchestration, and no agent architecture. It takes a PDF and a question, parses the content, retrieves the top three pages using basic cosine similarity, sends them to an LLM with a structured schema, and returns answers with line citations and highlighted source documents.

The gap between this simple approach and complex production systems isn’t prompt engineering or advanced retrieval algorithms. It comes from three habits the industry systematically skips:

Knowing the documents: Understanding structure, vocabulary, and content patterns
Knowing the experts: Understanding how domain specialists already work with these documents
Not confusing RAG with machine learning: Treating it as an information extraction problem, not a model optimization challenge

Historical Parallels and Future Implications

This pattern echoes the evolution of database systems in the 1980s and 1990s. Early relational database implementations were often slower than hand-coded file systems for specific use cases. The temptation was always to add more indexing strategies, more query optimization layers, more caching mechanisms.

But the systems that succeeded long-term weren’t the ones with the most sophisticated internals—they were the ones that best matched how people actually thought about their data. SQL succeeded because it reflected how domain experts naturally described their information needs, not because it had the most advanced query optimizer.

Enterprise RAG systems face the same choice. They can continue adding layers of AI sophistication that impress in demos but fail in daily use. Or they can focus on matching how domain experts actually work with documents, using AI as a tool for structured extraction rather than creative generation.

The companies that choose the latter approach—simple, transparent, domain-focused architectures over complex AI showcase systems—will likely build the document intelligence tools that still matter in five years. The rest will be rebuilding from scratch when the current generation of over-engineered RAG platforms becomes unmaintainable.

The future of enterprise document intelligence isn’t about better AI models. It’s about better understanding of documents, users, and the specific problems organizations actually need to solve. Sometimes the most advanced solution is the one that does less, not more.

Published in Stream · Dispatch #370 · May 23, 2026 · 7 min read.
Reply to paolo@mont3.ch - every email gets a human answer within 24h.