OPINION: Why LLMs Are Sophisticated Search, Not Intelligence

When we heard the MIT studies suggesting that 95% of generative AI pilots are not delivering ROI, we point to the fundamental misunderstanding of LLMs as being “intelligent”.

The problem isn’t the technology—it’s the label. We’re calling them “artificial intelligence,” setting expectations for autonomous decision-making and genuine understanding. Then we’re shocked when organisations implement them as intelligent agents and watch their projects collapse.

LLMs aren’t intelligent. They’re spectacularly good at something else entirely: sophisticated indexing wrapped in natural language processing and generation. Once you understand that distinction, the failure patterns make perfect sense—and so does how to actually get value from these tools.

The Industry Evidence

A 2025 MIT study titled ”The GenAI Divide: State of AI in Business 2025” found that 95% of generative AI pilots at companies are failing. The New York Times reported that AI business payoff continues to lag despite massive investment. Fortune covered the same research, highlighting the gap between hype and reality.

While we haven’t reviewed the individual projects and their merits, we hazard a guess that they over-estimated the “intelligence” of an agent and missed some vital nuance. We advocate that clients begin with automation of the administrivia aspects of processes, but leave the human central to decision points. Inform the human agent with well-summarised and contextualised data, and leave the human to do the decisioning, guidance, and creative functions.

If what companies are expecting from their “AI Agent” is for them to understand context, make judgements, and operate autonomously, they’ll likely be disappointed. This is because what an LLM is is a prediction engine optimised to generate statistically plausible text based on patterns in training data.

The ones succeeding? They’re treating LLMs as summarisation tools, pattern finders, and draft generators—not as intelligent decision-makers.

Why the Distinction Matters

Understanding what LLMs actually are changes everything about how you use them.

Wrong mental model: “The AI understands our domain, so we can delegate decisions to it.” Result: Projects fail because the model has no causal understanding, no goals beyond next-token prediction, and no way to validate its outputs against business reality.

Correct mental model: “This tool is exceptional at finding patterns, retrieving relevant information, and generating plausible text. Humans validate, decide, and take responsibility.” Result: Measurable value from summarisation, analysis support, and content generation—with humans staying in control of decisions that matter.

The capabilities are revolutionary. The indexing is unprecedented. The natural language generation is remarkable. But it’s not intelligence per se, and pretending otherwise is likely why the projects are failing.

What They’re Spectacularly Good At

When you understand LLMs as sophisticated indexing + generation, five categories of problems become solvable:

Research & Analysis Synthesis Feed in product disclosure statements, analyse fee structures, rank competitors, research ESG claims. What works brilliantly: gathering, structuring, comparing. What doesn’t: making value judgements about which product aligns with specific investor values. The model can retrieve and summarise but can’t weigh competing values because it has no value system.

Customer Support Context Aggregation Summarise interaction history, identify patterns, suggest response frameworks based on similar past cases. What works: pattern recognition across thousands of interactions. What doesn’t: generating emotionally authentic responses that don’t sound forced or trite. The model has no emotional relationship to the context.

Code & Document Pattern Recognition Find similar implementations, suggest debugging approaches for common errors, identify structural patterns across codebases. What works: recognising patterns seen during training. What doesn’t: architectural creativity for novel problems or aesthetic appreciation of elegant solutions. The model has no concept of frustration or satisfaction.

Draft Generation & Iteration Create structured content from patterns, generate multiple variations, reformulate for different audiences. What works: recombining existing patterns in new configurations. What doesn’t: genuinely novel arguments or emotionally resonant creative content. The model can recombine but not create in the true sense—it has nothing to convey.

Cross-Domain Pattern Mapping Identify analogous structures across different domains where those connections existed in training data. What works: surfacing correlations humans might miss. What doesn’t: proximate or cross-conceptual analysis when those specific links weren’t in the training set. The model can only retrieve what it’s indexed.

The pattern is clear: as you move closer to IP generation—engineering, architecture, design, creative work—effectiveness drops. Not because the technology is immature, but because these tasks require what LLMs fundamentally lack.

How LLMs Actually Work

If you want to understand why LLMs are indexing rather than intelligence, the mechanics reveal everything. This section walks through the architecture step-by-step.

The Octopus Thought Experiment

Before diving into technical details, consider Emily Bender and Alexander Koller’s octopus thought experiment. Imagine an octopus intercepting underwater telegraph cables between two islands. Over time, the octopus learns statistical patterns in the messages. It becomes so good at prediction that when islander A sends a message, the octopus can generate a plausible response as if it came from islander B.

The islanders think they’re communicating meaningfully. But the octopus has no idea what the messages mean. It’s just exceptionally good at statistical pattern matching.

That’s an LLM. Spectacular pattern matching. No grounding in what the patterns represent.

From Text to Tokens to Predictions

graph LR
    A[Text Input] --> B[Tokenisation]
    B --> C[Embedding]
    C --> D[Transformer<br/>Attention]
    D --> E[Output<br/>Prediction]
    E --> F[Next Token]
    F -.->|Loop| B
    
    style B fill:#647E91,stroke:#1D3A4B,color:#1A1A1A
    style C fill:#739088,stroke:#1D3A4B,color:#1A1A1A
    style D fill:#6B5E87,stroke:#1D3A4B,color:#FAFAFA
    style E fill:#29465B,stroke:#1D3A4B,color:#FAFAFA

Broadly, LLMs work through a four-stage pipeline: tokenisation, embedding, transformation (via attention mechanisms), and output prediction. At each stage, the goal is the same: optimise for predicting the next token.

graph LR
    A["Check out ngrok.ai"] --> B[Tokeniser]
    B --> C["['Check', ' out', ' ng', 'rok', '.ai']"]
    C --> D[Token IDs]
    D --> E["[4383, 842, 1657, 17690, 75584]"]
    
    style B fill:#647E91,stroke:#1D3A4B,color:#1A1A1A

Tokenisation breaks text into chunks—words, subwords, characters. “Check out ngrok.ai” becomes ["Check", " out", " ng", "rok", ".ai"], each assigned an integer ID: [4383, 842, 1657, 17690, 75584]. These token IDs are the fundamental units the model works with.

graph LR
    A["[4383, 842, 1657, 17690, 75584]"] --> B[Embedding<br/>Lookup]
    B --> C[High-Dimensional<br/>Vectors]
    C --> D["Each token → position in<br/>~1000-dimensional space"]
    D --> E[Optimised for<br/>prediction, not meaning]
    
    style B fill:#739088,stroke:#1D3A4B,color:#1A1A1A

Embedding converts each token ID into a position in high-dimensional space—often thousands of dimensions. Now it gets dense, but stick with us! Here’s a very heady explainer of higher-dimensional shapes, or hypercubes. For simplicity’s sake, imagine the shift in depth of fidelity when you move from a one-dimensional line, to a two-dimensional circle, to a three-dimensional sphere, and then imagine continuing that progression until you’re working in thousands of dimensions of fidelity.

This is where the critical insight emerges: these positions in a theoretical thousand-dimensional pseudo-space aren’t learned to represent the semantic meaning of the word, part-word, or symbol that the vector (number replacement for the word) represents. This is to say, “Apple” and “apple” will have different vectors, and the various uses of both “Apple” and “apple” throughout petabytes of training data will generate a large collection of different embedding coordinates. But whether the various mentions of “Apple” in the training data related to similar or different things is not what determines their pseudo-location in the thousand-dimensional space.

Rather, the embeddings are “learned” (determined) just to minimise next-token prediction loss. That is, to make it easier to predict the next probable token, based on that iteration’s thousands of dimensions.

If “cat” and “dog” predict similar next tokens in similar contexts, gradient descent nudges their embeddings closer together. Not because the model understands they’re both animals, but because reusing computation is efficient. Any “semantic similarity” you observe is incidental—a byproduct of prediction economics, not understanding.

This point destroys the “but embeddings capture meaning” argument. You can rotate the entire embedding space arbitrarily and the model still works perfectly, as long as you rotate all downstream weights accordingly. If positions had intrinsic semantic meaning, rotation would break it. Since rotation preserves function, there is no inherent meaning—only relative positions that optimise prediction.

graph LR
    A[Token<br/>Embeddings] --> B[Query Q]
    A --> C[Key K]
    A --> D[Value V]
    B --> E[Attention<br/>Weights]
    C --> E
    E --> F[Weighted<br/>Retrieval]
    D --> F
    F --> G[Context-Aware<br/>Embeddings]
    
    style E fill:#6B5E87,stroke:#1D3A4B,color:#FAFAFA
    style F fill:#6B5E87,stroke:#1D3A4B,color:#FAFAFA

Attention mechanisms determine how much weight to give each token when predicting the next one. Through learned weight matrices (WQ, WK, WV), the model calculates which tokens in the prompt are most relevant for generation. But relevance here means “statistically correlated in training data,” not “causally related” or “semantically connected.”

The ngrok article on prompt caching provides an excellent technical walkthrough of how attention actually works—the Q, K, V projections, the matrix multiplications, the caching of intermediate computations. If you want the full mathematical detail, that’s the definitive resource.

What matters for our argument: attention is a mechanism for weighted retrieval based on statistical patterns. It’s sophisticated search through a high-dimensional index, not reasoning about relationships.

graph LR
    A[Final<br/>Embeddings] --> B[Probability<br/>Distribution]
    B --> C[Sample<br/>Next Token]
    C --> D{End Token?}
    D -->|No| E[Append Token]
    D -->|Yes| F[Complete]
    E --> G[Feed Back<br/>to Input]
    
    style B fill:#29465B,stroke:#1D3A4B,color:#FAFAFA
    style C fill:#29465B,stroke:#1D3A4B,color:#FAFAFA

Output prediction takes the final attention output and generates probability distributions over the entire vocabulary. Sample from those probabilities, append the token, feed it back through the loop. Repeat until the model outputs a special “end” token.

At no point does the model:

Build a causal model of the domain
Ground symbols in non-linguistic experience
Maintain goals beyond next-token prediction
Understand what the tokens represent

It retrieves patterns, weights by relevance, generates statistically plausible continuations. That’s indexing + generation, not intelligence.

What Training Actually Optimises

When we say embeddings are “learned,” here’s what actually happens:

Each token’s embedding starts as random values. During training, gradient descent nudges those values to minimise one thing: prediction error. Not “capture meaning.” Not “group similar concepts.” Just: how well does this help predict the next token?

If two tokens behave similarly in context, treating them similarly helps prediction. Gradient descent finds it cheaper to make their embeddings functionally interchangeable than to learn completely separate behaviours. This produces geometric similarity—but doesn’t optimise for it.

The model never needs to know cats are animals. It just needs to know “cat” and “dog” predict similar tokens in similar contexts. Prediction accuracy does not equal conceptual understanding.

This is why Searle’s Chinese Room argument still applies. The model manipulates syntax (tokens) without semantic grounding. It passes statistical tests for plausibility without understanding what the symbols mean.

What Intelligence Actually Requires

To argue that LLMs aren’t intelligent, we need to define intelligence rigorously. Fortunately, cognitive science and neuroscience have spent decades establishing what components constitute intelligent systems.

The Research Foundation

A comprehensive 2018 survey of cognitive architectures spanning 40 years of research (Kotseruba & Tsotsos) found that models of intelligence consistently require:

Goal-directed behaviour shaped by internal value systems
Motivation and affect mechanisms
Causal reasoning capabilities
Grounded representations connected to sensory experience

Leading AI researchers echo these requirements. Melanie Mitchell argues in AI: A Guide for Thinking Humans (2019) that statistical pattern matching fundamentally differs from understanding. Gary Marcus’s ”The Next Decade in AI” (2020) explicitly rejects pure statistical indexing as a path to intelligence. Yoshua Bengio’s work on representation learning (2013) distinguishes between learning to predict versus learning genuine representations of the world.

The philosophical foundations are equally clear. Searle’s Chinese Room argument (1980) demonstrates that syntax manipulation alone is insufficient for semantic understanding. Harnad’s symbol grounding problem (1990) establishes that symbols must be grounded in non-symbolic experience to have meaning.

Neuroscience adds another dimension: emotion and cognition are integrated in human intelligence, not separate systems. Barbey et al.’s work on the neural bases of emotional intelligence, published in PNAS, shows distributed networks linking affective and cognitive processing. Pessoa’s research on emotion-cognition integration (2019) confirms these systems don’t operate independently.

If you have no emotional relationship to a fact—your own emotional relationship with the fact, and your emotional relationship with others’ relationships to those facts—you miss extreme amounts of nuance and detail. You cannot create in the true sense because you have nothing to convey. You deal only with the thin veneer of the factual realm.

What LLMs Are Missing

Map LLM architecture against these intelligence requirements:

Goals and values: LLMs have one goal optimised during training: predict the next token. No higher-order objectives, no value systems, no way to evaluate whether outputs align with meaningful goals. When you adjust temperature or top_p parameters, you’re not giving the model goals—you’re tweaking how it samples from probability distributions.

Motivation and affect: No mechanisms for emotional states, no valuation systems shaping decisions, no way to care about outcomes. The model doesn’t prefer elegant solutions over complex ones because it has no concept of frustration or satisfaction.

Causal reasoning: Purely correlational. The model knows “smoke” and “fire” appear together in training data, but has no representation that fire causes smoke. It can generate text about causality but cannot reason causally about novel situations.

Grounded representations: Embeddings are positions in abstract mathematical space, optimised for prediction, with no connection to physical or emotional reality. The token “red” isn’t grounded in visual experience—it’s just a vector that predicts other tokens in red-related contexts.

Understanding versus mimicry: An LLM doesn’t understand existential dread, loss, happiness, or emotional complexity. It has no concept of conflicting responses to the same thing. It can generate text about these experiences because such text appeared in training data, but it cannot experience or genuinely reason about them.

The LLM’s intelligence is a veneer of solid wood over chipboard. It’s not timber through and through.

The Practical Implication

This isn’t just philosophical hairsplitting. The gap between correlation and comprehension explains the 95% failure rate.

When you expect intelligence, you delegate decisions. When the system lacks causal models, grounding, and value systems, those decisions fail in predictable ways. The model generates plausible-sounding outputs with no mechanism to validate them against reality.

When you understand it as indexing + generation, you use it appropriately: pattern recognition, retrieval, synthesis, drafting. Humans provide the goals, values, grounding, and causal reasoning the model lacks.

Why They’re Still Revolutionary

None of this diminishes what LLMs can do. The indexing is unprecedented. The natural language generation is remarkable. The ability to retrieve relevant patterns from billions of tokens and synthesise them into coherent text is genuinely transformative.

But the revolution isn’t intelligence. It’s accessibility.

Before LLMs, sophisticated search required structured queries, databases, and technical expertise. Now you describe what you want in natural language and get synthesis across vast corpuses. That’s revolutionary without being intelligent.

The problem is the hype. We’re using hyperbole to drive sales and adoption, inflating valuations based on AI promises rather than indexing reality. Thousands of technically semi-literate influencers amplify this to drive ad revenue on YouTube videos proclaiming AGI is imminent.

The mislabelling matters because it sets wrong expectations, leads to failed implementations, and obscures what these tools actually excel at. When you expect intelligence and get sophisticated search, you’re disappointed. When you expect sophisticated search and understand how to leverage it, you get measurable value.

What Comes Next

Current LLMs are not on a path to intelligence. Scaling up—more parameters, bigger training sets, longer context windows—produces better indexing and generation. It doesn’t produce goals, grounding, causality, or understanding.

World models, as explored in Bengio’s work on causal representation learning, represent one potential path toward systems with genuine understanding. These would build explicit models of how the world works, enabling causal reasoning rather than pure pattern matching. Whether that path leads to AGI remains an open research question worth exploring separately.

For now, the practical reality is clear: understand what you’re working with, use it appropriately, and you’ll join the 5% getting ROI instead of the 95% chasing an intelligence that isn’t there.

The tools are sophisticated. The indexing is revolutionary. The natural language generation is remarkable. But calling it intelligence is why your projects are failing.

Understand the tool. Use it correctly. Stop calling it AI.