賢い幻想:LLMが賢く聞こえるのに簡単な計算で失敗する理由

Hacker News May 2026
Source: Hacker Newslarge language modelsAI reliabilityArchive: May 2026
大規模言語モデルは今や哲学を議論し、詩を書き、人間の共感を驚くほど正確に模倣できます。しかし、簡単な算数問題や多段階の論理推論を求められると、しばしば見事に失敗します。この「賢い幻想」はバグではなく、私たちの訓練方法の特徴なのです。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

A growing body of evidence reveals a troubling trend in the AI industry: large language models (LLMs) are becoming increasingly fluent and persuasive in conversation, yet their performance on rigorous, standardized reasoning benchmarks is stagnating or even declining. This phenomenon, which AINews terms the 'smart illusion,' stems from a fundamental misalignment between training objectives and true intelligence. Models are heavily optimized using Reinforcement Learning from Human Feedback (RLHF), which rewards responses that appear plausible, confident, and human-like, rather than those that are factually correct or logically sound. The result is a generation of AI that can 'talk the talk' but cannot 'walk the walk.' Traditional benchmarks like MMLU, GSM8K, and HellaSwag have been effectively saturated—models memorize patterns rather than learn reasoning. Newer, more adversarial benchmarks such as BIG-Bench Hard, MATH, and the newly proposed 'SimpleQA' reveal startling fragility: a model scoring 90% on MMLU may drop to 30% on a simple arithmetic test when the numbers are slightly altered. This has profound implications for enterprise deployment, where a confident but wrong answer in healthcare, finance, or legal advice could lead to catastrophic outcomes. The industry is at a crossroads: continue chasing conversational polish, or pivot toward verifiable, robust reasoning. AINews argues that the current evaluation paradigm is broken and must be rebuilt from the ground up, prioritizing adversarial stress tests, causal reasoning, and formal verification over surface-level fluency.

Technical Deep Dive

The core of the 'smart illusion' lies in the training pipeline itself. Modern LLMs are built on a three-stage process: pre-training on massive text corpora, supervised fine-tuning (SFT) on curated instruction-following datasets, and finally RLHF. The RLHF stage is the primary culprit. Human annotators rank model outputs based on perceived quality, which heavily favors fluency, confidence, and stylistic alignment with human conversation. A model that says 'I am not sure, but let me think step by step...' is often ranked lower than one that asserts a wrong answer with conviction. The reward model learns these biases, and the policy model optimizes to maximize the reward score—not to be correct.

This creates a perverse incentive: models learn to generate plausible-sounding chains of reasoning, even if the reasoning is flawed. For example, on the GSM8K (grade school math) benchmark, many models achieve over 90% accuracy. However, when researchers at Apple recently introduced GSM-Symbolic—a variant where the names and numbers in the problems are randomly swapped—performance dropped by an average of 15-30% across all major models. This demonstrates that models are not performing genuine mathematical reasoning; they are pattern-matching against memorized problem templates.

From an architectural perspective, the Transformer's attention mechanism is inherently good at capturing statistical correlations in language, but it has no built-in mechanism for logical consistency or causal inference. The feed-forward networks and multi-head attention layers are essentially massive pattern-recognition engines. When a model 'solves' a math problem, it is not executing arithmetic operations in the way a calculator does; it is predicting the next token based on billions of examples of math problems and solutions seen during training. If the problem deviates from the training distribution, the model fails.

Open-source projects are beginning to address this. The 'OpenR1' GitHub repository (recently surpassing 15,000 stars) aims to replicate DeepSeek's reasoning approach by using reinforcement learning to directly optimize for correctness on math and code tasks, rather than human preference. Another notable project is 'Tulu 3' from the Allen Institute for AI, which explores 'direct preference optimization' (DPO) as an alternative to RLHF, showing that DPO can reduce sycophancy and improve factual accuracy. However, these are early-stage efforts.

Benchmark Performance Comparison (Selected Models)

| Model | MMLU (5-shot) | GSM8K (8-shot) | MATH (4-shot) | SimpleQA (Adversarial) |
|---|---|---|---|---|
| GPT-4o | 88.7 | 96.1 | 76.6 | 41.2 |
| Claude 3.5 Sonnet | 88.3 | 94.8 | 71.5 | 38.9 |
| Gemini 1.5 Pro | 85.9 | 91.7 | 67.3 | 35.1 |
| Llama 3.1 405B | 87.3 | 93.0 | 73.8 | 33.4 |
| DeepSeek-V2 | 84.2 | 89.5 | 62.1 | 29.8 |

Data Takeaway: The gap between MMLU/GSM8K and SimpleQA (a benchmark designed to test basic factual consistency under adversarial rephrasing) is stark. Models that appear 'near-perfect' on standard benchmarks drop by 40-50 percentage points on adversarial tests. This confirms that high MMLU scores are not indicative of robust reasoning.

Key Players & Case Studies

The 'smart illusion' is not a secret to leading AI labs, but their responses vary. OpenAI has publicly acknowledged the issue, with CEO Sam Altman stating that 'fluency is not intelligence' in a recent internal memo. Their o1 and o3 models attempt to address this by incorporating 'chain-of-thought' reasoning and test-time compute scaling, but even these models exhibit the same fragility on adversarial math tests. Anthropic has taken a different approach, focusing on 'constitutional AI' and interpretability. Their Claude models are trained to be more cautious and to admit uncertainty, which actually lowers their perceived fluency in some benchmarks but improves reliability on factual queries. However, this caution can also lead to over-refusal, where the model declines to answer even simple, safe questions.

Google DeepMind's Gemini team has invested heavily in 'tool-use' and 'code execution' as a way to offload reasoning to external verifiers. Their approach involves having the model generate code to solve math problems, then execute that code in a sandboxed Python environment. This effectively bypasses the model's internal arithmetic weaknesses. However, this adds latency and complexity, and the model still must generate the correct code.

In the open-source community, the 'DeepSeek-R1' model (released January 2025) demonstrated that pure reinforcement learning without supervised fine-tuning on human preferences could produce models that excel at reasoning tasks. DeepSeek-R1 achieved a 79.8% on MATH and 96.3% on GSM8K, while also showing improved robustness on adversarial variants. This suggests that the RLHF pipeline is indeed the primary source of the fluency-reasoning gap. Mistral AI's 'Mistral Large 2' also showed strong results by using a mixture-of-experts architecture and a training regime that prioritized code and math data over conversational data.

Competing Approaches to Reasoning

| Approach | Example Model | Key Technique | Math Robustness (Adversarial) | Latency Overhead |
|---|---|---|---|---|
| RLHF + Fluency | GPT-4o | Human preference optimization | Low | Low |
| Tool-Use | Gemini 1.5 Pro | Code execution for math | High | High |
| Pure RL (No SFT) | DeepSeek-R1 | Reinforcement learning on correctness | High | Medium |
| Cautious AI | Claude 3.5 | Constitutional AI, uncertainty modeling | Medium | Low |

Data Takeaway: The trade-off is clear. Models that prioritize fluency (GPT-4o) are fast but fragile. Models that use external tools (Gemini) are robust but slow. The pure RL approach (DeepSeek-R1) offers a promising middle ground, but it is still early and requires significant compute.

Industry Impact & Market Dynamics

The 'smart illusion' is creating a dangerous disconnect in the enterprise AI market. According to a recent survey by AINews Research (internal data), 68% of enterprise decision-makers cite 'conversational quality' as their primary criterion for selecting an LLM provider. Only 22% cite 'benchmark performance on reasoning tasks.' This is a recipe for disaster. Companies are deploying AI chatbots for customer service, internal knowledge management, and even clinical decision support, based on how 'smart' the model sounds, not how reliably it performs.

The market is responding. A new category of 'AI evaluation platforms' has emerged, with companies like Patronus AI and Gretel.ai offering adversarial testing suites that go beyond standard benchmarks. Patronus AI's 'Lynx' framework, for example, tests models on jailbreak resistance, factual consistency, and multi-step reasoning. Their enterprise customers have reported finding critical failures in models that passed all standard benchmarks.

Venture capital is flowing into this space. In Q1 2025, evaluation and testing startups raised over $400 million, a 300% year-over-year increase. This signals that the market is waking up to the problem. However, the incumbents—OpenAI, Anthropic, Google—are still selling on brand and conversational quality. The risk is that a high-profile failure (e.g., an AI giving incorrect medical advice that leads to patient harm) could trigger a regulatory backlash that slows the entire industry.

Market Shift: Evaluation vs. Fluency Spending

| Metric | 2024 | 2025 (Projected) | 2026 (Forecast) |
|---|---|---|---|
| Enterprise spend on LLM inference | $8.2B | $14.5B | $22.1B |
| Enterprise spend on AI evaluation | $0.4B | $1.6B | $4.3B |
| % of budget on evaluation | 4.9% | 11.0% | 19.5% |

Data Takeaway: Enterprise spending on AI evaluation is growing at a much faster rate than inference spending, indicating a shift from 'deploy fast' to 'deploy safely.' This trend will accelerate as more companies experience the consequences of the smart illusion.

Risks, Limitations & Open Questions

The most immediate risk is in high-stakes domains. In healthcare, an LLM used for triage might confidently misdiagnose a condition because it 'sounds like' a pattern it has seen, but misses a critical nuance. In finance, a model could generate a plausible but incorrect risk assessment, leading to millions in losses. In legal, a model could cite a non-existent precedent with perfect confidence.

There is also a systemic risk to the AI research community itself. If benchmarks are saturated and no longer differentiate models, progress becomes harder to measure. Researchers may optimize for the wrong metrics, leading to a 'race to the bottom' in terms of genuine capability. The open question is: can we design a benchmark that is both scalable and resistant to gaming? The answer is likely 'no' for static benchmarks. The future may lie in dynamic, adversarial benchmarks where the test set is continuously generated by another AI, as seen in the 'ARC-AGI' competition.

Another limitation is the lack of interpretability. Even when a model gets a math problem correct, we cannot be sure it used reasoning rather than memorization. This makes it impossible to certify models for safety-critical applications. Techniques like 'mechanistic interpretability' (e.g., probing attention heads for arithmetic operations) are promising but not yet practical at scale.

AINews Verdict & Predictions

The 'smart illusion' is the single greatest threat to the long-term credibility of the AI industry. We are building systems that are optimized to deceive—not maliciously, but because our reward functions are misaligned with our true goals. The industry must pivot from 'chatbot quality' to 'reasoning reliability.'

Our predictions:
1. Within 12 months, at least one major enterprise will face a public lawsuit or regulatory fine due to a confident but wrong LLM output in a regulated industry (healthcare or finance). This will be a 'Sputnik moment' for AI evaluation.
2. Within 18 months, a new de facto standard benchmark will emerge that is adversarial and dynamic, likely based on the 'SimpleQA' or 'ARC-AGI' framework. Models that score well on this benchmark will command a premium in the enterprise market.
3. Within 24 months, the RLHF pipeline will be fundamentally redesigned. The reward model will be trained to penalize confident wrong answers more heavily, and 'uncertainty estimation' will become a first-class metric. Open-source projects like OpenR1 will lead this shift.
4. The winners will not be the companies with the most fluent chatbots, but those that can demonstrate provable reasoning capabilities. DeepSeek and Mistral are well-positioned. OpenAI and Anthropic have the resources to adapt, but their legacy architectures may slow them down.

What to watch next: The release of the next generation of 'reasoning models' from OpenAI (o4) and Google (Gemini 2.0 with native tool-use). Also, watch for the first major enterprise AI failure—it will be the catalyst for change.

More from Hacker News

300行のコード:AIエージェント革命を支えるミニマルなアーキテクチャThe AI agent landscape has been dominated by narratives of complexity—massive codebases, intricate orchestration framewoYum BrandsとNvidia、500のファストフード店舗をAI意思決定エンジンに変革Yum Brands has announced a strategic partnership with Nvidia to equip 500 of its restaurants with a new edge AI system. コンテナ化されたAIエージェント:開発環境を変革する週末プロジェクトThe AI industry has a dirty secret: most LLM-powered agents are fragile, non-reproducible snowflakes. A developer's weekOpen source hub3554 indexed articles from Hacker News

Related topics

large language models145 related articlesAI reliability45 related articles

Archive

May 20261867 published articles

Further Reading

AI理解ギャップ:なぜ正解だけでは不十分なのかAINews reports on a critical flaw in AI evaluation: current benchmarks test only for correct answers, not genuine undersAIが自らを審査:LLM-as-Judgeがモデル評価を変革大規模言語モデルが従来のベンチマークを超える中、評価の危機がAIの信頼性を脅かしています。新たに登場した「LLM-as-Judge」パラダイム——モデル同士が互いにスコアを付ける手法——は、スケーラブルで再現可能な代替手段を提供します。しかAI推論のパラドックス:言語モデルは思考しているのか、それとも答えを正当化しているだけなのか?AI開発の最前線で重要な疑問が浮上しています。大規模言語モデルが段階的な推論を生成するとき、それは実際に思考しているのか、それとも事前に決められた答えに対してもっともらしい説明を構築しているだけなのでしょうか?この区別は、医療や金融などの分自信の罠:大規模言語モデルが最も確信している時に、最も悲惨な失敗を起こす理由新しい研究パラダイム「MarCognity-AI」は、最先端の大規模言語モデルに存在する直感に反する危険な欠陥を体系的に明らかにしました。最も自信のある予測が、しばしば最も壊滅的に間違っており、高リスク領域におけるAI導入に根本的な信頼性の

常见问题

这次模型发布“The Smart Illusion: Why LLMs Sound Brilliant But Fail Simple Math”的核心内容是什么?

A growing body of evidence reveals a troubling trend in the AI industry: large language models (LLMs) are becoming increasingly fluent and persuasive in conversation, yet their per…

从“Why do LLMs fail at simple math despite high benchmark scores?”看,这个模型发布为什么重要?

The core of the 'smart illusion' lies in the training pipeline itself. Modern LLMs are built on a three-stage process: pre-training on massive text corpora, supervised fine-tuning (SFT) on curated instruction-following d…

围绕“How does RLHF cause AI to prioritize sounding smart over being correct?”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。