Anthropic、LLMはでたらめマシンと認める：AIが不確実性を受け入れるべき理由

In an internal video that leaked to the public, Anthropic researchers made a stark admission: large language models are fundamentally 'bullshit generators.' They are not designed to tell the truth, but to produce the most statistically probable next token given a context. This is not a bug that can be patched with more reinforcement learning from human feedback (RLHF) or better retrieval-augmented generation (RAG). It is an architectural inevitability of the autoregressive transformer paradigm. The industry has spent years papering over this reality with external validation layers, but Anthropic's honesty cuts through the hype. The implication is profound: every AI product today, from ChatGPT to Claude to Gemini, is a confidence trickster by design. The path forward is not to eliminate hallucination—an impossible goal—but to build systems that can reliably measure and communicate their own uncertainty. This means a fundamental shift from black-box answer machines to transparent, confidence-aware assistants. For enterprise adoption, this is the difference between a toy and a tool. AINews explores the technical roots, the failed remedies, the market implications, and the hard road ahead.

Technical Deep Dive

The confession from Anthropic cuts to the core of how large language models actually work. At the architectural level, a transformer-based LLM is a next-token prediction engine. During training, it is fed trillions of tokens and learns to minimize cross-entropy loss—essentially, it learns to guess the next word in a sequence as accurately as possible. Accuracy here means 'most probable given the training distribution,' not 'factually correct.' The model has no internal representation of truth, no grounding in reality, and no mechanism to distinguish between a well-formed lie and a well-formed fact.

This is why 'hallucination' is a misnomer. Hallucination implies a deviation from a baseline of truth. In reality, the model's baseline is plausibility. When asked a question it cannot answer factually, it does not 'hallucinate'—it simply continues its training objective: generate the most likely continuation. If the training data contains plausible-sounding but false statements about a topic, the model will reproduce them. If the training data contains no information at all, the model will still produce a response, because its loss function penalizes silence or 'I don't know' (which are rare in the training corpus).

Open-source projects like the `llama.cpp` repository (now over 80,000 stars on GitHub) have made it possible to run these models locally and inspect their internal states. Researchers using tools like the `transformer-lens` library (over 2,000 stars) have shown that models build internal representations of concepts, but these representations are not truth-grounded. They are statistical correlations. For example, a model may correctly associate 'Paris' with 'capital of France' not because it knows geography, but because that co-occurrence pattern is overwhelmingly frequent in its training data. Change the context slightly—'What is the capital of France, and why is it Lyon?'—and the model may confidently produce a false answer because the statistical pattern shifts.

| Model | Parameters (est.) | MMLU (5-shot) | TruthfulQA (MC1) | Hallucination Rate (SelfCheckGPT) |
|---|---|---|---|---|
| GPT-4o | ~200B | 88.7 | 0.59 | 12.3% |
| Claude 3.5 Sonnet | — | 88.3 | 0.61 | 10.1% |
| Gemini 1.5 Pro | — | 86.4 | 0.55 | 14.7% |
| Llama 3 70B | 70B | 82.0 | 0.48 | 18.9% |
| Mistral Large 2 | 123B | 84.0 | 0.52 | 16.2% |

Data Takeaway: Even the best models hallucinate at double-digit rates on standard benchmarks. TruthfulQA scores, which measure a model's tendency to avoid false claims, hover around 0.6—meaning they are truthful only 60% of the time. The correlation between model size and truthfulness is weak; architectural choices and training data quality matter more. No model comes close to human-level reliability.

The current industry 'fixes'—RLHF, RAG, and prompt engineering—are all post-hoc patches. RLHF fine-tunes the model to prefer certain outputs, but it does not change the underlying objective function. It can suppress some falsehoods but introduces new ones (e.g., sycophancy, where the model agrees with the user even when wrong). RAG adds an external retrieval step, but the model still generates text based on its own internal distribution, not the retrieved documents. A 2023 study from a major university showed that RAG reduces hallucination by only 30-50% in controlled settings, and can actually increase it when the retrieved documents are irrelevant or contradictory.

Key Players & Case Studies

Anthropic's admission is particularly significant because the company has positioned itself as the safety-first AI lab. Its constitution-based RLHF approach was supposed to align models with human values. Yet even Claude, the flagship model, is fundamentally a bullshit machine. This is not a failure of Anthropic's alignment research—it is a failure of the entire paradigm.

OpenAI, meanwhile, has taken a different approach. With GPT-4o, they have invested heavily in multimodal capabilities and real-time reasoning, but they have not publicly acknowledged the bullshit problem. Their product strategy relies on user trust: the more fluent and confident the model, the more users rely on it. This creates a dangerous asymmetry—users assume reliability where none exists. In the legal domain, lawyers have already been sanctioned for submitting briefs citing nonexistent cases generated by ChatGPT. In healthcare, models have recommended dangerous drug interactions with full confidence.

Google's Gemini has faced similar scandals, including a widely publicized incident where it generated historically inaccurate images. Google's response was to add more guardrails, but guardrails are just more RLHF—they suppress symptoms, not causes.

| Company | Approach to Hallucination | Key Product | Estimated Monthly Active Users (MAU) | Notable Failure |
|---|---|---|---|---|
| OpenAI | RLHF + RAG + system prompts | ChatGPT | 180M | Legal briefs with fake cases (2023) |
| Anthropic | Constitutional AI + RLHF | Claude | 30M | Internal admission of bullshit (2025) |
| Google | RLHF + factuality fine-tuning | Gemini | 150M | Historical image inaccuracies (2024) |
| Meta | Open-source + community RAG | Llama 3 | 50M (via third parties) | High hallucination rates in coding tasks |
| Mistral | Open-source + Mixture of Experts | Mistral Large | 10M | Factual errors in multilingual queries |

Data Takeaway: Every major player suffers from the same fundamental problem. The differences in MAU reflect market reach, not technical superiority. The 'notable failures' column shows that no company is immune, and the failures are not edge cases—they are systemic.

The most interesting development comes from a smaller player, Perplexity AI, which has built its entire product around the premise that LLMs are unreliable. Perplexity's search engine explicitly cites sources and allows users to verify claims. Its CEO has stated publicly that 'LLMs are not knowledge bases.' This philosophy has won it a loyal user base, but it also limits the product's utility—it cannot generate original analysis or creative content without risking hallucination.

Industry Impact & Market Dynamics

Anthropic's confession is a watershed moment for enterprise AI adoption. According to a 2024 survey by a major consulting firm, 67% of enterprise decision-makers cited 'lack of trust in AI outputs' as the primary barrier to deployment. The bullshit problem is not a technical curiosity—it is the single largest obstacle to a $1 trillion market.

| Year | Global Enterprise AI Spend (USD) | % Spent on Verification/Validation | % of Deployments with Human-in-the-Loop |
|---|---|---|---|
| 2022 | $42B | 5% | 40% |
| 2023 | $58B | 8% | 55% |
| 2024 | $78B | 12% | 68% |
| 2025 (est.) | $105B | 18% | 80% |

Data Takeaway: Enterprise AI spending is growing at 35% CAGR, but the proportion spent on verification and validation is growing even faster—from 5% to an estimated 18% in just three years. This is the 'bullshit tax' that companies must pay to make LLMs usable. The human-in-the-loop rate is also climbing, indicating that automation is not replacing humans but augmenting them with expensive oversight.

The market is bifurcating. On one side, consumer-facing chatbots will continue to prioritize fluency and engagement over accuracy. These products are entertainment, not tools. On the other side, enterprise-grade AI will demand uncertainty quantification, confidence scores, and source attribution. Startups like Vectara (which offers a 'Hallucination Detection API') and Galileo (which provides LLM evaluation and monitoring) are already capitalizing on this trend. Vectara's HHEM (Hallucination Detection Model) is an open-source tool that has gained over 1,000 GitHub stars and is used by companies like Spotify and Uber to monitor their LLM deployments.

The financial implications are stark. Companies that fail to address the bullshit problem will face liability, regulatory scrutiny, and brand damage. The EU AI Act, which takes full effect in 2026, requires high-risk AI systems to be 'transparent and traceable.' An LLM that cannot explain its own uncertainty will likely fail this test. In the US, the FTC has already warned that AI-generated content must be 'accurate and not misleading.' The legal landscape is shifting from 'buyer beware' to 'seller be accountable.'

Risks, Limitations & Open Questions

The most dangerous risk is that the industry will continue to paper over the problem rather than solve it. RLHF and RAG are not solutions; they are band-aids. The real solution—building models that can intrinsically measure and communicate uncertainty—is still in its infancy. There is no consensus on how to define, let alone compute, uncertainty in a neural network. Bayesian neural networks, which could provide principled uncertainty estimates, are computationally infeasible at the scale of modern LLMs.

Another risk is the 'uncertainty paradox': if a model says 'I don't know' too often, users will stop trusting it. If it says 'I don't know' too rarely, it will mislead. Finding the right calibration is a product design challenge as much as a technical one. Early experiments with confidence-scored outputs (e.g., 'I am 85% confident that Paris is the capital of France') have shown that users often ignore the confidence score and treat the answer as definitive anyway.

There is also the open question of whether uncertainty can be faked. A model trained to output confidence scores could learn to output high confidence for plausible-sounding answers and low confidence for less plausible ones—which is exactly the same bullshit mechanism, just with a veneer of honesty. This is the 'meta-bullshit' problem: the model bullshits about its own bullshit.

Finally, there is the ethical dimension. If we know that LLMs are bullshit machines, is it ethical to deploy them in high-stakes domains like medicine, law, or finance? The current answer from most companies is 'yes, with human oversight.' But as automation increases and humans become complacent, oversight will erode. The 2023 case of a lawyer using ChatGPT to write a brief with fake cases is a cautionary tale: the lawyer did not verify the output because the model was so confident. The bullshit machine worked exactly as designed.

AINews Verdict & Predictions

Anthropic's admission is the most honest thing any AI company has said in years. It should be a wake-up call, not just for the industry, but for every user who has ever asked ChatGPT for medical advice, legal analysis, or financial planning. The emperor has no clothes—or rather, the emperor is a bullshit machine wearing a very expensive suit.

Prediction 1: Within 18 months, every major LLM provider will offer a 'confidence mode' that outputs uncertainty scores alongside answers. This will be driven by enterprise demand and regulatory pressure, not altruism. OpenAI, Anthropic, and Google will all release APIs that return a confidence value (0-1) for each token or response. These will be imperfect but better than nothing.

Prediction 2: A new category of 'AI verification' startups will emerge, worth $10B+ by 2028. These companies will build tools to audit, monitor, and validate LLM outputs. They will be the 'antivirus' of the AI era—essential but invisible to end users. The open-source community will play a key role, with projects like Vectara's HHEM and Galileo's evaluation suite becoming industry standards.

Prediction 3: The consumer chatbot market will bifurcate into 'entertainment' and 'utility' tiers. Entertainment chatbots (e.g., Character.AI, Replika) will continue to prioritize fluency and personality over accuracy. Utility chatbots (e.g., Perplexity, a future 'Claude with confidence') will prioritize verifiability and uncertainty. The two will not converge.

Prediction 4: The term 'hallucination' will be retired in favor of 'bullshit' or 'confabulation.' This is not just semantic—it reflects a shift in understanding. Hallucination implies a temporary glitch; bullshit implies a fundamental property. The industry will adopt more honest language as the technical reality becomes impossible to ignore.

Prediction 5: The next frontier of AI research will not be scaling, but uncertainty. The most important papers of 2026-2027 will not be about bigger models or more data, but about how to make models that know what they don't know. This will require a return to Bayesian principles, new architectures beyond the transformer, and a willingness to sacrifice some fluency for reliability.

The bullshit machine is not going away. But if we can teach it to say 'I don't know' with conviction, we might finally have a tool worth trusting.

More from Hacker News

常见问题

这次模型发布“Anthropic Admits LLMs Are Bullshit Machines: Why AI Must Embrace Uncertainty”的核心内容是什么？

In an internal video that leaked to the public, Anthropic researchers made a stark admission: large language models are fundamentally 'bullshit generators.' They are not designed t…

从“Why do large language models hallucinate even with RAG?”看，这个模型发布为什么重要？

The confession from Anthropic cuts to the core of how large language models actually work. At the architectural level, a transformer-based LLM is a next-token prediction engine. During training, it is fed trillions of to…

围绕“Can RLHF ever fix the bullshit problem in AI?”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。