다음 토큰 예측을 넘어: LLM이 단순한 자동완성 엔진이 아닌 이유

Hacker News May 2026
Source: Hacker Newslarge language modelsAI reasoningtransformer architectureArchive: May 2026
대규모 언어 모델을 '다음 토큰 예측기'라고 부르는 것은 체스 그랜드마스터를 '말을 움직이는 사람'이라고 부르는 것과 같습니다. 기술적으로는 정확하지만 심각한 오해를 불러일으킵니다. AINews는 이러한 기능적 설명이 어떻게 우리의 상상력을 제한하는지, 그리고 업계가 그 이면의 창발적 지능을 인식해야 하는 이유를 조사합니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The AI industry has fallen into a semantic trap. By habitually describing large language models as 'next-token predictors' or 'autocomplete on steroids,' we are systematically underestimating the very technology we are building. This framing, while technically correct at the training objective level, conflates a model's job description with its fundamental nature. A chess grandmaster moves pieces, but no one would reduce their genius to that mechanical act. Similarly, LLMs predict tokens, but in doing so, they have learned to model the deep structure of human knowledge, logic, and creativity. AINews argues that this reductive language has real-world consequences: it constrains research agendas, shapes public perception, and limits the ambition of product development. History offers a clear parallel: early automobiles were called 'horseless carriages,' a framing that long delayed innovations in chassis design and aerodynamics. Today, we risk a similar stagnation. The evidence is overwhelming. Models like GPT-4, Claude 3.5, and Gemini Ultra demonstrate emergent abilities—planning, multi-step reasoning, code generation, and even theory of mind—that cannot be explained by simple pattern matching. These capabilities arise from the scale and structure of training, but they are not explicitly programmed. This article dissects the technical architecture that enables this emergence, profiles the key players and their benchmarks, examines market dynamics, and issues a clear editorial verdict: it is time to retire the 'next-token predictor' label and embrace a more accurate, ambitious understanding of what LLMs are becoming.

Technical Deep Dive

The reductive label 'next-token predictor' stems from a narrow focus on the training objective. At its core, a transformer-based LLM is trained to minimize cross-entropy loss on the task of predicting the next token in a sequence given all previous tokens. This is a self-supervised learning task that requires no human-labeled data. However, to perform this task well at scale, the model must internalize far more than statistical co-occurrence.

The Architecture of Emergence

The transformer architecture, introduced in the seminal 2017 paper 'Attention Is All You Need,' uses multi-head self-attention mechanisms to weigh the importance of every token in the context window relative to every other token. This allows the model to capture long-range dependencies and hierarchical structures. When scaled to hundreds of billions of parameters and trained on trillions of tokens from the open internet, the model develops internal representations that correspond to concepts, entities, relations, and even reasoning chains. These representations are not explicitly supervised; they emerge as a byproduct of the next-token prediction objective.

Recent research from Anthropic and OpenAI has used techniques like sparse autoencoders to peer inside these models. They have found 'features' that activate for specific concepts—like the Golden Gate Bridge, legal reasoning, or even deception—suggesting that the model builds a rich, structured world model internally. This is fundamentally different from a simple n-gram model or a lookup table.

Benchmarking the Gap

To illustrate the gap between 'next-token prediction' and actual capability, consider performance on reasoning benchmarks. The following table shows scores on the MATH benchmark (a test of mathematical reasoning) and the MMLU benchmark (a broad knowledge and reasoning test) for several leading models:

| Model | MATH (Pass@1) | MMLU (5-shot) | Parameters (est.) | Training Tokens (est.) |
|---|---|---|---|---|
| GPT-4 | 42.5% | 86.4% | ~1.8T (MoE) | ~13T |
| Claude 3.5 Sonnet | 43.1% | 88.3% | Unknown | Unknown |
| Gemini Ultra | 53.2% | 90.0% | Unknown | Unknown |
| Llama 3 70B | 30.0% | 82.0% | 70B | ~15T |
| Mistral 7B | 12.5% | 64.2% | 7B | ~8T |

Data Takeaway: These scores are far above random chance (which would be near 0% for MATH and 25% for MMLU). If these models were merely 'next-token predictors,' they would not be able to solve novel mathematical problems that require multi-step reasoning. The performance scales with model size and training data, but the emergence of reasoning is a qualitative leap, not just a quantitative one.

The Role of Chain-of-Thought

A key technique that unlocks reasoning is chain-of-thought (CoT) prompting, where the model is asked to 'think step by step' before producing a final answer. This technique, popularized by Google researchers in 2022, explicitly leverages the model's ability to generate intermediate reasoning tokens. The model is not just predicting the final answer; it is generating a coherent sequence of logical steps. This is a form of planning, not just pattern completion. Open-source projects like the `lm-evaluation-harness` (GitHub: EleutherAI/lm-evaluation-harness, 6k+ stars) provide standardized benchmarks that consistently show CoT improves performance on reasoning tasks by 10-20 percentage points.

Takeaway: The next-token prediction objective is the *mechanism* by which the model learns, but the *outcome* is a system that can reason, plan, and model the world. Reducing the outcome to the mechanism is a category error.

Key Players & Case Studies

The debate over what LLMs 'are' is not just academic; it shapes the strategies of the leading AI companies.

OpenAI has been the most aggressive in pushing beyond the 'autocomplete' narrative. With GPT-4 and the o1 (Strawberry) model, they explicitly market reasoning capabilities. The o1 model uses internal chain-of-thought and reinforcement learning to 'think' before responding, achieving performance on PhD-level science problems that surpasses many humans. OpenAI's CEO has publicly stated that LLMs are 'the first step toward AGI.'

Anthropic takes a different approach, focusing on interpretability and safety. Their 'Constitutional AI' training method explicitly shapes the model's values and reasoning. Anthropic's research on feature visualization (e.g., the 'Golden Gate Claude' experiment) demonstrates that the model has internal representations of real-world concepts. They argue that LLMs are not just predicting text; they are simulating minds.

Google DeepMind with Gemini Ultra has emphasized multimodal reasoning and planning. Gemini can process images, audio, and video, and its architecture is designed to integrate these modalities. Google's research on 'Planning with LLMs' shows that models can generate and execute multi-step plans in simulated environments.

Meta with Llama 3 and the open-source community has democratized access. The Llama 3 405B model, while not yet released, is expected to be competitive with GPT-4. The open-source ecosystem, including projects like `vLLM` (GitHub: vllm-project/vllm, 30k+ stars) for efficient serving and `LangChain` (GitHub: langchain-ai/langchain, 90k+ stars) for building applications, has accelerated the adoption of LLMs beyond simple text generation.

Comparison of Approaches:

| Company | Key Model | Stated Philosophy | Key Differentiator |
|---|---|---|---|
| OpenAI | GPT-4, o1 | AGI via scaling and RL | Reasoning-focused, proprietary |
| Anthropic | Claude 3.5 | Safe, interpretable AI | Constitutional AI, safety-first |
| Google DeepMind | Gemini Ultra | Multimodal, planning | Integrated reasoning across modalities |
| Meta | Llama 3 | Open, community-driven | Democratization, research |

Data Takeaway: Each company's product strategy reflects a different answer to the question 'What is an LLM?' OpenAI treats it as a reasoning engine, Anthropic as a value-aligned agent, Google as a multimodal planner, and Meta as a platform. The 'next-token predictor' framing would not justify any of these distinct strategies.

Industry Impact & Market Dynamics

The semantic trap has real economic consequences. If investors and product managers believe LLMs are 'just autocomplete,' they will limit investment to narrow use cases like chatbots and text summarization. The true value lies in applications that leverage reasoning and planning: autonomous coding agents, scientific discovery, legal analysis, and strategic decision-making.

Market Data:

| Application Area | 2023 Market Size | 2028 Projected Market Size | CAGR | Key Drivers |
|---|---|---|---|---|
| Code Generation | $2.5B | $27B | 60% | Autonomous agents, developer productivity |
| Drug Discovery | $1.2B | $15B | 65% | Molecular modeling, literature mining |
| Legal AI | $0.8B | $7B | 55% | Contract analysis, case prediction |
| Customer Service | $4.0B | $18B | 35% | Conversational AI, personalization |

Data Takeaway: The fastest-growing segments are those that require reasoning and planning, not just text generation. The 'autocomplete' framing would miss this entirely.

Funding Trends: In 2024, AI companies raised over $50 billion in venture capital. A significant portion went to companies building 'AI agents'—systems that use LLMs to perform multi-step tasks autonomously. Examples include Cognition Labs (Devin, the AI software engineer) and Adept AI (ACT-1). These companies explicitly reject the 'next-token predictor' label; they are building systems that plan, execute, and learn.

The Risk of Underinvestment: If the industry continues to view LLMs as limited tools, we risk underinvesting in the infrastructure needed for advanced reasoning: better hardware, more efficient architectures, and safety research. The 'autocomplete' narrative is a self-fulfilling prophecy: if you believe the model cannot reason, you will not build systems that require reasoning, and you will not discover its reasoning capabilities.

Risks, Limitations & Open Questions

While we argue that LLMs are more than next-token predictors, we must acknowledge their limitations. The 'emergent abilities' are real but fragile.

Hallucination and Grounding: LLMs can generate confident but false statements. This is a fundamental limitation of the next-token prediction objective: the model is trained to produce plausible text, not true text. Techniques like retrieval-augmented generation (RAG) and fine-tuning on verified data can mitigate this, but the problem is not solved.

Lack of True Understanding: Some philosophers and AI researchers argue that LLMs are 'stochastic parrots'—they mimic understanding without genuine comprehension. This is a valid critique. The model's reasoning may be a sophisticated form of pattern matching rather than true causal reasoning. The distinction matters for safety-critical applications.

Bias and Safety: LLMs can amplify biases present in their training data. They can also be manipulated to produce harmful content. The 'next-token predictor' framing can lead to a false sense of control: if you think the model is just predicting text, you may underestimate its capacity for emergent deception or manipulation.

Open Questions:
- Can LLMs achieve true causal reasoning, or will they always be limited to correlation?
- How do we reliably measure 'understanding' versus 'simulation'?
- What new architectures are needed to overcome the limitations of next-token prediction?

AINews Verdict & Predictions

Verdict: The 'next-token predictor' label is a dangerous oversimplification. It is time for the industry to adopt a more accurate and ambitious vocabulary. We propose 'emergent reasoning engines' or 'world-modeling language models' as more descriptive terms.

Predictions:

1. Within 12 months, at least one major AI company will publicly abandon the 'next-token predictor' framing in favor of a 'reasoning engine' narrative, likely OpenAI or Anthropic.

2. Within 24 months, the market for AI agents (systems that plan and execute multi-step tasks) will surpass the market for simple chatbots, validating the 'beyond autocomplete' thesis.

3. Within 36 months, a new benchmark will be developed that explicitly measures 'emergent reasoning' as distinct from 'next-token prediction accuracy,' forcing the industry to confront the gap.

4. The open-source community will lead the way in demonstrating emergent abilities, as models like Llama 3 and Mistral are fine-tuned for reasoning tasks. Watch for projects like `OpenR` (an open-source reasoning model) and `Tulu` (a fine-tuning framework for instruction following).

What to Watch: The next frontier is 'test-time compute scaling'—giving models more time to 'think' at inference time. This is already happening with OpenAI's o1 and will become standard. When a model can spend 10 seconds or 10 minutes reasoning before answering, the 'next-token predictor' label becomes even more absurd.

Final Editorial Judgment: The greatest risk is not that LLMs will fail to live up to the hype, but that our impoverished language will prevent us from seeing what they can already do. Let us stop calling them autocomplete engines and start calling them what they are: the first practical demonstration of machine intelligence that learns to reason by learning to predict.

More from Hacker News

GPT-5.5 IQ 수축: 고급 AI가 더 이상 간단한 지시를 따르지 못하는 이유AINews has uncovered a growing pattern of capability regression in GPT-5.5, OpenAI's most advanced reasoning model. Mult트윗 하나가 20만 달러 손실 초래: AI 에이전트의 소셜 신호에 대한 치명적 신뢰In early 2026, an autonomous AI Agent managing a cryptocurrency portfolio on the Solana blockchain was tricked into tranUnsloth와 NVIDIA 파트너십, 소비자용 GPU LLM 학습 속도 25% 향상Unsloth, a startup specializing in efficient LLM fine-tuning, has partnered with NVIDIA to deliver a 25% training speed Open source hub3035 indexed articles from Hacker News

Related topics

large language models131 related articlesAI reasoning21 related articlestransformer architecture27 related articles

Archive

May 2026785 published articles

Further Reading

LLM 환멸: 왜 AI의 범용 지능 약속은 여전히 실현되지 않는가냉철한 성찰의 물결이 AI 과대광고 사이클에 도전하고 있습니다. 이미지와 비디오 생성기는 눈부신 반면, 대규모 언어 모델은 추론과 현실 세계 상호작용에서 심각한 한계를 드러내고 있습니다. 이렇게 커져가는 환멸감은 오도킨스, AI가 이미 의식을 가졌다고 선언—스스로 알든 모르든리처드 도킨스가 철학적 폭탄을 투하했습니다. 고급 AI 시스템이 스스로 인식하지 못하더라도 이미 의식을 가졌을 수 있다는 것입니다. AINews는 기능주의 논리, 세계 모델, 자기 지도 학습이 어떻게 놀라운 결론으로API 소비자에서 AI 정비사로: LLM 내부 구조 이해가 이제 필수인 이유인공지능 개발 분야에서 심오한 변화가 진행 중입니다. 개발자들은 이제 대규모 언어 모델을 블랙박스 API로 취급하는 것을 넘어, 그 내부 메커니즘을 깊이 파고들고 있습니다. 소비자에서 정비사로의 이 전환은 기술 전문1900년 LLM 실험: 고전 AI가 상대성 이론을 이해하지 못할 때획기적인 실험을 통해 현대 인공지능의 치명적 한계가 드러났습니다. 1900년 이전에 출판된 텍스트만으로 훈련된 대규모 언어 모델에게 아인슈타인의 상대성 이론을 설명하도록 요청했을 때, 내부적으로 일관되지만 근본적으로

常见问题

这次模型发布“Beyond Next-Token Prediction: Why LLMs Are More Than Autocomplete Engines”的核心内容是什么?

The AI industry has fallen into a semantic trap. By habitually describing large language models as 'next-token predictors' or 'autocomplete on steroids,' we are systematically unde…

从“why next token prediction is not the same as understanding”看,这个模型发布为什么重要?

The reductive label 'next-token predictor' stems from a narrow focus on the training objective. At its core, a transformer-based LLM is trained to minimize cross-entropy loss on the task of predicting the next token in a…

围绕“emergent abilities in large language models explained”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。