시간 맹목: LLM이 인과관계를 파악하지 못하는 이유

Hacker News May 2026
Source: Hacker Newslarge language modelstransformer architectureArchive: May 2026
획기적인 오픈소스 연구가 대규모 언어 모델의 치명적 결함을 드러냈습니다. 이 모델들은 사건의 순서를 신뢰성 있게 정렬하거나 인과관계를 추론할 수 없습니다. Transformer 아키텍처에 뿌리를 둔 이 구조적 결함은 신뢰할 수 있는 AI 에이전트와 세계 모델을 구축하는 데 근본적인 장벽이 됩니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

A new open-source research paper, led by a team from MIT and the University of Cambridge, has systematically demonstrated that state-of-the-art large language models (LLMs) including GPT-4, Claude 3.5, and Gemini 1.5 Pro exhibit profound failures in temporal ordering and causal reasoning. In a series of carefully controlled experiments, models were asked to arrange scrambled event sequences (e.g., 'the egg broke' and 'the egg fell') and to identify causal relationships in simple narratives. The results were stark: GPT-4 achieved only 62% accuracy on temporal ordering tasks, barely above random chance for complex sequences, while a simple rule-based system scored 89%. The study's authors argue that the root cause is architectural: the Transformer's attention mechanism treats tokens as a bag of words, lacking an intrinsic notion of time or causality. This is not a data scaling problem—adding more training data only marginally improved performance. The findings have immediate and severe implications for the burgeoning field of AI agents, where understanding the sequence of actions and their consequences is non-negotiable. For example, an agent tasked with 'make coffee' must know that grinding beans must precede brewing, not the reverse. Similarly, video generation models like Sora and Runway Gen-3, which aim to simulate physical dynamics, produce visually stunning but causally incoherent outputs—objects appear and disappear without reason. The research points to a necessary shift: future breakthroughs will likely require neuro-symbolic architectures that combine the pattern-matching power of LLMs with explicit causal models and temporal logic engines. This is not just a technical hurdle; it is a litmus test for whether AI can move from being a sophisticated text generator to a genuine reasoning system.

Technical Deep Dive

The core of the problem lies in the Transformer architecture itself. The attention mechanism, while revolutionary for capturing contextual relationships, operates on a set of tokens without any inherent notion of order. Positional encodings (sinusoidal or learned) are added as a post-hoc fix, but they are a weak signal compared to the model's primary task of predicting the next token. This means the model learns statistical correlations between tokens, not temporal sequences.

A key experiment from the study illustrates this: when given the sentence "The man slipped on the banana peel. He fell down." and then asked "What happened first?", GPT-4 correctly answers 95% of the time. But when the sentences are scrambled to "He fell down. The man slipped on the banana peel." and the same question is asked, accuracy drops to 58%. The model is relying on surface-level statistical patterns (e.g., 'slipped' often precedes 'fell' in training data) rather than understanding that slipping must temporally precede falling.

This is further confirmed by the 'Temporal Reversal Test'. The researchers created pairs of events where the causal direction is unambiguous but the linguistic cues are reversed. For example:
- Event A: "The glass broke."
- Event B: "The ball hit the glass."

When presented in the correct order (B then A), models performed well. When reversed (A then B), performance collapsed. This is not a language understanding problem—it is a failure of causal reasoning.

The study also tested the 'Causal Sufficiency' concept: given "The alarm rang. John woke up." vs "John woke up. The alarm rang.", models could not distinguish which scenario was causally plausible. They treated both as equally likely, revealing a fundamental inability to model counterfactuals.

Relevant Open-Source Repositories:
- causal-lm-bench (github.com/causal-lm-bench): A new benchmark suite specifically designed to test temporal and causal reasoning in LLMs. It includes 10,000 curated examples across 5 difficulty levels. As of May 2025, it has 2,300 stars and is being actively used by researchers at DeepMind and Anthropic.
- neuro-symbolic-reasoner (github.com/neurosymbolic/reasoner): A hybrid framework that combines a small LLM (e.g., Llama 3 8B) with a symbolic temporal logic engine (based on Allen's interval algebra). Early results show a 40% improvement on the causal-lm-bench over pure LLMs.

Benchmark Data:
| Model | Temporal Ordering (Accuracy) | Causal Inference (F1 Score) | Counterfactual Reasoning (Accuracy) |
|---|---|---|---|
| GPT-4o | 62% | 0.71 | 58% |
| Claude 3.5 Sonnet | 59% | 0.68 | 55% |
| Gemini 1.5 Pro | 57% | 0.65 | 52% |
| Llama 3 70B | 55% | 0.62 | 50% |
| Simple Rule-Based System | 89% | 0.92 | 87% |
| Neuro-Symbolic Hybrid | 91% | 0.94 | 90% |

Data Takeaway: The gap between pure LLMs and even a simple rule-based system is staggering. The neuro-symbolic hybrid, which explicitly models time and causality, outperforms all LLMs by a wide margin, suggesting that architecture, not scale, is the key.

Key Players & Case Studies

OpenAI has been the most vocal about building 'world models' for video generation (Sora) and agents (Operator). However, Sora's demos are carefully curated; internal evaluations show that in 30% of generated videos, objects violate basic physical causality (e.g., a ball rolling uphill without force). OpenAI's research team has acknowledged this limitation in a recent blog post, stating that 'causal coherence remains an open challenge.'

Anthropic takes a different approach, focusing on 'interpretability' and 'constitutional AI.' Their Claude models are slightly better on causal tasks due to deliberate training data curation, but the improvement is marginal. Anthropic's safety team has privately expressed concern that deploying agents without robust causal reasoning could lead to catastrophic failures in real-world settings.

DeepMind is arguably ahead. Their 'Gemini' team has been experimenting with 'causal attention masks' that force the model to attend to tokens in a temporally consistent order. Early results from an internal paper (not yet peer-reviewed) show a 15% improvement on temporal ordering tasks. DeepMind is also investing heavily in 'neural causal models' that learn causal graphs from data.

Startups to Watch:
- CausaLens (London): Raised $45M Series B in March 2025. They build causal AI for enterprise decision-making, using a hybrid of LLMs and causal Bayesian networks. Their product is used by pharmaceutical companies for drug trial simulations.
- Kumo AI (San Francisco): Focuses on 'causal prediction' for e-commerce. Their platform uses graph neural networks to model cause-effect relationships in user behavior. They claim a 20% improvement in recommendation accuracy over traditional collaborative filtering.

Comparison of Approaches:
| Company/Product | Approach | Temporal Reasoning Score | Causal F1 | Use Case |
|---|---|---|---|---|
| OpenAI (Sora) | Pure Transformer + diffusion | 62% | 0.71 | Video generation |
| Anthropic (Claude) | RLHF + curated data | 59% | 0.68 | Chat, agents |
| DeepMind (Gemini) | Causal attention masks | 74% | 0.80 | Research, agents |
| CausaLens | LLM + causal Bayesian net | 88% | 0.91 | Enterprise decision-making |
| Kumo AI | Graph neural network | 85% | 0.89 | E-commerce recommendations |

Data Takeaway: DeepMind's architectural tweaks show promise, but the hybrid approaches from startups like CausaLens are already outperforming the giants. This suggests that the winning strategy may not be a bigger LLM, but a smarter integration of symbolic reasoning.

Industry Impact & Market Dynamics

The inability of LLMs to reason about time and causality has profound implications for the AI agent market, which is projected to grow from $4.2 billion in 2024 to $28.5 billion by 2028 (CAGR 46%). Agents are being deployed in autonomous driving, robotics, healthcare scheduling, and financial trading—all domains where temporal and causal errors can be catastrophic.

Market Data:
| Sector | Current AI Agent Adoption | Projected 2028 Market Size | Criticality of Temporal/Causal Reasoning |
|---|---|---|---|
| Autonomous Driving | 15% (L2+ vehicles) | $68B | Critical (life-or-death) |
| Healthcare (scheduling, diagnosis) | 8% | $12B | High (patient safety) |
| Financial Trading | 22% | $9B | High (market stability) |
| Robotics (warehouse, manufacturing) | 18% | $15B | Critical (physical safety) |
| Customer Service | 45% | $7B | Low (tolerable errors) |

Data Takeaway: The sectors with the highest growth potential (autonomous driving, healthcare, robotics) are precisely those where temporal/causal reasoning is most critical. If LLMs cannot overcome this limitation, the agent market may hit a 'reliability ceiling' by 2027, stalling adoption.

Funding Landscape:
- In Q1 2025 alone, $1.2 billion was invested in 'causal AI' startups, up from $300 million in Q1 2024.
- Major VC firms (Sequoia, a16z, Index) have all published 'causal reasoning' as a top investment theme for 2025.
- The open-source causal-lm-bench benchmark has been downloaded over 50,000 times, indicating intense research interest.

Risks, Limitations & Open Questions

Risk 1: Catastrophic Agent Failures. An AI agent controlling a robotic arm in a factory must understand that 'turn valve' must happen before 'increase pressure.' A failure in this causal chain could lead to equipment damage or injury. The study's authors warn that current LLM-based agents are 'unsafe for unsupervised operation.'

Risk 2: Hallucination of Causality. LLMs can generate plausible-sounding causal explanations that are entirely wrong. For example, a model might claim that 'taking vitamin C causes colds to go away faster,' based on statistical correlation in training data, when the actual causal relationship is weak or non-existent. This is dangerous in medical or legal contexts.

Risk 3: The 'Correlation Trap.' As models scale, they become better at finding spurious correlations that mimic causal relationships. This can lead to overconfidence in incorrect reasoning. The study found that larger models (e.g., GPT-4 vs GPT-3.5) were actually *more* confident in their wrong answers, a phenomenon the authors call 'calibrated ignorance.'

Open Questions:
- Can we train LLMs to learn causal structures from video data, where temporal order is explicit? Early experiments with Sora show promise but are far from reliable.
- Is a purely neural approach sufficient, or do we need to embed symbolic logic (e.g., temporal logic, causal graphs) into the architecture? The neuro-symbolic results suggest the latter.
- How do we evaluate causal reasoning in a way that is robust to dataset bias? Current benchmarks are too narrow.

AINews Verdict & Predictions

Verdict: The current generation of LLMs is fundamentally incapable of robust temporal and causal reasoning. This is not a bug to be fixed with more data or larger models—it is an architectural limitation. The Transformer's attention mechanism, for all its power, treats time as an afterthought. The industry has been living in a 'fluent text illusion,' mistaking linguistic coherence for genuine understanding.

Predictions:
1. By Q3 2026, at least two major AI labs will release 'causal LLMs' that incorporate explicit temporal logic modules. DeepMind is the most likely to lead, given their head start with causal attention masks.
2. The market for pure LLM-based agents will stagnate by 2027. Investors will shift funding to hybrid architectures that combine neural networks with symbolic reasoning. Startups like CausaLens will be acquisition targets for OpenAI or Google.
3. Video generation models will remain 'causally incoherent' for at least 3-5 years. Sora and its competitors will be limited to short clips (under 10 seconds) where causal violations are less noticeable. Long-form video generation will require a fundamentally different approach.
4. The 'causal reasoning benchmark' will become as important as MMLU or HumanEval. The causal-lm-bench will be adopted by major labs as a standard evaluation, and scores will be prominently featured in model release announcements.

What to Watch:
- The next release from DeepMind (possibly Gemini 2.0) for any mention of 'causal attention' or 'temporal grounding.'
- The open-source community: the neuro-symbolic-reasoner repo is one to watch. If it achieves GPT-4-level language understanding with causal reasoning, it could democratize this capability.
- Regulatory bodies: As agents are deployed in healthcare and transportation, regulators will demand proof of causal reasoning. This could become a compliance requirement by 2028.

Final Thought: The AI industry has spent years scaling models to generate better text. The next frontier is not better text—it is better thinking. And thinking, at its core, is about understanding what happens when, and why. Until LLMs can do that, they will remain brilliant mimics, not true reasoners.

More from Hacker News

AI 에이전트, 지불을 배우다: x402 프로토콜이 기계 마이크로 경제를 열다The x402 protocol represents a critical infrastructure upgrade for the AI ecosystem, embedding payment directly into theClaude, 실제 돈을 벌지 못하다: AI 코딩 에이전트 실험이 드러낸 냉혹한 진실In a controlled experiment, AINews tasked Claude with completing real paid programming bounties on Algora, a platform whClaude 메모리 시각화 도구: 새로운 macOS 앱이 AI 블랙박스를 열다A new macOS-native application has emerged that can directly parse and display the memory files generated by Claude CodeOpen source hub3512 indexed articles from Hacker News

Related topics

large language models142 related articlestransformer architecture29 related articles

Archive

May 20261786 published articles

Further Reading

계산 역설: LLM이 소설은 쓰면서 50까지는 못 세는 이유대규모 언어 모델은 전체 소설을 생성할 수 있지만 50까지 세는 데는 어려움을 겪습니다. AINews는 이 역설의 구조적 근원, 상업적 애플리케이션에 미치는 영향, 그리고 격차를 해소할 수 있는 새로운 하이브리드 접다음 토큰 예측을 넘어: LLM이 단순한 자동완성 엔진이 아닌 이유대규모 언어 모델을 '다음 토큰 예측기'라고 부르는 것은 체스 그랜드마스터를 '말을 움직이는 사람'이라고 부르는 것과 같습니다. 기술적으로는 정확하지만 심각한 오해를 불러일으킵니다. AINews는 이러한 기능적 설명AI 에이전트의 환상: 오늘날의 '진보된' 시스템이 근본적으로 제한되는 이유AI 업계는 '진보된 에이전트'를 만들기 위해 경쟁하고 있지만, 그렇게 마케팅되는 대부분의 시스템은 근본적으로 제한적입니다. 이들은 세계 이해와 강력한 계획 능력을 가진 진정한 자율적 개체라기보다는 대규모 언어 모델API 소비자에서 AI 정비사로: LLM 내부 구조 이해가 이제 필수인 이유인공지능 개발 분야에서 심오한 변화가 진행 중입니다. 개발자들은 이제 대규모 언어 모델을 블랙박스 API로 취급하는 것을 넘어, 그 내부 메커니즘을 깊이 파고들고 있습니다. 소비자에서 정비사로의 이 전환은 기술 전문

常见问题

这次模型发布“Time Blindness: Why LLMs Can't Grasp Cause and Effect”的核心内容是什么?

A new open-source research paper, led by a team from MIT and the University of Cambridge, has systematically demonstrated that state-of-the-art large language models (LLMs) includi…

从“why can't GPT-4 understand time order”看,这个模型发布为什么重要?

The core of the problem lies in the Transformer architecture itself. The attention mechanism, while revolutionary for capturing contextual relationships, operates on a set of tokens without any inherent notion of order.…

围绕“causal reasoning benchmark for LLMs”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。