'Planejamento míope' dos LLMs exposto: por que a IA não consegue ver além de três passos

11 de maio de 2026 às 14:12 AINews arXiv cs.AI May 2026

Source: arXiv cs.AI Archive: May 2026

Um novo método de pesquisa extrai árvores de busca dos traços de raciocínio dos LLMs, revelando uma falha fundamental: mesmo os modelos mais avançados se envolvem em 'planejamento míope', simulando apenas dois ou três passos à frente. Isso desafia a suposição de que cadeia de pensamento equivale a raciocínio profundo e fornece uma medida quantitativa dessa limitação.

The article body is currently shown in English by default. You can generate the full version in this language on demand.

A team of researchers has developed a novel technique to reverse-engineer the reasoning process of large language models (LLMs) into explicit search trees. By analyzing the branching structure of these trees, they discovered that state-of-the-art reasoning models, including those fine-tuned with chain-of-thought (CoT) prompting, exhibit a severe 'myopic planning' bias. The models explore future branches to a depth of only two to three steps, effectively performing local greedy optimization rather than constructing a global strategy. This finding directly challenges the prevailing belief that lengthy CoT outputs indicate deep, multi-step reasoning. The study provides a concrete metric—'planning depth'—to quantify this limitation, and demonstrates its impact across tasks like mathematical problem-solving, code generation, and agentic planning. The implications are profound: current autoregressive LLMs may have hit a fundamental ceiling in handling long-range dependencies. The work is already sparking interest in hybrid architectures that combine the generative fluency of neural networks with explicit search algorithms, such as Monte Carlo Tree Search (MCTS) or reinforcement learning (RL), to extend the planning horizon. For the AI industry, this is both a warning and a roadmap: the path to truly intelligent agents lies not in scaling model size alone, but in fundamentally rethinking how models explore and commit to future states.

Technical Deep Dive

The core innovation of this research is the Search Tree Extraction (STE) method, which transforms an LLM's linear token-by-token generation into a structured, directed acyclic graph (DAG) representing its internal search space. Unlike prior work that treats CoT as a single path, STE reconstructs the model's implicit exploration of alternative reasoning branches.

How STE Works:
1. Token-Level Logit Capture: During inference, the model's logits for each token position are recorded. For each generated token, the top-k alternative tokens (by probability) are preserved as potential 'sibling' nodes.
2. Tree Construction: A root node (the initial prompt) is created. Each generated token becomes a child node of its predecessor. Sibling nodes are added from the top-k alternatives. The process repeats recursively, building a tree where each path is a possible reasoning trajectory.
3. Depth and Breadth Metrics: The tree is analyzed for 'planning depth'—the maximum number of consecutive steps the model explores before converging to a single path. A 'branching factor' measures how many alternatives are considered at each step. The study found that for complex multi-step tasks (e.g., solving a 5-step math word problem), the average planning depth was 2.3 steps, with a branching factor of 1.4—meaning the model rarely considered more than one or two alternatives beyond the immediate next step.

Why This Happens: The root cause lies in the autoregressive decoding mechanism. At each step, the model selects the token with the highest cumulative probability, conditioned on the entire previous sequence. This greedy selection inherently favors locally optimal choices. While techniques like beam search or temperature sampling can introduce broader exploration, they do not fundamentally alter the model's inability to evaluate long-term consequences. The study shows that even with beam width of 10, the effective planning depth only increases to 3.1 steps.

Relevant Open-Source Work: The methodology is reminiscent of the Tree-of-Thoughts (ToT) framework (GitHub: `princeton-nlp/tree-of-thought-llm`, 12k+ stars), which explicitly prompts models to generate and evaluate multiple reasoning paths. However, ToT requires manual prompt engineering and is not a diagnostic tool. The STE method is fully automated and model-agnostic. Another relevant repository is `google-deepmind/alphazero` (30k+ stars), which uses MCTS for planning in game environments—a stark contrast to LLMs' myopic approach.

Benchmark Data:

| Model | Avg. Planning Depth | Branching Factor | GSM8K Accuracy | MATH Accuracy |
|---|---|---|---|---|
| GPT-4o (no CoT) | 1.8 | 1.2 | 87.2% | 76.5% |
| GPT-4o (CoT) | 2.3 | 1.4 | 92.0% | 83.8% |
| Claude 3.5 Sonnet (CoT) | 2.1 | 1.3 | 91.5% | 82.1% |
| Gemini 1.5 Pro (CoT) | 2.0 | 1.3 | 90.8% | 80.3% |
| Llama-3-70B (CoT) | 1.9 | 1.1 | 85.4% | 72.1% |
| Qwen2.5-72B (CoT) | 2.2 | 1.4 | 89.1% | 78.9% |

Data Takeaway: The table reveals a clear correlation: models with higher planning depth achieve better accuracy on reasoning benchmarks. However, even the best model (GPT-4o with CoT) only reaches a depth of 2.3 steps—far below what is needed for tasks requiring 10+ sequential decisions. This suggests that CoT primarily improves local coherence, not global strategy.

Key Players & Case Studies

The research was conducted by a team from the University of California, Berkeley, and Anthropic, led by Dr. Sarah Chen (a former DeepMind researcher known for her work on interpretability). The team includes contributors to the Anthropic's 'Golden Gate Claude' interpretability project. Their approach builds on earlier work by Chris Olah on mechanistic interpretability, but shifts focus from individual neurons to the structure of reasoning.

Case Study: AI Agents in Software Engineering
A practical demonstration of myopic planning was observed in Devin (Cognition AI's autonomous coding agent). When tasked with fixing a bug in a 10-file codebase, Devin's reasoning trace showed it would fix one file, then re-analyze the entire codebase from scratch, rather than planning a sequence of edits across files. The STE analysis revealed a planning depth of 1.5 steps, meaning Devin effectively operated as a 'greedy patcher' rather than a strategic refactorer. This explains why Devin often introduces new bugs: it cannot foresee the ripple effects of its changes.

Case Study: Scientific Hypothesis Generation
In drug discovery, models like Google DeepMind's AlphaFold are used to generate hypotheses for protein folding. A recent attempt to use GPT-4o to propose a multi-step synthesis pathway for a novel molecule failed because the model would suggest the first reaction step without considering the stability of intermediates. The STE analysis showed a planning depth of 1.8 steps—the model was 'chemically myopic.'

Comparison of Planning-Enhanced Approaches:

| Approach | Planning Depth Achieved | Computational Cost | Task Suitability |
|---|---|---|---|
| Standard Autoregressive LLM | 1.5-2.3 | Low | Simple Q&A |
| CoT + Self-Consistency | 2.0-2.5 | Medium | Math word problems |
| Tree-of-Thoughts (ToT) | 3.0-4.0 | High | Creative writing, puzzles |
| MCTS + LLM (e.g., AlphaZero-style) | 10+ | Very High | Games, optimization |
| Reinforcement Learning from Human Feedback (RLHF) | 1.8-2.2 | Medium | General alignment |

Data Takeaway: Explicit search algorithms (MCTS) dramatically extend planning depth but at a prohibitive computational cost. The challenge is to find a middle ground—perhaps by distilling MCTS policies into the LLM's weights, a technique being explored by Google DeepMind's 'AlphaDev' team.

Industry Impact & Market Dynamics

The myopic planning finding has immediate implications for the $200B+ AI industry, particularly in three high-stakes verticals:

1. Autonomous Agents (Robotics, Customer Service): Companies like Figure AI (raised $675M) and Adept ($350M) are building agents that must execute multi-step tasks (e.g., 'book a flight, reserve a hotel, and send a calendar invite'). If these agents can only plan 2 steps ahead, they will fail in dynamic environments. The market for 'planning-aware' agent frameworks is expected to grow from $1.2B in 2025 to $8.5B by 2028 (CAGR 48%).

2. Code Generation: GitHub Copilot (owned by Microsoft) and Cursor are used by millions of developers. Myopic planning explains why these tools often produce code that compiles but has logical errors in multi-file projects. A new wave of 'planning-enhanced' code assistants, such as Sweep AI (which uses a planner to decompose tasks), is emerging.

3. Scientific Research: Insilico Medicine and Recursion Pharmaceuticals use LLMs to design experiments. The myopic planning limitation means they cannot reliably propose multi-step synthesis or clinical trial sequences. This is a bottleneck for AI-driven drug discovery, a market projected to reach $50B by 2030.

Funding & Investment Trends:

| Company | Focus | Funding Raised | Key Product |
|---|---|---|---|
| Cognition AI | Autonomous coding agents | $175M (Series B) | Devin |
| Adept | General-purpose agents | $350M (Series B) | ACT-1 |
| Figure AI | Humanoid robots | $675M (Series B) | Figure 01 |
| Google DeepMind | AI for science | N/A (Alphabet) | AlphaFold, AlphaDev |
| Anthropic | Safe AI | $7.6B (total) | Claude |

Data Takeaway: The largest investments are flowing into companies building agents and scientific tools—precisely the domains most affected by myopic planning. This suggests that investors are betting on future breakthroughs in planning capabilities, but the current reality is that these products are fundamentally limited.

Risks, Limitations & Open Questions

1. Over-Interpretation of Tree Structure: The STE method assumes that the top-k logits represent genuine alternative reasoning paths. However, LLMs may generate high-probability tokens that are semantically similar (e.g., 'the' vs. 'a'), which inflates the branching factor without meaningful exploration. The researchers acknowledge this and use semantic clustering to filter out trivial alternatives, but the method is not perfect.

2. Computational Overhead: STE requires storing and analyzing logits for every token, which increases inference latency by 30-50%. For real-time applications (e.g., chatbots), this is impractical. The method is currently best suited for offline analysis and model debugging.

3. Does Planning Depth Matter for All Tasks? For tasks with short dependencies (e.g., simple Q&A, translation), myopic planning is sufficient. The risk is that the industry over-corrects and adds unnecessary complexity to models that work well for 90% of use cases.

4. Ethical Concerns: If models cannot plan long-term, they cannot reliably predict the consequences of their actions. This is a safety issue for autonomous systems. For example, a myopic AI agent managing a power grid might make a locally optimal decision that leads to a blackout 10 steps later. The research underscores the need for 'planning alignment'—ensuring models' short-term actions align with long-term human values.

AINews Verdict & Predictions

Verdict: This research is a wake-up call. The AI community has been seduced by the apparent sophistication of chain-of-thought reasoning, mistaking verbosity for depth. The STE method provides a much-needed reality check and a diagnostic tool. It is a landmark contribution to interpretability.

Predictions:
1. By Q3 2026, every major LLM provider (OpenAI, Anthropic, Google, Meta) will release a 'planning depth' benchmark score for their models, similar to MMLU. This will become a standard metric for evaluating reasoning capability.
2. By 2027, hybrid architectures combining LLMs with explicit search (e.g., MCTS) will become the default for agentic tasks, displacing pure autoregressive models. We predict the emergence of a new class of 'planning transformers' that internally simulate future states before decoding.
3. The biggest winner will be Anthropic, whose interpretability-first approach positions them to lead in planning-aware architectures. Their recent hiring of Dr. Chen (lead author of this study) is a strategic signal.
4. The biggest loser will be companies that rely solely on scaling model size without addressing architectural limitations. Meta's Llama-4 (if it remains purely autoregressive) will struggle to compete in agentic benchmarks.

What to Watch: The open-source community's response. If a project like `llama.cpp` integrates STE-based planning diagnostics, it could democratize this analysis and accelerate innovation. Also watch for Apple's entry into on-device AI agents—their focus on efficiency may force them to adopt planning-aware architectures sooner than competitors.

常见问题

这次模型发布“LLM 'Myopic Planning' Exposed: Why AI Can't See Beyond Three Steps”的核心内容是什么？

A team of researchers has developed a novel technique to reverse-engineer the reasoning process of large language models (LLMs) into explicit search trees. By analyzing the branchi…

从“How does search tree extraction (STE) work for diagnosing LLM planning depth?”看，这个模型发布为什么重要？

The core innovation of this research is the Search Tree Extraction (STE) method, which transforms an LLM's linear token-by-token generation into a structured, directed acyclic graph (DAG) representing its internal search…

围绕“What are the best alternatives to chain-of-thought for long-horizon planning in AI?”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。