The Knowing-Doing Gap: Why LLMs Fail to Call Tools When It Matters Most

arXiv cs.AI May 2026
Source: arXiv cs.AIArchive: May 2026
Large language models (LLMs) can identify when they need a tool, yet frequently choose not to use it — a critical flaw dubbed the 'knowing-doing gap.' This discovery overturns the assumption that tool necessity is a binary property and points toward a new generation of self-aware AI agents.

A groundbreaking study has exposed a fundamental flaw in how large language models (LLMs) behave as autonomous agents: they suffer from a 'knowing-doing gap.' While models can accurately determine when a task requires an external tool — such as an API call for real-time data — they often fail to actually invoke that tool during execution. Instead, they fall back on their parametric memory, generating plausible but incorrect answers. This finding challenges the prevailing industry wisdom that tool necessity is a static, model-agnostic property. The research, conducted by a team spanning multiple universities and AI labs, systematically tested frontier models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B across a curated benchmark of 1,200 tasks. The results show that even the best models exhibit a 15-25% gap between 'knowing' (correctly identifying tool need in a planning phase) and 'doing' (actually calling the tool during execution). The implications are profound: in high-stakes domains like finance, healthcare, and legal reasoning, a single missed tool call can lead to catastrophic errors. The study's authors argue that the solution lies not in adding more tools, but in embedding meta-cognitive mechanisms — circuits that allow the model to monitor its own uncertainty and dynamically decide when to delegate to an external system. This shifts the competitive landscape from 'who has the most tools' to 'who builds agents that reliably use the right tool at the right time.'

Technical Deep Dive

The 'knowing-doing gap' is not a failure of reasoning — it is a failure of execution. To understand why, we must look at how LLMs process tool calls internally. Most modern LLMs are trained on massive corpora that include tool-use examples, but the training objective is next-token prediction, not goal-oriented action selection. When a model generates a tool call, it must produce a special token sequence (e.g., `<function=weather_api>`) that triggers an external system. The decision to emit that token is a probabilistic choice competing against the alternative of simply generating an answer from memory.

The Architecture of the Gap

The study's authors designed a two-phase evaluation: a 'planning' phase where the model is asked to describe what it would do (including whether a tool is needed), and an 'execution' phase where it must actually produce the tool call. They found that the gap emerges from three architectural sources:

1. Token-level competition: The probability of generating a tool-call token is often lower than generating a plausible-looking answer token, especially when the model has seen similar questions in training. This is a form of 'memorization shortcut.'

2. Attention decay: In long reasoning chains, the model's attention to the initial instruction to 'use tools when needed' fades. By the time it reaches the decision point, the contextual signal is diluted.

3. Reward misalignment: During RLHF training, models are rewarded for producing coherent answers, not for correctly deciding to abstain from answering. This creates a perverse incentive to always generate an answer, even if wrong.

Benchmark Data

The study introduced a new benchmark, ToolUse-Gap, with 1,200 tasks spanning 12 domains (weather, math, current events, code execution, database queries, etc.). Each task has a known ground truth about tool necessity.

| Model | Planning Accuracy (Know) | Execution Accuracy (Do) | Gap | Avg. Latency (ms) |
|---|---|---|---|---|
| GPT-4o | 92.3% | 74.1% | 18.2% | 1,450 |
| Claude 3.5 Sonnet | 90.8% | 72.5% | 18.3% | 1,620 |
| Gemini 1.5 Pro | 88.6% | 67.2% | 21.4% | 1,380 |
| Llama 3.1 405B | 85.1% | 61.3% | 23.8% | 2,100 |
| Mistral Large 2 | 83.4% | 59.8% | 23.6% | 1,550 |

Data Takeaway: All models show a significant gap, with open-weight models (Llama, Mistral) suffering more. The gap is consistent across domains, suggesting it is a systemic architectural issue, not a data artifact.

Relevant Open-Source Work

Several GitHub repositories are directly relevant. ToolBench (github.com/OpenBMB/ToolBench, 7,800 stars) provides a framework for training tool-use agents, but its focus is on accuracy, not on the knowing-doing gap. AgentBench (github.com/THUDM/AgentBench, 6,200 stars) evaluates agent performance across diverse environments, but does not separate planning from execution. The study's authors have released a companion repo, ToolUseGap (github.com/toolusegap/benchmark, 1,200 stars as of this writing), which includes the full benchmark and evaluation scripts.

Takeaway: The gap is a fundamental architectural limitation. Future models need explicit 'tool-use gates' — neural modules that monitor internal uncertainty and trigger tool calls when confidence falls below a threshold.

Key Players & Case Studies

The study was led by researchers from Stanford University, UC Berkeley, and Anthropic, with contributions from Google DeepMind. Notably, the team includes Dr. Yizhong Wang (known for the T0 and FLAN models) and Dr. Percy Liang (Stanford's Center for Research on Foundation Models).

Competitive Landscape

Several companies are already racing to address this gap, though none have fully solved it:

| Company / Product | Approach | Reported Gap Reduction | Status |
|---|---|---|---|
| OpenAI (GPT-4o with function calling) | Fine-tuned tool-use tokens, system prompt reinforcement | ~5% improvement over base GPT-4 | Production |
| Anthropic (Claude 3.5 with tool use) | Constitutional AI + tool-use specific RLHF | ~8% improvement over Claude 3 | Production |
| Google (Gemini 1.5 Pro with tools) | Long-context attention + explicit tool-use heads | ~3% improvement | Production |
| Microsoft (AutoGen framework) | Multi-agent orchestration with separate planner and executor | ~12% improvement in controlled tests | Research/Preview |
| Meta (Llama 3.1 + tool-use adapter) | Lightweight adapter layers trained on tool-use data | ~6% improvement | Open-source |

Data Takeaway: No approach closes the gap entirely. The best results come from separating planning and execution into different agents (as in AutoGen), but this introduces latency and complexity.

Case Study: Financial Services

A major hedge fund (name withheld) tested GPT-4o for real-time market analysis. In a 30-day trial, the model correctly identified that it needed to call a live stock price API in 94% of planning scenarios, but actually called it only 71% of the time. The 23% gap led to 14 instances where the model generated a price from memory that was off by more than 5%, causing a simulated loss of $2.3 million. The fund has since implemented a 'forced tool-use' wrapper that intercepts any answer attempt and verifies it against a tool call — a brute-force solution that works but sacrifices flexibility.

Takeaway: The knowing-doing gap is not a theoretical curiosity; it has real dollar costs. High-stakes adopters must implement guardrails until models can self-correct.

Industry Impact & Market Dynamics

This study arrives at a critical inflection point. The AI agent market is projected to grow from $5.1 billion in 2024 to $47.1 billion by 2030 (CAGR 44.8%), according to industry estimates. The knowing-doing gap directly threatens this growth: if agents cannot reliably use tools, they cannot be trusted with autonomous tasks.

Market Segmentation

| Segment | 2024 Market Size | Projected 2030 Size | Key Vulnerability to Gap |
|---|---|---|---|
| Enterprise Automation | $2.1B | $18.5B | High — missed tool calls cause workflow failures |
| Customer Service | $1.4B | $12.3B | Medium — errors can be caught by human review |
| Healthcare | $0.6B | $6.8B | Critical — wrong diagnosis from memory is unacceptable |
| Financial Services | $0.8B | $7.2B | Critical — real-time data dependency |
| Legal & Compliance | $0.2B | $2.3B | High — hallucinations in legal reasoning are costly |

Data Takeaway: The healthcare and financial segments, which have the highest stakes, are also the most vulnerable. These sectors may delay adoption until the gap is addressed.

Business Model Implications

Currently, most AI agent platforms charge per-tool-call or per-API-usage. If models fail to call tools, they are effectively overcharging for incorrect answers. This creates a misaligned incentive: platform providers benefit from models that *don't* call tools (since it reduces their API costs), while users need models that *do* call tools. The study suggests that future pricing models may need to incorporate a 'correctness guarantee' — charging a premium for verified tool-use and offering refunds for erroneous memory-based answers.

Takeaway: The knowing-doing gap will force a shift from usage-based pricing to outcome-based pricing, at least in high-stakes domains.

Risks, Limitations & Open Questions

Risks

1. Overcorrection: If models become too eager to call tools, they may trigger unnecessary API calls, increasing latency and cost. The study found that some models already show a 'reverse gap' — calling a tool when it is not needed — in 5-8% of cases.

2. Security surface: More tool calls mean more attack vectors. A malicious actor could craft prompts that force the model to call a compromised API.

3. Dependence on external systems: If the tool API is down or slow, the model may fall back to memory anyway, negating the benefit.

Limitations of the Study

- The benchmark is synthetic; real-world tool-use involves more complex decision trees.
- The study does not explore multi-step tool chains, where a model must call multiple tools in sequence.
- The sample size of 1,200 tasks, while substantial, may not capture edge cases in domains like creative writing or strategic planning.

Open Questions

- Can the gap be closed entirely through better training, or is it an inherent property of autoregressive models?
- Should tool-use decisions be made by a separate 'router' model rather than the LLM itself?
- How does the gap scale with model size? Preliminary data suggests larger models have a smaller gap, but the trend is not linear.

Takeaway: The knowing-doing gap is a solvable engineering problem, but it requires a fundamental rethinking of how agents are built — not just better prompts.

AINews Verdict & Predictions

Verdict: The knowing-doing gap is the single most underappreciated obstacle to autonomous AI agents. The industry has been obsessed with building bigger tool catalogs and more complex agent frameworks, but this study shows that the bottleneck is not tool availability — it is the model's own inability to execute its decisions. This is a wake-up call.

Predictions:

1. By Q4 2025, at least two major LLM providers will release models with dedicated 'meta-cognitive' modules that explicitly monitor internal uncertainty and trigger tool calls. These will be marketed as 'self-aware' or 'uncertainty-aware' agents.

2. By mid-2026, the 'tool-use gap' will become a standard metric in LLM benchmarks, alongside MMLU and HumanEval. Companies that fail to report it will face scrutiny from enterprise buyers.

3. The next wave of startups will focus not on building more tools, but on building 'tool-use reliability layers' — middleware that sits between the LLM and the tool API, intercepting memory-based answers and forcing verification. Expect to see at least three such startups raise Series A rounds in the next 12 months.

4. Open-weight models will close the gap faster than closed-source ones, because the research community can iterate on architectural changes more freely. Llama 4 or its successor may become the preferred choice for tool-intensive agent applications.

5. The biggest winner will be the company that solves the gap while maintaining low latency. Currently, no one has done both — Microsoft's AutoGen reduces the gap but adds 2-3 seconds of latency. The first to achieve <500ms latency with <5% gap will dominate the enterprise agent market.

What to watch: The next release from Anthropic (potentially Claude 4) and the open-source community's response to the ToolUseGap benchmark. If a simple architectural fix emerges from the open-source world, the entire industry will pivot within months.

Final thought: The knowing-doing gap is a mirror held up to the AI industry's own hype. We have been selling agents as autonomous decision-makers, but they are more like brilliant advisors who sometimes forget to pick up the phone. The next frontier is not intelligence — it is reliability.

More from arXiv cs.AI

UntitledFor years, the multimodal AI community has operated under a tacit assumption: to make models both 'see' and 'reason' corUntitledThe fundamental problem with LLM planners in industrial settings has never been a lack of creativity—it's a lack of struUntitledThe legal profession's embrace of AI has always carried an undercurrent of unease: when a model confidently delivers a wOpen source hub326 indexed articles from arXiv cs.AI

Archive

May 20261609 published articles

Further Reading

The Memory Governance Revolution: Why AI Agents Must Learn to Forget to SurviveAs AI agents evolve from single-task tools into persistent digital companions, their crude memory systems are breaking dSTEM Agent Architecture Emerges: Biological 'Pluripotency' Design Could End AI Agent Rigidity EraA groundbreaking AI agent architecture, drawing inspiration from stem cell biology, is challenging the fundamental desigVisual Reasoning's Blind Spot: Why AI Must Learn to See Before It ThinksA new study exposes a fundamental flaw in visual language models: they are not trained to see accurately. By rewarding oSPIN's DAG Contract: Taming LLM Chaos for Industrial Agent ReliabilitySPIN is a planning wrapper that forces LLM-generated workflows into a Directed Acyclic Graph (DAG) contract, structurall

常见问题

这次模型发布“The Knowing-Doing Gap: Why LLMs Fail to Call Tools When It Matters Most”的核心内容是什么?

A groundbreaking study has exposed a fundamental flaw in how large language models (LLMs) behave as autonomous agents: they suffer from a 'knowing-doing gap.' While models can accu…

从“LLM tool calling failure rate benchmark comparison”看,这个模型发布为什么重要?

The 'knowing-doing gap' is not a failure of reasoning — it is a failure of execution. To understand why, we must look at how LLMs process tool calls internally. Most modern LLMs are trained on massive corpora that includ…

围绕“how to fix knowing-doing gap in AI agents”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。