AI演繹推理能力於多智能體《妙探尋兇》遊戲模擬中接受考驗

arXiv cs.AI March 2026
Source: arXiv cs.AImulti-agent AIArchive: March 2026
一項突破性研究將經典推理遊戲《妙探尋兇》轉化為複雜的文本型多智能體模擬,為AI推理設立了新基準。該研究讓頂尖語言模型在智力對決中相互較量,揭示了它們在邏輯推理能力上的顯著局限性。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

A novel research initiative has established a sophisticated new paradigm for evaluating the complex reasoning capabilities of large language models (LLMs). By constructing a text-based, multi-agent simulation environment based on the classic board game Clue (known as *Miao Tan Xun Xiong*), the study moves beyond simple question-answering to test models in a dynamic scenario requiring long-term memory, logical integration, and strategic adaptation.

The research deployed six intelligent agents powered by two prominent model families—GPT-4o-mini and Gemini-2.5-Flash—across 18 full simulated games. Each agent, representing a character in the murder mystery, had to navigate accusations, process revealed information, and deduce the correct culprit, weapon, and location through iterative dialogue and observation. The core objective was to measure how well these models can chain together disparate pieces of information over multiple conversational turns, a fundamental skill for real-world applications.

A critical secondary investigation explored whether fine-tuning LLMs on structured logical puzzles could enhance their performance in this open-ended, game-based reasoning task. The findings systematically expose the current fragility of even advanced models when faced with scenarios demanding persistent state tracking and causal deduction. While not yielding a dramatic performance leap, the study provides crucial empirical direction for future development, emphasizing that reliable AI reasoning requires moving beyond surface-level language generation to build robust internal mechanisms for world modeling and strategic optimization.

Technical Analysis

The construction of a Clue-based multi-agent environment represents a significant methodological advancement in AI evaluation. Traditional benchmarks often test knowledge retrieval or single-step inference. This framework, however, forces models to operate in a constrained but open-ended world with explicit rules, hidden information, and multiple interacting participants. Success requires a model to maintain a dynamic "mental model" of the game state: who has what cards, what suggestions have been made and disproven, and which possibilities have been eliminated.

The research findings indicate that while models like GPT-4o-mini and Gemini-2.5-Flash can parse individual turns and generate plausible-sounding dialogue, they struggle with the consistent, long-horizon logic required for winning. Key failure modes include:
* State Tracking Degradation: Models frequently lose track of previously established facts over the course of a long conversation, leading to logically inconsistent moves.
* Strategic Myopia: Agents often make suggestions that are locally coherent but do not contribute to a long-term winning strategy, such as failing to strategically test specific hypotheses to narrow down possibilities.
* Inference Chain Breakdown: The ability to combine multiple pieces of negative information (e.g., "Player A does not have the Revolver, and Player B does not have the Library") to triangulate a positive conclusion remains fragile.

The fine-tuning experiment on logical puzzles is particularly insightful. It probes the transfer learning capability from structured, formal logic problems to a messy, interactive narrative. Preliminary results suggest that while such training can improve performance on puzzle-like aspects, it does not automatically confer robust strategic reasoning or flawless state management in the game. This highlights a gap between mastering a *form* of reasoning and developing a general, applicable reasoning *faculty* that can be deployed flexibly.

Industry Impact

This research provides a tangible, scalable testbed with immediate implications for high-stakes industries. The core challenge—integrating fragmented evidence over time to reach a correct conclusion—is directly analogous to critical professional tasks.
* Healthcare: Diagnostic pathways involve sequential testing, ruling out hypotheses, and synthesizing patient history, lab results, and symptoms—a process mirroring the Clue deduction loop.
* Finance & Risk: Analysts must piece together market signals, company filings, and economic indicators to build a coherent investment thesis or assess credit risk.
* Legal & Compliance: Reviewing case law, evidence, and testimonies to construct a legal argument or investigate regulatory breaches requires meticulous multi-step reasoning.

By demonstrating current AI limitations in a controlled game environment, the study sets a clear performance target for developers aiming to build assistive tools for these domains. Furthermore, the multi-agent framework itself is a blueprint for future systems involving AI collaboration or negotiation, such as in supply chain optimization or collaborative design platforms.

Future Outlook

The path forward illuminated by this study is twofold: enhancing model architectures and refining evaluation paradigms.

Architecturally, the results underscore the necessity of moving beyond next-token prediction as the sole training objective. Future models may require explicit modules for persistent world-state memory, causal reasoning engines, and planning algorithms that operate over a longer horizon. Techniques like chain-of-thought prompting are a step in this direction, but the need is for these capabilities to be baked into the model's fundamental operation, not just elicited through careful prompting.

From an evaluation perspective, this work signals a broader shift towards interactive, sequential, and goal-oriented benchmarks. The era of static question-answer datasets is giving way to dynamic simulations where AI performance is measured by its ability to achieve an objective in a complex environment. We can expect a proliferation of similar benchmarks based on other rule-based games (bridge, diplomacy, dungeon masters), software environments, or simulated economic markets.

Ultimately, the research delivers a sobering but constructive message: today's most advanced language models, while proficient at pattern matching and knowledge synthesis, are not yet reliable deductive reasoners. Achieving true reasoning robustness will require a concerted effort to build AI that doesn't just talk about logic but consistently *enacts* it over time and through interaction. The Clue board has been set; the challenge for the AI community is now to solve the puzzle of building a truly reasoning machine.

More from arXiv cs.AI

无标题As large language models (LLMs) transition from answering questions to executing actions via tool calls, a critical bott无标题The Theory of Mind Utility (ToM-U) framework marks a critical inflection point in AI social intelligence research—shifti无标题The AI community has long been trapped in a 'blind men and the elephant' dilemma: the same system can be declared both 'Open source hub457 indexed articles from arXiv cs.AI

Related topics

multi-agent AI40 related articles

Archive

March 20262347 published articles

Further Reading

OmniToM Reveals LLMs Still Can't Read Minds: A Social Reasoning Wake-Up CallA new benchmark called OmniToM exposes a fundamental flaw in large language models: they excel at social reasoning testsLinAlg-Bench 揭示 LLM 數學推理中的結構性裂痕一項名為 LinAlg-Bench 的新基準測試,系統性地評估了 10 個前沿語言模型在線性代數任務上的表現,在 6,600 個輸出中發現了 1,156 個結構性失敗。這些失敗不僅僅是計算錯誤,更證明了模型在處理組合推理時存在深層的架構裂痕信念引擎:讓AI立場轉變可審計且可問責多智能體AI辯論長期存在黑箱問題:當AI改變立場時,無人知曉原因。全新的信念引擎引入了可審計的信念更新層,使每一次立場轉變都能追溯到具體證據、錨定效應或角色漂移。這將AI協商轉變為透明且可驗證的過程。PolitNuggets 基準測試揭露 AI 代理在長尾政治事實檢索中的盲點一項名為 PolitNuggets 的新基準測試顯示,雖然 AI 模型在直接問答中表現出色,但在跨語言探索與整合零散、低訊號的政治資訊方面卻嚴重不足,這對它們在真實應用中扮演獨立研究者的角色構成挑戰。

常见问题

这次模型发布“AI's Deductive Reasoning Put to the Test in Multi-Agent Clue Game Simulations”的核心内容是什么?

A novel research initiative has established a sophisticated new paradigm for evaluating the complex reasoning capabilities of large language models (LLMs). By constructing a text-b…

从“How does GPT-4 perform in logical deduction games compared to Gemini?”看,这个模型发布为什么重要?

The construction of a Clue-based multi-agent environment represents a significant methodological advancement in AI evaluation. Traditional benchmarks often test knowledge retrieval or single-step inference. This framewor…

围绕“Can AI be fine-tuned to play board games like Clue better?”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。