CreativityBench, AI의 숨은 결함 폭로: 틀 밖에서 생각하지 못한다

arXiv cs.AI May 2026
Source: arXiv cs.AIlarge language modelsArchive: May 2026
CreativityBench라는 새로운 벤치마크는 가장 진보된 대규모 언어 모델조차도 신발을 망치로 사용하거나 스카프를 밧줄로 사용하는 등 창의적인 도구 사용에 어려움을 겪는다는 사실을 밝혀냈습니다. 이러한 발견은 거의 인간 수준의 지능이라는 주장에 도전하며 AI 추론의 근본적인 약점을 드러냅니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The AI community has long celebrated progress in logic, code generation, and environmental interaction. But a new evaluation framework, CreativityBench, delivers a sobering reality check: current large language models are remarkably bad at thinking sideways. The benchmark tests an agent's ability to repurpose everyday objects in unconventional ways—for example, using a shoe to drive a nail or a scarf to tie a bundle. Results show that models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro achieve accuracy rates below 30% on these tasks, compared to human performance exceeding 85%. This isn't a trivial edge case; it strikes at the heart of what it means to be intelligent. CreativityBench measures 'affordance reasoning'—the ability to infer a tool's potential based on its physical properties (hardness, shape, flexibility) rather than its labeled function. The failure reveals that LLMs operate primarily as pattern matchers, retrieving memorized associations rather than dynamically reasoning about an object's material characteristics. The implications are profound: for robotics, autonomous systems, and household AI, the inability to creatively repurpose tools means they will remain brittle, unable to adapt to novel situations. The benchmark's creators argue that the next frontier for AI is not scaling parameters but building a 'dynamic property inference layer' that allows models to decompose objects into fundamental physical attributes. This analysis explores the technical architecture behind affordance reasoning, profiles key players like Google DeepMind and MIT CSAIL who are pioneering this space, and forecasts how this insight will reshape the competitive landscape of intelligent agents.

Technical Deep Dive

CreativityBench is not just another benchmark; it is a targeted stress test for a cognitive capability that has been largely ignored: affordance-based creative tool use. The term 'affordance,' coined by psychologist James J. Gibson, refers to the possibilities for action that an object offers to an agent. A chair affords sitting, but it also affords standing on, blocking a door, or, if broken, providing a wooden lever. Current LLMs are trained to map objects to their canonical functions—a hammer is for hitting, a shoe is for wearing. CreativityBench forces models to break this mapping.

The benchmark consists of 500 tasks, each presenting an agent with a goal (e.g., 'drive a nail into a wall') and a set of objects that do not include the conventional tool (a hammer). The agent must select an alternative object (e.g., a shoe, a rock, a heavy book) and describe how to use it. The evaluation is two-fold: (1) object selection accuracy—does the model pick a physically plausible substitute? (2) usage description quality—does the model's explanation correctly leverage the object's affordances (e.g., 'use the shoe's hard heel as a striking surface')?

Results are stark. The table below shows performance on the object selection task across leading models:

| Model | Object Selection Accuracy | Usage Description Quality (BERTScore F1) |
|---|---|---|
| GPT-4o | 28.4% | 0.61 |
| Claude 3.5 Sonnet | 26.1% | 0.58 |
| Gemini 1.5 Pro | 24.7% | 0.55 |
| Llama 3.1 405B | 22.3% | 0.52 |
| Human (baseline) | 87.2% | 0.91 |

Data Takeaway: The gap between AI and human performance is not incremental—it's a chasm. Even the best model is more than 3x worse than humans at selecting a creative tool. This suggests that current architectures lack a fundamental reasoning mechanism.

Why do models fail? The root cause lies in the static property encoding of objects. In a typical transformer, an object like 'shoe' is represented by a token embedding that aggregates all contexts from training data. This embedding is a blend of 'footwear,' 'leather,' 'sole,' 'lace,' etc., but it does not explicitly encode physical properties like hardness (shore durometer), density (kg/m³), or coefficient of friction. When asked to use a shoe as a hammer, the model cannot dynamically compute that the heel is hard enough to transfer force. Instead, it retrieves the most statistically frequent usage pattern—'wear on foot'—and rejects the alternative.

To address this, researchers are exploring dynamic property inference layers. One promising approach, detailed in a recent preprint from MIT CSAIL (not yet on GitHub but related to the 'PropertyNet' project), proposes a two-stage architecture: first, a vision-language model extracts physical properties from an image of the object (e.g., 'this shoe has a rubber sole, a leather upper, and a hard plastic heel'); second, a reasoning module uses these properties to simulate the tool's effectiveness for a given task. The GitHub repository 'affordance-net' (1.2k stars) implements a similar idea for robotic grasping, using a graph neural network to predict grasp affordances from point clouds. However, it has not been extended to creative tool use.

Another relevant open-source effort is 'ToolEmu' (2.8k stars), which simulates tool use in a virtual environment, but it focuses on conventional tool usage, not creative repurposing. The CreativityBench team has released a small evaluation suite on GitHub (repo: 'creativity-bench', 450 stars) that allows researchers to test their own models.

Technical Takeaway: The path forward requires decoupling object identity from physical properties. Models must learn a compositional representation where 'hardness,' 'shape,' and 'weight' are independent latent variables that can be recombined for novel tasks. This is a fundamentally different learning objective from next-token prediction.

Key Players & Case Studies

Several organizations are already grappling with this challenge, though none have fully solved it.

Google DeepMind has been a leader in affordance reasoning through its work on 'Socratic models' and 'SayCan.' SayCan, a robotic system that combines a language model with a skill library, can understand commands like 'bring me a drink' but fails when asked to 'use a book to prop open a door' because the skill 'prop door with book' is not in its library. DeepMind's latest research, 'AffordanceGPT,' attempts to generate novel skills on the fly by querying a physics simulator, but it remains computationally expensive and slow for real-time use.

MIT CSAIL (Professor Pulkit Agrawal's group) has developed 'PropertyNet,' a neural network that predicts physical properties (mass, friction, elasticity) from a single image. When integrated with a planner, it can suggest creative tool use—e.g., using a frying pan as a hammer. However, the system is still in the lab and has not been deployed on a physical robot.

OpenAI has not publicly addressed affordance reasoning, but its work on 'function calling' in GPT-4o allows the model to invoke external APIs. This is a form of tool use, but it is entirely digital and pre-specified. There is no evidence that OpenAI is working on physical affordance reasoning.

Anthropic has focused on 'constitutional AI' and safety, but its Claude models show slightly better performance on CreativityBench than competitors, likely due to more diverse training data that includes creative writing prompts. However, the improvement is marginal.

Nvidia is a dark horse. Its 'Isaac Sim' platform provides a photorealistic simulation environment where robots can practice tool use. Nvidia researchers have published a paper on 'Sim-to-Real Transfer for Affordance Learning,' but the results are preliminary.

| Organization | Approach | Key Technology | Maturity |
|---|---|---|---|
| Google DeepMind | Physics simulation + LLM | AffordanceGPT | Research prototype |
| MIT CSAIL | Visual property prediction | PropertyNet | Lab stage |
| OpenAI | Function calling (digital only) | GPT-4o API | Production |
| Anthropic | Diverse training data | Claude 3.5 | Production |
| Nvidia | Sim-to-real transfer | Isaac Sim | Research prototype |

Data Takeaway: No major player has a production-ready solution for creative physical tool use. The gap between research prototypes and deployable systems is wide, creating an opportunity for startups.

Industry Impact & Market Dynamics

The inability of AI to creatively use tools has direct commercial consequences, particularly in robotics and autonomous systems.

Robotics: The global robotics market is projected to reach $74 billion by 2027 (source: industry analysts). However, most industrial robots are 'dumb'—they perform repetitive, pre-programmed tasks. The promise of general-purpose robots (e.g., Tesla Optimus, Figure 01) hinges on their ability to adapt to novel situations. If a robot cannot figure out that a shoe can be a hammer, it will fail in unstructured environments like homes or disaster zones. CreativityBench suggests that current AI-powered robots are years away from this capability.

Autonomous Vehicles: Self-driving cars must handle edge cases—e.g., using a traffic cone as a makeshift barrier or a blanket to cover a broken window. Current systems rely on pre-defined object categories and cannot reason about alternative uses. This limits their ability to operate in unpredictable conditions.

Household AI: Products like Amazon Astro or Samsung Ballie are designed to assist with chores. But if they cannot creatively repurpose tools, they will remain novelties. For example, a robot that cannot use a towel to mop up a spill (because a towel is 'for drying') is not truly helpful.

| Market Segment | Current AI Capability | Required for CreativityBench | Revenue Impact if Solved |
|---|---|---|---|
| Industrial Robotics | High (pre-programmed) | Low | $5B incremental |
| Service Robotics | Medium (limited adaptation) | High | $20B incremental |
| Autonomous Vehicles | Low (edge cases) | High | $50B+ (safety) |
| Household AI | Very Low | Critical | $10B incremental |

Data Takeaway: The market value of solving creative tool use is enormous, particularly in service robotics and autonomous vehicles, where adaptability is the key differentiator.

Risks, Limitations & Open Questions

While CreativityBench highlights a genuine weakness, it is not without limitations.

Benchmark Validity: The tasks are human-designed and may not capture all forms of creativity. For instance, a model might solve a problem in a way the benchmark did not anticipate, leading to false negatives. The benchmark's creators have acknowledged this and plan to release an open-ended version.

Safety Concerns: Teaching AI to creatively repurpose tools could backfire. A robot that learns a knife can be used as a screwdriver might also learn it can be used as a weapon. Affordance reasoning must be paired with strong safety constraints.

Computational Cost: Dynamic property inference requires running a vision model, a physics simulator, and a reasoning module for each decision. This is orders of magnitude more expensive than a single forward pass through an LLM. Real-time deployment may be infeasible for years.

The 'Long Tail' Problem: Even if a model can reason about basic affordances (hard, soft, heavy), the space of possible creative uses is infinite. How do we ensure the model generalizes to truly novel situations? This is an open research question.

AINews Verdict & Predictions

CreativityBench is not a minor critique; it is a fundamental indictment of the current AI paradigm. The industry has been chasing scale—bigger models, more data, longer context windows—but this benchmark shows that scale alone cannot imbue a model with the ability to reason about the physical world in a flexible way. The 'intelligence' we measure with MMLU, GSM8K, or HumanEval is largely pattern matching, not genuine understanding.

Prediction 1: Within 18 months, at least one major AI lab (likely DeepMind or a well-funded startup) will release a model that achieves >50% accuracy on CreativityBench by integrating a dedicated affordance reasoning module. This will be seen as a breakthrough.

Prediction 2: The next wave of AI funding will shift from 'scaling laws' to 'reasoning architectures.' Startups that build property inference layers or physics-aware planners will attract significant venture capital.

Prediction 3: Robotics companies will begin to incorporate CreativityBench-like evaluations into their hiring and product roadmaps. The first company to demonstrate a robot that can creatively repurpose tools in a real-world setting will gain a massive competitive advantage.

What to watch: The GitHub repositories 'affordance-net' and 'creativity-bench' for community progress; any publication from DeepMind on 'AffordanceGPT 2.0'; and product announcements from Figure AI or Tesla regarding tool use capabilities.

The era of 'bigger is better' is ending. The era of 'smarter is better' is beginning. CreativityBench has drawn the line in the sand.

More from arXiv cs.AI

ARMOR 2025: 모든 것을 바꾸는 군사 AI 안전 벤치마크The AI safety community has long focused on preventing models from generating hate speech, misinformation, or harmful ad에이전트 안전은 모델이 아니라, 에이전트 간의 대화 방식에 달려 있다For years, the AI safety community operated under a seemingly reasonable assumption: if each model in a multi-agent syst저지연 사기 탐지: AI 에이전트를 적대적 공격으로부터 보호하는 동적 방패As large language model (LLM) agents become more autonomous, executing complex tasks and calling external tools, they alOpen source hub280 indexed articles from arXiv cs.AI

Related topics

large language models131 related articles

Archive

May 2026784 published articles

Further Reading

도구 사용의 숨겨진 비용: LLM 에이전트가 검색 대신 생각해야 할 때요인화된 개입 프레임워크를 사용한 새로운 연구는 LLM에 계산기와 검색 엔진 같은 외부 도구를 장착하면 의미 간섭 하에서 추론 성능이 저하될 수 있음을 보여줍니다. '도구 사용 세금'은 도구 강화 아키텍처에 대한 업AR 안경과 LLM, 실시간 심리 조작 공격 가능하게 하다새로운 사회공학 공격인 AR-LLM-SE는 AR 안경을 사용해 시각 및 오디오 데이터를 포착하고, 대규모 언어 모델이 이를 즉시 분석하여 상세한 심리 프로필과 조작 전술을 생성합니다. 이는 데이터 도난에서 실시간 심그래프 구조 지능: LLM이 네트워크에서 사고하는 법을 배우는 방법생성형 AI의 최전선은 고립된 텍스트 생성에서 상호 연결된 구조적 추론으로 이동하고 있습니다. 그래프 기술과 대규모 언어 모델의 전략적 융합은 근본적인 아키텍처 진화를 의미하며, AI 시스템이 복잡한 관계 네트워크를KWBench, AI 평가 재정의: 문제 해결에서 문제 발견으로KWBench라는 새로운 벤치마크가 인공지능 평가 방식의 근본적인 전제에 도전하고 있습니다. 이는 대규모 언어 모델이 질문에 답하거나 작업을 수행하는 능력을 테스트하는 대신, 복잡하고 비구조화된 상황 속에서 핵심 문

常见问题

这次模型发布“CreativityBench Exposes AI's Hidden Flaw: Can't Think Outside the Box”的核心内容是什么?

The AI community has long celebrated progress in logic, code generation, and environmental interaction. But a new evaluation framework, CreativityBench, delivers a sobering reality…

从“How to improve AI creative tool use”看,这个模型发布为什么重要?

CreativityBench is not just another benchmark; it is a targeted stress test for a cognitive capability that has been largely ignored: affordance-based creative tool use. The term 'affordance,' coined by psychologist Jame…

围绕“Affordance reasoning vs pattern matching”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。