CreativityBench, AI의 숨은 결함 폭로: 틀 밖에서 생각하지 못한다

The AI community has long celebrated progress in logic, code generation, and environmental interaction. But a new evaluation framework, CreativityBench, delivers a sobering reality check: current large language models are remarkably bad at thinking sideways. The benchmark tests an agent's ability to repurpose everyday objects in unconventional ways—for example, using a shoe to drive a nail or a scarf to tie a bundle. Results show that models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro achieve accuracy rates below 30% on these tasks, compared to human performance exceeding 85%. This isn't a trivial edge case; it strikes at the heart of what it means to be intelligent. CreativityBench measures 'affordance reasoning'—the ability to infer a tool's potential based on its physical properties (hardness, shape, flexibility) rather than its labeled function. The failure reveals that LLMs operate primarily as pattern matchers, retrieving memorized associations rather than dynamically reasoning about an object's material characteristics. The implications are profound: for robotics, autonomous systems, and household AI, the inability to creatively repurpose tools means they will remain brittle, unable to adapt to novel situations. The benchmark's creators argue that the next frontier for AI is not scaling parameters but building a 'dynamic property inference layer' that allows models to decompose objects into fundamental physical attributes. This analysis explores the technical architecture behind affordance reasoning, profiles key players like Google DeepMind and MIT CSAIL who are pioneering this space, and forecasts how this insight will reshape the competitive landscape of intelligent agents.

Technical Deep Dive

CreativityBench is not just another benchmark; it is a targeted stress test for a cognitive capability that has been largely ignored: affordance-based creative tool use. The term 'affordance,' coined by psychologist James J. Gibson, refers to the possibilities for action that an object offers to an agent. A chair affords sitting, but it also affords standing on, blocking a door, or, if broken, providing a wooden lever. Current LLMs are trained to map objects to their canonical functions—a hammer is for hitting, a shoe is for wearing. CreativityBench forces models to break this mapping.

The benchmark consists of 500 tasks, each presenting an agent with a goal (e.g., 'drive a nail into a wall') and a set of objects that do not include the conventional tool (a hammer). The agent must select an alternative object (e.g., a shoe, a rock, a heavy book) and describe how to use it. The evaluation is two-fold: (1) object selection accuracy—does the model pick a physically plausible substitute? (2) usage description quality—does the model's explanation correctly leverage the object's affordances (e.g., 'use the shoe's hard heel as a striking surface')?

Results are stark. The table below shows performance on the object selection task across leading models:

| Model | Object Selection Accuracy | Usage Description Quality (BERTScore F1) |
|---|---|---|
| GPT-4o | 28.4% | 0.61 |
| Claude 3.5 Sonnet | 26.1% | 0.58 |
| Gemini 1.5 Pro | 24.7% | 0.55 |
| Llama 3.1 405B | 22.3% | 0.52 |
| Human (baseline) | 87.2% | 0.91 |

Data Takeaway: The gap between AI and human performance is not incremental—it's a chasm. Even the best model is more than 3x worse than humans at selecting a creative tool. This suggests that current architectures lack a fundamental reasoning mechanism.

Why do models fail? The root cause lies in the static property encoding of objects. In a typical transformer, an object like 'shoe' is represented by a token embedding that aggregates all contexts from training data. This embedding is a blend of 'footwear,' 'leather,' 'sole,' 'lace,' etc., but it does not explicitly encode physical properties like hardness (shore durometer), density (kg/m³), or coefficient of friction. When asked to use a shoe as a hammer, the model cannot dynamically compute that the heel is hard enough to transfer force. Instead, it retrieves the most statistically frequent usage pattern—'wear on foot'—and rejects the alternative.

To address this, researchers are exploring dynamic property inference layers. One promising approach, detailed in a recent preprint from MIT CSAIL (not yet on GitHub but related to the 'PropertyNet' project), proposes a two-stage architecture: first, a vision-language model extracts physical properties from an image of the object (e.g., 'this shoe has a rubber sole, a leather upper, and a hard plastic heel'); second, a reasoning module uses these properties to simulate the tool's effectiveness for a given task. The GitHub repository 'affordance-net' (1.2k stars) implements a similar idea for robotic grasping, using a graph neural network to predict grasp affordances from point clouds. However, it has not been extended to creative tool use.

Another relevant open-source effort is 'ToolEmu' (2.8k stars), which simulates tool use in a virtual environment, but it focuses on conventional tool usage, not creative repurposing. The CreativityBench team has released a small evaluation suite on GitHub (repo: 'creativity-bench', 450 stars) that allows researchers to test their own models.

Technical Takeaway: The path forward requires decoupling object identity from physical properties. Models must learn a compositional representation where 'hardness,' 'shape,' and 'weight' are independent latent variables that can be recombined for novel tasks. This is a fundamentally different learning objective from next-token prediction.

Key Players & Case Studies

Several organizations are already grappling with this challenge, though none have fully solved it.

Google DeepMind has been a leader in affordance reasoning through its work on 'Socratic models' and 'SayCan.' SayCan, a robotic system that combines a language model with a skill library, can understand commands like 'bring me a drink' but fails when asked to 'use a book to prop open a door' because the skill 'prop door with book' is not in its library. DeepMind's latest research, 'AffordanceGPT,' attempts to generate novel skills on the fly by querying a physics simulator, but it remains computationally expensive and slow for real-time use.

MIT CSAIL (Professor Pulkit Agrawal's group) has developed 'PropertyNet,' a neural network that predicts physical properties (mass, friction, elasticity) from a single image. When integrated with a planner, it can suggest creative tool use—e.g., using a frying pan as a hammer. However, the system is still in the lab and has not been deployed on a physical robot.

OpenAI has not publicly addressed affordance reasoning, but its work on 'function calling' in GPT-4o allows the model to invoke external APIs. This is a form of tool use, but it is entirely digital and pre-specified. There is no evidence that OpenAI is working on physical affordance reasoning.

Anthropic has focused on 'constitutional AI' and safety, but its Claude models show slightly better performance on CreativityBench than competitors, likely due to more diverse training data that includes creative writing prompts. However, the improvement is marginal.

Nvidia is a dark horse. Its 'Isaac Sim' platform provides a photorealistic simulation environment where robots can practice tool use. Nvidia researchers have published a paper on 'Sim-to-Real Transfer for Affordance Learning,' but the results are preliminary.

| Organization | Approach | Key Technology | Maturity |
|---|---|---|---|
| Google DeepMind | Physics simulation + LLM | AffordanceGPT | Research prototype |
| MIT CSAIL | Visual property prediction | PropertyNet | Lab stage |
| OpenAI | Function calling (digital only) | GPT-4o API | Production |
| Anthropic | Diverse training data | Claude 3.5 | Production |
| Nvidia | Sim-to-real transfer | Isaac Sim | Research prototype |

Data Takeaway: No major player has a production-ready solution for creative physical tool use. The gap between research prototypes and deployable systems is wide, creating an opportunity for startups.

Industry Impact & Market Dynamics

The inability of AI to creatively use tools has direct commercial consequences, particularly in robotics and autonomous systems.

Robotics: The global robotics market is projected to reach $74 billion by 2027 (source: industry analysts). However, most industrial robots are 'dumb'—they perform repetitive, pre-programmed tasks. The promise of general-purpose robots (e.g., Tesla Optimus, Figure 01) hinges on their ability to adapt to novel situations. If a robot cannot figure out that a shoe can be a hammer, it will fail in unstructured environments like homes or disaster zones. CreativityBench suggests that current AI-powered robots are years away from this capability.

Autonomous Vehicles: Self-driving cars must handle edge cases—e.g., using a traffic cone as a makeshift barrier or a blanket to cover a broken window. Current systems rely on pre-defined object categories and cannot reason about alternative uses. This limits their ability to operate in unpredictable conditions.

Household AI: Products like Amazon Astro or Samsung Ballie are designed to assist with chores. But if they cannot creatively repurpose tools, they will remain novelties. For example, a robot that cannot use a towel to mop up a spill (because a towel is 'for drying') is not truly helpful.

| Market Segment | Current AI Capability | Required for CreativityBench | Revenue Impact if Solved |
|---|---|---|---|
| Industrial Robotics | High (pre-programmed) | Low | $5B incremental |
| Service Robotics | Medium (limited adaptation) | High | $20B incremental |
| Autonomous Vehicles | Low (edge cases) | High | $50B+ (safety) |
| Household AI | Very Low | Critical | $10B incremental |

Data Takeaway: The market value of solving creative tool use is enormous, particularly in service robotics and autonomous vehicles, where adaptability is the key differentiator.

Risks, Limitations & Open Questions

While CreativityBench highlights a genuine weakness, it is not without limitations.

Benchmark Validity: The tasks are human-designed and may not capture all forms of creativity. For instance, a model might solve a problem in a way the benchmark did not anticipate, leading to false negatives. The benchmark's creators have acknowledged this and plan to release an open-ended version.

Safety Concerns: Teaching AI to creatively repurpose tools could backfire. A robot that learns a knife can be used as a screwdriver might also learn it can be used as a weapon. Affordance reasoning must be paired with strong safety constraints.

Computational Cost: Dynamic property inference requires running a vision model, a physics simulator, and a reasoning module for each decision. This is orders of magnitude more expensive than a single forward pass through an LLM. Real-time deployment may be infeasible for years.

The 'Long Tail' Problem: Even if a model can reason about basic affordances (hard, soft, heavy), the space of possible creative uses is infinite. How do we ensure the model generalizes to truly novel situations? This is an open research question.

AINews Verdict & Predictions

CreativityBench is not a minor critique; it is a fundamental indictment of the current AI paradigm. The industry has been chasing scale—bigger models, more data, longer context windows—but this benchmark shows that scale alone cannot imbue a model with the ability to reason about the physical world in a flexible way. The 'intelligence' we measure with MMLU, GSM8K, or HumanEval is largely pattern matching, not genuine understanding.

Prediction 1: Within 18 months, at least one major AI lab (likely DeepMind or a well-funded startup) will release a model that achieves >50% accuracy on CreativityBench by integrating a dedicated affordance reasoning module. This will be seen as a breakthrough.

Prediction 2: The next wave of AI funding will shift from 'scaling laws' to 'reasoning architectures.' Startups that build property inference layers or physics-aware planners will attract significant venture capital.

Prediction 3: Robotics companies will begin to incorporate CreativityBench-like evaluations into their hiring and product roadmaps. The first company to demonstrate a robot that can creatively repurpose tools in a real-world setting will gain a massive competitive advantage.

What to watch: The GitHub repositories 'affordance-net' and 'creativity-bench' for community progress; any publication from DeepMind on 'AffordanceGPT 2.0'; and product announcements from Figure AI or Tesla regarding tool use capabilities.

The era of 'bigger is better' is ending. The era of 'smarter is better' is beginning. CreativityBench has drawn the line in the sand.

More from arXiv cs.AI

常见问题

这次模型发布“CreativityBench Exposes AI's Hidden Flaw: Can't Think Outside the Box”的核心内容是什么？

The AI community has long celebrated progress in logic, code generation, and environmental interaction. But a new evaluation framework, CreativityBench, delivers a sobering reality…

从“How to improve AI creative tool use”看，这个模型发布为什么重要？

CreativityBench is not just another benchmark; it is a targeted stress test for a cognitive capability that has been largely ignored: affordance-based creative tool use. The term 'affordance,' coined by psychologist Jame…

围绕“Affordance reasoning vs pattern matching”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。