Mercury 2 対 PinchBench：拡散モデルが具現化AIの初の真の試練をどのように再定義しているか

The AI evaluation landscape is undergoing its most consequential evolution since the introduction of benchmarks like MMLU or HumanEval. PinchBench represents a fundamental departure from static knowledge tests, placing large models inside a simulated 3D environment called OpenClaw where success is measured by the ability to complete physical tasks—like manipulating objects—through precise, sequential actions. This shift from passive intelligence to active agency is the central challenge for developing truly useful AI assistants.

The recent evaluation of Mercury 2, a model built on a diffusion architecture, on PinchBench is particularly revealing. Diffusion models, famous for generating coherent images and video, apply a probabilistic, iterative denoising process to the problem of action planning. Instead of predicting a single deterministic sequence of actions, they reason over multiple potential futures, refining a noisy initial plan into a robust trajectory. This approach may offer superior robustness in the messy, uncertain conditions of real-world tasks compared to traditional autoregressive models that predict one step at a time.

The results are a mixed but illuminating bag. Mercury 2 demonstrates notable capabilities in spatial reasoning and multi-step planning within PinchBench's constrained domains, yet it also starkly reveals the immense gap between simulated success and real-world deployment. The benchmark itself, while a necessary and rigorous step, is still a simplified proxy for the staggering complexity of physical interaction. This moment crystallizes the industry's collective bet: the next phase of AI value will be unlocked not by better talkers, but by competent doers, with applications spanning from domestic robotics and advanced manufacturing to autonomous logistics. The race to build reliable embodied AI agents has officially left the starting line, with PinchBench as its first true measuring stick.

Technical Deep Dive

At its core, PinchBench is not merely another dataset; it's an interactive simulation framework that demands integration of three historically separate AI capabilities: natural language understanding, geometric and physical reasoning, and closed-loop control. The benchmark is built atop the OpenClaw environment, which simulates a robotic manipulator with a parallel gripper. Tasks are defined in natural language (e.g., "Stack the red block on the blue cylinder") and require the AI to generate a sequence of low-level motor commands to achieve the goal, all while dealing with physics, partial observability, and potential failure states.

The innovation of applying diffusion models to this domain, as exemplified by Mercury 2, is architectural and philosophical. Traditional approaches in robotics often use hierarchical methods: a high-level planner (often LLM-based) breaks down the task, and a lower-level controller (using reinforcement learning or classical methods) executes it. This pipeline is brittle, as errors compound across levels.

Mercury 2 reframes action planning as a conditional generative modeling problem. Given a goal description and the current state (an observation), the model is trained to generate the optimal sequence of actions. The diffusion process works by:
1. Forward Process: Starting from a clean action trajectory, noise is iteratively added until it becomes pure Gaussian noise.
2. Reverse Process (Inference): The model learns to reverse this—starting from noise and the conditioning inputs (goal, state), it iteratively denoises to produce a coherent action plan. This iterative refinement allows the model to explore and optimize over the action space probabilistically, making it more robust to ambiguous instructions and novel situations.

This is akin to how Stable Diffusion generates an image by gradually refining noise, but here the "canvas" is a temporal sequence of robot joint angles or end-effector positions. Key technical enablers include the use of vision-language models (VLMs) to encode the scene observation into a latent representation that fuses with the text instruction, and a temporal U-Net architecture that denoises across the time dimension of the action sequence.

Relevant open-source projects are emerging to support this paradigm. The `diffusion_policy` GitHub repository from researchers at Carnegie Mellon University and Google has gained significant traction (over 2.5k stars). It provides a toolkit for implementing diffusion-based visuomotor policies, demonstrating state-of-the-art results on real-world robotic manipulation tasks. Another, `OpenVLA` (Open Vision-Language-Action), is an open-source reproduction of models like RT-2, providing a foundation for building embodied models that can be fine-tuned on benchmarks like PinchBench.

| Benchmark Component | What It Tests | Challenge for LLMs/Diffusion Models |
|---|---|---|
| Instruction Parsing | Understanding spatial relationships & object properties from text. | Resolving ambiguity ("the small block" when multiple are small). |
| State Estimation | Interpreting the 3D scene from visual input. | Occlusion, lighting changes, novel objects. |
| Long-Horizon Planning | Decomposing a goal into a valid sequence of sub-goals. | Combinatorial explosion; recovering from mid-sequence failures. |
| Low-Level Control | Generating precise, dynamically feasible motor commands. | Sim-to-real gap; actuator noise and delays. |
| Closed-Loop Adaptation | Reacting to unexpected outcomes (e.g., block slips). | Requires fast re-planning; most models are open-loop. |

Data Takeaway: PinchBench's multi-component design exposes the holistic challenge of embodied AI. Success requires excellence across a chain of capabilities, where failure in any one link—like poor state estimation—dooms the entire task, highlighting why integrated architectures like Mercury 2 are essential.

Key Players & Case Studies

The push toward embodied AI is creating new alliances and competitive fronts. The players can be categorized by their core strengths: foundational model developers, robotics specialists, and integrated agent platforms.

1. Foundational Model Developers Betting on "Action" as a Modality:
* Google's DeepMind: A pioneer with the RT (Robotics Transformer) series. RT-1 and RT-2 demonstrated that transformer models trained on large-scale robotic data can generalize across tasks and embodiments. Their work on Genie, a generative interactive environment model, points toward learning world models for planning. Their strategy is to leverage vast data from academic labs and their own robots to build generally capable "action-foundation models."
* OpenAI: While famously secretive about robotics, their investment in Figure AI and the multi-modal, reasoning capabilities of GPT-4 and o1 suggest an inevitable move into the embodied space. Their strength lies in supreme reasoning and instruction-following, which could be paired with a partner's physical control stack.
* Anthropic: Focused on safety and reliability, their approach to embodied AI would likely be exceptionally cautious. Claude's constitutional AI principles would need translation into physical action constraints, a profound research challenge.

2. Robotics-First Companies & Research Labs:
* Boston Dynamics (now part of Hyundai): Possesses unparalleled hardware and low-level control software (Atlas, Spot). Their challenge is integrating high-level AI reasoning. They are likely consumers of models like Mercury 2, not builders.
* Covariant: Focuses specifically on AI for warehouse robotics. Their RFM (Robotics Foundation Model) is a real-world, production-tested example of a vision-language-action model deployed at scale for parcel manipulation. They represent the "vertical integration" path to success.
* Toyota Research Institute (TRI): Heavily publishes on diffusion policies and large-scale robotic learning. Their Large Behavior Model (LBM) project is a direct parallel to Mercury 2, aiming to create generative models of robot behavior from diverse data.

3. The New Integrated Agent Platforms:
* Sanctuary AI: Developing general-purpose humanoid robots (Phoenix) with a proprietary AI control system, Carbon. They aim to control the full stack from silicon to software, viewing the AI mind and robotic body as inseparable.
* 1X Technologies (formerly Halodi Robotics): Backed by OpenAI, they are deploying humanoid robots (Neo) in logistics and security, serving as a potential real-world testbed for advanced AI models from partners.

| Company/Project | Core Approach | Key Differentiator | Embodiment Focus |
|---|---|---|---|
| DeepMind (RT-2) | Transformer-based VLA Model | Scale of robotic data; integration with Gemini. | Mobile manipulators, general tasks. |
| Covariant RFM | Robotics-First Foundation Model | Deployed in real warehouses; economic ROI proven. | Bin picking, parcel sorting. |
| Mercury 2 (Research) | Diffusion-based Action Planner | Probabilistic planning for robustness. | Simulated & lab robotic arms. |
| Sanctuary AI Carbon | Full-Stack Cognitive Architecture | Tight integration with proprietary humanoid hardware. | General-purpose humanoid tasks. |
| OpenAI (via Figure) | LLM Reasoning + Partner Control | World-class reasoning & language; outsourced embodiment. | Humanoid interaction & manipulation. |

Data Takeaway: The competitive landscape is bifurcating. Some, like Covariant and Sanctuary, pursue deep vertical integration for specific commercial domains. Others, like Google and OpenAI, are building horizontal "brain" platforms intended to work across many potential "bodies." The success of either strategy hinges on who first solves the reliability gap highlighted by PinchBench.

Industry Impact & Market Dynamics

The successful translation of models like Mercury 2 from PinchBench scores to real-world utility will trigger massive economic shifts. The embodied AI market is currently a high-risk R&D arena but is projected to become the primary engine of physical automation.

Immediate Applications (3-5 Year Horizon):
1. Logistics and Warehousing: The most ripe sector. Tasks are structured, environments can be partially controlled, and the economic incentive is enormous. Companies like Covariant and Boston Dynamics are already here. An AI that reliably picks and places millions of unique items would revolutionize e-commerce fulfillment.
2. Advanced Manufacturing: Assembly, inspection, and quality control in electronics and automotive. This requires higher precision and handling of complexity than current industrial robots allow.
3. Laboratory Automation: Life sciences labs involve repetitive, precise liquid handling and instrumentation tasks—an ideal domain for early embodied AI.

Long-Term Vision (5-10+ Years):
1. Domestic and Service Robotics: The ultimate challenge. Unstructured homes, vast task variety, and high safety requirements make this the final frontier. Success here depends on models that can learn continuously from minimal human feedback.
2. Healthcare and Elder Care: Assistance with mobility, fetch-and-carry tasks, and monitoring. This requires not just physical competence but extraordinary social intelligence and trustworthiness.

Funding is flooding into the space. Figure AI's $675 million Series B round at a $2.6 billion valuation in early 2024 is a bellwether. Similarly, 1X's $100 million Series B led by OpenAI signals strategic bets being placed. The market is anticipating a platform shift.

| Market Segment | Estimated Addressable Market (2030) | Key Growth Driver | Major Barrier |
|---|---|---|---|
| Logistics Robotics | $45 - $60 Billion | E-commerce growth; labor shortages. | Handling extreme item diversity & packaging. |
| Service & Domestic Robots | $30 - $40 Billion | Aging populations; demand for convenience. | Cost, safety, and generalized task ability. |
| Healthcare Assistance | $15 - $25 Billion | Rising care needs; clinician shortage. | Regulatory approval; human-robot interaction safety. |
| AI Agent Software (Platform) | $20 - $30 Billion | Demand for "brains" across robot brands. | Standardization of interfaces & benchmarks. |

Data Takeaway: The logistics sector will be the first major revenue battleground, funding the R&D needed for more complex domains. The platform play—selling the AI "driver"—could eventually dwarf the market for any single robot type, but only if true interoperability and generalization are achieved.

Risks, Limitations & Open Questions

The path from PinchBench to pervasive embodied AI is fraught with technical, ethical, and societal hurdles.

Technical Limitations:
* The Sim-to-Real Chasm: PinchBench is a simulation. The physics, visuals, and action fidelity are approximations. A model scoring 95% in OpenClaw may score 5% on a real robot due to unmodeled friction, sensor noise, or calibration errors. Bridging this gap requires massive real-world data collection or advanced domain randomization techniques, both expensive.
* Catastrophic Forgetting & Adaptation: An agent trained on thousands of PinchBench-like tasks may fail spectacularly when faced with a simple but novel real-world requirement (e.g., opening a stiff drawer). The ability to learn online from a handful of demonstrations—few-shot adaptation—remains a core unsolved problem.
* Compositional Reasoning Limits: While Mercury 2 can plan to "stack the block," can it understand and execute "tidy the living room, putting toys in the box and books on the shelf, but leave the newspaper on the coffee table"? This level of abstract, conditional, and context-aware planning is beyond current benchmarks.

Ethical & Societal Risks:
* Unpredictable Failure Modes: A conversational AI hallucinates text. An embodied AI hallucinates actions. The physical consequences of an unexpected action sequence—a collision, breakage, or injury—are immediate and tangible. Guaranteeing safety is a control theory and verification nightmare.
* Mass Displacement of Physical Labor: The automation potential is far greater than that of clerical AI. While new jobs will be created, the transition for millions in manufacturing, warehousing, and driving could be violently disruptive, requiring unprecedented social policy.
* Autonomy and Agency: At what capability threshold does an embodied AI shift from being a "tool" to an "agent" with perceived autonomy? This blurs lines of responsibility and raises profound questions about control and oversight.

Open Research Questions:
1. World Models: Do we need explicitly learned, internal models of physics and cause-and-effect (like DeepMind's Genie) for robust planning, or can end-to-end models like Mercury 2 implicitly capture this?
2. Data Scaling Laws: Will performance in embodied tasks scale predictably with model size and training data, as seen in LLMs, or is there a different, more complex relationship?
3. Benchmark Evolution: PinchBench is a start. The community urgently needs a suite of increasingly complex benchmarks that include human interaction, long-horizon tasks, and real-world deployment metrics.

AINews Verdict & Predictions

The PinchBench benchmark and the performance of models like Mercury 2 represent the end of the beginning for embodied AI. For years, the field promised intelligent agents but delivered mostly intelligent chatterboxes. Now, the goalposts have definitively moved. The industry's focus, talent, and capital are aligning on the hard problem of action.

Our editorial judgment is that diffusion-based or similar generative approaches to action planning will become the dominant architectural paradigm for high-level robot control within two years. Their inherent robustness and capacity for multi-hypothesis reasoning are better suited to uncertainty than deterministic autoregressive chains. However, they will not operate alone; they will sit atop faster, reactive low-level controllers (often learned via reinforcement learning) in a hybrid hierarchy.

Specific Predictions:
1. By end of 2025, a major AI lab (likely DeepMind or an OpenAI-Figure collaboration) will announce an embodied model that achieves super-human performance on a broad suite of simulated manipulation benchmarks, including a next-generation PinchBench. This model will be multimodal, accepting both text and video demonstrations as input.
2. The first "killer app" for general-purpose embodied AI will emerge in logistics by 2027. It will not be a humanoid, but a mobile manipulator that can handle >80% of items in a major retailer's fulfillment center without re-programming, reducing unit economics by over 40% compared to current automated systems.
3. A significant safety incident involving a learning-based embodied AI in an industrial setting will occur within 3 years, leading to calls for—and the eventual creation of—a new regulatory framework for "certifiable learning systems," akin to aviation safety but for adaptive software.
4. The open-source community will lag in embodied AI compared to LLMs. The barrier of real-world data and hardware access is too high. The most advanced models and benchmarks will remain largely within well-funded corps and academia, though frameworks like `diffusion_policy` will enable valuable applied research.

What to Watch Next: Monitor the release of PinchBench v2 or its successors. Look for the inclusion of dynamic elements (moving objects, other agents), deformable objects (cloths, liquids), and human-in-the-loop tasks. The moment these benchmarks include a "real-world transfer" score from simulation to a standardized physical testbed, the race will enter its most critical phase. Additionally, watch for partnerships between AI labs and major manufacturers (e.g., Tesla Optimus integrating a third-party "brain") as a sign of which horizontal platform strategy is gaining traction.

The journey from a language model that can describe how to make a cup of coffee to an embodied agent that can physically do it is the defining AI challenge of this decade. PinchBench and Mercury 2 have given us the first clear map of that treacherous, transformative terrain.

常见问题

这次模型发布“Mercury 2 vs. PinchBench: How Diffusion Models Are Redefining Embodied AI's First Real Test”的核心内容是什么？

The AI evaluation landscape is undergoing its most consequential evolution since the introduction of benchmarks like MMLU or HumanEval. PinchBench represents a fundamental departur…

从“How does Mercury 2 diffusion model work for robotics?”看，这个模型发布为什么重要？

At its core, PinchBench is not merely another dataset; it's an interactive simulation framework that demands integration of three historically separate AI capabilities: natural language understanding, geometric and physi…

围绕“What is the PinchBench benchmark and how is it scored?”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。