Mercury 2 대 PinchBench: 확산 모델이 구현형 AI의 첫 번째 실제 시험을 어떻게 재정의하는가

Hacker News March 2026
Source: Hacker Newsembodied AIroboticsAI agentsArchive: March 2026
PinchBench라는 새로운 벤치마크는 AI 모델을 채팅 창에서 끌어내 시뮬레이션된 3D 세계로 보내 이해, 계획, 행동 능력을 테스트하고 있습니다. 확산 모델 기반의 Mercury 2가 이 테스트에서 보인 성능은 중요한 산업 전환점을 알립니다: 이제 AI의 최전선은 구현형 지능에 의해 정의됩니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The AI evaluation landscape is undergoing its most consequential evolution since the introduction of benchmarks like MMLU or HumanEval. PinchBench represents a fundamental departure from static knowledge tests, placing large models inside a simulated 3D environment called OpenClaw where success is measured by the ability to complete physical tasks—like manipulating objects—through precise, sequential actions. This shift from passive intelligence to active agency is the central challenge for developing truly useful AI assistants.

The recent evaluation of Mercury 2, a model built on a diffusion architecture, on PinchBench is particularly revealing. Diffusion models, famous for generating coherent images and video, apply a probabilistic, iterative denoising process to the problem of action planning. Instead of predicting a single deterministic sequence of actions, they reason over multiple potential futures, refining a noisy initial plan into a robust trajectory. This approach may offer superior robustness in the messy, uncertain conditions of real-world tasks compared to traditional autoregressive models that predict one step at a time.

The results are a mixed but illuminating bag. Mercury 2 demonstrates notable capabilities in spatial reasoning and multi-step planning within PinchBench's constrained domains, yet it also starkly reveals the immense gap between simulated success and real-world deployment. The benchmark itself, while a necessary and rigorous step, is still a simplified proxy for the staggering complexity of physical interaction. This moment crystallizes the industry's collective bet: the next phase of AI value will be unlocked not by better talkers, but by competent doers, with applications spanning from domestic robotics and advanced manufacturing to autonomous logistics. The race to build reliable embodied AI agents has officially left the starting line, with PinchBench as its first true measuring stick.

Technical Deep Dive

At its core, PinchBench is not merely another dataset; it's an interactive simulation framework that demands integration of three historically separate AI capabilities: natural language understanding, geometric and physical reasoning, and closed-loop control. The benchmark is built atop the OpenClaw environment, which simulates a robotic manipulator with a parallel gripper. Tasks are defined in natural language (e.g., "Stack the red block on the blue cylinder") and require the AI to generate a sequence of low-level motor commands to achieve the goal, all while dealing with physics, partial observability, and potential failure states.

The innovation of applying diffusion models to this domain, as exemplified by Mercury 2, is architectural and philosophical. Traditional approaches in robotics often use hierarchical methods: a high-level planner (often LLM-based) breaks down the task, and a lower-level controller (using reinforcement learning or classical methods) executes it. This pipeline is brittle, as errors compound across levels.

Mercury 2 reframes action planning as a conditional generative modeling problem. Given a goal description and the current state (an observation), the model is trained to generate the optimal sequence of actions. The diffusion process works by:
1. Forward Process: Starting from a clean action trajectory, noise is iteratively added until it becomes pure Gaussian noise.
2. Reverse Process (Inference): The model learns to reverse this—starting from noise and the conditioning inputs (goal, state), it iteratively denoises to produce a coherent action plan. This iterative refinement allows the model to explore and optimize over the action space probabilistically, making it more robust to ambiguous instructions and novel situations.

This is akin to how Stable Diffusion generates an image by gradually refining noise, but here the "canvas" is a temporal sequence of robot joint angles or end-effector positions. Key technical enablers include the use of vision-language models (VLMs) to encode the scene observation into a latent representation that fuses with the text instruction, and a temporal U-Net architecture that denoises across the time dimension of the action sequence.

Relevant open-source projects are emerging to support this paradigm. The `diffusion_policy` GitHub repository from researchers at Carnegie Mellon University and Google has gained significant traction (over 2.5k stars). It provides a toolkit for implementing diffusion-based visuomotor policies, demonstrating state-of-the-art results on real-world robotic manipulation tasks. Another, `OpenVLA` (Open Vision-Language-Action), is an open-source reproduction of models like RT-2, providing a foundation for building embodied models that can be fine-tuned on benchmarks like PinchBench.

| Benchmark Component | What It Tests | Challenge for LLMs/Diffusion Models |
|---|---|---|
| Instruction Parsing | Understanding spatial relationships & object properties from text. | Resolving ambiguity ("the small block" when multiple are small). |
| State Estimation | Interpreting the 3D scene from visual input. | Occlusion, lighting changes, novel objects. |
| Long-Horizon Planning | Decomposing a goal into a valid sequence of sub-goals. | Combinatorial explosion; recovering from mid-sequence failures. |
| Low-Level Control | Generating precise, dynamically feasible motor commands. | Sim-to-real gap; actuator noise and delays. |
| Closed-Loop Adaptation | Reacting to unexpected outcomes (e.g., block slips). | Requires fast re-planning; most models are open-loop. |

Data Takeaway: PinchBench's multi-component design exposes the holistic challenge of embodied AI. Success requires excellence across a chain of capabilities, where failure in any one link—like poor state estimation—dooms the entire task, highlighting why integrated architectures like Mercury 2 are essential.

Key Players & Case Studies

The push toward embodied AI is creating new alliances and competitive fronts. The players can be categorized by their core strengths: foundational model developers, robotics specialists, and integrated agent platforms.

1. Foundational Model Developers Betting on "Action" as a Modality:
* Google's DeepMind: A pioneer with the RT (Robotics Transformer) series. RT-1 and RT-2 demonstrated that transformer models trained on large-scale robotic data can generalize across tasks and embodiments. Their work on Genie, a generative interactive environment model, points toward learning world models for planning. Their strategy is to leverage vast data from academic labs and their own robots to build generally capable "action-foundation models."
* OpenAI: While famously secretive about robotics, their investment in Figure AI and the multi-modal, reasoning capabilities of GPT-4 and o1 suggest an inevitable move into the embodied space. Their strength lies in supreme reasoning and instruction-following, which could be paired with a partner's physical control stack.
* Anthropic: Focused on safety and reliability, their approach to embodied AI would likely be exceptionally cautious. Claude's constitutional AI principles would need translation into physical action constraints, a profound research challenge.

2. Robotics-First Companies & Research Labs:
* Boston Dynamics (now part of Hyundai): Possesses unparalleled hardware and low-level control software (Atlas, Spot). Their challenge is integrating high-level AI reasoning. They are likely consumers of models like Mercury 2, not builders.
* Covariant: Focuses specifically on AI for warehouse robotics. Their RFM (Robotics Foundation Model) is a real-world, production-tested example of a vision-language-action model deployed at scale for parcel manipulation. They represent the "vertical integration" path to success.
* Toyota Research Institute (TRI): Heavily publishes on diffusion policies and large-scale robotic learning. Their Large Behavior Model (LBM) project is a direct parallel to Mercury 2, aiming to create generative models of robot behavior from diverse data.

3. The New Integrated Agent Platforms:
* Sanctuary AI: Developing general-purpose humanoid robots (Phoenix) with a proprietary AI control system, Carbon. They aim to control the full stack from silicon to software, viewing the AI mind and robotic body as inseparable.
* 1X Technologies (formerly Halodi Robotics): Backed by OpenAI, they are deploying humanoid robots (Neo) in logistics and security, serving as a potential real-world testbed for advanced AI models from partners.

| Company/Project | Core Approach | Key Differentiator | Embodiment Focus |
|---|---|---|---|
| DeepMind (RT-2) | Transformer-based VLA Model | Scale of robotic data; integration with Gemini. | Mobile manipulators, general tasks. |
| Covariant RFM | Robotics-First Foundation Model | Deployed in real warehouses; economic ROI proven. | Bin picking, parcel sorting. |
| Mercury 2 (Research) | Diffusion-based Action Planner | Probabilistic planning for robustness. | Simulated & lab robotic arms. |
| Sanctuary AI Carbon | Full-Stack Cognitive Architecture | Tight integration with proprietary humanoid hardware. | General-purpose humanoid tasks. |
| OpenAI (via Figure) | LLM Reasoning + Partner Control | World-class reasoning & language; outsourced embodiment. | Humanoid interaction & manipulation. |

Data Takeaway: The competitive landscape is bifurcating. Some, like Covariant and Sanctuary, pursue deep vertical integration for specific commercial domains. Others, like Google and OpenAI, are building horizontal "brain" platforms intended to work across many potential "bodies." The success of either strategy hinges on who first solves the reliability gap highlighted by PinchBench.

Industry Impact & Market Dynamics

The successful translation of models like Mercury 2 from PinchBench scores to real-world utility will trigger massive economic shifts. The embodied AI market is currently a high-risk R&D arena but is projected to become the primary engine of physical automation.

Immediate Applications (3-5 Year Horizon):
1. Logistics and Warehousing: The most ripe sector. Tasks are structured, environments can be partially controlled, and the economic incentive is enormous. Companies like Covariant and Boston Dynamics are already here. An AI that reliably picks and places millions of unique items would revolutionize e-commerce fulfillment.
2. Advanced Manufacturing: Assembly, inspection, and quality control in electronics and automotive. This requires higher precision and handling of complexity than current industrial robots allow.
3. Laboratory Automation: Life sciences labs involve repetitive, precise liquid handling and instrumentation tasks—an ideal domain for early embodied AI.

Long-Term Vision (5-10+ Years):
1. Domestic and Service Robotics: The ultimate challenge. Unstructured homes, vast task variety, and high safety requirements make this the final frontier. Success here depends on models that can learn continuously from minimal human feedback.
2. Healthcare and Elder Care: Assistance with mobility, fetch-and-carry tasks, and monitoring. This requires not just physical competence but extraordinary social intelligence and trustworthiness.

Funding is flooding into the space. Figure AI's $675 million Series B round at a $2.6 billion valuation in early 2024 is a bellwether. Similarly, 1X's $100 million Series B led by OpenAI signals strategic bets being placed. The market is anticipating a platform shift.

| Market Segment | Estimated Addressable Market (2030) | Key Growth Driver | Major Barrier |
|---|---|---|---|
| Logistics Robotics | $45 - $60 Billion | E-commerce growth; labor shortages. | Handling extreme item diversity & packaging. |
| Service & Domestic Robots | $30 - $40 Billion | Aging populations; demand for convenience. | Cost, safety, and generalized task ability. |
| Healthcare Assistance | $15 - $25 Billion | Rising care needs; clinician shortage. | Regulatory approval; human-robot interaction safety. |
| AI Agent Software (Platform) | $20 - $30 Billion | Demand for "brains" across robot brands. | Standardization of interfaces & benchmarks. |

Data Takeaway: The logistics sector will be the first major revenue battleground, funding the R&D needed for more complex domains. The platform play—selling the AI "driver"—could eventually dwarf the market for any single robot type, but only if true interoperability and generalization are achieved.

Risks, Limitations & Open Questions

The path from PinchBench to pervasive embodied AI is fraught with technical, ethical, and societal hurdles.

Technical Limitations:
* The Sim-to-Real Chasm: PinchBench is a simulation. The physics, visuals, and action fidelity are approximations. A model scoring 95% in OpenClaw may score 5% on a real robot due to unmodeled friction, sensor noise, or calibration errors. Bridging this gap requires massive real-world data collection or advanced domain randomization techniques, both expensive.
* Catastrophic Forgetting & Adaptation: An agent trained on thousands of PinchBench-like tasks may fail spectacularly when faced with a simple but novel real-world requirement (e.g., opening a stiff drawer). The ability to learn online from a handful of demonstrations—few-shot adaptation—remains a core unsolved problem.
* Compositional Reasoning Limits: While Mercury 2 can plan to "stack the block," can it understand and execute "tidy the living room, putting toys in the box and books on the shelf, but leave the newspaper on the coffee table"? This level of abstract, conditional, and context-aware planning is beyond current benchmarks.

Ethical & Societal Risks:
* Unpredictable Failure Modes: A conversational AI hallucinates text. An embodied AI hallucinates actions. The physical consequences of an unexpected action sequence—a collision, breakage, or injury—are immediate and tangible. Guaranteeing safety is a control theory and verification nightmare.
* Mass Displacement of Physical Labor: The automation potential is far greater than that of clerical AI. While new jobs will be created, the transition for millions in manufacturing, warehousing, and driving could be violently disruptive, requiring unprecedented social policy.
* Autonomy and Agency: At what capability threshold does an embodied AI shift from being a "tool" to an "agent" with perceived autonomy? This blurs lines of responsibility and raises profound questions about control and oversight.

Open Research Questions:
1. World Models: Do we need explicitly learned, internal models of physics and cause-and-effect (like DeepMind's Genie) for robust planning, or can end-to-end models like Mercury 2 implicitly capture this?
2. Data Scaling Laws: Will performance in embodied tasks scale predictably with model size and training data, as seen in LLMs, or is there a different, more complex relationship?
3. Benchmark Evolution: PinchBench is a start. The community urgently needs a suite of increasingly complex benchmarks that include human interaction, long-horizon tasks, and real-world deployment metrics.

AINews Verdict & Predictions

The PinchBench benchmark and the performance of models like Mercury 2 represent the end of the beginning for embodied AI. For years, the field promised intelligent agents but delivered mostly intelligent chatterboxes. Now, the goalposts have definitively moved. The industry's focus, talent, and capital are aligning on the hard problem of action.

Our editorial judgment is that diffusion-based or similar generative approaches to action planning will become the dominant architectural paradigm for high-level robot control within two years. Their inherent robustness and capacity for multi-hypothesis reasoning are better suited to uncertainty than deterministic autoregressive chains. However, they will not operate alone; they will sit atop faster, reactive low-level controllers (often learned via reinforcement learning) in a hybrid hierarchy.

Specific Predictions:
1. By end of 2025, a major AI lab (likely DeepMind or an OpenAI-Figure collaboration) will announce an embodied model that achieves super-human performance on a broad suite of simulated manipulation benchmarks, including a next-generation PinchBench. This model will be multimodal, accepting both text and video demonstrations as input.
2. The first "killer app" for general-purpose embodied AI will emerge in logistics by 2027. It will not be a humanoid, but a mobile manipulator that can handle >80% of items in a major retailer's fulfillment center without re-programming, reducing unit economics by over 40% compared to current automated systems.
3. A significant safety incident involving a learning-based embodied AI in an industrial setting will occur within 3 years, leading to calls for—and the eventual creation of—a new regulatory framework for "certifiable learning systems," akin to aviation safety but for adaptive software.
4. The open-source community will lag in embodied AI compared to LLMs. The barrier of real-world data and hardware access is too high. The most advanced models and benchmarks will remain largely within well-funded corps and academia, though frameworks like `diffusion_policy` will enable valuable applied research.

What to Watch Next: Monitor the release of PinchBench v2 or its successors. Look for the inclusion of dynamic elements (moving objects, other agents), deformable objects (cloths, liquids), and human-in-the-loop tasks. The moment these benchmarks include a "real-world transfer" score from simulation to a standardized physical testbed, the race will enter its most critical phase. Additionally, watch for partnerships between AI labs and major manufacturers (e.g., Tesla Optimus integrating a third-party "brain") as a sign of which horizontal platform strategy is gaining traction.

The journey from a language model that can describe how to make a cup of coffee to an embodied agent that can physically do it is the defining AI challenge of this decade. PinchBench and Mercury 2 have given us the first clear map of that treacherous, transformative terrain.

More from Hacker News

AI가 판을 뒤집다: 시니어 근로자, 새로운 경제에서 협상력 확보The conventional wisdom that senior employees are the primary victims of AI automation is collapsing under the weight ofAI 에이전트, 지불을 배우다: x402 프로토콜이 기계 마이크로 경제를 열다The x402 protocol represents a critical infrastructure upgrade for the AI ecosystem, embedding payment directly into theClaude, 실제 돈을 벌지 못하다: AI 코딩 에이전트 실험이 드러낸 냉혹한 진실In a controlled experiment, AINews tasked Claude with completing real paid programming bounties on Algora, a platform whOpen source hub3513 indexed articles from Hacker News

Related topics

embodied AI133 related articlesrobotics22 related articlesAI agents721 related articles

Archive

March 20262347 published articles

Further Reading

OpenAI, Isara에 9400만 달러 투자… 구체화된 AI와 물리적 세계 지배로의 전략적 전환 신호OpenAI는 확장 가능한 다목적 로봇 에이전트를 구축하는 스타트업 Isara에 9400만 달러를 투자하며 전략적으로 디지털 영역을 넘어서고 있습니다. 이번 움직임은 대규모 언어 모델을 물리적 경험에 기반하게 하고,Sutton Declares LLMs a Dead End: Why Reinforcement Learning Will Power AI's Next BreakthroughRichard Sutton, the father of reinforcement learning, has declared that large language models are a technological dead e챗봇에서 컨트롤러로: AI 에이전트가 현실의 운영 체제가 되는 방법AI 환경은 정적인 언어 모델에서 제어 시스템으로 기능하는 동적 에이전트로의 패러다임 전환을 겪고 있습니다. 이러한 자율적 개체는 복잡한 환경 내에서 인지, 계획 및 행동할 수 있으며, AI를 조언 역할에서 로봇 시AI 에이전트가 'GTA'를 리버스 엔지니어링하는 방법: 자율적 디지털 세계 이해의 새벽획기적인 실험을 통해 AI 에이전트가 'G랜드 세프트 오토: 산 안드레아스'의 디지털 세계를 자율적으로 리버스 엔지니어링하는 것이 입증되었습니다. 이 에이전트의 목표는 게임에서 승리하는 것이 아니라, 게임의 기본적인

常见问题

这次模型发布“Mercury 2 vs. PinchBench: How Diffusion Models Are Redefining Embodied AI's First Real Test”的核心内容是什么?

The AI evaluation landscape is undergoing its most consequential evolution since the introduction of benchmarks like MMLU or HumanEval. PinchBench represents a fundamental departur…

从“How does Mercury 2 diffusion model work for robotics?”看,这个模型发布为什么重要?

At its core, PinchBench is not merely another dataset; it's an interactive simulation framework that demands integration of three historically separate AI capabilities: natural language understanding, geometric and physi…

围绕“What is the PinchBench benchmark and how is it scored?”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。