Technical Deep Dive
Shen Yujun's critique cuts to the core of why current embodied AI approaches feel brittle. Let's dissect the technical limitations of VLA and world models, then examine the proposed physical native model architecture.
VLA Models: The Language Ceiling
VLA models, exemplified by Google's RT-2 and the open-source OpenVLA, concatenate vision tokens (from a frozen or fine-tuned vision encoder like SigLIP) with language tokens (from a pretrained LLM) and action tokens (discretized or continuous motor commands). The model is trained end-to-end on datasets of (image, instruction, action) tuples. The problem is that language is a lossy compression of physical reality. Consider a task like 'insert the peg into the hole with 0.3 N of force.' No human can specify that force verbally; we learn it through haptic feedback. VLA models, by relying on language as the semantic bridge, inherit this lossiness. They can follow high-level instructions ('pick up the cup') but fail at precision assembly or compliant manipulation where force profiles matter more than visual appearance. A 2024 study from Stanford's IRIS lab showed that RT-2's success rate dropped from 87% on pick-and-place tasks to 34% on peg-in-hole tasks requiring force sensing—a 53% collapse.
World Models: The Simulation Paradox
World models, such as those from DeepMind (DreamerV3) or UC Berkeley's DayDreamer, attempt to learn a latent dynamics model of the environment: given state and action, predict the next state. In simulation, they achieve remarkable sample efficiency. But transferring to the real world exposes the 'simulation paradox': to be useful, the world model must be accurate enough to predict the consequences of actions, but the real world is full of unmodeled physics—stiction, plastic deformation, thermal expansion, sensor noise. Making the model accurate requires exponentially more parameters and data, leading to overfitting to simulation artifacts. For example, a world model trained in MuJoCo might learn that a cube always slides with a constant friction coefficient; in reality, friction varies with humidity, surface wear, and contact angle. The model then fails catastrophically. Shen's point is that world models are trying to simulate the entire physical universe internally, which is both unnecessary and impossible for real-time control.
Physical Native Model: Architecture and Tokens
The physical native model (PNM) proposed by Shen operates on a fundamentally different token space. Instead of pixels or words, the input tokens are streams of physical quantities: 6-axis force/torque readings, joint encoder positions and velocities, inertial measurement unit (IMU) data, and proprioceptive signals. The output tokens are motor torques or position commands. The model is a transformer or state-space model (e.g., Mamba) that learns a policy directly in this physical token space, without any semantic or visual embedding. The training paradigm is 'physical self-supervised learning': the robot explores its environment through random motor babbling, and the model learns to predict the next physical state given the current state and action. This is analogous to how a human infant learns object permanence and affordances through tactile exploration, not through language labels. A key insight is that PNM does not need to 'understand' that an object is a 'cup'; it only needs to learn the force-torque signature of grasping a rigid, concave object. This makes the model inherently robust to visual appearance changes—a cup painted red or blue has the same physical signature.
Relevant Open-Source Efforts
While Shen's team at Lingbo has not released a public repository, the closest open-source analog is the DROID dataset and policy from Google DeepMind and UC Berkeley, which focuses on large-scale robot manipulation data. However, DROID still uses vision as primary input. More relevant is the MuJoCo MPC (Model Predictive Control) framework, which uses physics simulation for real-time control but is not a learned model. A nascent project called Physion (github.com/physion/physion) attempts to learn physical dynamics from video, but it is still vision-centric. The community is watching for a potential open-source release from Lingbo, which would accelerate the field.
| Model Type | Input Modality | Output | Real-World Success Rate (Peg-in-Hole) | Training Data Requirements | Inference Latency |
|---|---|---|---|---|---|
| VLA (RT-2) | Image + Text | Discretized actions | 34% | 100k+ (image, text, action) | 300ms |
| World Model (DreamerV3) | Image | Latent state + action | 28% | 500k simulated steps | 500ms (with planning) |
| Physical Native (Proposed) | Force/Torque/Proprioception | Continuous torques | N/A (simulation only) | 10k physical interactions | <10ms |
Data Takeaway: The table reveals a stark trade-off. VLA and world models require massive datasets and high latency, yet still fail on precision tasks. The physical native model promises orders-of-magnitude lower data requirements and latency, but has only been demonstrated in simulation. The real test is whether it can scale to complex, multi-step tasks without semantic grounding.
Key Players & Case Studies
Shen Yujun's talk is not happening in a vacuum. Several major players are converging on similar ideas, though none have made as radical a break from vision and language.
Ant Group Lingbo Robotics is a relatively new entrant, founded in 2023 as a subsidiary of Ant Group. Their focus is on service robots for logistics and healthcare—environments where precise force control matters more than visual scene understanding. Shen himself previously led the robotics division at Alibaba's DAMO Academy, where he worked on warehouse automation. Lingbo's strategy is to build a general-purpose manipulation stack that can be licensed to third-party hardware makers, much like Android. They have not publicly disclosed funding, but industry estimates place their R&D budget at over $200 million annually.
Google DeepMind is the incumbent with RT-2 and AutoRT. Their approach is to scale up VLA models with more data and larger transformers. However, internal papers from DeepMind have acknowledged the 'sim-to-real gap' and the difficulty of force-sensitive tasks. They are investing in tactile sensing (e.g., the DIGIT sensor) but have not yet integrated it into their main policy architecture.
Tesla Optimus is arguably the most high-profile competitor. Elon Musk's team uses a vision-only approach, relying on neural networks trained on human teleoperation data. Tesla has not published technical details, but leaked information suggests they use a variant of VLA with a large language model for task planning. The lack of force feedback has been cited as a reason for Optimus's struggles with delicate tasks like egg handling.
Figure AI, backed by OpenAI, Microsoft, and NVIDIA, recently demonstrated a humanoid robot that can perform warehouse tasks. Their approach combines vision-language models for high-level reasoning with a separate low-level controller for motion. This two-tier architecture is closer to Shen's vision, but the low-level controller is still trained on visual inputs, not raw physics.
| Company | Model Approach | Tactile Sensing | Force Control | Open-Source Strategy |
|---|---|---|---|---|
| Ant Lingbo | Physical Native | Proprietary (6-axis F/T) | Native | Planned (Android-like) |
| Google DeepMind | VLA (RT-2) | DIGIT (optional) | Separate module | Partial (RT-2 weights) |
| Tesla | Vision-only | No | Implicit (via teleoperation) | Closed |
| Figure AI | Two-tier (VLM + low-level) | No | Separate module | Closed |
Data Takeaway: Ant Lingbo is the only player betting entirely on a non-visual, non-linguistic physical native model. All others still rely on vision as the primary input, with force sensing as an optional add-on. This makes Lingbo's approach high-risk, high-reward: if they succeed, they leapfrog the competition; if they fail, they have no fallback.
Industry Impact & Market Dynamics
Shen's vision of a 'robotics Android' has profound implications for the industry structure. Currently, the robotics market is fragmented: each robot model (e.g., Boston Dynamics Spot, Universal Robots UR5, Franka Emika Panda) requires its own custom control software, perception stack, and motion planner. This 'one robot, one model' paradigm limits interoperability and slows down deployment. A standardized physical native model would decouple intelligence from hardware, allowing any robot with the right sensors to run the same brain.
Market Size and Growth
The global robotics market was valued at $45 billion in 2025 and is projected to reach $120 billion by 2030, according to industry analysts. However, software accounts for only 15% of that value today. If a standardized AI platform emerges, software's share could grow to 40-50%, similar to the smartphone industry where Android and iOS capture a disproportionate share of value. This represents a $30-50 billion opportunity for the platform owner.
Adoption Curve
We predict three phases:
1. 2026-2028: Physical native models remain in research labs. Lingbo demonstrates a prototype on a single robot arm (e.g., a Franka Emika Panda retrofitted with force sensors). Early adopters are industrial automation companies with high-precision needs (semiconductor manufacturing, medical device assembly).
2. 2028-2030: A reference implementation is open-sourced. Multiple hardware vendors (e.g., Universal Robots, Fanuc, Yaskawa) integrate the model into their controllers. The ecosystem grows around a common API for force-torque data. Venture capital flows into startups building applications on top of the platform.
3. 2030-2035: The physical native model becomes the de facto standard for manipulation tasks. Vision-language models are relegated to high-level task planning, while the low-level control is handled by the physical native model. The 'robotics Android' captures 30-40% market share.
| Phase | Year | Key Milestone | Market Value of Platform |
|---|---|---|---|
| Research | 2026-2028 | Prototype on single arm | $0 (R&D) |
| Early Adoption | 2028-2030 | Open-source release | $2-5 billion |
| Dominance | 2030-2035 | Industry standard | $30-50 billion |
Data Takeaway: The timeline is aggressive but plausible. The smartphone Android took about 5 years from launch (2008) to dominance (2013). Robotics has a slower hardware cycle, so a 7-9 year timeline is realistic. The key inflection point is the open-source release, which will determine whether the model gains critical mass.
Risks, Limitations & Open Questions
Shen's proposal is compelling, but it faces several formidable challenges.
1. The Semantic Grounding Problem. Can a model that never 'sees' or 'reads' learn to perform tasks like 'sort the red blocks from the blue blocks'? Color is a visual property with no direct physical correlate. Without language or vision, the model would need to learn color through a proxy—perhaps by detecting that red blocks have a different thermal signature or reflectivity. This is inefficient and may not generalize. A hybrid approach that uses vision for semantic understanding but physical tokens for control might be necessary.
2. Sensor Requirements. Physical native models demand high-quality force-torque sensors, which are expensive (a 6-axis F/T sensor costs $2,000-$5,000). This could limit adoption to high-end industrial robots, defeating the 'Android for everyone' vision. Low-cost alternatives like capacitive or piezoresistive sensors are less accurate.
3. The Exploration Problem. How does a robot learn physical dynamics without a teacher? Random motor babbling is sample-inefficient and may damage the robot. Safe exploration in the physical world is an open research problem. Simulation can help, but then we are back to the simulation paradox.
4. Ethical and Safety Concerns. A standardized robotic brain could be weaponized or used for harmful purposes. Unlike Android, which can be locked down, a physical native model that controls real-world actuators poses physical risks. Who is liable when a robot running the model causes an accident? The hardware maker? The model developer? The open-source community?
5. The 'Android' Business Model. Android is free but Google makes money through services and ads. What is the monetization strategy for a robotics Android? Licensing fees? Cloud services for model updates? A marketplace for skills? Ant Group has not clarified this, and a poorly designed business model could stifle adoption.
AINews Verdict & Predictions
Shen Yujun's physical native model is the most intellectually honest critique of current embodied AI I have heard in years. The industry has been drunk on the success of large language models and assumed that adding vision and language to robots would solve everything. It hasn't. The fundamental issue is that language and vision are human-centric abstractions that discard the very physical details robots need to manipulate the world. Shen is right to call for a return to first principles: force, torque, inertia, contact.
Prediction 1: By 2028, at least one major robotics company (likely Figure AI or a Japanese industrial giant like Fanuc) will adopt a physical native model as their primary low-level controller. The performance gains on precision tasks are too large to ignore. VLA models will be relegated to high-level task planning and human-robot interaction, while the 'muscle memory' of the robot will be handled by a physical native model.
Prediction 2: Ant Group will open-source a reference implementation of their physical native model by Q4 2027. This is the only way to achieve the network effects needed for an Android-like ecosystem. They will monetize through a cloud platform for model fine-tuning and a marketplace for pre-trained skills (e.g., 'grasping a fragile object,' 'screwing a bolt').
Prediction 3: The biggest winner will not be a hardware company but a sensor manufacturer. The physical native model creates massive demand for high-quality, low-cost force-torque sensors. Companies like ATI Industrial Automation (current market leader) or startups like Tactile Robotics will see exponential growth. Expect a sensor startup to reach unicorn status by 2029.
What to watch next: The next major milestone is a public demonstration of a physical native model performing a complex, multi-step task (e.g., assembling a smartphone) without any visual or language input. If Lingbo achieves this within 12 months, the race is on. If not, the idea may remain a provocative but impractical thought experiment. Either way, Shen has forced the field to confront an uncomfortable truth: we have been building robots that see and talk, but not robots that feel.