Technical Deep Dive
The core technical shift is from reactive control to predictive world modeling. Traditional robotics relies on pre-programmed motion paths, sensor feedback loops, and carefully structured environments. The new paradigm, championed by researchers like Yann LeCun's 'Joint Embedding Predictive Architecture' (JEPA) and Fei-Fei Li's work on spatial intelligence, aims to give robots a causal understanding of physics: that a cup will fall if pushed off a table, that a door opens by turning a handle, not by brute force.
At the architectural level, these systems combine a large vision-language model (VLM) for semantic understanding with a learned dynamics model that predicts future states. For example, Google DeepMind's 'RT-2' and its successors use a transformer-based architecture that ingests video frames and robot actions, outputting both a text description of the scene and a probability distribution over future joint positions. The key innovation is the use of 'latent action spaces'—the model doesn't predict exact torques, but rather high-level intentions like 'grasp' or 'slide,' which are then refined by a low-level controller.
A critical enabler is the rise of differentiable physics simulators. NVIDIA's Isaac Sim and the open-source MuJoCo (now maintained by Google) have been upgraded with GPU-accelerated, differentiable physics engines that allow gradients to flow from task loss (e.g., 'pick up the block') back through the simulation to optimize the policy. This enables 'sim-to-real' transfer with unprecedented fidelity. The GitHub repository 'diffsim' by the MIT CSAIL group has gained over 4,000 stars for its differentiable rigid-body dynamics, allowing end-to-end training of control policies that transfer to real hardware with zero fine-tuning in some cases.
Real-time adaptation is the next frontier. Current world models are trained offline on massive datasets and then frozen at deployment. The next generation, being explored by startups like Covariant and Physical Intelligence, uses online fine-tuning: the robot continuously updates its world model based on its own sensory stream. This is computationally expensive—a single gradient update on a 7B-parameter model takes seconds on an A100 GPU, far too slow for real-time control. Researchers are exploring 'mixture of experts' architectures where only a small subset of parameters (the 'adaptation head') is updated online, while the core world model remains static. Early results from a preprint by UC Berkeley's BAIR lab show a 40% improvement in success rate on novel object manipulation tasks when using online adaptation versus a frozen model.
| Benchmark | Model | Success Rate (Novel Objects) | Latency (ms per inference) | Training Data (hours) |
|---|---|---|---|---|
| RLBench (10 tasks) | RT-2 (frozen) | 62.3% | 45 | 10,000 |
| RLBench (10 tasks) | RT-2 + online adapt | 87.1% | 210 | 10,000 + 2 online |
| CALVIN (long-horizon) | JEPA-based | 54.7% | 78 | 5,000 |
| CALVIN (long-horizon) | Proprioceptive VLM | 71.2% | 112 | 8,000 |
Data Takeaway: Online adaptation dramatically improves performance on novel tasks but at a 4-5x latency cost, making it unsuitable for high-speed industrial applications today. The trade-off between generalization and speed remains the central engineering challenge.
Key Players & Case Studies
The field has bifurcated into two camps: 'generalists' building universal brains for any robot, and 'verticals' optimizing for specific tasks. The generalists include Covariant (founded by Pieter Abbeel, Rocky Duan, and Peter Chen), which has raised over $700M to build the 'Robotic Brain'—a foundation model that can control any robot arm. Their latest model, 'RFM-2' (Robotic Foundation Model 2), is trained on data from over 100 different robot types across 20+ warehouses. Covariant's strategy is to license the brain, not the hardware, a pure software play.
On the vertical side, Figure AI (backed by OpenAI, Microsoft, and Jeff Bezos) is building a humanoid robot with a tightly integrated brain. Their Figure 02 robot, unveiled in early 2026, uses a custom VLM trained on egocentric video from 500 robots operating in BMW factories. The key insight: by controlling both hardware and software, Figure can optimize the brain for its specific actuator dynamics, achieving smoother motion than a generalist model on a third-party arm. However, this comes at the cost of flexibility—the Figure 02 brain cannot be easily ported to a different robot.
A third, emerging category is the 'simulation-first' approach, led by Skild AI (spun out of CMU). Skild has built a massive, 1.2 billion-parameter world model trained entirely in simulation (using NVIDIA Isaac Gym) across 10,000 virtual environments. Their claim: the model generalizes to real-world tasks without any real-world fine-tuning. In a public demo, a Skild-controlled robot arm successfully opened a child-proof medicine bottle—a task that requires precise force control and understanding of the 'push-and-turn' mechanism—after only seeing it in simulation. This is a bold claim that has yet to be independently verified at scale.
| Company | Approach | Funding Raised | Key Metric | Deployment |
|---|---|---|---|---|
| Covariant | Generalist brain, hardware-agnostic | $750M | 95% pick rate in warehouses | 500+ warehouses |
| Figure AI | Integrated humanoid, custom brain | $1.2B | 1,000 units deployed in automotive | 3 factories |
| Skild AI | Simulation-only world model | $350M | 80% zero-shot success on novel tasks | Pilot stage |
| Physical Intelligence | Online adaptation, generalist | $400M | 70% success on long-horizon tasks | Research stage |
Data Takeaway: Covariant's warehouse pick rate is the gold standard for commercial viability, but Figure's integrated approach has achieved faster deployment in structured environments. Skild's zero-shot claim is the most ambitious but unproven at scale.
Industry Impact & Market Dynamics
The shift to software-defined robotics is reshaping the entire value chain. The global robotics market is projected to reach $120B by 2028, with the 'brain' software segment growing at a 35% CAGR, outpacing hardware growth (12% CAGR). This is driving a wave of M&A: in the last 12 months, we've seen three major acquisitions of AI software startups by hardware incumbents (e.g., ABB acquiring a perception startup for $200M, Fanuc acquiring a motion planning company).
The business model shift to RaaS is accelerating. A typical industrial robot arm costs $50,000 upfront; a RaaS model with a $5,000 monthly subscription (including software updates, cloud inference, and maintenance) has a 4-year total cost of $240,000—but the lower upfront cost is attractive to SMEs. More importantly, the recurring revenue stream allows companies to invest in continuous software improvement. Figure AI reported that 70% of its 2025 revenue came from subscriptions, up from 20% in 2024.
| Year | Hardware Revenue (Global) | Software/Subscription Revenue | RaaS Adoption Rate (Industrial) |
|---|---|---|---|
| 2023 | $45B | $5B | 5% |
| 2024 | $48B | $8B | 12% |
| 2025 | $50B | $14B | 22% |
| 2026 (est.) | $52B | $22B | 35% |
Data Takeaway: Software revenue is growing 3x faster than hardware, and RaaS adoption is on track to become the dominant model for industrial robotics by 2028.
The call for open data standards is the most surprising development. At a private CEO roundtable in May 2026, leaders from Covariant, Skild, and a major Chinese robotics firm (who asked to remain anonymous) proposed a 'Robotic Data Commons'—a shared repository of robot interaction data, labeled with a common schema. The rationale is simple: training a general-purpose world model requires billions of diverse interactions, and no single company has access to enough data. This mirrors the early days of NLP, where the creation of Common Crawl and Wikipedia enabled the transformer revolution. If successful, this could lower the barrier to entry for new players and accelerate the entire field.
Risks, Limitations & Open Questions
Despite the optimism, significant risks remain. First, the 'sim-to-real gap' is not fully closed. While simulation-based training has improved, subtle differences in friction, lighting, and material properties still cause failures. A 2025 study by Stanford's IRIS lab found that even the best sim-to-real policies fail 15-20% of the time on tasks involving deformable objects (e.g., folding towels, handling cables).
Second, the computational cost of world models is prohibitive for edge deployment. Most current systems rely on cloud inference, introducing latency and reliability issues. A robot in a factory that loses its internet connection becomes a brick. On-device inference with a 7B-parameter model requires a $10,000+ GPU module, which is not cost-effective for a $50,000 robot. The race is on to develop efficient architectures (e.g., quantization, pruning, distillation) that can run on embedded hardware.
Third, safety and alignment remain unresolved. A robot with a world model that can predict the outcome of its actions could also discover unintended, harmful actions. For example, a robot tasked with stacking boxes might learn that pushing a human out of the way is a faster path to the goal. Current reward models are too brittle to prevent such behavior. The field lacks a robust framework for 'embodied AI safety'—analogous to RLHF for language models but grounded in physical consequences.
Fourth, the open data initiative faces enormous coordination challenges. Companies are reluctant to share proprietary data, especially from high-value deployments. The proposed 'Robotic Data Commons' has no governance structure, no funding, and no clear IP framework. It may remain a talking point rather than a reality.
AINews Verdict & Predictions
The embodied AI industry is undergoing a necessary maturation. The hardware-first era was a distraction; the real breakthroughs will come from software intelligence. We make three predictions:
1. By Q1 2027, a 'world model' will be commercially deployed that can handle 90% of common pick-and-place tasks in unstructured environments without prior training. This will be achieved by a hybrid approach: a large, pre-trained VLM for semantic understanding, combined with a lightweight, online-adaptable dynamics model. The winner will be a company that masters the data flywheel—collecting real-world interaction data at scale and using it to continuously improve the model.
2. The RaaS model will become the default for new robot deployments by 2028, with software subscription revenue exceeding hardware revenue for the first time. This will force traditional hardware manufacturers to either acquire AI capabilities or become commoditized suppliers.
3. The 'Robotic Data Commons' will launch in a limited form by mid-2027, backed by a consortium of 5-10 major players, but will struggle with data quality and privacy. It will accelerate research but not immediately transform commercial products.
The most important thing to watch is the real-world deployment data: which company can demonstrate the lowest cost-per-task, the highest reliability, and the fastest adaptation to new environments. The hardware is almost a commodity now; the battle for the brain has just begun.