Beyond the Screen: Why AI's Next War Is in the Physical World

The 2026 Zhiyuan Conference delivered a sobering reality check for the AI industry. While large language models have mastered text, images, and code, their performance in the messy, unpredictable physical world remains primitive. The conference's most resonant message was that the next frontier for AI is not more tokens or pixels, but real-world interaction—robotics, autonomous driving, and industrial automation. This transition from 'digital intelligence' to 'embodied intelligence' demands a fundamental rethinking of model architectures, training paradigms, and business models. The era of API-based, per-token billing is giving way to hardware-plus-subscription models. Safety concerns escalate from content moderation to physical harm prevention. AINews examines the technical hurdles—from spatial reasoning to causal inference—and profiles the companies and research groups already racing to bridge the simulation-to-reality gap. The winners will not be those with the biggest parameter counts, but those who can close the loop between perception, planning, and action in the real world.

Technical Deep Dive

The shift from screen-based AI to physical-world AI is not a simple extension of existing capabilities; it requires a fundamentally different stack. The core challenge is bridging the gap between the discrete, symbolic world of text and the continuous, noisy, high-dimensional world of physics.

Architecture and Algorithms

Traditional LLMs operate on sequences of tokens. They have no inherent understanding of 3D space, material properties, or cause and effect. To operate in the physical world, models must integrate multiple modalities—vision, touch, proprioception—and learn to output continuous control signals, not just text.

A leading approach is the Vision-Language-Action (VLA) model, exemplified by Google DeepMind's RT-2 and the open-source OpenVLA (7B parameters, 50K+ GitHub stars). These models take a camera image and a natural language instruction (e.g., "pick up the red cup") and directly output motor commands. The key innovation is fine-tuning a pre-trained vision-language model on robot demonstration data, effectively transferring semantic knowledge to physical actions.

However, VLAs face a critical bottleneck: data scarcity. While text data is abundant, high-quality robot demonstration data is expensive and slow to collect. This has spurred research into sim-to-real transfer. NVIDIA's Isaac Sim and the open-source MuJoCo physics engine are used to generate massive synthetic training datasets. The challenge is that models trained in simulation often fail in the real world due to the 'reality gap'—differences in friction, lighting, and object dynamics. Techniques like domain randomization (varying simulation parameters during training) have become standard, but the gap remains a major hurdle.

Benchmarking Physical Performance

Standard LLM benchmarks like MMLU are irrelevant here. New benchmarks are emerging to evaluate physical intelligence:

| Benchmark | Focus | Key Metric | Example Task |
|---|---|---|---|
| Meta-World | Multi-task robotic manipulation | Success rate across 50 tasks | Open door, push block, pick and place |
| CALVIN | Long-horizon language-conditioned tasks | Task completion rate | "Make coffee" (sequence of 5+ actions) |
| Habitat 3.0 | Embodied navigation and interaction | Success weighted by path length | Find a person and hand them an object |
| Waymo Open Motion Dataset | Autonomous driving planning | Displacement error, collision rate | Predict vehicle trajectories in complex intersections |

Data Takeaway: Current state-of-the-art models achieve ~60-70% success on Meta-World's 50 tasks, but drop below 30% on CALVIN's long-horizon tasks. This reveals a fundamental weakness in sequential reasoning and memory—a model can pick up a cup, but fails to remember it needs to place it in the sink afterward.

Another critical technical area is causal inference. LLMs excel at correlation but struggle with causality. A robot trained to push a button that turns on a light might learn that pushing the button is the cause, but if the button is moved, it fails. Researchers at MIT CSAIL have proposed Causal World Models that explicitly learn the underlying dynamics of the environment, allowing for zero-shot adaptation to new configurations. This is an active area with repos like Causal-Imitation (1.2K stars) showing promising results in learning causal graphs from interaction data.

Key Players & Case Studies

The race to physical AI is not just a research problem; it's a competitive battleground with distinct strategic approaches.

The Big Tech Players

| Company | Approach | Key Product/Platform | Strategy |
|---|---|---|---|
| Tesla | Vertical integration | Optimus humanoid robot, Full Self-Driving | End-to-end neural networks trained on massive real-world fleet data; control over hardware and software |
| Google DeepMind | Foundation models + robotics | RT-2, AutoRT, Gemini Robotics | Leveraging massive multimodal models for generalization; open-sourcing research to attract talent |
| NVIDIA | Infrastructure provider | Isaac Sim, Jetson Orin, GR00T | Selling the 'picks and shovels'—simulation, hardware, and developer tools for the entire industry |
| OpenAI | Strategic investment | Investment in Figure AI, 1X Technologies | Betting on external robotics companies while focusing internally on model development |

Case Study: Figure AI and OpenAI

Figure AI, backed by OpenAI, Microsoft, and NVIDIA, has demonstrated the most visible progress in combining LLMs with humanoid robotics. Their Figure 01 robot, powered by an OpenAI model, can engage in full conversations while performing tasks like handing an apple to a person. The key insight: the LLM handles high-level reasoning and language understanding, while a separate low-level policy network handles the precise motor control. This modular approach is currently the most practical, but it introduces latency and coordination challenges.

Case Study: Covariant

Covariant, founded by former OpenAI researcher Pieter Abbeel, takes a different approach. Their Covariant Brain is a single AI model trained on data from hundreds of deployed robots in warehouses. Instead of a humanoid form factor, they focus on robotic arms for picking and sorting. This narrower focus has yielded higher reliability—their systems achieve over 99% pick success in production environments. The trade-off is lack of generality; the robot cannot walk or navigate, but it excels at its specific task.

The Open-Source Movement

The open-source community is also making strides. The LeRobot project (15K+ stars) by Hugging Face provides a standardized framework for collecting and training on robot data. The DROID dataset (10K+ hours of teleoperated robot data) is the largest open-source collection of its kind, enabling smaller labs to compete with Big Tech.

Industry Impact & Market Dynamics

The shift to physical AI is reshaping market structures and business models.

Market Size and Growth

| Segment | 2025 Market Size (USD) | 2030 Projected Size (USD) | CAGR |
|---|---|---|---|
| Industrial Robotics | $45B | $80B | 12% |
| Service Robotics (logistics, cleaning) | $20B | $55B | 22% |
| Humanoid Robotics | $2B | $30B | 72% |
| Autonomous Driving (L4+) | $5B | $50B | 58% |

Data Takeaway: The humanoid and autonomous driving segments are projected to grow at explosive rates, but from a small base. The industrial robotics market, while slower-growing, is the most mature and profitable today.

Business Model Transformation

The dominant AI business model today is API-based, per-token pricing (e.g., OpenAI's GPT-4o at $5 per million input tokens). Physical AI demands a different approach:

- Hardware-as-a-Service (HaaS): Companies like Intrinsic (Alphabet's robotics software arm) are moving toward selling robots as a service, charging a monthly fee that includes hardware, software, and maintenance. This lowers the upfront cost for customers and creates recurring revenue.
- Outcome-based pricing: Instead of paying for compute, customers pay per task completed (e.g., per pallet moved, per defect detected). This aligns incentives but requires highly reliable systems.
- Vertical integration: Tesla's model—owning the AI, the hardware, and the deployment—allows for tighter optimization but requires massive capital expenditure.

Supply Chain Shifts

The demand for specialized hardware is surging. NVIDIA's H100 GPU, once the gold standard for AI training, is being supplemented by the Jetson Orin and Thor chips designed for real-time inference in robots and vehicles. Sensor manufacturers like Luminar and Hesai are seeing record orders for LiDAR units. The entire supply chain for sensors, actuators, and compute is being reshaped by the physical AI boom.

Risks, Limitations & Open Questions

The path to physical AI is fraught with challenges that go beyond technical hurdles.

Safety and Liability

When an LLM generates a harmful text, the damage is reputational. When a robot arm swings into a human worker, the damage is physical. The current safety frameworks for AI—red-teaming, content filters—are inadequate for physical systems. Who is liable when a self-driving car causes an accident? The model developer? The hardware manufacturer? The fleet operator? Legal frameworks are years behind the technology.

The Sim-to-Real Gap

Despite advances, sim-to-real transfer remains unreliable. A robot trained to open doors in simulation might fail on a real door with a slightly different handle shape or friction coefficient. This limits the scalability of training—every new environment may require some real-world fine-tuning, which is slow and expensive.

Energy and Compute Constraints

Running a large model on a robot in real-time is a major engineering challenge. The Figure 01 robot reportedly uses a custom compute module that draws over 500 watts. For a humanoid robot to operate for a full workday, battery technology needs to improve dramatically. Current energy densities limit runtime to 1-2 hours for most humanoid prototypes.

Job Displacement

The most immediate impact of physical AI will be in logistics and manufacturing—sectors that employ millions of workers globally. While AI proponents argue that new jobs will be created (robot maintenance, AI training), the transition period could be painful. The social safety nets needed to manage this transition are not in place.

AINews Verdict & Predictions

The move from screen to physical world is not optional—it is the inevitable next phase of AI. The companies that succeed will be those that embrace the complexity of the real world, not those that retreat further into the safety of digital environments.

Our Predictions:

1. By 2028, at least one major automaker will deploy a fully autonomous (L4) taxi fleet in a major city without a safety driver. The regulatory and technical hurdles are immense, but the economic incentive (eliminating the driver cost) is too large to ignore.

2. Humanoid robots will remain a niche until 2030. The cost, reliability, and energy challenges are too great. Instead, the first wave of physical AI will be in constrained environments—warehouses, factories, and hospitals—using specialized form factors (arms, carts, surgical assistants).

3. The open-source robotics ecosystem will surpass proprietary systems in research velocity. Just as Llama and Mistral democratized LLM research, projects like LeRobot and OpenVLA will enable a wave of innovation from university labs and startups, outpacing the slow-moving corporate giants.

4. A major safety incident involving a physical AI system will occur by 2027, triggering a regulatory crackdown. This will slow deployment in the short term but lead to better safety standards in the long term, much like the aviation industry after early crashes.

5. The most valuable AI company in 2030 will not be an LLM provider, but a company that successfully integrates AI with hardware at scale. Tesla, with its vertical integration and massive data pipeline, is the current frontrunner, but a startup like Figure AI or Covariant could disrupt the incumbents.

The battle for AI's future is leaving the digital realm. The winners will be those who can navigate the messy, unpredictable, and dangerous world of atoms, not just bits.

常见问题

这次模型发布“Beyond the Screen: Why AI's Next War Is in the Physical World”的核心内容是什么？

The 2026 Zhiyuan Conference delivered a sobering reality check for the AI industry. While large language models have mastered text, images, and code, their performance in the messy…

从“What is the difference between embodied AI and traditional LLMs?”看，这个模型发布为什么重要？

The shift from screen-based AI to physical-world AI is not a simple extension of existing capabilities; it requires a fundamentally different stack. The core challenge is bridging the gap between the discrete, symbolic w…

围绕“Which companies are leading in physical AI for robotics?”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。