Jim Fan ogłasza śmierć VLA i teleoperacji: rewolucja modeli świata NVIDIA

In a sweeping and deliberately provocative statement, Jim Fan, the head of NVIDIA's Generalist Robot Embodied Agent (GEAR) lab, has declared that Vision-Language-Action (VLA) models and teleoperation are 'dead' technologies. This is not a casual opinion but a calculated signal from the company that dominates AI compute. Fan argues that the current dominant paradigm—where robots are trained by stitching together vision, language, and action into a single model, or by human operators manually guiding movements via teleoperation—is a dead end. The core problem, he contends, is that both approaches fail to scale. VLA models, while impressive in controlled demos, exhibit catastrophic generalization failures when faced with novel objects, lighting, or spatial arrangements. Teleoperation, on the other hand, creates a human-in-the-loop bottleneck that makes data collection for every new task prohibitively expensive. Fan's alternative is a bet on 'world models'—neural networks that learn the causal physics of the environment, allowing robots to simulate outcomes, plan actions, and learn through self-play in virtual environments. This represents a paradigm shift from 'perception-to-action' mapping to 'understanding-to-reasoning-to-action.' The implications are profound: startups built entirely on VLA fine-tuning pipelines may find their approach obsolete, while NVIDIA positions its Omniverse and Cosmos simulation platforms as the indispensable infrastructure for this new era. AINews examines the technical underpinnings, the key players, and the market dynamics of this unfolding revolution.

Technical Deep Dive

Jim Fan's declaration is rooted in a fundamental critique of how modern robot learning models are architected. The VLA paradigm, popularized by models like Google's RT-2 and PALM-E, treats robot control as a sequence-to-sequence problem. A vision encoder (e.g., ViT) processes camera input, a language model (e.g., PaLM) interprets a task instruction, and a small action decoder outputs joint angles or end-effector poses. The problem is that this is a shallow mapping. The model learns correlations between pixels, words, and motor commands, but it has no internal representation of physics—no understanding of gravity, friction, inertia, or object permanence. When a VLA model is trained on 10,000 episodes of 'pick up the red cup' and then sees a translucent cup or a cup in shadow, it fails because the pixel distribution has shifted, not because it lacks a concept of 'cupness.'

Teleoperation, meanwhile, is the data collection method that feeds these models. Systems like the ALOHA platform or DROID use human operators to remotely control robot arms, generating high-quality demonstration data. But this is a scaling dead end. Each new task requires a human to physically or virtually guide the robot through hundreds of trajectories. The cost per demonstration is high, and the diversity of data is limited by human patience and dexterity. The DROID dataset, for example, contains over 350 hours of teleoperated data across 80 tasks, but even this massive effort covers a tiny fraction of the possible real-world interactions.

Fan's alternative is the world model approach, most prominently embodied in NVIDIA's Cosmos platform and the Isaac Gym simulator. A world model is a neural network that learns a latent representation of the environment's state and a transition function that predicts how that state evolves under different actions. Instead of learning 'if I see a cup, move gripper to (x,y,z),' the robot learns 'if I apply force vector F to this object, it will accelerate according to its mass and friction.' This is a causal model of physics. The robot can then use this model for planning: it can simulate thousands of possible action sequences in its 'imagination,' evaluate which one leads to the goal state, and execute that plan. This is the same principle behind DeepMind's Dreamer and MuZero algorithms, but applied to physical robotics.

The engineering challenge is immense. Building a world model that is accurate enough for real-world manipulation requires capturing complex dynamics: rigid body physics, soft body deformation, fluid dynamics, contact forces, and more. NVIDIA's approach is to use a 'neural physics' model trained on massive amounts of simulation data from Isaac Sim. The model is then fine-tuned with a small amount of real-world data to correct for the 'sim-to-real' gap. This is a fundamentally different scaling law: instead of scaling human demonstrations, you scale simulation compute.

| Approach | Data Source | Scaling Bottleneck | Generalization to Novel Scenes | Compute Cost at Inference |
|---|---|---|---|---|
| VLA (RT-2, PALM-E) | Teleoperated demos | Human data collection | Poor (fails on distribution shift) | Low (single forward pass) |
| Teleoperation (ALOHA, DROID) | Human-guided trajectories | Human time & cost | N/A (data collection method) | N/A |
| World Model (Cosmos, Dreamer) | Simulation + small real-world | Simulation fidelity & compute | High (causal reasoning) | High (requires planning rollouts) |
| Behavior Cloning (Diffusion Policy) | Teleoperated demos | Data diversity | Moderate (smooths trajectories) | Low |

Data Takeaway: The table reveals a clear trade-off. VLA and teleoperation offer low inference cost but suffer from scaling and generalization issues. World models promise superior generalization but at a much higher computational cost during inference, requiring powerful GPUs for real-time planning. This is precisely where NVIDIA's hardware advantage becomes a moat.

A notable open-source project in this space is Genesis (github: Genesis-Embodied-AI/Genesis), a universal physics engine designed for robot learning. It has amassed over 15,000 stars on GitHub and provides a Python-native platform for generating massive amounts of simulated training data. Another is MuJoCo (github: google-deepmind/mujoco), the de facto standard for physics simulation in robotics research, which recently added support for differentiable physics, enabling gradient-based optimization of control policies. These tools are the building blocks for the world model paradigm.

Key Players & Case Studies

The battle for the future of robot learning is being fought on multiple fronts. NVIDIA, under Jim Fan's leadership, is the most aggressive proponent of the world model approach. Their strategy is to build the 'operating system' for physical AI: Omniverse for simulation, Cosmos for world model training, Isaac for robot control, and Jetson/Orin for edge deployment. This is a vertical integration play that mirrors their dominance in AI training infrastructure.

Google DeepMind, meanwhile, is hedging its bets. They pioneered the VLA approach with RT-2 and are now exploring a hybrid path. Their AutoRT system uses a large language model (LLM) to propose tasks and a VLA model to execute them, but they are also investing in Genie, a foundational world model for 2D game environments, and DreamerV3, which learns world models from pixels. DeepMind's approach is more conservative: they see VLA as a practical short-term solution and world models as a long-term research direction.

Tesla, under Elon Musk, is taking a radically different approach. Their Optimus robot relies heavily on teleoperation for data collection, with human operators wearing VR headsets to control the robot. Tesla's bet is that they can collect data at unprecedented scale by deploying thousands of robots in their factories, each generating teleoperation data. This is a brute-force scaling approach that directly contradicts Fan's thesis. The outcome of this bet will be a critical test: can teleoperation data scale if you have enough humans and robots?

A new wave of startups is also emerging. Physical Intelligence (π), founded by former Google and Berkeley researchers, is building a foundation model for robotics that combines VLA with world model components. Their π0 model is a diffusion-based policy that can perform multiple tasks, but it still relies on large-scale teleoperation data. Covariant, another prominent startup, uses a more traditional reinforcement learning approach with simulation, but their core product, the Covariant Brain, is still a VLA-like system that maps perception to action.

| Company/Entity | Core Approach | Key Product/Platform | Reliance on Teleoperation | World Model Investment |
|---|---|---|---|---|
| NVIDIA | World Model + Simulation | Omniverse, Cosmos, Isaac | Low (sim-first) | Very High |
| Google DeepMind | Hybrid VLA + World Model | RT-2, AutoRT, DreamerV3 | Medium | High |
| Tesla | Teleoperation at Scale | Optimus | Very High | Low |
| Physical Intelligence | Diffusion Policy + VLA | π0 | High | Medium |
| Covariant | RL + VLA | Covariant Brain | Medium | Low |

Data Takeaway: The competitive landscape is fragmented, but NVIDIA's bet is the most radical and highest-risk. If world models prove feasible, NVIDIA will own the infrastructure. If teleoperation scales better than expected, Tesla and the VLA startups will have a head start. The next 12-18 months will be decisive.

Industry Impact & Market Dynamics

Fan's declaration is already reshaping investment and research priorities. Venture capital funding for robotics startups reached $6.4 billion in 2024, according to data from PitchBook, with a growing share going to 'foundation model' companies. However, the market is bifurcating. Investors are now asking hard questions: is your company's approach scalable beyond the demo? If you rely on teleoperation, what is your cost per task? If you use VLA, what is your generalization rate on unseen objects?

The market for robot simulation software is projected to grow from $1.2 billion in 2024 to $4.8 billion by 2030, driven by the shift toward world models. NVIDIA's Omniverse is the dominant platform, but open-source alternatives like Isaac Sim (free for research) and Genesis are gaining traction. This creates a classic platform play: NVIDIA wants to be the AWS of robot learning, charging for compute and simulation time.

| Metric | 2024 Value | 2030 Projection | CAGR |
|---|---|---|---|
| Global Robotics VC Funding | $6.4B | $12.1B | 11.2% |
| Robot Simulation Software Market | $1.2B | $4.8B | 26.0% |
| Number of Teleoperation Startups | ~45 | ~20 (est.) | -12.5% |
| Number of World Model Startups | ~12 | ~50 (est.) | 27.0% |

Data Takeaway: The market is voting with its dollars. Simulation and world model approaches are growing at over 25% CAGR, while teleoperation-centric startups are expected to consolidate or pivot. The number of pure teleoperation startups is projected to halve by 2030.

Risks, Limitations & Open Questions

The world model paradigm is not without its own risks. The most significant is the sim-to-real gap. No matter how good a simulation is, it is an approximation of reality. Small discrepancies in friction, mass distribution, or lighting can cause a world model to fail catastrophically in the real world. NVIDIA's Cosmos attempts to bridge this gap by training on massive amounts of real-world video, but this is an unsolved research problem.

Second, the computational cost of world model planning is immense. Running thousands of rollouts in a neural physics model requires a GPU cluster. For a robot that needs to react in milliseconds, this is a non-starter. Real-time planning with world models is an active area of research, with techniques like 'latent planning' and 'model predictive control with learned dynamics' showing promise, but no production-ready solution exists.

Third, there is a data diversity problem for world models. While simulation can generate infinite data, it is often 'boring' data—the robot picking up the same cup in the same way. To learn robust world models, you need diverse failure modes: objects slipping, collisions, unexpected forces. Generating this diversity requires clever curriculum learning and adversarial scenario generation, which adds complexity.

Finally, there is an ethical concern: if robots learn entirely in simulation, they may develop behaviors that are optimal in the simulator but dangerous in the real world. A robot that has never experienced a fragile human hand might plan a trajectory that causes injury. This is a safety-critical issue that demands rigorous validation before deployment.

AINews Verdict & Predictions

Jim Fan is right about the direction of travel, but he is early. VLA and teleoperation are not 'dead' today, but they are dying. The next two years will see a rapid migration toward simulation-first, world-model-centric approaches. Here are our specific predictions:

1. By 2026, at least three major VLA-focused startups will pivot to world models or be acquired. The cost of teleoperation data collection will become unsustainable as investors demand ROI. Companies like Physical Intelligence will need to show a clear path to reducing their reliance on human demonstrations.

2. NVIDIA will release a 'world model as a service' product by mid-2026. This will be a cloud API that allows any robotics company to upload a robot model and task specification, and receive a trained world model policy. This will be the 'GPT moment' for robotics.

3. Tesla's teleoperation-heavy approach will hit a scaling wall by 2027. The cost of maintaining a human-in-the-loop for millions of robots will prove prohibitive, and Tesla will be forced to invest heavily in simulation, potentially licensing NVIDIA's technology.

4. The open-source ecosystem will converge around a single world model framework. Genesis or a fork of it will become the 'PyTorch of robotics,' with a thriving community of researchers contributing physics models and pre-trained world models.

5. The first commercial robot that uses a world model for real-time manipulation will be deployed in a warehouse by Q4 2026. It will be a pick-and-place robot that can handle 95% of novel objects without prior training, outperforming all VLA-based competitors.

The bottom line: Jim Fan's provocation is a strategic move to accelerate the transition to a paradigm that benefits NVIDIA's core business. But he is also correct on the technical merits. The era of 'monkey see, monkey do' robot learning is ending. The era of 'robot think, robot act' is beginning.

常见问题

这次模型发布“Jim Fan Declares VLA and Teleoperation Dead: NVIDIA's World Model Revolution”的核心内容是什么？

In a sweeping and deliberately provocative statement, Jim Fan, the head of NVIDIA's Generalist Robot Embodied Agent (GEAR) lab, has declared that Vision-Language-Action (VLA) model…

从“Jim Fan VLA teleoperation dead explanation”看，这个模型发布为什么重要？

Jim Fan's declaration is rooted in a fundamental critique of how modern robot learning models are architected. The VLA paradigm, popularized by models like Google's RT-2 and PALM-E, treats robot control as a sequence-to-…

围绕“NVIDIA world model robot learning strategy”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。