小鵬汽車VLA 2.0 OKRs揭示自動駕駛進化新階段

Q: 围绕“What is the cost target for Xpeng's next-gen autonomous driving system?”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。

2026年3月22日上午12:04 AINews March 2026

autonomous driving Archive: March 2026

小鵬汽車CEO何小鵬透過一系列雄心勃勃的OKRs，公開闡述了公司第二代視覺語言行動（VLA）模型的發展藍圖。這些目標從根本上挑戰了當前自動駕駛的現狀，推動產業邁向由端到端AI定義的未來。

The article body is currently shown in English by default. You can generate the full version in this language on demand.

In a recent live stream, Xpeng Motors founder and CEO He Xiaopeng outlined multiple Objectives and Key Results (OKRs) for the company's next-generation Vision Language Action (VLA) model. This public declaration is more than a product roadmap; it's a strategic manifesto for the next phase of autonomous driving competition. The core goals center on three transformative leaps: achieving robust urban navigation without reliance on high-definition (HD) maps, dramatically improving performance in complex and long-tail driving scenarios, and slashing the system cost to enable mass-market adoption of high-level intelligent driving features.

This move signals a decisive shift from rule-based, modular autonomous driving stacks toward end-to-end AI systems. The first-generation VLA, which powers Xpeng's current XNGP advanced driver-assistance system, already represents a step in this direction by unifying perception, prediction, and planning within a single neural network framework. The second-generation model's OKRs explicitly target the limitations of its predecessor and the broader industry's reliance on HD maps—an expensive and difficult-to-scale crutch.

He Xiaopeng's framing of each OKR as a "questioning of 'what could be better'" underscores a philosophy of relentless iteration. The technical ambition is clear: to create a driving AI that generalizes more like a human, using primarily visual and linguistic understanding of the environment rather than pre-mapped precision. The commercial implication is equally significant: reducing system cost is a prerequisite for moving from a premium differentiator to a standard feature, a critical battleground in the fiercely competitive Chinese EV market. This announcement provides a rare, transparent look into the priority stack of a leading automaker at the precise moment when AI-first architecture is set to redefine the industry.

Technical Deep Dive

The evolution from Xpeng's first-generation VLA to the targeted second-generation model represents a fundamental architectural shift. The first-gen VLA, as deployed in XNGP, is a large-scale multimodal model that processes raw sensor data (predominantly camera, lidar, radar) and outputs structured driving actions. It uses a transformer-based architecture to create a unified vector space representing the driving scene, fusing visual features with semantic language queries (e.g., "identify the drivable area," "predict pedestrian trajectory"). However, it still operates in conjunction with, and is often constrained by, traditional rule-based planning modules and a heavy reliance on HD map priors.

The second-generation VLA's OKRs point to a purer end-to-end approach. The goal of "HD-map-free city navigation" necessitates a model that builds its own persistent, online spatial understanding—a neural scene representation or a "neural map." This likely involves innovations in Bird's Eye View (BEV) transformation with temporal fusion, where consecutive camera frames are integrated to create a dynamic, ego-centric understanding of the 3D environment. Crucially, this representation must be imbued with semantic meaning (lanes, traffic lights, crosswalks) inferred on the fly, a task that leans heavily on vision-language pre-training.

Improving "complex scenario handling" targets the notorious long-tail problem. This requires moving beyond pattern recognition to causal reasoning and counterfactual prediction. The model must answer questions like, "If that scooter swerves, what are my safe options?" Techniques from reinforcement learning (RL), particularly offline RL and world models, are likely being integrated. Here, simulation becomes paramount. Xpeng's proprietary simulation platform, which can generate millions of corner-case scenarios, will be used to train and stress-test the VLA's decision-making policies in a safe, scalable manner.

The push for lower cost is not just about cheaper chips; it's about algorithmic efficiency. A more capable, generalizable model might actually require fewer computational resources for inference than a brittle, rule-heavy stack that requires constant exception handling. Key techniques include model distillation (creating a smaller, faster student model from a large teacher), sparsity (activating only parts of the network for a given input), and novel attention mechanisms that reduce quadratic complexity.

| Technical Feature | VLA 1.0 (Current XNGP) | VLA 2.0 OKR Targets |
|---|---|---|
| Map Dependency | Heavy reliance on pre-built HD maps for localization and semantics. | Primarily vision-based, with HD maps as an optional fallback or for validation. |
| Architecture | Large multimodal model + traditional planning/control modules. | Truer end-to-end, with planning deeply integrated into the neural network's output. |
| Core Training Data | Supervised learning on millions of real-world miles + simulation. | Massive-scale simulation for long-tail scenarios + reinforcement learning. |
| Compute Focus | High inference compute for perception; rule-based planning. | Optimized for efficient inference of a unified model; potential for smaller, specialized models. |
| Key Innovation | Unified vector space for perception and prediction. | Online neural scene representation & causal reasoning for action. |

Data Takeaway: The transition from VLA 1.0 to 2.0 is a move from an AI-assisted, map-dependent system to a self-reliant, reasoning-centric AI driver. The technical leap is less about raw parameter count and more about architectural purity and training paradigm shifts.

Key Players & Case Studies

Xpeng is not operating in a vacuum. Its VLA 2.0 OKRs are a direct response to and an attempt to leapfrog a global competitive landscape.

Tesla's Full Self-Driving (FSD) V12 is the most prominent benchmark for an end-to-end neural network driving system. Tesla eliminated over 300,000 lines of explicit C++ code for planning and control, replacing it with a single neural network trained on millions of video clips. Its performance, particularly in complex urban settings, demonstrates the potential of the approach. However, its opacity and occasional unpredictable behaviors highlight the risks. Xpeng's strategy appears to be a more hybrid and potentially cautious path, aiming for Tesla's generalization capability while possibly retaining more verifiable safety layers.

Wayve (UK) and Waabi (Canada) are pure-play AI driving startups championing end-to-end, learned models. Wayve's "AV2.0" manifesto and its GAIA-1 world model for generative simulation align closely with the spirit of Xpeng's OKRs. Waabi's focus on a closed-loop, simulation-first training paradigm using its probabilistic world model is precisely the kind of technology needed to tackle the long-tail scenarios Xpeng has targeted.

Within China, NIO is advancing its own full-stack technology, with a strong emphasis on its proprietary NIO Adam supercomputing platform for training. Li Auto has surprised the industry with the rapid deployment and user adoption of its AD Max 3.0 system, which also utilizes an end-to-end BEV perception model. The competition is forcing rapid iteration.

| Company / Product | Core Technical Approach | Strategic Focus vs. Xpeng VLA 2.0 |
|---|---|---|
| Xpeng XNGP / VLA 2.0 | Vision-Language-Action unified model, moving to end-to-end. | Balancing AI purity with safety verification; explicit cost-reduction OKR. |
| Tesla FSD V12 | Pure vision, end-to-end neural network. | Radical simplicity; bet on massive real-world data scale over simulation. |
| Wayve AV2.0 | End-to-end driving from pixels to actions; generative world models. | Research-led, seeking fundamental breakthroughs in generalization. |
| Li Auto AD Max 3.0 | BEV transformer perception + rule-based planning. | Fast execution and user-centric feature deployment over architectural revolution. |

Data Takeaway: The competitive field is bifurcating between radical end-to-end purists (Tesla, Wayve) and pragmatic integrators (Xpeng, Li Auto). Xpeng's public OKRs place it closer to the purists in ambition but suggest a pragmatic integration timeline, with cost being a uniquely explicit and critical metric.

Industry Impact & Market Dynamics

The successful execution of Xpeng's VLA 2.0 OKRs would trigger a cascade of changes across the automotive and mobility industries.

First, it would democratize high-level autonomous driving. The single largest barrier to widespread adoption of features like Urban Navigate on Autopilot (NOA) is cost—not just the consumer price, but the OEM's Bill of Materials (BOM). Expensive lidar sensors, high-definition mapping subscriptions, and powerful, energy-hungry compute platforms are prohibitive. A vision-dominant, efficient AI model could reduce the sensor suite and compute requirements, potentially cutting the system cost by 50% or more. This makes it viable for mid-range and eventually economy vehicles.

Second, it reshapes the geographic rollout strategy. Today, deploying Urban NOA in a new city requires months of meticulous HD mapping. A capable HD-map-free system could, in theory, be activated anywhere overnight, instantly expanding the serviceable area from dozens to hundreds or thousands of cities globally. This turns autonomous driving from a geographically gated feature into a standard vehicle capability.

The business model shifts from selling hardware to monetizing software and data. The lower hardware BOM increases vehicle margin, while the software—the AI driver itself—becomes a recurring revenue stream via subscriptions (like Tesla's FSD). Furthermore, the fleet of vehicles becomes a data collection and training engine. Each intervention, disengagement, or complex scenario handled successfully becomes valuable training data to improve the central VLA model, creating a powerful network effect.

| Market Impact Dimension | Current State (2024) | Post-VLA 2.0 Adoption (2026-2028 Projection) |
|---|---|---|
| Avg. System Cost (OEM BOM) | $2,000 - $3,500 (lidar, high-end compute, map licensing) | Target: < $1,000 (vision-focused, efficient compute, minimal mapping) |
| Urban NOA City Coverage (China) | ~50-100 cities (map-dependent) | Potential: 200+ cities (map-free or light-map) |
| Penetration in New EVs | ~15-20% (mostly premium models) | Potential: 40-60% (across mid-range segments) |
| Primary Revenue Model | Bundled with vehicle sale (one-time) | Increased mix of software subscriptions (recurring) |

Data Takeaway: The economic model of autonomous driving flips with successful cost reduction and capability generalization. Market size explodes, but competition on software performance and pricing intensifies, moving the battleground from sensor specs to AI algorithm efficacy.

Risks, Limitations & Open Questions

The path outlined by these OKRs is fraught with technical and ethical challenges.

The foremost risk is the black box problem. As systems become more end-to-end, explaining *why* the AI made a specific decision becomes exponentially harder. This is a severe regulatory and liability hurdle. How does Xpeng plan to provide the necessary safety assurance certificates to authorities when the decision-making process is an inscrutable matrix multiplication? Techniques like attention visualization and causal tracing are nascent.

Simulation-to-reality gap is a fundamental limitation. While training on billions of simulated miles is necessary for long-tail scenarios, the model's performance in the real world is not guaranteed. The generative models used to create simulations must themselves be incredibly accurate and diverse. A flaw in the simulation world model could lead to catastrophic blind spots in the real-world VLA.

Data dependency and bias present another risk. The model's performance will reflect its training data. If the data is predominantly from Chinese road environments and driving styles, will it generalize safely to Europe or North America? Furthermore, edge cases involving rare vehicle types, extreme weather conditions, or ambiguous traffic agent behavior (e.g., a traffic director's hand signals) may remain problematic.

Finally, the cost reduction OKR could backfire if pursued too aggressively. Sacrificing a redundant sensor like lidar for pure vision may improve economics but could reduce robustness in edge-case weather (heavy fog, blinding sun). The quest for lower compute might lead to over-compressed models that lose critical reasoning capabilities. The balance between cost, performance, and safety is perilous.

Open questions remain: Will regulators accept neural network outputs as sufficient evidence of safety? Can a unified VLA model ever be as predictable and verifiable as a rule-based system in critical scenarios? How will insurance models adapt to AI drivers whose decision logic cannot be fully audited?

AINews Verdict & Predictions

He Xiaopeng's public VLA 2.0 OKRs are a masterstroke of strategic communication. They signal deep technical confidence, set clear expectations for investors and customers, and apply public pressure on the R&D organization. More importantly, they correctly identify the three pillars—capability, generalization, and cost—that will define the winner in the next phase of autonomous driving.

Our editorial judgment is that while the goals are exceptionally ambitious, they point to the correct north star. The era of stitching together perception, prediction, and planning modules is ending. The future belongs to holistic, learned driving models. Xpeng's pragmatic approach—evolving its architecture while maintaining a focus on verifiable deployment—may give it an advantage over more radical, but potentially reckless, competitors.

Specific Predictions for the Next 12-18 Months:

1. We will see the first limited, beta deployment of an HD-map-free Urban NOA feature from Xpeng in at least one major Chinese city by Q1 2025. It will initially operate in a tightly geofenced area and require extensive driver monitoring, but its mere existence will be a watershed moment.
2. The cost of Xpeng's XNGP system will drop by at least 30% in the next vehicle refresh cycle, achieved through a combination of sensor rationalization (e.g., reducing lidar count) and more efficient compute chip selection, directly enabled by a more capable VLA model.
3. A public incident involving ambiguous decision-making by an end-to-end system (from any major player) will trigger a regulatory pause and intense scrutiny on AI explainability in late 2024 or 2025. This will force companies, including Xpeng, to invest heavily in explainable AI (XAI) tools for their VLAs, potentially slowing rollout speed but improving long-term trust.
4. The competition will shift from "who has the most cities with HD maps" to "who has the lowest disengagement rate in un-mapped, complex urban environments." Benchmarking datasets focused on long-tail scenario handling will become the new currency of technical marketing.

The key metric to watch is not the number of OKRs achieved, but the rate of closure between the performance of VLA 2.0 in mapped versus unmapped territories. When that gap becomes negligible, the revolution He Xiaopeng is betting on will have truly begun.

常见问题

这次模型发布“Xpeng's VLA 2.0 OKRs Reveal the Next Phase of Autonomous Driving's Evolution”的核心内容是什么？

In a recent live stream, Xpeng Motors founder and CEO He Xiaopeng outlined multiple Objectives and Key Results (OKRs) for the company's next-generation Vision Language Action (VLA)…

从“How does Xpeng VLA 2.0 differ from Tesla FSD V12 architecture?”看，这个模型发布为什么重要？

围绕“What is the cost target for Xpeng's next-gen autonomous driving system?”，这次模型更新对开发者和企业有什么影响？