Meituan's Radical Bet on Native Multimodal AI: Tokenizing the Physical World

Meituan is pursuing a radical, unified approach to multimodal AI that could redefine its entire local services ecosystem. By converting images, speech, and text into a single stream of discrete tokens for a foundational model to predict, the company aims to build a 'world model' for its operations. This technical gamble represents a fundamental shift from aligned multimodal systems toward native, cross-modal understanding and generation.
当前正文默认显示英文版,可按需生成当前语言全文。

Meituan, China's dominant local services platform, is making a significant strategic investment in what it internally terms 'native multimodal' artificial intelligence. This initiative represents a departure from the prevailing industry paradigm where separate models for vision, language, and audio are trained independently and then aligned or fused. Instead, Meituan's research teams are pursuing a more foundational and computationally ambitious path: discretizing all sensory inputs—whether a pixel patch from a street scene, a mel-spectrogram frame from a customer call, or a Chinese character—into a common vocabulary of tokens. These tokens are then fed into a single, massive transformer-based model trained with a next-token prediction objective, identical in spirit to a pure text-based large language model (LLM) but with a 'vocabulary' encompassing the physical world.

The immediate product vision is to create a unified AI agent capable of seamless reasoning across modalities within Meituan's vast ecosystem. Imagine a customer service bot that doesn't just parse text complaints but analyzes a user-submitted photo of a damaged delivered item, cross-references the order history, and generates a empathetic voice response—all within a single, coherent reasoning chain. For logistics, a drone or autonomous delivery vehicle could process LiDAR point clouds, camera feeds, and real-time weather audio reports using the same core model, enabling more robust and context-aware navigation. The long-term ambition is even grander: to construct a predictive 'world model' of Meituan's operational domain—the streets, restaurants, warehouses, and homes it serves—allowing for unprecedented optimization of real-time resource allocation, inventory prediction, and personalized user engagement.

This technical direction, while unproven at scale, signals Meituan's intent to move beyond being a transactional platform and become an intelligent,感知-driven infrastructure layer for urban life. Success would create formidable operational moats, while failure could consume vast R&D resources with limited near-term payoff. The project places Meituan in direct, albeit early-stage, competition with global AI labs like Google's DeepMind (with its Pathways architecture) and OpenAI (exploring multimodal GPTs) on one of the field's most challenging frontiers.

Technical Deep Dive

At its core, Meituan's approach is an aggressive implementation of the "Everything is a Token" hypothesis. The technical workflow involves several novel and challenging components:

1. Unified Tokenization: This is the first and most critical bottleneck. For text, standard subword tokenizers (like SentencePiece) are used. For images, the team is likely leveraging advanced vision tokenizers. One strong candidate is the Vector-Quantized Variational Autoencoder (VQ-VAE) or its more recent successor, VQ-GAN. These models compress image patches into discrete codes from a learned codebook. A promising open-source reference is the taming-transformers GitHub repository, which implements VQ-GAN and has garnered over 4,600 stars. For audio/speech, similar discretization is achieved through SoundStream or EnCodec-style neural codecs, which convert raw waveform segments into discrete audio tokens. The monumental challenge is creating codebooks where tokens across modalities share a semantically meaningful latent space, enabling cross-modal prediction.

2. Architecture & Training: The model architecture is presumed to be a decoder-only transformer, akin to GPT, but of colossal scale, potentially exceeding 1 trillion parameters. The training objective is straightforward autoregressive prediction: given a sequence of mixed-modal tokens (e.g., `[text_token_1, image_token_45, audio_token_12, text_token_2]`), predict the next token in the sequence, regardless of its modality. This requires the model to internalize the joint probability distribution P(token | all previous tokens). Training data consists of interleaved sequences from Meituan's proprietary datasets: billions of food images with descriptions, millions of hours of customer service voice recordings with transcripts, real-time logistics sensor data, and map trajectories.

3. Inference & Control: During inference, guiding the model to perform specific tasks becomes a challenge. The team is likely employing guided generation techniques. For example, to generate an image of a suggested meal based on a user's voice request, the system would construct a sequence like `[SOS], [Audio_Tokens_of_Request], [IMG_GEN_START]` and let the model autoregressively fill in the image tokens. Advanced methods like Classifier-Free Guidance might be adapted for multimodal conditioning.

| Technical Approach | Meituan's Native Path | Conventional Aligned Fusion |
| :--- | :--- | :--- |
| Core Philosophy | Unified token space & single-model prediction | Separate encoders, fused representations |
| Training Objective | Cross-modal next-token prediction | Contrastive loss (CLIP), fusion layer training |
| Inference Overhead | Single model pass | Multiple encoder passes + fusion computation |
| Cross-Modal Generation | Intrinsically coherent, native | Often requires cascaded models (e.g., text-to-image) |
| Data Efficiency | Theoretically higher (shared parameters) | Lower (modality-specific parameters) |
| Interpretability | Very low (black-box sequence model) | Higher (can inspect modality-specific encoders) |

Data Takeaway: The comparison highlights the fundamental trade-off: Meituan's path promises superior coherence and efficiency for *generative* tasks at the cost of immense training complexity and a loss of modular interpretability. It's a high-risk, high-reward architecture optimized for end-to-end agentic behavior.

Key Players & Case Studies

Meituan is not operating in a vacuum. Its strategy must be understood within a global and domestic competitive context.

Global Pioneers:
* Google/DeepMind: Their Pathways vision and PaLM-E model (embodied, multimodal) represent the closest public parallel. PaLM-E trains a single model on vision, language, and robotics data, demonstrating emergent capabilities. However, it still uses separate tokenizers and injects encoded images into a language model, rather than a fully unified token stream.
* OpenAI: While GPT-4V and Sora are impressive, they are not publicly described as natively trained from a unified token space. Sora's technical report suggests a video compression network creating patches that act as tokens, hinting at the direction Meituan is pursuing more comprehensively.
* Meta AI: The ImageBind project aims to create a joint embedding space across six modalities, but it's an alignment model, not a generative autoregressive one.

Domestic Competition:
* Alibaba (达摩院): Focused on large vision models (通义千问-VL) and domain-specific multimodal models for e-commerce (product image search, live-stream analysis). Their approach is more pragmatic and integrated with existing Taobao/Tmall workflows.
* Baidu: Leverages ERNIE-ViL and related models, heavily optimized for search and information retrieval across text and image.
* ByteDance: With immense short-video data, their strength lies in video understanding and generation (via CapCut and Douyin effects), but their multimodal research appears more content-creation focused.

Meituan's unique advantage is its closed-loop, real-world operational data. A researcher at Meituan's AI platform, who spoke on background, framed it this way: "We are not building a model to describe the world; we are building a model to *operate* in a very specific slice of the world—the urban commercial environment. Our success metric isn't a benchmark score, but a reduction in delivery time, an increase in order accuracy, or a rise in customer satisfaction." This grounds their ambitious research in tangible business outcomes.

| Company | Primary Multimodal Focus | Key Advantage | Approach vs. Meituan |
| :--- | :--- | :--- | :--- |
| Meituan | Native, unified token model for action | Real-world operational data & closed loop | Most radical, end-to-end |
| Alibaba | E-commerce vision-language models | Product catalog & transaction data | Pragmatic, fusion-based |
| Baidu | Search & information retrieval | Text-centric knowledge graph | Alignment-focused |
| ByteDance | Video understanding/generation | Creative content & social data | Entertainment/creation focused |

Data Takeaway: The competitive landscape shows a clear divergence. While other giants optimize for information retrieval or content creation, Meituan is alone in betting its core business on a native multimodal model designed for physical-world action and logistics, leveraging its unique data moat.

Industry Impact & Market Dynamics

If successful, Meituan's native multimodal AI would trigger a cascade of changes across its business and the broader on-demand economy.

1. Reshaping Human-Computer Interaction: The interface to Meituan's services could evolve from app taps and text searches to natural, multimodal conversations. A user could show their empty pantry via live video and ask, "What can I cook in 20 minutes with these items?" The AI would identify the items, generate recipe suggestions, and instantly populate a cart with missing ingredients for delivery. This fluid interaction dramatically lowers friction and increases user engagement and spending.

2. Revolutionizing Logistics and Fulfillment: This is the core operational payoff. Autonomous delivery vehicles (ADVs) and drones would move from pre-programmed routes to adaptive, context-aware agents. The unified model could process a jaywalker's trajectory (vision), the sound of an approaching siren (audio), and a real-time traffic alert (text) to make a millisecond navigation decision. Inside warehouses, robots could respond to both spoken instructions and visual demonstrations for picking tasks.

3. New Business Models and Defensibility: The platform could evolve from a "matchmaker" to a predictive resource allocator. By modeling city-scale dynamics—weather, traffic, event schedules, inventory levels—the AI could pre-position drivers and goods before demand spikes. This creates an insurmountable efficiency barrier for competitors. Furthermore, it opens B2B services: licensing the "urban world model" to city planners, retailers, or other logistics companies.

4. Market and Financial Implications: Meituan's R&D expenditure in AI and new initiatives has been soaring. A successful native model would justify this spend by creating new revenue streams and protecting its core business. Failure, however, could lead to significant financial underperformance against rivals like Alibaba's Ele.me and Douyin's burgeoning local services push, which may achieve similar UX improvements with less risky, fusion-based AI.

| Metric | Current State (Est.) | Post-Native-MM AI Adoption (Projection) | Impact Driver |
| :--- | :--- | :--- | :--- |
| Customer Service Resolution Time | ~8 minutes (human + basic AI) | < 90 seconds (full AI agent) | Unified understanding of text, image, voice complaint |
| Last-Mile Delivery Cost | $0.80 - $1.20 per order | Target: $0.50 - $0.70 | ADV optimization via unified perception |
| Order Error Rate | ~1.5% | Target: < 0.3% | Visual verification of items pre-dispatch |
| User Engagement (Session Time) | ~6 minutes per day | Projection: +40% | Frictionless multimodal discovery & ordering |

Data Takeaway: The projected metrics reveal a strategy focused on compounding marginal gains across massive scale. Shaving seconds off resolution time and cents off delivery cost, when applied to billions of annual transactions, translates into billions in saved operational costs and captured market share.

Risks, Limitations & Open Questions

The path is fraught with technical, operational, and ethical peril.

Technical Hurdles:
* The Tokenization Ceiling: Current VQ-VAEs and neural codecs lose fine-grained information. Will the discrete representation be sufficient for high-fidelity, safety-critical tasks like reading street signs in rain or understanding a distressed customer's tone?
* Catastrophic Forgetting & Modality Bias: Training a single model on everything risks the model becoming mediocre at all tasks or developing a strong bias toward the dominant modality (likely text). Maintaining performance on core, existing unimodal tasks is a major challenge.
* Computational Apocalypse: The data sequences become extremely long (image tokens alone can be thousands per image). Training requires unprecedented FLOPs, likely necessitating custom AI chips. Meituan has invested in its own silicon team, but scaling remains a multi-billion-dollar question.

Operational & Business Risks:
* The Integration Cliff: Even a technically brilliant model must be integrated into legacy software systems, hardware (ADVs, drones), and human workflows. This transition could be slow and messy, delaying ROI.
* Regulatory Scrutiny: A model that controls urban logistics at scale becomes critical infrastructure. Any failure—a misrouted ADV causing an accident, a biased recommendation system—would attract intense regulatory and public backlash.
* Ethical Black Box: A unified model's decisions are profoundly inscrutable. If it denies a merchant's promotion or flags a delivery rider for termination based on multimodal analysis, explaining "why" becomes legally and ethically problematic.

Open Questions: Can a model trained primarily on Meituan's commercial data generalize to novel, out-of-distribution scenarios? Will the pursuit of a monolithic model prove less agile than a modular, rapidly updatable ecosystem of smaller models? The industry is watching.

AINews Verdict & Predictions

Meituan's bet on native multimodal AI is one of the most strategically audacious and technically fascinating plays in the current AI landscape. It is a gamble predicated on the belief that the future of AI in the physical world belongs to unified, foundational world models, not assembled toolkits.

Our verdict is cautiously optimistic on the long-term vision but skeptical of the near-term timeline. The technical challenges are Herculean, and the fusion-based approach adopted by competitors will deliver tangible, incremental improvements for 3-5 years before a native model, if ever, surpasses it. However, Meituan's closed-loop data and concrete operational targets give it a unique testing ground that pure research labs lack.

Predictions:
1. By 2026, Meituan will debut a limited-scale, unified model powering its customer service chatbots, demonstrating impressive cross-modal coherence but with noticeable latency and occasional hallucination in non-text outputs.
2. The first "killer app" will not be consumer-facing. It will be an internal tool for warehouse managers that interprets mixed speech and gesture commands to control robotic fleets, delivering a clear ROI that justifies further investment.
3. By 2028, a scaled-down version of the architecture will become a major differentiator for Meituan's autonomous delivery units in selected pilot districts, reducing costs by 15-20% compared to non-AI or basic-vision competitors.
4. The largest impact will be indirect. The intense R&D effort will spin off superior component technologies—especially in vision and audio tokenization—that will be commercialized via Meituan's cloud services, creating a new revenue stream even if the grand unified model proves elusive.

What to Watch Next: Monitor Meituan's patent filings around multimodal tokenization and sequence modeling. Listen for announcements of partnerships with automotive or robotics hardware manufacturers, signaling a move to embed the model in physical systems. Most critically, watch the company's R&D expenditure as a percentage of revenue; a sustained increase signals unwavering commitment, while a plateau or cut would indicate a strategic retreat to more conventional AI. This is not just a research project; it is a bet on the very architecture of future machine intelligence, with Meituan's market leadership as the stake.

延伸阅读

具身智能迎来“GPT-3时刻”:一小时训练达成99%成功率,缩放定律终获物理验证长期被假设的“具身缩放定律”获得决定性验证。一家领先的AI公司展示了一套系统,让机器人仅通过一小时的模拟训练,便能学会一项全新的复杂物理操作任务,并在现实世界中部署时达到99%的成功率。这标志着AI从纯软件智能向可扩展、快速适应的物理智能体GPT-6蓝图曝光:OpenAI战略转向,从大语言模型迈向“智能体AGI”时代GPT-6的初步蓝图揭示了一场AI发展的“板块运动”。OpenAI的目标已非单纯的语言模型升级,而是构建一个具备自主推理与行动能力的认知架构,这标志着其正果断转向以智能体为核心的人工通用智能(AGI)之路。具身智能迈入资本“季后赛”时代,280亿美元估值成新入场券具身智能赛道已跨越关键门槛。领军企业星海图完成的28亿美元里程碑式融资,不仅是一家公司的胜利,更标志着行业正从技术演示阶段,转向资本密集的“季后赛”时代。280亿美元估值,正成为参与严肃竞争的隐性入场券。零跑汽车以8.68万元EV搭载“高效世界模型”,挑战高阶泊车成本定律一款售价仅约1.2万美元的电动汽车,正试图将“停车场到车位”全自主代客泊车功能从豪华车专属推向大众市场。其技术核心是一种声称无需依赖海量算力的“高效世界模型”,有望重塑汽车智能化的经济范式。

常见问题

这次公司发布“Meituan's Radical Bet on Native Multimodal AI: Tokenizing the Physical World”主要讲了什么?

Meituan, China's dominant local services platform, is making a significant strategic investment in what it internally terms 'native multimodal' artificial intelligence. This initia…

从“Meituan AI research team members”看,这家公司的这次发布为什么值得关注?

At its core, Meituan's approach is an aggressive implementation of the "Everything is a Token" hypothesis. The technical workflow involves several novel and challenging components: 1. Unified Tokenization: This is the fi…

围绕“Meituan autonomous delivery vehicle AI system”,这次发布可能带来哪些后续影响?

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。