Technical Deep Dive
The token bubble is rooted in a fundamental misunderstanding of what tokens represent. Tokens are not intelligence; they are units of computation. The industry's fixation on token volume—whether it's the number of parameters in a model or the number of tokens consumed during inference—has created a perverse incentive: build bigger models that generate more tokens, regardless of whether those tokens produce useful output.
The Architecture of Waste
Modern large language models (LLMs) like GPT-4, Claude, and Baidu's ERNIE series operate on a transformer architecture. The core mechanism is attention, which computes relationships between all tokens in a sequence. The computational cost of attention scales quadratically with sequence length (O(n²)). This means doubling the context window quadruples the compute cost. Yet, many applications—like simple document summarization or customer service queries—use only a fraction of that context.
Li Yanhong's critique targets this inefficiency. He argues that the industry should optimize for 'token efficiency'—the ratio of useful output tokens to total input tokens. This is analogous to the concept of 'bits per word' in information theory, but applied to economic value.
The Efficiency Frontier
Several technical approaches are emerging to break the token addiction:
1. Speculative Decoding: Instead of generating one token at a time, this technique uses a smaller 'draft' model to predict multiple tokens in parallel, which the main model then verifies. This can reduce latency by 2-3x without sacrificing quality. Google's Medusa and Meta's research on this are notable.
2. KV-Cache Optimization: Key-Value caching is standard for autoregressive generation, but it consumes massive memory. Techniques like multi-query attention (MQA) and grouped-query attention (GQA), used in Llama 2 and Falcon, reduce the cache size by sharing keys and values across heads, cutting memory usage by 30-50%.
3. Quantization and Pruning: Reducing model precision from FP16 to INT4 or INT8 can shrink model size by 4x and speed up inference by 2-3x on compatible hardware. Open-source tools like llama.cpp and AutoGPTQ have made this accessible. The GitHub repository for llama.cpp has over 70,000 stars and is a go-to for running models on consumer hardware.
4. Mixture of Experts (MoE): Models like Mixtral 8x7B activate only a subset of parameters per token, achieving high performance with lower per-token cost. This is a direct architectural response to the 'bigger is better' fallacy.
Benchmarking the New Ruler
To evaluate models under the new 'value per token' framework, we need metrics that measure efficiency, not just raw capability. The following table compares leading models on both traditional benchmarks and a proposed 'efficiency score' (useful output tokens per dollar):
| Model | Parameters | MMLU Score | Latency (ms/token) | Cost per 1M tokens (USD) | Efficiency Score (MMLU points per $1) |
|---|---|---|---|---|---|
| GPT-4o | ~200B (est.) | 88.7 | 15 | $5.00 | 17.7 |
| Claude 3.5 Sonnet | — | 88.3 | 12 | $3.00 | 29.4 |
| Gemini 1.5 Pro | — | 86.5 | 10 | $3.50 | 24.7 |
| ERNIE 4.0 Turbo | ~100B (est.) | 82.1 | 8 | $1.20 | 68.4 |
| Llama 3 70B (open) | 70B | 82.0 | 20 (on A100) | $0.59 (via Groq) | 139.0 |
| Mixtral 8x7B (open) | 46.7B (active 12.9B) | 70.6 | 9 | $0.20 | 353.0 |
Data Takeaway: The table reveals a stark truth: smaller, more efficient models like Mixtral 8x7B and Llama 3 70B deliver competitive MMLU scores at a fraction of the cost. ERNIE 4.0 Turbo, while not top-tier on raw benchmarks, offers the best efficiency score among closed-source models. The 'value per token' lens completely reorders the leaderboard.
Key Players & Case Studies
Baidu: Leading the Pivot
Li Yanhong's 'new ruler' is not just rhetoric; it is embedded in Baidu's product strategy. ERNIE Bot, Baidu's flagship LLM, has been aggressively optimized for inference speed and cost. Baidu claims that ERNIE 4.0 Turbo achieves a 50% reduction in inference cost compared to its predecessor, while maintaining 95% of the accuracy on key tasks. This is achieved through a combination of model pruning, quantization, and a custom inference stack running on Baidu's Kunlun chips.
Baidu's approach is to target specific verticals—search, cloud, autonomous driving—where token efficiency directly translates to lower operational costs and faster response times. For example, in Baidu Search, using a smaller, distilled model for query understanding rather than the full ERNIE 4.0 saves millions of dollars in compute costs per year.
OpenAI and Anthropic: The Scale Incumbents
OpenAI and Anthropic have historically championed the 'scale is all you need' philosophy. GPT-4 and Claude 3 were built on massive compute clusters, and their pricing reflects that. However, even these leaders are pivoting. OpenAI's GPT-4o mini and Anthropic's Claude 3 Haiku are smaller, cheaper models designed for high-volume, cost-sensitive applications. This is tacit acknowledgment that the token bubble is unsustainable.
The Open-Source Rebellion
Open-source models are the most aggressive proponents of the new efficiency paradigm. The Llama 3 series from Meta, especially the 8B and 70B variants, offers near-frontier performance at a fraction of the cost. The GitHub repository for the Llama recipes project (over 15,000 stars) provides fine-tuning scripts that allow developers to adapt these models for specific tasks, further improving efficiency.
Mistral AI's Mixtral 8x7B is another standout. Its MoE architecture means it activates only 12.9B parameters per token, yet it rivals larger models on benchmarks. The company's pricing—$0.20 per million tokens—is 25x cheaper than GPT-4o. This has made it a favorite for startups building AI applications on a budget.
Comparison of Model Strategies
| Company | Flagship Model | Strategy | Token Efficiency Focus | Pricing Model |
|---|---|---|---|---|
| Baidu | ERNIE 4.0 Turbo | Vertical optimization, custom silicon | High (targeted use cases) | Per-token, volume discounts |
| OpenAI | GPT-4o | General intelligence, scale | Medium (introduced mini models) | Tiered by model size |
| Anthropic | Claude 3.5 Sonnet | Safety + capability | Medium (Haiku for cost) | Tiered by model size |
| Meta (open) | Llama 3 70B | Open-source, community-driven | High (can be distilled/quantized) | Free (self-hosted) |
| Mistral (open) | Mixtral 8x7B | MoE efficiency | Very high (active params only) | Free (self-hosted) |
Data Takeaway: The open-source ecosystem is winning on token efficiency by design. Their architectures (MoE, smaller active parameters) and distribution models (free, customizable) force proprietary vendors to compete on value, not just scale.
Industry Impact & Market Dynamics
The Cost Crisis
The token bubble has created a cost crisis for AI adoption. According to industry estimates, enterprise AI spending on inference could reach $50 billion by 2026, up from $10 billion in 2023. Much of this spending is wasted on unnecessary tokens. A study by a major cloud provider found that 60% of tokens generated by LLMs in enterprise applications are never used—they are either redundant, irrelevant, or part of failed generations.
Li Yanhong's 'new ruler' directly addresses this. If the industry adopts 'value per token' as a key performance indicator, it will force a reallocation of compute resources. Companies will invest in smaller, task-specific models rather than one-size-fits-all behemoths.
Market Share Shifts
The following table shows projected market share changes based on efficiency adoption:
| Segment | 2023 Market Share | 2026 Projected Share | Key Driver |
|---|---|---|---|
| Large general models (GPT-4 class) | 60% | 30% | Cost pressure, efficiency backlash |
| Medium task-specific models | 25% | 45% | Vertical optimization, fine-tuning |
| Small edge models (on-device) | 15% | 25% | Privacy, latency, cost |
Data Takeaway: The market is bifurcating. The era of the single 'god model' is ending. The future is a portfolio of specialized models, each optimized for token efficiency in its domain.
The Hardware Angle
NVIDIA's dominance is also under threat. If the industry shifts to smaller models, demand for H100/B200 GPUs may soften, while demand for inference-optimized chips (like Groq's LPU, Cerebras, or Baidu's Kunlun) will rise. This could reshape the semiconductor landscape.
Risks, Limitations & Open Questions
The Accuracy-Efficiency Trade-off
The most significant risk is that pushing for token efficiency too aggressively could degrade model quality. A model that generates fewer tokens might be cheaper, but if it produces inaccurate or incomplete answers, the cost savings are illusory. The challenge is finding the Pareto frontier where efficiency gains do not compromise utility.
The 'Goodhart's Law' Problem
Once 'value per token' becomes a target, it will be gamed. Developers might optimize for short, vague answers that technically score well on efficiency but fail to solve the user's problem. The metric must be carefully defined to include output quality, not just quantity.
Open Questions
- Can the industry agree on a standardized 'value per token' metric? Currently, there is no consensus.
- Will regulators step in to mandate efficiency standards, especially for energy-hungry models?
- How will this shift affect AI safety research, which often relies on large-scale red-teaming?
AINews Verdict & Predictions
Li Yanhong is right. The token bubble is bursting, and the industry is overdue for a value reckoning. The obsession with scale has led to a misallocation of capital, compute, and talent. The new ruler—value per token—is not just a business metric; it is an engineering philosophy.
Our Predictions:
1. By 2026, the term 'token efficiency' will be as common as 'accuracy' in model evaluation. Every major model card will include a cost-per-useful-output metric.
2. Open-source models will capture 50%+ of the enterprise inference market by 2027, driven by their superior efficiency and customizability.
3. Baidu will become a reference case for how to pivot from scale to efficiency, potentially influencing Chinese and global AI strategy.
4. NVIDIA will face its first serious competitive threat from inference-optimized chips, as the market shifts from training to inference workloads.
5. The 'one model to rule them all' era is over. The future is a federated ecosystem of specialized, efficient models, each optimized for a specific domain and cost profile.
What to watch next: Baidu's ERNIE Bot usage metrics, the adoption of MoE architectures in new models, and the pricing wars between OpenAI, Anthropic, and open-source alternatives. The token bubble is deflating, and the air is rushing toward practical value.