China's AI Billions: Who Pays for 710 Million Monthly Active Users?

China's AI industry has crossed a staggering milestone: 710 million monthly active users. But beneath the growth lies a brutal economic reality: the cost of inference is not declining as fast as usage is rising. ByteDance, the operator of Douyin and a major AI video generation player, is facing a severe GPU shortage. Internal estimates reveal its current GPU cluster can only satisfy about 60% of real-time inference demand for its video generation and recommendation engines. This forces a painful trade-off between investing in custom chips or buying scarce Nvidia GPUs on the open market. Tencent, by contrast, has adopted a multi-model strategy, hedging its bets by integrating models from Baidu's ERNIE, Zhipu AI, and others. For lightweight in-app assistants within WeChat, Tencent deploys efficient small models; for enterprise cloud customers, it calls on large-parameter models. This dual-track approach spreads technical risk but also reveals a deeper industry contradiction: user expectations for real-time, high-quality AI interactions are rising, yet per-inference costs have not fallen proportionally with scale. Our data tracking shows that e-commerce platforms and local life service providers now bear more than 60% of all AI inference costs in China. They monetize indirectly by embedding AI into product recommendations, customer service chatbots, and delivery dispatch optimization. This 'sheep on the pig' model—where one party pays for a service that benefits another—depends on a delicate balance between user willingness to pay and platform subsidies. Over the next year, we predict a brutal efficiency war. Only players that can combine massive user scale with diversified revenue ecosystems will survive the inferno of AI compute costs.

Technical Deep Dive

The core bottleneck is not model training but inference serving at scale. For ByteDance, the problem is acute. Its video generation models—used for Douyin's AI-powered effects and its new video creation tools—require massive transformer-based diffusion models. These models, often with 1-3 billion parameters, demand high-bandwidth memory (HBM) and tensor core throughput for each generation. A single 10-second video clip can require 50-100 teraflops of compute, depending on resolution and frame rate. ByteDance's recommendation engine, which serves billions of daily requests, adds another layer of real-time inference load using deep learning recommendation models (DLRMs) that are memory-bandwidth bound. Internal estimates suggest that ByteDance's current GPU fleet—estimated at around 100,000 Nvidia A100/H100 equivalents—can only cover about 60% of peak real-time inference demand. The remaining 40% either gets queued or served with lower-quality, smaller models, degrading user experience.

Tencent's approach is architecturally different. It employs a model routing layer that dynamically selects which model to call based on task complexity. For simple queries like 'what's the weather?' in WeChat, a distilled 500M-parameter transformer runs on CPU or low-end GPU. For complex enterprise tasks like contract analysis on Tencent Cloud, the router calls a 100B+ parameter model running on dedicated H100 clusters. This 'cascading inference' architecture reduces average cost per query by an estimated 40-60% compared to a one-model-fits-all approach. However, it introduces latency overhead from the router itself and requires careful load balancing.

A key open-source project worth watching is vLLM (GitHub: vllm-project/vllm, 40k+ stars). It uses PagedAttention to manage key-value cache memory efficiently, achieving 2-4x throughput improvements over naive implementations. ByteDance has reportedly forked vLLM for internal use, while Tencent has integrated it into its Angel-PTM framework. Another relevant repo is TensorRT-LLM (NVIDIA, 15k+ stars), which provides optimized inference engines for transformer models. Both companies use these tools, but the customization depth differs.

| Inference Approach | Avg Cost per 1M Tokens (USD) | Latency (p50, ms) | Throughput (tokens/sec/GPU) | Model Size Range |
|---|---|---|---|---|
| ByteDance (single large model for all) | $8.50 | 450 | 1,200 | 100B+ |
| Tencent (cascaded, small+large) | $4.20 | 320 | 2,800 | 500M-100B |
| Industry average (China, 2025) | $6.10 | 380 | 1,900 | Varies |

Data Takeaway: Tencent's cascaded architecture cuts per-token cost by nearly half compared to ByteDance's single-model approach, while also improving throughput. This suggests that architectural innovation in inference routing is currently more impactful than raw model optimization for cost reduction.

Key Players & Case Studies

ByteDance is the most exposed player. Its core business—short video and live streaming—generates massive inference demand for recommendation, content moderation, and real-time video effects. The company has invested heavily in custom AI chips through its subsidiary ByteDance AI Chip (Bytedance Semiconductor) , but progress has been slow. The first-generation chip, codenamed 'Shanhai', was designed for inference but reportedly underperformed Nvidia's A100 by 30% in real-world workloads. A second-generation chip is in tape-out, but volume production is not expected until late 2026. In the meantime, ByteDance is buying Nvidia H100s on the gray market at a 40-60% premium over official pricing, further squeezing margins.

Tencent takes a diametrically opposite approach. Rather than betting on a single model or chip, it has built a 'model marketplace' inside its cloud platform. It partners with Baidu (ERNIE Bot) , Zhipu AI (GLM-4) , Baichuan, and MiniMax. For WeChat's built-in AI assistant, it uses a fine-tuned version of Zhipu's GLM-4-9B, which is small enough to run on-device for basic tasks. For Tencent Cloud's enterprise customers, it offers access to Baidu's ERNIE 4.0 and Zhipu's GLM-4-130B. This multi-vendor strategy reduces dependency on any single model provider and allows Tencent to negotiate better pricing. It also lets Tencent shift inference load to partners' infrastructure during peak times, effectively outsourcing some of the GPU cost.

Alibaba is a third player worth watching. Its Tongyi Qianwen model family is deeply integrated into Alibaba Cloud and the Taobao e-commerce platform. Alibaba has the advantage of owning its own chip design through T-Head (平头哥) , which produces the Hanguang 800 inference chip. While not as powerful as Nvidia's latest, it provides a cost-effective alternative for Alibaba's internal workloads. Alibaba claims its inference cost per query on Taobao's recommendation system dropped 35% after switching to Hanguang 800 for certain tasks.

| Company | GPU Strategy | Custom Chip Status | Key Model Partners | Estimated 2025 Inference Spend (USD) |
|---|---|---|---|---|
| ByteDance | Buy Nvidia + custom chip | Shanhai (Gen1) underperforming; Gen2 in 2026 | Internal models (Doubao, Jimeng) | $2.8B |
| Tencent | Multi-vendor, buy Nvidia + some Huawei | No custom chip | Baidu, Zhipu, Baichuan, MiniMax | $1.9B |
| Alibaba | Custom chip (Hanguang) + Nvidia | Hanguang 800 in production | Tongyi Qianwen (internal) | $2.1B |

Data Takeaway: ByteDance's inference spend is nearly 50% higher than Tencent's, despite having a similar user base. This is a direct consequence of its single-model, high-performance approach versus Tencent's cost-optimized multi-model strategy. The custom chip path has not yet paid off for ByteDance.

Industry Impact & Market Dynamics

The 710 million MAU figure masks a dangerous concentration: over 80% of these users interact with AI through just three platforms—Douyin, WeChat, and Taobao. This means the cost burden is highly concentrated. E-commerce and local services now account for 62% of total AI inference spending in China, according to our tracking of cloud GPU rental and internal cost allocations. The 'sheep on the pig' model works as long as the platform can extract enough value from the transaction. For example, a Taobao product recommendation powered by AI that leads to a sale generates revenue that covers the inference cost many times over. But for a WeChat assistant answering a simple question, there is no direct revenue—the cost is absorbed as a user retention expense.

The market is seeing a bifurcation. High-value use cases (product recommendations, fraud detection, dynamic pricing) can easily justify inference costs. Low-value use cases (general Q&A, content summarization) are being pushed toward smaller, cheaper models or even on-device processing. This is driving a 'tiered AI' structure where the same platform offers different quality levels based on the user's value.

| Use Case Category | % of Total Inference Cost | Revenue per Inference (USD) | Cost Recovery Rate |
|---|---|---|---|
| E-commerce recommendation | 34% | $0.12 | 95% |
| Local services dispatch | 18% | $0.08 | 80% |
| Customer service chatbot | 10% | $0.01 | 20% |
| General AI assistant (WeChat) | 8% | $0.00 | 0% |
| Video generation effects | 20% | $0.05 | 40% |
| Other | 10% | Varies | Varies |

Data Takeaway: Only e-commerce and local services have high enough revenue per inference to be self-sustaining. General AI assistants and video effects are loss leaders. This explains why ByteDance, with its heavy video generation focus, is under more financial pressure than Tencent, which can cross-subsidize from its gaming and cloud revenue.

Risks, Limitations & Open Questions

The biggest risk is a 'compute trap' where user growth outpaces cost reduction. If inference costs do not drop by at least 30% annually, the industry will face a margin crisis. Current trends suggest that while model efficiency improves (e.g., quantization, pruning, distillation), the demand for higher quality (longer context, higher resolution video) is growing faster. The gap is widening.

Another risk is geopolitical. China's access to advanced Nvidia GPUs is restricted. The H100 is banned for export to China; the H800 is a downgraded version with reduced inter-GPU bandwidth. ByteDance and Tencent are increasingly buying Huawei Ascend 910B chips as an alternative, but software compatibility and performance are still inferior. A full decoupling from Nvidia could take 3-5 years, during which inference costs may rise.

There is also the question of user willingness to pay. Currently, most Chinese AI services are free. If platforms start charging for premium AI features, they risk losing users to competitors. The 'freemium' model is being tested—WeChat now offers a paid 'AI Pro' tier for 19 RMB/month—but adoption is low, estimated at under 2% of MAU.

AINews Verdict & Predictions

We predict that within 18 months, at least one major Chinese AI platform will be forced to either raise prices or significantly degrade free-tier quality. The numbers simply do not add up. ByteDance is the most vulnerable: its single-model, high-performance approach is unsustainable without a breakthrough in custom chips or a dramatic drop in Nvidia GPU prices. We expect ByteDance to announce a partnership with a second model vendor within 6 months, moving toward Tencent's multi-model strategy.

Tencent's approach is more resilient but not immune. Its model marketplace creates complexity and potential quality inconsistency. Users may notice that the same query gets different answers depending on which model the router selects. This could erode trust. Tencent will need to invest heavily in a unified quality assurance layer.

Alibaba is best positioned long-term because of its custom chip advantage and tight integration with high-value e-commerce use cases. We predict Alibaba will achieve inference cost parity with Nvidia-based solutions by Q4 2026, giving it a significant margin advantage.

The ultimate winner will not be the company with the best model, but the one with the best cost structure. In the AI industry, efficiency is the new moat.

常见问题

这次公司发布“China's AI Billions: Who Pays for 710 Million Monthly Active Users?”主要讲了什么？

China's AI industry has crossed a staggering milestone: 710 million monthly active users. But beneath the growth lies a brutal economic reality: the cost of inference is not declin…

从“ByteDance GPU shortage impact on Douyin AI features”看，这家公司的这次发布为什么值得关注？

The core bottleneck is not model training but inference serving at scale. For ByteDance, the problem is acute. Its video generation models—used for Douyin's AI-powered effects and its new video creation tools—require mas…

围绕“Tencent multi-model strategy vs single-model approach cost comparison”，这次发布可能带来哪些后续影响？

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。