Technical Deep Dive
The headline hardware numbers from the Four Dragons' 2025 reports are genuinely impressive. The latest generation of domestic chips—such as Huawei's Ascend 910C, Cambricon's MLU590, Biren's BR100, and Moore Threads' MTT S4000—now deliver FP16 performance in the range of 300-450 TFLOPS, with memory bandwidth exceeding 2 TB/s. These figures place them within striking distance of NVIDIA's H100 (989 TFLOPS FP8, 3.35 TB/s HBM3) and even the newer B200, especially for inference-optimized configurations.
However, the real story lies in the software stack. NVIDIA's dominance is built not just on CUDA, but on a vast ecosystem of libraries (cuDNN, cuBLAS, TensorRT), profiling tools (Nsight), and framework integrations that have been refined over 15 years. The Four Dragons are attempting to replicate this with their own stacks: Huawei's CANN (Compute Architecture for Neural Networks), Cambricon's BangWare, Biren's BIREN-SDK, and Moore Threads' MUSA (Moore Threads Unified System Architecture).
A critical technical challenge is the compiler and runtime layer. NVIDIA's NVCC compiler and the Triton inference server have set a high bar for automatic kernel optimization and dynamic batching. Domestic alternatives are catching up—for instance, Huawei's MindSpore framework and its graph compiler have shown competitive performance on standard vision models—but they lag in supporting the latest model architectures like Mixture-of-Experts (MoE) and state-space models (Mamba).
| Metric | NVIDIA H100 | Huawei Ascend 910C | Cambricon MLU590 | Biren BR100 | Moore Threads MTT S4000 |
|---|---|---|---|---|---|
| FP16 TFLOPS | 989 | 450 | 350 | 400 | 320 |
| Memory Bandwidth | 3.35 TB/s | 2.0 TB/s | 1.8 TB/s | 2.2 TB/s | 1.6 TB/s |
| HBM Capacity | 80 GB | 64 GB | 48 GB | 64 GB | 48 GB |
| Framework Support | PyTorch, JAX, TF | MindSpore, PyTorch (partial) | PyTorch (custom fork) | PyTorch (via adapter) | PyTorch, TF (partial) |
| Cluster Utilization (est.) | 65-80% | 40-55% | 35-50% | 30-45% | 35-50% |
Data Takeaway: While peak FP16 performance gaps have narrowed to 2-3x, the real divergence is in cluster utilization—domestic chips achieve only 40-55% of theoretical peak in production, versus 65-80% for NVIDIA clusters. This gap represents not just wasted compute, but higher effective cost per trained model.
A promising open-source effort is the CANN Community Edition on GitHub, which has seen over 5,000 stars and provides a lower-level interface for custom kernel development. Similarly, Biren's BIREN-SDK has released reference implementations for LLaMA and Stable Diffusion, but users report that debugging distributed training jobs remains significantly harder than with CUDA's Nsight Systems.
Key Players & Case Studies
The Four Dragons are pursuing distinct strategies, reflecting their different origins and strengths.
Huawei (Ascend) is the clear leader, leveraging its deep integration with the Kunpeng CPU ecosystem and its own cloud services (Huawei Cloud). Its strategy is to offer a complete, vertically integrated stack—from chip to server to cloud to framework (MindSpore). This approach has won major contracts from state-owned enterprises and telecom operators. However, the closed nature of the ecosystem has drawn criticism from developers who prefer the flexibility of PyTorch.
Cambricon has positioned itself as the most 'pure-play' AI chip company, with a strong focus on the developer experience. Its BangWare software stack includes a PyTorch-compatible backend that claims to require minimal code changes. Cambricon has also been aggressive in publishing benchmark results, showing competitive performance on ResNet-50 and BERT inference. Yet, its smaller scale means less community support and fewer third-party libraries.
Biren Technology has taken a differentiated approach by targeting the high-end training market with its BR100 architecture, which features a unique 'MIMD' (Multiple Instruction, Multiple Data) design for better utilization on sparse models. Biren has partnered with several AI startups to optimize training for diffusion models and MoE architectures. Its GitHub repository for model examples has gained traction, but the complexity of its hardware architecture makes software optimization challenging.
Moore Threads is the newest entrant and has focused on the gaming and graphics market as a beachhead, but its MUSA architecture is designed to be CUDA-compatible at the instruction level, allowing for easier porting of existing CUDA code. This 'drop-in replacement' strategy has attracted some interest from smaller AI labs, but performance overheads of 20-30% on translated code remain a barrier.
| Company | Strategy | Key Customer Base | Software Stack | GitHub Stars (SDK/Examples) |
|---|---|---|---|---|
| Huawei (Ascend) | Vertical integration | State-owned, telecom | CANN, MindSpore | ~5,000 (CANN CE) |
| Cambricon | Developer-friendly | Cloud providers, research | BangWare, PyTorch backend | ~3,500 |
| Biren Technology | High-end training | AI startups, labs | BIREN-SDK, custom kernels | ~2,000 |
| Moore Threads | CUDA compatibility | Gaming, small AI labs | MUSA, CUDA translator | ~4,000 |
Data Takeaway: The table reveals a fragmented software landscape. No single company has achieved the network effects of CUDA, and each is essentially building its own island. This fragmentation is a major barrier to widespread adoption, as developers must choose a vendor and lock into its ecosystem.
Industry Impact & Market Dynamics
The 50 billion yuan revenue milestone is not just a number—it represents a structural shift in China's AI infrastructure procurement. Government mandates requiring the use of domestic chips in 'new infrastructure' projects, combined with export controls on advanced NVIDIA GPUs, have created a captive market. However, this demand is not uniform. Cloud providers like Alibaba Cloud and Tencent Cloud are the largest buyers, but they are also the most demanding in terms of software maturity and performance.
A key dynamic is the emergence of 'AI-as-a-Service' platforms that abstract away the underlying hardware. Companies like SenseTime and Megvii are building training platforms that can schedule jobs across heterogeneous clusters of NVIDIA and domestic chips. This reduces vendor lock-in but also puts pressure on the Four Dragons to ensure their chips are compatible with these orchestration layers.
| Metric | 2024 | 2025 | 2026 (Projected) |
|---|---|---|---|
| Combined Revenue (Billion RMB) | 32 | 52 | 75-85 |
| Market Share vs. NVIDIA (China) | 15% | 25% | 35-40% |
| Average Cluster Utilization | 35% | 45% | 55-60% |
| Number of Production Models Supported | 50 | 120 | 250+ |
Data Takeaway: The projected 35-40% market share by 2026 suggests that domestic chips will become a significant force, but they will still be a secondary option for the most demanding workloads. The key inflection point will be when cluster utilization crosses 60%, making domestic chips cost-competitive on a total-cost-of-ownership basis.
Risks, Limitations & Open Questions
The most significant risk is the software ecosystem gap. While hardware performance is closing, the developer experience—from installation to debugging to optimization—remains years behind CUDA. This creates a 'chicken-and-egg' problem: without a large user base, the software won't improve; without good software, users won't switch.
Another risk is the dependency on government mandates. If the political winds shift or if export controls are relaxed, the Four Dragons could face a sudden loss of their protected market. The industry must prove it can compete on merit, not just policy.
There are also technical limitations. Current domestic chips still struggle with large-scale distributed training (10,000+ GPU clusters) due to immature networking (NVLink equivalents) and collective communication libraries. The lack of a mature alternative to NVIDIA's InfiniBand or NVSwitch means that scaling efficiency drops sharply beyond a few hundred chips.
Finally, the rise of new AI architectures—particularly state-space models and MoE—requires rapid software adaptation. The Four Dragons' software teams are already stretched thin supporting existing models, and keeping pace with the breakneck speed of AI research is a monumental challenge.
AINews Verdict & Predictions
The Four Dragons have successfully crossed the 'viability' threshold. Their hardware is good enough for a wide range of inference and training tasks, and their 2025 revenues prove there is real market demand. However, the transition to 'indispensable' will be won or lost in the software layer.
Our predictions for 2026:
1. Consolidation is inevitable. The fragmented software ecosystem is unsustainable. We predict at least one major acquisition or strategic alliance among the Four Dragons to unify their software stacks, possibly around a common open-source compiler like MLIR or Triton.
2. Cluster utilization will become the key metric. Companies that can demonstrate 60%+ utilization in production will win the next wave of cloud contracts. This will drive investment in profiling tools, automated optimization, and better distributed training libraries.
3. The 'CUDA compatibility' strategy will fail. Moore Threads' approach of translating CUDA code will prove to be a dead end, as performance overheads and the rapid evolution of CUDA features make it a losing game. The winners will be those who invest in native, optimized stacks for PyTorch and JAX.
4. A 'dark horse' will emerge from the open-source community. A community-driven project, possibly based on the OpenAI Triton compiler, could provide a hardware-agnostic layer that reduces the lock-in of any single vendor. This would be a game-changer for the entire ecosystem.
The next 18 months will be the most critical in the history of China's domestic AI chip industry. The hardware is ready. The question is whether the software can catch up.