Technical Deep Dive
Cambricon's architectural DNA is distinct from the mainstream GPU approach. Its Siyuan series, particularly the MLU370 and the upcoming MLU590, are built around a Cambricon Instruction Set Architecture (ISA) that emphasizes sparse tensor processing and near-memory computing. The sparse computation engine is designed to exploit the inherent sparsity in neural network weights and activations, potentially delivering significant performance-per-watt advantages for models like Transformers, where attention heads can be pruned. The near-memory computing logic attempts to reduce the von Neumann bottleneck by integrating compute logic closer to the memory cells, a technique that can dramatically lower data movement energy — often the dominant cost in AI inference.
However, the core engineering challenge is the Cambricon Neuware (CNware) software stack. Nvidia's CUDA has over 15 years of optimization, a vast library ecosystem (cuDNN, cuBLAS, TensorRT), and a global community of developers who have built their workflows around it. CNware, while functional, lags in several critical areas:
- Operator coverage: Many niche but important operations (e.g., specific attention variants, custom activation functions) are missing or unoptimized.
- Debugging and profiling tools: The tooling is less mature, making it harder for developers to diagnose performance bottlenecks.
- Distributed training support: Frameworks like PyTorch's DDP and FSDP have deep CUDA integrations. Porting these to CNware requires significant engineering effort and often leads to suboptimal scaling efficiency.
A recent benchmark comparison on the MLU370 vs. Nvidia A100 for a standard LLM training task (GPT-3 1.3B parameter model) illustrates the gap:
| Metric | Nvidia A100 (80GB) | Cambricon MLU370-S4 | Gap |
|---|---|---|---|
| Training Throughput (tokens/sec) | 12,500 | 7,800 | -37.6% |
| Memory Bandwidth Utilization | 89% | 72% | -19.1% |
| Time to Converge (hours) | 48 | 72 | +50% |
| Power Consumption (W) | 400 | 250 | -37.5% |
| Cost per Token (relative) | 1.0x | 0.65x | -35% |
Data Takeaway: While Cambricon offers a lower cost per token due to lower power and potentially lower chip pricing, the 50% longer training time is a deal-breaker for most large model developers. Time-to-market for new models is critical; a 50% slowdown can mean losing competitive advantage. The power efficiency advantage is real but insufficient to offset the throughput deficit.
On the open-source front, the Cambricon PyTorch backend (available on GitHub) has seen moderate community activity, with around 1,200 stars and periodic updates. However, the repository's issue tracker reveals persistent problems with operator coverage and memory management for large models. A notable project is the CNDEV repository, which provides low-level driver and runtime interfaces, but its complexity limits its use to a small number of system-level engineers.
Key Players & Case Studies
The domestic AI chip landscape is no longer a two-horse race. Here is a comparative analysis of the major contenders:
| Company | Focus Area | Key Product | Training Performance (vs. A100) | Ecosystem Maturity | Primary Customers |
|---|---|---|---|---|---|
| Huawei (Ascend) | Large-scale training & inference | Ascend 910B | ~80-90% | High (MindSpore, CANN) | Major cloud providers, state-owned enterprises |
| Biren Technology | HPC & AI training | BR100 | ~70-80% | Medium (BIREN-SDK) | Research institutions, HPC centers |
| Cambricon | Full-stack (training + inference) | MLU590 (upcoming) | ~60-70% (est.) | Low-Medium (CNware) | Select LLM startups, smart city projects |
| Enflame | Inference & edge | T20 | N/A (inference only) | Medium (TopsRider) | Cloud gaming, video analytics |
| Moore Threads | Consumer & datacenter GPU | MTT S4000 | ~50-60% | Low (MUSA) | Gaming, content creation, small AI workloads |
Data Takeaway: Huawei has emerged as the clear leader in ecosystem maturity and training performance, leveraging its massive internal AI usage and government relationships. Biren has carved a niche in HPC, but its commercial traction remains limited. Cambricon sits in a precarious middle ground: it has the most ambitious full-stack vision but lacks the ecosystem pull of Huawei and the specialized focus of Enflame or Moore Threads.
A case study in Cambricon's struggles is its partnership with a major Chinese LLM developer, Baichuan Intelligence. Early reports indicated that Baichuan used Cambricon chips for some inference workloads but opted for Nvidia and Huawei Ascend for its primary training clusters. The reason cited was the difficulty in scaling Cambricon's chips for distributed training across hundreds of nodes — a problem rooted in both hardware interconnect (Cambricon uses a proprietary interconnect, not NVLink) and software stack immaturity. This pattern is repeated across the industry: Cambricon is often used for secondary or backup workloads, not as the primary compute engine.
Industry Impact & Market Dynamics
The Chinese AI chip market is projected to grow from $8 billion in 2024 to $25 billion by 2028, driven by government mandates for domestic AI infrastructure and the explosive demand for LLM inference. However, this growth is not a rising tide that lifts all boats. The market is bifurcating into two tiers:
1. High-performance training: Dominated by Huawei Ascend and Nvidia (via gray market channels).
2. Cost-sensitive inference: A fragmented market where Cambricon, Enflame, and startups compete on price-per-query.
Cambricon's revenue figures paint a worrying picture:
| Year | Revenue (USD) | Net Income | R&D Spend (% of Revenue) | Market Cap (Peak) |
|---|---|---|---|---|
| 2022 | $120M | -$200M | 150% | $15B |
| 2023 | $95M | -$250M | 180% | $8B |
| 2024 (est.) | $110M | -$220M | 160% | $6B |
Data Takeaway: Cambricon has never been profitable and is burning cash at an alarming rate. R&D spending as a percentage of revenue is unsustainable — it indicates that the company is spending nearly twice its revenue on R&D. This is typical for a pre-revenue startup, but Cambricon has been public for years. The declining market cap reflects investor skepticism about its path to profitability. The company's cash runway, based on its latest financials, is approximately 12-18 months, assuming no additional fundraising.
The government's push for 'indigenous innovation' has provided a lifeline. Cambricon has won several contracts for smart city and public security projects, but these are low-margin and not scalable. The real prize — powering the next generation of Chinese LLMs — remains elusive.
Risks, Limitations & Open Questions
1. The CUDA Lock-In is Real: Even if Cambricon achieves parity on paper, the switching cost for developers is enormous. A team that has spent years optimizing its PyTorch code for CUDA will not easily migrate to CNware, especially when the performance gain is marginal. The network effects of CUDA are a moat that no single chip company has breached.
2. Huawei's Shadow: Huawei is not just a competitor; it is an ecosystem. Its MindSpore framework, while not as popular as PyTorch, is deeply integrated into China's state-owned enterprises and research labs. Huawei can afford to subsidize chip sales for strategic wins, a luxury Cambricon does not have.
3. The 'World Model' Trap: Cambricon's marketing has increasingly focused on 'world models' and video generation, but these workloads are even more demanding than LLMs. They require massive memory bandwidth and complex attention mechanisms. If Cambricon's chips cannot demonstrate competitive performance on Sora-like models, this narrative will backfire.
4. Geopolitical Headwinds: US export controls have paradoxically helped domestic chip companies by limiting Nvidia's sales. However, they also restrict Cambricon's access to advanced manufacturing nodes (e.g., 5nm from TSMC). The company is likely stuck on 7nm-class processes, which limits its ability to compete on raw performance.
5. The Talent Drain: Top AI chip engineers in China are being poached by Huawei, Alibaba's Pingtouge, and well-funded startups. Cambricon's ability to retain its best architects is an open question.
AINews Verdict & Predictions
Our editorial judgment is clear: Cambricon will not become 'China's Nvidia.' The company is fighting a multi-front war it cannot win. Huawei has already won the training market for domestic AI. Biren has a stronger position in HPC. The inference market is too fragmented and low-margin to support Cambricon's valuation.
Predictions for the next 24 months:
- Acquisition or Strategic Alliance: Cambricon will likely be acquired by a larger state-owned enterprise or a cloud provider (e.g., China Mobile, Alibaba) that needs in-house chip capability. A standalone future is untenable.
- Niche Specialization: If it survives independently, Cambricon will pivot to a specific vertical — likely autonomous driving or edge AI — where its sparse computing and low-power advantages matter more than ecosystem breadth.
- Stock Decline: The market will continue to re-rate Cambricon downward as revenue growth disappoints. A 50% decline from current levels is plausible within 12 months.
- The 'China Nvidia' Narrative Dies: The title will pass to Huawei, which has the scale, ecosystem, and government backing to dominate. Cambricon's legacy will be as a pioneer that could not execute on its vision.
What to watch: The MLU590 launch in late 2025. If it cannot demonstrate competitive performance against the Ascend 910C on standard LLM benchmarks (MMLU, HumanEval), the last hope for a turnaround will be gone. Also, watch for any major customer win — a public commitment from a top-tier LLM developer (e.g., Baidu, Alibaba) would be the only signal that could change our bearish view.