Technical Deep Dive
The core of Zhipu's breakthrough lies in a multi-layered software stack optimization targeting the unique architecture of Chinese AI chips, particularly those from companies like Huawei (Ascend series), Cambricon, and Biren Technology. Unlike NVIDIA's CUDA ecosystem, which provides a mature, unified programming model, domestic chips have historically suffered from fragmented, poorly documented, and often proprietary software tools. Zhipu's approach tackles this at three levels:
1. Kernel-Level Fusion and Memory Management: Zhipu's engineers developed custom compute kernels that fuse multiple operations (e.g., attention, feed-forward, layer normalization) into single, hardware-optimized passes. This reduces the overhead of kernel launch latency and memory bandwidth consumption—critical for inference where latency is paramount. For example, on Huawei's Ascend 910B, Zhipu achieved a 2.3x speedup in token generation throughput by fusing the multi-head attention and feed-forward network operations into a single kernel, compared to running them separately. The team also implemented a novel memory pooling strategy that reduces peak memory usage by 40% for a 70B-parameter model, enabling deployment on fewer chips.
2. Automatic Operator Mapping and Compilation: Zhipu built a custom compiler layer that automatically maps PyTorch and JAX model graphs to the instruction sets of different domestic chips. This is analogous to what OpenAI's Triton does for NVIDIA GPUs, but generalized for heterogeneous architectures. The compiler uses a cost model based on chip-specific characteristics (e.g., number of tensor cores, memory hierarchy, interconnect bandwidth) to select the optimal operator implementation. On Cambricon's MLU370, this reduced the need for manual operator rewriting by 80%, cutting adaptation time from 3 months to 2 weeks for a standard LLaMA-2 13B model.
3. Quantization and Sparsity Exploitation: Zhipu integrated advanced quantization techniques (INT4, INT8, and mixed-precision) that are aware of the underlying chip's numerical stability. Unlike NVIDIA's TensorRT, which assumes FP16 or TF32, Zhipu's stack dynamically selects quantization schemes based on the chip's hardware support. For Biren Technology's BR100 chip, which lacks native INT4 support, the stack uses a software-emulated INT4 scheme that achieves 1.8x throughput improvement over FP16 with less than 1% accuracy loss on the MMLU benchmark.
Benchmark Performance Data:
| Chip | Model | Throughput (tokens/s) | Latency (ms/token) | Memory Usage (GB) | Adaptation Time |
|---|---|---|---|---|---|
| NVIDIA A100 (baseline) | LLaMA-2 13B | 120 | 8.3 | 26 | N/A |
| Huawei Ascend 910B (before) | LLaMA-2 13B | 45 | 22.2 | 32 | 3 months |
| Huawei Ascend 910B (after) | LLaMA-2 13B | 98 | 10.2 | 19 | 2 weeks |
| Cambricon MLU370 (before) | LLaMA-2 13B | 30 | 33.3 | 38 | 4 months |
| Cambricon MLU370 (after) | LLaMA-2 13B | 72 | 13.9 | 22 | 2 weeks |
| Biren BR100 (before) | LLaMA-2 13B | 25 | 40.0 | 40 | 5 months |
| Biren BR100 (after) | LLaMA-2 13B | 60 | 16.7 | 24 | 3 weeks |
Data Takeaway: Zhipu's stack brings domestic chip inference performance to within 60-80% of NVIDIA A100 levels, while slashing adaptation time by 80-90%. The memory savings are particularly significant, enabling deployment on fewer chips and reducing total cost of ownership.
For developers, the key GitHub repository to watch is Zhipu's open-source inference engine, `zhipu-infer` (currently 12,000 stars). It now includes pre-built support for Ascend, Cambricon, and Biren chips, with a plugin architecture for adding new hardware. The repo's recent commits show active development on dynamic batching and continuous batching optimizations, which are critical for production inference serving.
Key Players & Case Studies
Zhipu's breakthrough is not happening in isolation. It is part of a broader movement by Chinese AI model companies to reduce dependence on NVIDIA hardware, driven by export controls and supply chain uncertainty. The key players and their strategies:
- Zhipu AI: As the model provider, Zhipu has the most incentive to optimize for domestic chips because it directly affects its cloud inference pricing and customer acquisition. Zhipu's GLM series models are now available for inference on Huawei Cloud's Ascend-powered instances at a 40% discount compared to NVIDIA A100 instances. This has already attracted several mid-sized enterprises in finance and healthcare that are sensitive to cost.
- Huawei (Ascend): Huawei has been the most aggressive in building its software ecosystem, with the CANN (Compute Architecture for Neural Networks) toolkit. However, CANN has been criticized for being complex and poorly documented. Zhipu's stack effectively acts as a higher-level abstraction over CANN, making it more accessible. Huawei has responded by contributing to Zhipu's open-source kernel library, signaling a cooperative rather than competitive stance.
- Cambricon: Cambricon's MLU series has struggled with software maturity. Zhipu's compiler layer directly addresses this by automatically generating optimized code, bypassing Cambricon's own SDK. This could be a double-edged sword: it makes Cambricon chips more usable, but it also reduces Cambricon's control over its software narrative.
- Biren Technology: Biren's BR100 chip was designed for training, but inference performance has been poor. Zhipu's quantization techniques have made it viable for inference, but Biren's lack of INT4 hardware support limits the gains. Biren is reportedly working on a new chip (BR200) with native INT4 support, expected in Q4 2026.
Comparison of Software Stack Approaches:
| Company | Approach | Key Advantage | Key Limitation |
|---|---|---|---|
| NVIDIA | CUDA + TensorRT | Mature ecosystem, broad model support | High cost, export control risks |
| Huawei | CANN (proprietary) | Tight hardware integration | Steep learning curve, vendor lock-in |
| Zhipu | zhipu-infer (open-source) | Hardware-agnostic, fast adaptation | Relies on chip vendors for low-level drivers |
| Intel | OpenVINO | Cross-platform, good for edge | Limited support for large models |
Data Takeaway: Zhipu's open-source, hardware-agnostic approach is unique. It lowers the barrier for all domestic chip vendors simultaneously, creating a rising tide that lifts all boats. However, it also means Zhipu becomes a critical dependency—if Zhipu stops supporting a chip, that chip's inference ecosystem collapses.
Industry Impact & Market Dynamics
The immediate impact is on the economics of AI inference in China. Currently, the cost of running a 70B-parameter model on NVIDIA A100 instances is approximately $0.50 per million tokens. With Zhipu's optimization on Ascend 910B, the cost drops to $0.28 per million tokens—a 44% reduction. For a company running 10 million tokens per day, this translates to annual savings of $800,000.
This cost advantage is driving a wave of adoption. Several Chinese cloud providers, including Alibaba Cloud and Tencent Cloud, have announced support for Zhipu's optimized inference on their domestic chip instances. The total addressable market for domestic chip inference in China is projected to grow from $2.1 billion in 2025 to $8.5 billion by 2028, according to industry estimates.
Market Growth Projections:
| Year | Domestic Chip Inference Market (China, $B) | Zhipu's Estimated Market Share | NVIDIA's Market Share (China) |
|---|---|---|---|
| 2025 | 2.1 | 5% | 75% |
| 2026 | 3.8 | 15% | 60% |
| 2027 | 5.9 | 25% | 45% |
| 2028 | 8.5 | 35% | 30% |
Data Takeaway: Zhipu is positioned to capture a significant share of the domestic chip inference market, potentially becoming the de facto middleware layer. This would give Zhipu pricing power and a moat that extends beyond its model capabilities.
Longer-term, this shift has second-order effects on the AI supply chain. As domestic chips become viable for inference, demand for NVIDIA chips in China will decline, potentially affecting NVIDIA's revenue by $3-5 billion annually by 2028. It also reduces the strategic leverage of export controls, as Chinese companies can now build competitive inference infrastructure without access to the latest NVIDIA hardware.
Risks, Limitations & Open Questions
Despite the promise, significant risks remain:
1. Performance Gap Persists: Even with Zhipu's optimizations, domestic chips still lag NVIDIA in peak throughput and latency, especially for very large models (100B+ parameters). For real-time applications like chatbots, the 10-20ms latency difference can be noticeable.
2. Training Not Addressed: Zhipu's breakthrough is focused on inference. Training large models on domestic chips remains prohibitively slow due to immature distributed training frameworks. This limits the scope of the breakthrough to deployment, not development.
3. Vendor Lock-in Risk: While Zhipu's stack is open-source, it is controlled by Zhipu. If Zhipu changes its business model or prioritizes certain chips over others, developers could be left stranded. The lack of a truly independent, community-driven alternative is a concern.
4. Hardware Reliability: Domestic chips have historically had higher failure rates and lower mean time between failures (MTBF) compared to NVIDIA. In a production inference environment, this translates to higher operational overhead and potential downtime.
5. Regulatory Uncertainty: The Chinese government may impose requirements on which chips can be used for certain AI applications, potentially favoring state-backed chip vendors over others. This could fragment the market and undermine Zhipu's hardware-agnostic approach.
AINews Verdict & Predictions
Zhipu's software stack breakthrough is a genuine inflection point for China's AI infrastructure. It transforms domestic chips from a last-resort option to a viable, cost-effective alternative for inference workloads. The 30% stock surge is not irrational exuberance but a rational repricing of Zhipu's strategic value.
Our predictions:
1. Within 12 months, Zhipu will announce a partnership with at least two major Chinese cloud providers to offer domestic chip inference as a standard service, undercutting NVIDIA-based pricing by 30-50%. This will trigger a price war in China's cloud AI market.
2. Within 18 months, the open-source community will fork Zhipu's inference engine to create an independent, community-maintained version, reducing the vendor lock-in risk. This fork will gain significant traction among smaller developers.
3. Within 24 months, at least one major Chinese AI chip vendor (likely Huawei) will acquire or license Zhipu's software stack to create a proprietary, optimized version, signaling a shift from hardware-only to software-defined chip competition.
4. The biggest loser: NVIDIA's Chinese inference revenue will decline by 25% within two years, accelerating the company's pivot to training and high-end markets.
What to watch next: The adoption rate of Zhipu's stack among non-Chinese developers. If it gains traction in Southeast Asia or Africa, where cost sensitivity is high, it could become a global alternative to CUDA for inference. The next major test will be the release of Zhipu's GLM-5 model, which is expected to be optimized exclusively for domestic chips—a clear signal of Zhipu's long-term bet.