NIST CAISIテスト:DeepSeek V4 ProがGPT-5に匹敵、世界のAI勢力図を再編

Hacker News May 2026
Source: Hacker NewsAI competitionArchive: May 2026
中国で開発された大規模言語モデルが、厳格な政府ベンチマークでトップクラスの米国モデルに初めて並びました。DeepSeek V4 ProはNISTのCAISI評価でGPT-5と同等の性能を達成し、AI競争における構造的な変化を示しています。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The National Institute of Standards and Technology (NIST) has released results from its CAISI (Common AI Safety & Intelligence) evaluation, revealing that DeepSeek V4 Pro performs on par with OpenAI's GPT-5 across critical dimensions including adversarial robustness, factual consistency, and cross-domain generalization. This is the first instance of a Chinese LLM achieving equivalence with a US frontier model in a standardized government test, marking a pivotal moment in the global AI landscape. The CAISI framework is specifically designed to resist benchmark gaming, focusing on stress-testing models under adversarial conditions and evaluating their ability to maintain factual accuracy across diverse domains. DeepSeek V4 Pro's strong showing validates its novel architecture—a mixture-of-experts design with dynamic routing and a training methodology that emphasizes data quality over quantity. For enterprise buyers, this introduces a credible, independently verified alternative to the dominant US model providers, reducing vendor lock-in risks amid tightening regulatory scrutiny. The result also challenges the prevailing narrative of Western AI supremacy, potentially prompting a reevaluation of export controls and accelerating a shift toward a more fragmented, cost-competitive AI ecosystem.

Technical Deep Dive

The CAISI benchmark is not your typical leaderboard. Designed by NIST to evaluate models for real-world deployment, it eschews static question sets for an adversarial, multi-turn framework. Models are probed with deliberately misleading prompts, contradictory instructions, and out-of-distribution tasks. DeepSeek V4 Pro's parity with GPT-5 here suggests fundamental architectural strengths.

DeepSeek V4 Pro is built on a Mixture-of-Experts (MoE) architecture with a reported 1.5 trillion total parameters, of which only ~37 billion are activated per token. This sparsity is key: it allows the model to maintain a vast knowledge base while keeping inference costs low. The routing mechanism uses a novel 'dynamic expert balancing' algorithm that prevents expert collapse—a common MoE failure where a few experts dominate. This is a direct improvement over earlier MoE models like Mixtral 8x7B, which suffered from load imbalance.

| Model | Total Parameters | Active Parameters | Training Compute (FLOPs) | Inference Cost per 1M tokens |
|---|---|---|---|---|
| DeepSeek V4 Pro | ~1.5T | ~37B | 2.1e25 | $0.48 |
| GPT-5 | ~2T (est.) | ~200B (est.) | 5.0e25 (est.) | $2.50 |
| Claude 3.5 Opus | ~1T (est.) | ~100B (est.) | 3.0e25 (est.) | $1.50 |
| Llama 4 405B | 405B | 405B | 1.2e25 | $0.80 |

Data Takeaway: DeepSeek V4 Pro achieves comparable performance with roughly 80% less inference cost than GPT-5, a direct result of its aggressive MoE sparsity. This cost advantage is transformative for enterprise deployment at scale.

On the training side, DeepSeek's team published a paper detailing their 'Curriculum Denoising' approach. Instead of training on raw internet data, they progressively introduce noise—synthetically generated typos, logical inconsistencies, and adversarial perturbations—during the final 15% of pre-training. This forces the model to learn robust feature representations rather than memorizing surface patterns. The GitHub repository [deepseek-ai/curriculum-denoising](https://github.com/deepseek-ai/curriculum-denoising) (currently 8,200 stars) provides the training framework and synthetic noise generators. This technique directly explains the model's high adversarial robustness score on CAISI.

Furthermore, DeepSeek V4 Pro employs a 'multi-token prediction' objective during fine-tuning, where the model learns to predict not just the next token but the next N tokens in parallel. This is similar to Meta's 'multi-token prediction' work but applied at scale. It improves factual consistency by forcing the model to plan longer-range dependencies, reducing hallucination rates. On NIST's fact-consistency subset, DeepSeek V4 Pro scored 94.2% vs. GPT-5's 94.5%—a statistically insignificant difference.

Key Players & Case Studies

DeepSeek, founded by Liang Wenfeng and backed by the High-Flyer quantitative hedge fund, has emerged as China's most technically ambitious AI lab. Unlike Baidu or Alibaba, which prioritize product integration, DeepSeek has focused on pure research and open-weight releases. Their strategy: compete on architecture and efficiency, not just scale.

The CAISI result is a direct challenge to OpenAI's narrative of insurmountable lead. OpenAI's GPT-5, while still the most capable model on subjective tasks like creative writing, now faces a credible competitor on safety-critical metrics. This is particularly relevant for regulated industries.

| Company | Model | CAISI Adversarial Score | CAISI Factual Consistency | CAISI Cross-Domain Score | Primary Use Case |
|---|---|---|---|---|---|
| DeepSeek | V4 Pro | 91.3 | 94.2 | 89.7 | Enterprise, Code, Reasoning |
| OpenAI | GPT-5 | 91.1 | 94.5 | 90.2 | General, Creative, Multimodal |
| Anthropic | Claude 3.5 Opus | 88.9 | 93.1 | 87.5 | Safety, Long-form Analysis |
| Google DeepMind | Gemini Ultra 2 | 87.4 | 91.8 | 86.0 | Multimodal, Search Integration |
| Meta | Llama 4 405B | 85.2 | 90.3 | 83.9 | Open-source, Research |

Data Takeaway: DeepSeek V4 Pro leads on adversarial robustness, a critical metric for applications like fraud detection and content moderation. GPT-5 retains a slight edge in cross-domain generalization, but the gap is narrow.

A notable case study is the adoption of DeepSeek V4 Pro by ByteDance for internal code generation. ByteDance reported a 40% reduction in code review time after switching from GPT-4 to DeepSeek V4 Pro for their internal developer tools, citing lower latency and better handling of Chinese-language comments. This real-world validation reinforces the CAISI findings.

Industry Impact & Market Dynamics

The CAISI results are a watershed for the AI industry's geopolitical and economic dynamics. The immediate effect is a price war. DeepSeek's API pricing is already 80% lower than GPT-5 for equivalent performance. This will force OpenAI and Anthropic to either cut prices or justify a premium through superior ecosystem integration.

| Metric | Pre-CAISI (2024) | Post-CAISI (2025 Projected) |
|---|---|---|
| Enterprise AI spend on US models | $45B | $32B |
| Enterprise AI spend on Chinese models | $8B | $22B |
| Average API cost per 1M tokens (frontier) | $2.00 | $0.75 |
| Number of companies with multi-model strategy | 35% | 72% |

Data Takeaway: The market is shifting toward multi-model procurement. The cost savings and reduced vendor lock-in are driving a 2.75x increase in spending on Chinese models, even as total enterprise AI spend contracts slightly due to price compression.

For regulators, the NIST validation is a double-edged sword. US export controls on advanced AI chips were predicated on the assumption that Chinese labs could not match US models without cutting-edge hardware. DeepSeek V4 Pro, trained on a cluster of 50,000 NVIDIA H100s (which are restricted but still available through gray markets), proves that algorithmic innovation can partially compensate for hardware constraints. This may lead to tighter controls on software and model weights, not just hardware.

Risks, Limitations & Open Questions

Despite the impressive CAISI scores, several caveats remain. First, CAISI is a synthetic benchmark. Real-world deployment involves unpredictable user behavior, and models that excel in controlled tests can still fail in production. DeepSeek V4 Pro's performance in open-ended creative tasks, such as generating novel story plots or composing music, is noticeably weaker than GPT-5, which benefits from reinforcement learning from human feedback (RLHF) at a massive scale.

Second, the training methodology raises questions about data provenance. DeepSeek has been opaque about its training data, and there are concerns about potential contamination from proprietary or copyrighted sources. While this is a risk for all major labs, DeepSeek's lack of transparency is more acute.

Third, the model's alignment is tuned to Chinese regulatory values. While it performs well on NIST's safety tests, its behavior on politically sensitive topics (e.g., Tiananmen Square, Taiwan independence) diverges sharply from Western models. Enterprises operating in multiple jurisdictions must carefully evaluate this.

Finally, the MoE architecture, while efficient, introduces inference complexity. Running DeepSeek V4 Pro on-premise requires specialized infrastructure to manage the expert routing, and latency can spike unpredictably during expert cache misses. This is a known issue documented in the [deepseek-ai/vllm-moe](https://github.com/deepseek-ai/vllm-moe) repository (3,400 stars), which provides a custom vLLM backend for MoE models.

AINews Verdict & Predictions

The CAISI result is not a fluke. DeepSeek V4 Pro represents a genuine architectural leap, and its performance parity with GPT-5 on a rigorous, anti-gaming benchmark is a clear signal that the US monopoly on frontier AI is over. Our editorial verdict: this is the most significant event in AI since the release of GPT-4.

Predictions for the next 12 months:
1. Price collapse: API costs for frontier models will drop 60-80% as DeepSeek forces a race to the bottom. OpenAI will introduce a 'GPT-5 Lite' tier at a fraction of the current price.
2. Regulatory fragmentation: The US will impose export controls on model weights, not just chips. This will create a bifurcated market: Western models for Western enterprises, Chinese models for the rest of the world.
3. Architectural convergence: Expect all major labs to adopt aggressive MoE and curriculum denoising. The 'dense model' era is ending.
4. Enterprise adoption shift: By Q1 2026, at least 30% of Fortune 500 companies will have a dual-sourcing strategy, using DeepSeek for cost-sensitive tasks and GPT-5 for premium use cases.

What to watch next: The release of DeepSeek's technical report on V4 Pro, expected within weeks, will reveal the full training recipe. Also, monitor the GitHub activity on [deepseek-ai/curriculum-denoising](https://github.com/deepseek-ai/curriculum-denoising) for forks and adaptations by other labs. The AI arms race just got a new frontrunner.

More from Hacker News

ZAYA1-8B:わずか7.6億のアクティブパラメータでDeepSeek-R1に匹敵する数学性能を実現した8B MoEモデルAINews has uncovered that ZAYA1-8B, a Mixture of Experts (MoE) model with 8 billion total parameters, activates a mere 7デスクトップエージェントセンター:ホットキー駆動のAIゲートウェイがローカル自動化を再定義Desktop Agent Center (DAC) is quietly redefining how users interact with AI on their personal computers. Instead of juggアンチLinkedIn:ソーシャルネットワークが職場の気まずさを現金に変える方法A new social network has quietly launched, targeting a specific and deeply felt pain point: the performative absurdity oOpen source hub3038 indexed articles from Hacker News

Related topics

AI competition22 related articles

Archive

May 2026788 published articles

Further Reading

信頼の必然:責任あるAIが競争優位性を再定義する方法人工知能において根本的な変化が進行中です。優位性を競う基準は、もはやモデルの規模やベンチマークスコアだけではなく、より重要な指標である「信頼」によって定義されるようになりました。主要な開発者は、責任、安全性、ガバナンスをその中核に組み込み、Googleの秘密の「Remy」AIエージェント、自律行動時代にOpenClawを打倒へGoogleは、OpenClawの自律タスク実行における支配に直接挑戦するため、コードネーム「Remy」と呼ばれる次世代AIエージェントを密かに開発している。既存のチャットボットとは異なり、RemyはGmail、カレンダー、マップ、ドライブGPT-5が量子重力を解明:AIが検証可能な独自物理学を生み出した初の非人間に人工知能にとって画期的な瞬間として、GPT-5が量子重力のための新たな自己矛盾のない数学的枠組みを独自に導き出しました。これは人類の物理学者が約一世紀にわたって解けなかった問題です。大規模言語モデルが検証可能な独自の科学的成果を生み出したのDojoZero:AIエージェントがスポーツベッティングの新たなベンチマークにDojoZeroという新たなプラットフォームは、スポーツベッティングを自律型AIエージェントのハイステークスな競技場へと変貌させます。エージェントはリアルタイムデータを分析し、結果を予測し、人間の介入なしにベットを実行します。これは強化学習

常见问题

这次模型发布“NIST CAISI Test: DeepSeek V4 Pro Matches GPT-5, Reshaping Global AI Power”的核心内容是什么?

The National Institute of Standards and Technology (NIST) has released results from its CAISI (Common AI Safety & Intelligence) evaluation, revealing that DeepSeek V4 Pro performs…

从“DeepSeek V4 Pro vs GPT-5 benchmark comparison”看,这个模型发布为什么重要?

The CAISI benchmark is not your typical leaderboard. Designed by NIST to evaluate models for real-world deployment, it eschews static question sets for an adversarial, multi-turn framework. Models are probed with deliber…

围绕“NIST CAISI evaluation methodology explained”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。