準二次注意機構が1200万トークンの壁を突破:AI推論の新時代

Hacker News May 2026
Source: Hacker NewsAI reasoningArchive: May 2026
新しい準二次注意機構が従来のTransformerの計算限界を打ち破り、大規模言語モデルのコンテキストウィンドウを1200万トークンに拡張しました。これは24,000ページのテキスト、または200時間の書き起こし音声に相当します。この飛躍により、長文脈推論がより実用的になることが期待されます。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

AINews has uncovered a fundamental breakthrough in attention mechanism design that redefines the upper limits of large language model (LLM) context windows. Traditional quadratic attention — the O(n²) computational bottleneck that has constrained transformer architectures since their inception — has been supplanted by a sub-quadratic approach that scales nearly linearly with sequence length. The result: a context window of up to 12 million tokens, enabling a single model to ingest an entire library of books, hours of video transcriptions, or a complete software codebase without resorting to chunking, retrieval-augmented generation (RAG), or fragmented memory. This is not an incremental optimization; it is a structural re-architecture of how attention computes relevance across sequences. The breakthrough directly addresses the 'lost in the middle' problem, where standard models lose coherence beyond a few thousand tokens. By reducing the marginal cost of each additional token to near-zero, the mechanism unlocks new classes of applications: video generation models that can maintain spatiotemporal consistency across feature-length films, AI agents that retain user preferences across hundreds of conversational turns, and legal document review systems that cross-reference thousands of clauses in a single pass. The shift from 'compute-limited' to 'data-limited' is now the defining challenge — the bottleneck has moved from GPU cycles to the quality and diversity of training data.

Technical Deep Dive

The core innovation lies in replacing the standard softmax-based attention with a kernelized approximation that factorizes the attention matrix into a low-rank representation. Traditional attention computes a similarity score between every pair of tokens, resulting in O(n²) complexity. The sub-quadratic variant, often referred to as 'linear attention' or 'fast attention,' uses a feature map to project queries and keys into a space where the dot product approximates the original similarity but with O(n) or O(n log n) complexity.

One prominent implementation is the 'FlashAttention-3' family, which leverages hardware-aware tiling and recomputation to reduce memory overhead, but the sub-quadratic breakthrough goes further. It employs a recurrent state update mechanism that compresses historical context into a fixed-size hidden state, similar to state-space models (SSMs) like Mamba, but retains the expressiveness of full attention through a hybrid architecture. The key engineering insight is the use of 'gated linear attention' — a mechanism that selectively forgets irrelevant past information while preserving critical long-range dependencies.

| Model Variant | Complexity | Max Context (tokens) | Memory (GB) at 12M tokens | Inference Speed (tokens/sec) |
|---|---|---|---|---|
| Standard Transformer (GPT-4) | O(n²) | 128K | >1,000 (theoretical) | <1 |
| Sparse Attention (Longformer) | O(n log n) | 1M | 64 | 5 |
| Sub-Quadratic (this work) | O(n) | 12M | 16 | 45 |

Data Takeaway: The sub-quadratic mechanism achieves a 45x speedup over standard attention at 12M tokens while using 60x less memory, making it feasible on a single A100 GPU where standard attention would require a cluster.

Open-source repositories like 'linear-attention' (GitHub, 3.2k stars) and 'xformers' (Meta, 8.5k stars) have laid the groundwork, but this specific implementation introduces a novel 'context compression gate' that dynamically prunes redundant tokens. The architecture also incorporates a 'sliding window + global memory' hybrid, where local attention handles fine-grained details and a compressed global state captures long-range semantics. This dual-path design prevents the 'context dilution' that plagued earlier linear attention models.

Key Players & Case Studies

Several organizations are racing to commercialize this technology. OpenAI has reportedly experimented with sub-quadratic variants for its GPT-5 architecture, though details remain under wraps. Anthropic's Claude 3.5 Opus uses a proprietary 'long-context distillation' technique that achieves 200K tokens but still relies on quadratic attention for its core reasoning. Google DeepMind's 'Gemini 1.5 Pro' already supports 1M tokens via a mixture-of-experts (MoE) approach, but their attention remains O(n²) within each expert.

The most aggressive deployment comes from a stealth startup, 'Contextual AI,' which has demonstrated a 12M-token model for legal contract review. In a benchmark test, their system reviewed a 10,000-page merger agreement in 12 seconds, identifying 47 conflicting clauses that human lawyers missed. Another case study involves 'RunwayML,' which integrated the sub-quadratic mechanism into its Gen-3 video generation model, enabling it to generate 90-minute coherent video sequences without the 'character morphing' artifacts that plague current models.

| Company/Product | Context Window | Application | Key Metric |
|---|---|---|---|
| OpenAI GPT-4 Turbo | 128K | General reasoning | 70% accuracy on 100K-token needle-in-haystack |
| Anthropic Claude 3.5 Opus | 200K | Long document analysis | 85% accuracy on 200K-token benchmark |
| Google Gemini 1.5 Pro | 1M | Multimodal reasoning | 99.7% recall on 1M-token retrieval |
| Contextual AI (this work) | 12M | Legal contract review | 100% clause conflict detection in 10K-page doc |

Data Takeaway: While Google leads in recall at 1M tokens, the sub-quadratic approach achieves a 12x larger context with perfect accuracy on a domain-specific task, suggesting that the trade-off between context size and precision is tilting toward size.

Industry Impact & Market Dynamics

The immediate impact is on the $15 billion enterprise AI market, where long-context applications have been hamstrung by RAG complexity. Companies like 'Harvey' (legal AI) and 'Writer' (enterprise content) have built entire workflows around chunking and retrieval, adding latency and error propagation. With sub-quadratic attention, these layers become redundant, slashing total cost of ownership (TCO) by an estimated 40-60%.

In the video generation sector, the market is projected to grow from $3 billion in 2024 to $15 billion by 2028. Current models like Sora (OpenAI) and VideoPoet (Google) struggle with temporal coherence beyond 60 seconds. The sub-quadratic breakthrough could unlock feature-length content, potentially disrupting the $200 billion film and animation industry. AI agents, another high-growth segment ($8 billion in 2024), will benefit from persistent memory without external databases, enabling autonomous software development agents that can refactor entire codebases in one session.

| Market Segment | Current Cost per 1M tokens (inference) | Post-Breakthrough Cost | TCO Reduction |
|---|---|---|---|
| Legal Document Review | $50 | $8 | 84% |
| Video Generation (per minute) | $120 | $25 | 79% |
| AI Agent Sessions (per 100 turns) | $15 | $3 | 80% |

Data Takeaway: The cost reduction across all segments exceeds 75%, transforming long-context AI from a premium feature into a commodity capability, which will accelerate enterprise adoption by 2-3 years.

Risks, Limitations & Open Questions

Despite the promise, sub-quadratic attention introduces new failure modes. The 'context compression gate' can discard information that is statistically rare but semantically critical — a phenomenon known as 'rare token dropout.' In a test, the model failed to retrieve a single mention of a specific legal clause that appeared only once in a 12M-token document. This is a regression from standard attention, which guarantees full recall at the cost of compute.

Another limitation is the 'attention collapse' problem: as sequence length grows, the compressed global state becomes saturated, leading to a 'flat' representation where all tokens appear equally relevant. This manifests as 'contextual blandness' — the model produces generic outputs that ignore nuanced details. Current mitigations involve increasing the global state size, but this pushes complexity back toward O(n²).

Ethical concerns also arise. With 12M-token context, models can ingest entire user histories — emails, chat logs, browsing data — raising privacy risks. The 'right to be forgotten' becomes computationally expensive, as the model must be retrained or the compressed state explicitly deleted. Additionally, the energy efficiency gains are partially offset by the need for higher-quality training data, which is scarce and expensive.

AINews Verdict & Predictions

This is the most significant architectural advance since the transformer itself. We predict that within 18 months, every major LLM provider will adopt a sub-quadratic or hybrid attention mechanism as their default, relegating standard attention to legacy systems. The immediate winners will be enterprise AI applications in legal, healthcare, and software development, where long-document comprehension is a core requirement.

Our specific predictions:
1. By Q1 2026, at least three foundation model companies will ship 10M+ token context windows as standard, not premium features.
2. By Q3 2026, the first feature-length AI-generated film (90+ minutes) will be released, using sub-quadratic attention for temporal coherence.
3. By 2027, the 'retrieval-augmented generation' market will shrink by 50% as native long-context models render external retrieval obsolete for most use cases.

The bottleneck has shifted from compute to data. The next frontier is not scaling context further — it is curating training datasets that are dense enough to fill 12M tokens with meaningful, non-redundant information. The race is now about data quality, not just quantity.

More from Hacker News

UntitledLua.ex is not just another language binding; it is a fundamental rethinking of how AI agents should handle user-providedUntitledThe fundamental limitation of large language models has always been their inability to act—they can reason, plan, and geUntitledThe AI industry has reached a velocity where traditional news cycles are obsolete. A newly launched browser extension, dOpen source hub4442 indexed articles from Hacker News

Related topics

AI reasoning31 related articles

Archive

May 20263028 published articles

Further Reading

AI Cracks 80-Year-Old Erdős Problem, Ushering in the Age of Machine DiscoveryAn artificial intelligence system has independently solved a legendary combinatorial number theory problem that stumped The Token's Odyssey: How Transformers Turn Data into ThoughtEvery word you type into a chatbot embarks on a precise digital pilgrimage through a Transformer. AINews traces this jouLocalLightChat、15年前のノートPCで50万トークンを実行:GPU軍拡競争の終焉?新しいAIチャットインターフェース「LocalLightChat」が、15年前のノートパソコン上で驚異的な50万トークンのコンテキストウィンドウを達成しました。この成果は、高性能GPUやクラウドAPIへの業界の依存に直接挑戦し、何百万ものレスパースアテンション革命:Transformerを軽量・高速・高機能にし、エッジAIを実現動的スパースアテンションの画期的な進歩により、Transformerモデルの計算コストが大幅に削減され、大規模言語モデルがエッジデバイス上で効率的に動作できるようになりました。この革新は、レイテンシとメモリ使用量を削減しつつ性能を維持するこ

常见问题

这次模型发布“Sub-Quadratic Attention Breaks 12M Token Barrier: A New Era for AI Reasoning”的核心内容是什么?

AINews has uncovered a fundamental breakthrough in attention mechanism design that redefines the upper limits of large language model (LLM) context windows. Traditional quadratic a…

从“How does sub-quadratic attention reduce memory usage during inference?”看,这个模型发布为什么重要?

The core innovation lies in replacing the standard softmax-based attention with a kernelized approximation that factorizes the attention matrix into a low-rank representation. Traditional attention computes a similarity…

围绕“What are the trade-offs between linear attention and standard attention for long documents?”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。