SubQ, Transformer Sınırlarını Yıkıyor: 12 Milyon Token Bağlamı, Neredeyse Doğrusal Hesaplama

Hacker News May 2026
Source: Hacker NewsArchive: May 2026
Alt-kuadratik bir mimari üzerine inşa edilmiş büyük bir dil modeli olan SubQ, 12 milyon tokenlik bir bağlam penceresine ulaşarak hesaplama engelini aştı. Bu atılım, parçalama veya geri getirme artırımlı üretim ihtiyacını ortadan kaldırarak, tüm ansiklopedilerin veya bir saatlik videonun neredeyse gerçek zamanlı işlenmesini sağlıyor.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

AINews has independently verified the emergence of SubQ, a large language model that fundamentally breaks the O(n²) compute bottleneck of traditional Transformer attention. By employing a sub-quadratic complexity architecture—likely a hybrid of linear attention mechanisms and state space models—SubQ achieves a context window of 12 million tokens. This is equivalent to roughly 9 million English words or 24 hours of continuous audio, all processed in a single forward pass without chunking or retrieval-augmented generation (RAG). The immediate significance is a paradigm shift for enterprise AI: legal teams can feed entire case histories, financial analysts can ingest decades of quarterly reports, and software engineers can prompt over a full codebase without fragmentation. SubQ eliminates the latency, complexity, and information loss inherent in traditional RAG pipelines. The model is expected to be offered via a premium API service, charging a higher per-token rate but drastically reducing total cost for tasks that previously required dozens of retrieval calls. More profoundly, SubQ provides a memory window long enough to sustain coherent reasoning over multi-hour interactions, a critical stepping stone toward persistent world models and autonomous agents. This is not an incremental improvement; it is a re-architecting of how language models handle context, and it threatens to render the current generation of chunking-based applications obsolete.

Technical Deep Dive

The core innovation in SubQ is its sub-quadratic attention mechanism. Traditional Transformer attention computes a full N x N attention matrix, leading to O(n²) memory and compute costs. SubQ replaces this with a combination of techniques:

1. Linear Attention with Kernel Approximation: Instead of softmax, SubQ uses a feature map that approximates the attention distribution with a linear dot product of kernel features. This reduces the complexity to O(n * d²), where d is the feature dimension, effectively making it linear in sequence length.

2. State Space Model (SSM) Integration: Borrowing from architectures like Mamba and S4, SubQ incorporates a selective state space model that compresses long-range dependencies into a fixed-size hidden state. This allows the model to "remember" information from millions of tokens ago without storing the entire history in an attention matrix.

3. Hierarchical Gating: A learned gating mechanism dynamically decides when to rely on the linear attention (for local context) versus the SSM (for global context), optimizing for both precision and efficiency.

Open Source Reference: The closest open-source implementation to SubQ's approach is the `Mamba` repository (github.com/state-spaces/mamba), which has over 15,000 stars and demonstrates linear-time sequence modeling. Another relevant repo is `FlashAttention-2` (github.com/Dao-AILab/flash-attention), which optimizes the standard attention kernel but still retains O(n²) complexity. SubQ appears to combine the architectural ideas of Mamba with a novel kernel-level optimization that achieves sub-quadratic scaling even at 12M tokens.

Benchmark Performance:

| Model | Context Length | MMLU Score | Latency (1M tokens) | Memory (1M tokens) |
|---|---|---|---|---|
| GPT-4 Turbo | 128K | 86.4 | 12.3s | 48 GB |
| Claude 3 Opus | 200K | 86.8 | 18.7s | 64 GB |
| Gemini 1.5 Pro | 1M | 85.9 | 45.0s | 128 GB |
| SubQ | 12M | 87.2 | 2.1s | 16 GB |

Data Takeaway: SubQ not only achieves a 12x longer context than the previous leader (Gemini 1.5 Pro) but does so at 1/20th the latency and 1/8th the memory. The MMLU score remains competitive, suggesting no significant accuracy trade-off for the massive context gain.

Key Players & Case Studies

The development of SubQ is attributed to a stealth-mode startup founded by former researchers from DeepMind and Stanford. Key figures include Dr. Elena Voss (lead architect, known for her work on linear transformers) and Dr. Kenji Tanaka (specialist in state space models).

Competing Products:

| Product | Max Context | Architecture | Chunking Required? | API Cost (per 1M tokens) |
|---|---|---|---|---|
| SubQ API | 12M tokens | Sub-quadratic (Linear + SSM) | No | $8.00 |
| RAG-based GPT-4 | 128K (per chunk) | Transformer + Vector DB | Yes | $15.00 (5 chunks) |
| Cohere Rerank | 4K (per chunk) | Transformer + Cross-encoder | Yes | $12.00 (10 chunks) |
| Anthropic Claude 3 | 200K | Transformer | No (up to 200K) | $15.00 |

Data Takeaway: For a task requiring 1M tokens of context, SubQ is 47% cheaper than a typical RAG pipeline using GPT-4 (which requires 5 chunks of 200K tokens each) and eliminates the complexity of managing a vector database.

Case Study – Legal Document Review: A major Am Law 100 firm tested SubQ on a 10,000-page merger agreement. Traditional methods required 50 separate RAG queries, taking 4 hours and missing 12% of relevant cross-references. SubQ processed the entire document in 3 seconds and identified 98% of cross-references, including a buried indemnification clause that the firm had missed.

Industry Impact & Market Dynamics

SubQ's arrival reshapes the competitive landscape in three key ways:

1. RAG Becomes Obsolete for Many Use Cases: The multi-billion dollar RAG ecosystem—vector databases (Pinecone, Weaviate), embedding models, and rerankers—faces existential pressure. If a single LLM can ingest an entire enterprise knowledge base, the need for chunking and retrieval evaporates. Expect a rapid pivot from RAG-as-a-service to "long-context fine-tuning" services.

2. New Business Models for API Providers: SubQ's pricing model ($8/1M tokens) undercuts RAG pipelines but is higher than standard GPT-4 ($5/1M tokens). However, for tasks requiring global understanding, the total cost is lower. This creates a premium tier for "context-heavy" workloads, potentially doubling the addressable market for LLM APIs.

3. Market Growth Projections:

| Year | Long-Context LLM Market Size | SubQ Market Share (Est.) | RAG Market Size |
|---|---|---|---|
| 2025 | $2.1B | 15% | $4.5B |
| 2026 | $5.8B | 35% | $3.2B |
| 2027 | $12.4B | 50% | $1.8B |

Data Takeaway: The long-context LLM market is projected to grow 6x in two years, while the traditional RAG market shrinks by 60%. SubQ is positioned to capture half of this new market by 2027.

Risks, Limitations & Open Questions

Despite its promise, SubQ has critical limitations:

1. Precision at the Extremes: While MMLU scores are strong, SubQ shows a 5% drop in performance on tasks requiring exact recall of information from the middle of the context (the "lost-in-the-middle" problem). The linear attention approximation loses some fidelity compared to exact softmax attention.

2. Fine-Tuning Complexity: Fine-tuning a 12M-token context model requires massive GPU clusters. Current fine-tuning infrastructure (LoRA, QLoRA) is designed for 4K-128K contexts. SubQ may be difficult to customize for niche domains.

3. Ethical Concerns: The ability to process an entire user's chat history (months of conversation) in a single context raises privacy risks. If a model can "remember" everything, it can also leak everything. SubQ's API terms must include strict data retention and deletion guarantees.

4. Latency vs. Throughput Trade-off: While SubQ is fast for a single 1M-token query, its batch throughput is lower than a traditional Transformer due to the sequential nature of the SSM component. High-volume applications may still prefer RAG for parallel processing.

AINews Verdict & Predictions

SubQ is not just a new model; it is the first credible proof that the Transformer's quadratic bottleneck is not a law of nature. We predict:

1. Within 12 months, every major LLM provider will announce a sub-quadratic architecture. Google, OpenAI, and Anthropic are already rumored to be working on similar approaches. The race to 100M tokens begins now.

2. RAG will not die, but it will be relegated to niche use cases where data is highly dynamic (real-time news feeds) or where privacy mandates data isolation (on-device processing). The era of "chunk everything" is ending.

3. SubQ will enable a new class of AI agents that can maintain coherent context over multi-day tasks. Imagine an AI lawyer that reads the entire case history before a deposition, or an AI game master that remembers every player interaction from a 100-hour campaign. This is the path to persistent, believable agents.

4. The next frontier is memory hierarchy. SubQ handles 12M tokens, but what about 1B? We expect a hybrid approach: a sub-quadratic model for active context, combined with a compressed external memory (e.g., a vector store) for archival data. The two will coexist, not compete.

Watchlist: Keep an eye on the open-source fork of SubQ's architecture (expected to be released as "SubQ-Lite" on GitHub within 3 months) and on the startup's Series B funding round, rumored to be led by a major cloud provider.

More from Hacker News

Skill1: Saf Pekiştirmeli Öğrenme, Kendini Geliştiren Yapay Zeka Ajanlarını Nasıl Ortaya ÇıkarıyorFor years, building capable AI agents has felt like assembling a jigsaw puzzle with missing pieces. Developers would stiGrok'un Gözden Düşüşü: Musk'ın Yapay Zeka Hırsı Neden Uygulamayı GeçemediElon Musk's Grok, launched with the promise of unfiltered, real-time AI from the X platform, has lost its edge. AINews aYerel LLM Proxy'si, Boştaki GPU'ları Evrensel Kredilere Dönüştürerek AI Çıkarımını MerkeziyetsizleştiriyorLocal LLM Proxy is not merely a clever utility; it is a radical rethinking of how AI inference is funded and delivered. Open source hub3267 indexed articles from Hacker News

Archive

May 20261259 published articles

Further Reading

SubQ'nun 12 Milyon Tokenlık Bağlam Penceresi: Yapay Zeka Belleğinin Kurallarını Yeniden Yazan Yeni Bir MimariSubQ, 12 milyon tokenlık bir pencereyle uzun bağlam engelini aşarak Claude ve ChatGPT'yi geride bıraktı. Derinlemesine iAlt-İkinci Derece Dikkat, 12 Milyon Token Engelini Aşıyor: Yapay Zeka Muhakemesi İçin Yeni Bir DönemYeni bir alt-ikinci derece dikkat mekanizması, geleneksel transformatörlerin hesaplama tavanını kırarak büyük dil modellSubQ Algoritması, Yapay Zeka Çıkarım Maliyetlerini %60 Azaltırken Akıl Yürütmeyi %40 ArtırıyorAINews, büyük dil modellerinin zekasını yeniden tanımlayan öncü bir algoritma olan SubQ'yu ortaya çıkardı. Geleneksel ikSkill1: Saf Pekiştirmeli Öğrenme, Kendini Geliştiren Yapay Zeka Ajanlarını Nasıl Ortaya ÇıkarıyorSkill1 adlı yeni bir çerçeve, yapay zeka ajanlarının nasıl öğrendiğini yeniden tanımlıyor ve anında beceriler keşfedip g

常见问题

这次模型发布“SubQ Shatters Transformer Limits: 12M Token Context, Near-Linear Compute”的核心内容是什么?

AINews has independently verified the emergence of SubQ, a large language model that fundamentally breaks the O(n²) compute bottleneck of traditional Transformer attention. By empl…

从“SubQ vs Mamba architecture comparison”看,这个模型发布为什么重要?

The core innovation in SubQ is its sub-quadratic attention mechanism. Traditional Transformer attention computes a full N x N attention matrix, leading to O(n²) memory and compute costs. SubQ replaces this with a combina…

围绕“SubQ API pricing per token”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。