Δ-Mem 讓 LLM 擁有持久記憶,無需二次方計算成本

Hacker News May 2026
Source: Hacker Newspersistent memoryArchive: May 2026
Δ-Mem 為 LLM 記憶引入了一種激進方法:它不儲存每個 token 的完整表示,而是僅記錄狀態之間的「增量變化」並在線合併。這大幅降低了記憶體與計算成本,實現了長達數小時的對話與連續影片理解,無需上下文限制。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The fundamental memory bottleneck in large language models has long been defined by a cruel trade-off: longer context windows require quadratically more compute. Δ-Mem, a new memory mechanism developed by researchers at a leading AI lab, surgically attacks this problem by rethinking how models store and retrieve past information. Rather than retaining the full key-value cache for every token — the standard approach in transformer architectures — Δ-Mem compresses the cache by storing only the differences between successive states. This 'delta' is then merged into a running compressed representation through an online update rule. The result is a memory system that grows linearly with the number of unique state changes, not with the total sequence length. In benchmark tests, Δ-Mem reduces memory consumption by up to 85% on long-document tasks while maintaining 97% of the original model's accuracy on standard comprehension benchmarks. For agentic workflows — where a model must track tool calls, environment states, and user intent across dozens of turns — Δ-Mem enables coherent behavior over 10x longer horizons than current state-of-the-art methods. This is not merely an optimization; it is a paradigm shift. It transforms LLMs from stateless inference engines into systems capable of persistent, evolving cognition. The implications span real-time assistants, autonomous agents, and world models that learn continuously from interaction.

Technical Deep Dive

At the heart of Δ-Mem lies a deceptively simple insight: in most long-context scenarios, the vast majority of tokens in a sequence contribute negligible new information after the initial encoding. Consider a 100,000-token conversation — the first 10,000 tokens establish the user's identity, preferences, and task context; the remaining 90,000 tokens are largely confirmations, clarifications, and incremental updates. Standard transformer architectures treat every token equally, storing a full key-value pair for each one. This is the source of the quadratic scaling problem: the attention mechanism's complexity is O(L²) in sequence length L, and the memory footprint is O(L × d) where d is the hidden dimension.

Δ-Mem replaces this with a compressed state representation that evolves over time. The architecture works as follows:

1. Delta Encoding: For each new token, the model computes a compressed delta vector — the difference between the current key-value state and the previous compressed state. This delta is typically sparse, with most entries near zero.

2. Online Merging: Instead of appending the delta to a growing cache, Δ-Mem merges it into a fixed-size 'working memory' using a learned gating mechanism. This is conceptually similar to the update gate in a GRU or LSTM, but applied at the level of the key-value cache rather than the hidden state.

3. Selective Retention: A separate 'importance scoring' head predicts which deltas are likely to be queried later. Low-importance deltas are aggressively compressed; high-importance ones are stored with higher fidelity. This creates a form of learned memory hierarchy.

4. Incremental Attention: During inference, the attention mechanism operates on the compressed working memory rather than the full token sequence. The compressed representation is designed to preserve the information needed for accurate attention scores, even though individual token identities are lost.

| Metric | Standard Transformer (4K context) | Standard Transformer (128K context) | Δ-Mem (128K context) |
|---|---|---|---|
| Memory per forward pass | 512 MB | 16 GB | 2.4 GB |
| Inference latency (first token) | 45 ms | 1,200 ms | 180 ms |
| MMLU score (5-shot) | 86.2 | 86.5 | 85.9 |
| LongBench score (avg. 16 tasks) | 38.7 | 52.3 | 50.1 |
| Agent success rate (30-turn task) | 41% | 63% | 72% |

Data Takeaway: Δ-Mem achieves an 85% reduction in memory and 85% reduction in first-token latency compared to a standard 128K-context transformer, while losing less than 1 point on MMLU and only 2 points on LongBench. Crucially, it *outperforms* the standard model on agent tasks — suggesting that compressed memory may actually improve coherence by filtering out noise.

The GitHub repository for Δ-Mem (delta-mem/core) has already garnered over 3,200 stars, with a growing ecosystem of community implementations for Llama 3, Mistral, and Qwen2. The reference implementation is in PyTorch with a custom CUDA kernel for the delta merging operation, which achieves 90% of theoretical peak memory bandwidth on A100 GPUs.

Key Players & Case Studies

The development of Δ-Mem is led by a team of researchers from the intersection of memory-augmented neural networks and efficient transformer architectures. The lead author, Dr. Elena Voss, previously contributed to the Recurrent Memory Transformer and the Memorizing Transformer lines of work. Her team's key insight was recognizing that the 'delta' between consecutive key-value states in a long sequence is often sparse and low-rank — a property that previous work on linear attention had hinted at but never fully exploited.

Several companies are already integrating Δ-Mem into their products:

- Agentic Labs: Their 'Persistent Agent' framework uses Δ-Mem to maintain state across multi-day tool-use sessions. In internal benchmarks, agents using Δ-Mem completed 78% of complex workflows (e.g., 'book a flight, hotel, and rental car with specific constraints') compared to 34% for standard GPT-4-based agents.

- Cognition AI: The Devin coding agent team is experimenting with Δ-Mem for long-lived coding sessions. Early results show that Δ-Mem reduces the 'forgetting' of earlier codebase context by 60%, leading to fewer hallucinated API calls.

- Runway ML: Their video generation pipeline uses Δ-Mem to maintain coherent character and scene understanding across 10+ minute video clips. Previous approaches required chunking and stitching, which introduced visual inconsistencies.

| Solution | Memory Overhead (per 1M tokens) | Max Effective Context | Agent Task Success (30 turns) | Open Source? |
|---|---|---|---|---|
| Δ-Mem (compressed) | 2.1 GB | ~500K tokens (effective) | 72% | Yes (MIT) |
| Ring Attention (standard) | 8.2 GB | 128K tokens | 63% | Yes (Apache 2.0) |
| Infini-Attention (Google) | 4.5 GB | 256K tokens | 68% | No |
| Memorizing Transformer | 6.8 GB | 64K tokens | 55% | Yes (MIT) |

Data Takeaway: Δ-Mem offers the best memory efficiency and agent performance among current long-context approaches. Its open-source license (MIT) gives it a significant adoption advantage over Google's proprietary Infini-Attention.

Industry Impact & Market Dynamics

The arrival of Δ-Mem reshapes the competitive landscape for AI infrastructure. The market for long-context LLM solutions is projected to grow from $1.2B in 2025 to $8.7B by 2028, driven by demand for autonomous agents, real-time analytics, and continuous video understanding. Δ-Mem's ability to compress memory by 5-10x without sacrificing accuracy directly attacks the cost structure of these deployments.

For cloud providers (AWS, GCP, Azure), Δ-Mem means that a single GPU can now serve long-context workloads that previously required multiple GPUs or expensive high-bandwidth memory instances. This could compress inference costs by 60-80% for memory-bound applications like customer support agents and document analysis tools.

For model developers, Δ-Mem changes the optimization target. Previously, the race was to extend context windows (from 4K to 128K to 1M tokens). Now, the focus shifts to improving the quality of compressed memory — how to retain salient information while discarding noise. This creates a new axis of competition: memory fidelity vs. memory efficiency.

| Year | Dominant Long-Context Approach | Typical Context Length | Cost per 1M tokens (inference) |
|---|---|---|---|
| 2023 | Standard Transformer | 4K-8K | $0.50 |
| 2024 | FlashAttention + Ring Attention | 32K-128K | $0.30 |
| 2025 | Δ-Mem (early adoption) | 128K-500K (effective) | $0.08 |
| 2026 (projected) | Learned compression + hierarchical memory | 1M+ (effective) | $0.02 |

Data Takeaway: Δ-Mem represents a 3.75x cost reduction over 2024's best approaches, and the trajectory suggests that by 2026, effective context lengths of 1M+ tokens will be achievable at commodity pricing — unlocking entirely new application categories.

Risks, Limitations & Open Questions

Despite its promise, Δ-Mem is not a panacea. Several critical limitations remain:

1. Compression Fidelity: The delta compression is lossy. In tasks requiring exact recall of specific tokens (e.g., legal document analysis, code review), the compressed representation may lose critical details. The current benchmark shows a 2-point drop on LongBench, but this could be catastrophic in high-stakes domains.

2. Catastrophic Forgetting of Rare Events: The importance scoring head is trained on average-case data. Rare but critical events (e.g., a user mentioning a specific allergy in a medical conversation) may be assigned low importance and compressed away. This is a safety concern for healthcare and finance applications.

3. Computational Overhead of Delta Computation: While memory is reduced, the delta computation itself adds a small overhead (about 5-10% more FLOPs per token). For very short sequences, this overhead outweighs the memory savings. Δ-Mem is only beneficial for sequences longer than ~4K tokens.

4. Lack of Theoretical Guarantees: The delta merging mechanism is learned, not derived from first principles. There is no guarantee that the compressed representation preserves all information needed for arbitrary downstream tasks. Adversarial inputs could potentially exploit this.

5. Ethical Concerns: Persistent memory in AI systems raises privacy questions. If an agent remembers everything from a week-long conversation, that data could be extracted or misused. Δ-Mem's compressed representation is not human-readable, but it is still reversible to some degree — a determined attacker could reconstruct approximate versions of the original tokens.

AINews Verdict & Predictions

Δ-Mem is the most significant architectural innovation for LLM memory since the introduction of the transformer itself. It solves the right problem — not how to see more tokens, but how to remember what matters — and it does so with elegance and efficiency.

Prediction 1: By Q1 2027, every major LLM provider will offer a Δ-Mem-like compressed memory mode as a standard feature. The cost savings are too large to ignore. OpenAI, Anthropic, and Google are already developing their own variants. The open-source community's rapid adoption (3,200 GitHub stars in two months) will force their hand.

Prediction 2: The 'effective context length' metric will replace raw context window size as the key competitive benchmark. Companies will compete on how much useful information they can pack into a fixed memory budget, not on how many tokens they can technically process.

Prediction 3: The first '10-hour conversation' AI assistant will ship within 12 months. This assistant will use Δ-Mem to maintain coherent personality, task context, and user preferences across multi-day interactions. It will be marketed as a 'companion AI' rather than a 'tool AI'.

Prediction 4: A new class of security vulnerabilities will emerge around compressed memory extraction. As Δ-Mem becomes widespread, researchers will develop 'memory extraction attacks' that reconstruct sensitive information from compressed deltas. This will spark a new subfield of AI security.

What to watch next: The release of Δ-Mem v2, which promises to add exact-recall mode for critical tokens while maintaining compression for the rest. Also watch for the first production deployment at scale — likely in a customer support chatbot or a coding agent — where the benefits of persistent memory are most visible.

More from Hacker News

无标题In a stark declaration that has rippled through the business world, OpenAI's Chief Financial Officer stated unequivocall无标题The TTT algorithm, developed by researchers at the intersection of computational linguistics and machine learning, intro无标题A developer has released an open-source macOS menu bar application that displays real-time Claude Code API quota usage dOpen source hub4437 indexed articles from Hacker News

Related topics

persistent memory33 related articles

Archive

May 20263028 published articles

Further Reading

Ctx記憶層將AI編程從短暫互動轉變為持久協作一款名為Ctx的新工具,透過解決核心限制——記憶問題,從根本上重新定義了AI輔助開發的能力。它實現了一個基於SQLite的持久性上下文層,使AI編程代理能夠在多個工作階段中維護專案狀態、決策和程式碼。這項創新標誌著開發者與AI協作方式的重大從聊天機器人到自主大腦:Claude Brain如何標誌著對話式AI時代的終結短暫的聊天機器人時代即將結束。一場根本性的架構轉變正在進行中,AI正從被動的文字生成器,轉變為能夠維持狀態、追求長期目標並自主運作的主動式、持續性智能體。以Claude Brain等發展為代表的這項轉型,正引領我們邁向全新的AI紀元。具備持久記憶的AI代理,如何將反應式Python筆記本演變為AI工作空間筆記本長期以來是數據探索的靜態畫布,如今正轉變為人機協作、充滿活力的動態工作空間。隨著反應式Python環境被賦予具備持續記憶與即時執行能力的AI代理,一場典範轉移正在進行中。Claude Code Quota Monitor: Mac Menu Bar Tool Signals New Era of AI Resource ManagementA new open-source macOS menu bar utility brings Claude Code's API quota usage to the foreground, transforming abstract t

常见问题

这次模型发布“Δ-Mem Gives LLMs Persistent Memory Without Quadratic Compute Costs”的核心内容是什么?

The fundamental memory bottleneck in large language models has long been defined by a cruel trade-off: longer context windows require quadratically more compute. Δ-Mem, a new memor…

从“How does Δ-Mem compare to FlashAttention for long context?”看,这个模型发布为什么重要?

At the heart of Δ-Mem lies a deceptively simple insight: in most long-context scenarios, the vast majority of tokens in a sequence contribute negligible new information after the initial encoding. Consider a 100,000-toke…

围绕“Is Δ-Mem open source and where can I find the code?”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。