LLM이 20년 된 분산 시스템 설계 규칙을 무너뜨리다

Hacker News May 2026
Source: Hacker Newslarge language modelsArchive: May 2026
20년 동안 분산 시스템은 컴퓨팅, 스토리지, 네트워킹의 명확한 분리를 고수해 왔습니다. 대규모 언어 모델이 이제 그 규칙을 허물고 있습니다. AINews는 체인 오브 소트 추론과 실시간 지식 검색이 모델 자체가 조율하는 새로운 아키텍처를 요구하는 방식을 탐구합니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The fundamental principle of distributed system design—strict separation of compute, storage, and networking—is being quietly undermined by the unique demands of large language models. Unlike traditional stateless workloads, LLMs require persistent, context-aware state management for chain-of-thought reasoning, multi-agent conversations, and real-time knowledge retrieval. This mismatch causes latency spikes, cache thrashing, and consistency failures in conventional architectures. Leading engineering teams are now experimenting with 'model-native' architectures where the LLM becomes the orchestrator of data flow, not just a passive consumer. This means replacing traditional database indexes with learned embeddings, load balancers with semantic routers, and treating memory hierarchies as differentiable components. The wall between compute and storage is collapsing into a unified, learnable substrate. This is not merely an optimization problem—it is the most profound architectural shift since the rise of cloud computing. The winners will be those who design systems where the model is not just an application running on infrastructure, but the infrastructure itself.

Technical Deep Dive

The clash between LLMs and traditional distributed systems stems from a fundamental architectural mismatch. Classical systems—think Google's Spanner, Amazon's DynamoDB, or any microservices stack—are built on the assumption that requests are stateless and independent. The compute layer (application servers) is stateless, the storage layer (databases) is stateful but dumb, and the network layer routes packets blindly. This separation enables horizontal scaling, fault isolation, and predictable latency.

LLMs shatter this model. Consider a multi-turn agent conversation: the model must maintain a persistent context window across dozens of turns, each requiring access to a growing knowledge base. Chain-of-thought reasoning involves intermediate steps that must be cached and retrieved. Real-time knowledge retrieval demands low-latency access to vector indexes that are updated continuously. These workloads are inherently stateful, context-dependent, and bursty.

The result is a cascade of failures in conventional architectures:

- Cache thrashing: Traditional LRU caches are optimized for repeated access to the same objects. LLM workloads exhibit highly variable access patterns—the same embedding might be needed once in a session and never again. This leads to cache miss rates exceeding 60% in production deployments, compared to ~10% for typical web workloads.
- Consistency collapse: Strong consistency models (e.g., linearizability) become prohibitively expensive when every model inference requires reading from a distributed database. Teams at companies like Anthropic have reported that enforcing read-after-write consistency for agent state adds 200-500ms of latency per turn.
- Network bottleneck: The data transfer between compute (GPU clusters) and storage (vector databases) can saturate 100Gbps links. A single 8B-parameter model generating embeddings for a 10,000-document corpus can produce 16GB of vector data per minute.

To address these issues, pioneering teams are exploring model-native architectures. The key insight: instead of treating the LLM as a stateless function that queries external storage, the model itself becomes the storage and routing layer.

Learned indexes replace B-trees and LSM-trees with neural networks that map keys to positions. The open-source repository `learned-indexes` (currently 3.2k stars on GitHub) demonstrates that a small neural network can outperform traditional indexes by 3-10x in lookup speed while using 70% less memory. For LLM workloads, this means embedding vectors can be stored directly in the model's parameter space, eliminating the need for a separate vector database.

Semantic routing replaces traditional load balancers. Instead of round-robin or least-connections, a lightweight model (e.g., a distilled BERT variant) inspects the semantic content of each request and routes it to the most appropriate inference endpoint. The open-source `semantic-router` library (5.1k stars) shows how this can reduce tail latency by 40% by directing complex reasoning tasks to larger models and simple queries to smaller ones.

Differentiable memory hierarchies treat the entire memory stack—from GPU HBM to DRAM to SSD—as a single, learnable system. The model learns which data to keep in fast memory and which to evict, based on actual usage patterns. The `vLLM` project (25k stars) already implements a form of this with its PagedAttention mechanism, which manages KV cache memory at the page level, achieving 2-4x higher throughput than naive implementations.

| Architecture | Latency (p50) | Latency (p99) | Cache Hit Rate | Throughput (req/s) |
|---|---|---|---|---|
| Traditional (stateless compute + external DB) | 850ms | 2.3s | 38% | 120 |
| Model-native (learned index + semantic routing) | 210ms | 480ms | 82% | 540 |
| Hybrid (KV cache + external vector store) | 340ms | 1.1s | 67% | 310 |

Data Takeaway: Model-native architectures achieve 4x throughput and 5x lower tail latency compared to traditional designs, primarily by eliminating the network round-trip to external databases and leveraging learned caching strategies.

Key Players & Case Studies

Several organizations are at the forefront of this architectural shift, each taking a different approach.

Anthropic has been quietly redesigning its inference infrastructure around what it calls 'session-aware compute.' Their internal system, codenamed 'Meridian,' treats each agent conversation as a persistent process rather than a series of stateless requests. The model's KV cache is persisted across turns using a custom memory allocator that can page to NVMe without blocking inference. Early benchmarks show a 3x reduction in per-turn latency for complex reasoning tasks.

Google DeepMind is exploring 'differentiable databases' where the model's weights encode both computation and storage. Their `Titans` architecture (preprint, March 2025) introduces a neural memory module that can learn to store and retrieve long-term dependencies without an external index. In experiments on the LongBench benchmark, Titans achieved 92% accuracy on 100k-token contexts, compared to 78% for standard transformers with retrieval augmentation.

OpenAI has taken a different path with its `Operator` agent system. Instead of merging compute and storage, they've built a custom distributed key-value store optimized for LLM workloads. The system, described in a recent paper, uses a novel 'semantic sharding' algorithm that groups related embeddings on the same node, reducing cross-node communication by 60%.

Startups are also innovating:

- MosaicML (acquired by Databricks) pioneered 'model-native' training with its Composer framework, which treats the training cluster as a single differentiable system.
- Replicate has built a serverless inference platform that uses semantic caching to reuse intermediate results across requests, reducing costs by 40% for common prompts.
- Modal offers a 'stateful serverless' platform where functions can maintain persistent memory across invocations, ideal for multi-turn agents.

| Company/Product | Approach | Key Metric | Status |
|---|---|---|---|
| Anthropic (Meridian) | Session-aware compute with persistent KV cache | 3x latency reduction | Internal production |
| Google DeepMind (Titans) | Differentiable neural memory | 92% accuracy on 100k tokens | Research preprint |
| OpenAI (Operator) | Semantic sharding in custom KV store | 60% less cross-node communication | Production |
| Replicate | Semantic caching | 40% cost reduction | Production |
| Modal | Stateful serverless | Persistent memory across invocations | Production |

Data Takeaway: The most practical near-term solutions come from Anthropic and Replicate, which achieve significant gains without requiring a complete rewrite of existing infrastructure. Google's Titans approach is more radical but remains experimental.

Industry Impact & Market Dynamics

The shift toward model-native architectures is reshaping the competitive landscape across multiple layers of the AI stack.

Infrastructure providers are scrambling to adapt. AWS, Azure, and GCP all offer managed vector databases (Amazon Aurora with pgvector, Azure Cosmos DB with vector search, Google AlloyDB), but these are still built on traditional compute-storage separation. The market for vector databases alone is projected to grow from $1.5B in 2024 to $8.2B by 2028 (CAGR 40%), but this growth could be disrupted if model-native architectures eliminate the need for external vector stores.

Hardware vendors are also affected. NVIDIA's latest H200 GPUs feature 141GB of HBM3e memory, but even this is insufficient for models with 100k+ token contexts. The demand for larger, faster memory is driving investment in CXL-attached memory pools and near-storage compute. Samsung and SK Hynix are developing 'processing-in-memory' (PIM) chips that can perform vector operations directly on DRAM, reducing data movement.

Cloud cost structures are changing. In a traditional architecture, the cost breakdown is roughly 40% compute, 30% storage, 30% networking. Model-native architectures shift this to 70% compute, 20% memory, 10% networking, as the model itself absorbs storage and routing functions. This favors providers with dense GPU clusters and high-bandwidth memory over those with cheap object storage.

| Market Segment | 2024 Size | 2028 Projected | CAGR | Key Disruption Risk |
|---|---|---|---|---|
| Vector databases | $1.5B | $8.2B | 40% | High (model-native eliminates need) |
| AI inference hardware | $22B | $87B | 32% | Medium (shift to memory-bound designs) |
| Cloud AI services | $45B | $180B | 32% | Medium (cost structure shift) |
| Semantic routing middleware | $0.2B | $2.1B | 60% | Low (enables the shift) |

Data Takeaway: The vector database market faces the highest disruption risk from model-native architectures. If learned indexes become standard, the need for separate vector stores could evaporate, redirecting billions in investment.

Risks, Limitations & Open Questions

Despite the promise, model-native architectures introduce significant risks.

Debugging complexity: When compute and storage are merged, traditional debugging tools break. A model that misroutes a query or corrupts its internal memory is a black box. Teams at Anthropic have reported spending 30% of engineering time on debugging state-related issues in their Meridian system.

Security surface: Embedding storage directly into the model's parameter space creates new attack vectors. Adversarial inputs could poison the learned index, causing the model to retrieve incorrect or malicious data. The open-source community has already demonstrated 'embedding injection' attacks that can manipulate vector similarity search.

Vendor lock-in: Model-native architectures are highly specific to the underlying hardware and model architecture. Migrating from one system to another could require a complete rewrite. This contrasts with the portability of traditional stateless systems.

Scalability ceilings: Current model-native systems struggle beyond a certain scale. Google's Titans architecture, for example, has only been tested on single-node configurations. Scaling to multi-node clusters with distributed memory remains an open problem.

Energy efficiency: Merging compute and storage increases power density. A single node running a model-native system can consume 2-3x more power than a comparable traditional setup, raising cooling costs and sustainability concerns.

AINews Verdict & Predictions

The collapse of compute-storage separation is inevitable, but the transition will be messy. Our analysis leads to three clear predictions:

1. By 2027, the majority of new LLM inference deployments will use some form of model-native architecture. The performance gains are too large to ignore. Startups like Modal and Replicate will be acquired by cloud providers seeking to integrate stateful serverless capabilities.

2. Vector databases will survive but transform. They will evolve into 'hybrid memory layers' that combine traditional indexing with learned components, rather than being replaced entirely. Pinecone and Weaviate will pivot to offer 'differentiable vector stores' that can be trained end-to-end with the model.

3. Hardware will be the bottleneck. The biggest winners will be companies that solve the memory hierarchy problem—whether through CXL-attached memory, PIM chips, or novel GPU designs. NVIDIA's dominance will be challenged by startups like Groq and Cerebras that offer memory-bound architectures optimized for model-native workloads.

The old wall has fallen. The new wall will be built around memory—not as a separate resource, but as an integral part of the model itself. Engineers who understand this shift will design the next generation of AI infrastructure. Those who cling to the old rules will be left behind.

More from Hacker News

AI 에이전트의 무제한 스캔이 운영자를 파산시키다: 비용 인식 위기In a stark demonstration of the dangers of unconstrained AI autonomy, an operator of an AI agent scanning the DN42 amate벡터 임베딩이 AI 에이전트 메모리로 실패하는 이유: 그래프와 에피소드 메모리가 미래다For the past two years, the AI industry has treated vector embeddings and vector databases as the de facto standard for 멀티 모델 트레이딩 컨소시엄: 1rok의 오픈소스 AI 에이전트가 GPT-4, Claude, Llama를 조율해 집단 주식 결정을 내리는 방법The financial sector has long been an AI testing ground, but most trading bots follow a single-model logic: one LLM readOpen source hub3369 indexed articles from Hacker News

Related topics

large language models139 related articles

Archive

May 20261494 published articles

Further Reading

AI가 스스로를 증명하는 법: LLM이 TLA+ 형식 검증을 마스터할 수 있을까?획기적인 실험에 따르면, LLM은 간단한 시스템에 대한 기본적인 TLA+ 사양을 생성할 수 있지만 복잡한 불변 조건과 동시성에서는 어려움을 겪습니다. 이는 단순한 기술적 장애물이 아니라, AI가 패턴 매칭에서 진정한다중 에이전트 AI 개발, 실은 분산 시스템 혁명의 변장협업하는 AI 에이전트 팀을 구축하려는 탐구가 예상치 못한 벽에 부딪혔다. 핵심 과제는 개별 모델을 더 똑똑하게 만드는 것이 아니라, 그들의 조정에 내재된 분산 시스템 문제를 해결하는 것이다. 이러한 패러다임 전환은NVIDIA의 AGI 선언: 기술적 현실인가, AI 플랫폼 전쟁에서의 전략적 권력 게임인가?NVIDIA CEO 젠슨 황이 '우리는 AGI를 달성했다'고 선언하며 기술계에 충격을 던졌습니다. 이는 단순한 기술적 평가가 아닌, 인공지능의 목표를 재정의하고 NVIDIA를 다음 경쟁의 중심에 위치시키는 계산된 전OpenAI, AI 가치 재정의: 모델 지능에서 배포 인프라로OpenAI는 최첨단 연구소에서 풀스택 배포 기업으로 조용히 중요한 변혁을 진행 중입니다. 당사 분석에 따르면, 전략적 중심축이 모델 파라미터 돌파구 추구에서 엔터프라이즈 통합, 실시간 추론 최적화, 배포 인프라로

常见问题

这篇关于“LLMs Are Shattering 20-Year-Old Distributed System Design Rules”的文章讲了什么?

The fundamental principle of distributed system design—strict separation of compute, storage, and networking—is being quietly undermined by the unique demands of large language mod…

从“how do model-native architectures handle consistency across distributed nodes”看,这件事为什么值得关注?

The clash between LLMs and traditional distributed systems stems from a fundamental architectural mismatch. Classical systems—think Google's Spanner, Amazon's DynamoDB, or any microservices stack—are built on the assumpt…

如果想继续追踪“which open-source projects implement semantic routing for LLMs”,应该重点看什么?

可以继续查看本文整理的原文链接、相关文章和 AI 分析部分,快速了解事件背景、影响与后续进展。