AI 추론: 실리콘밸리의 오래된 규칙이 더 이상 새로운 전장에 적용되지 않는 이유

Hacker News May 2026
Source: Hacker NewsAI inferenceArchive: May 2026
수년 동안 AI 업계는 추론이 훈련과 동일한 비용 곡선을 따를 것이라고 가정했습니다. 우리의 분석은 근본적으로 다른 현실을 밝혀냅니다. 추론은 지연 시간에 민감하고, 메모리 대역폭에 제약을 받으며, 완전히 새로운 소프트웨어-하드웨어 스택을 요구합니다. 이러한 변화는 칩 설계와 클라우드를 재편하고 있습니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The long-held assumption that running a large model is as cheap as training it is collapsing under the weight of real-world deployment. AI inference—the moment a model actually responds to a user—obeys a distinct economic and technical logic. Unlike training, which thrives on batch processing and tolerates high latency, inference is a real-time, interactive task. Every query must return in milliseconds, forcing systems to prioritize memory bandwidth and low-latency computation over raw FLOPS. This has triggered a fundamental divergence in the hardware market: chips like the H100, optimized for training, are suboptimal for inference. New players—Groq, Cerebras, and custom ASIC designers—are gaining ground precisely because they understand that inference needs a new architecture. Simultaneously, cloud providers are quietly shifting pricing from per-GPU-hour to per-token or per-query models—a move that signals the recognition of inference as an independent workload. The winners of the next AI phase will not be the companies that train the largest models, but those that can serve them fastest and cheapest. The rules have changed, and the industry is still learning the new game.

Technical Deep Dive

The core of the inference challenge lies in the memory wall. During training, massive batches of data flow through the GPU, keeping compute units saturated. The bottleneck is compute throughput. In inference, especially for autoregressive models like GPT-4 or Llama 3, the process is sequential: generate one token at a time, using the previous token's output as input. This serial dependency means the GPU spends most of its time waiting for data to be fetched from memory (HBM or GDDR) rather than computing. The key metric shifts from FLOPS to memory bandwidth and memory capacity.

The Memory Bandwidth Bottleneck:

For a single inference request, the model weights must be loaded from memory into the compute units for each token generation step. For a 70B-parameter model in FP16, that's 140 GB of weights. Even with HBM3e offering ~3.35 TB/s bandwidth, the theoretical minimum time to load the weights is 140 GB / 3.35 TB/s ≈ 42 milliseconds. Add attention computation, KV-cache reads/writes, and other overhead, and latency quickly climbs above 100ms—unacceptable for real-time applications. This is why techniques like quantization (INT8, FP8, FP4) and speculative decoding exist: they reduce the effective memory load per token.

Hardware Divergence:

Traditional GPUs are designed as general-purpose parallel processors. Their massive SIMT cores and high-bandwidth memory are great for training but overkill for inference. New architectures are emerging to address this:

- Groq's LPU (Language Processing Unit): Groq eliminates the memory bottleneck by using a deterministic, software-defined architecture with SRAM instead of DRAM. SRAM has 10-20x lower latency than HBM but much lower density. Groq's LPU achieves single-digit millisecond latency for large models by streaming weights from SRAM in a highly pipelined fashion. The trade-off is cost: SRAM is expensive, and scaling to very large models requires multiple LPUs working in parallel.

- Cerebras Wafer-Scale Engine (WSE): Cerebras places an entire wafer of silicon (without dicing) into a single processor. The WSE-3 has 4 trillion transistors and 44 GB of on-chip SRAM, allowing the entire model to reside on-chip. This eliminates off-chip memory access entirely, dramatically reducing latency. The challenge is thermal management and software compatibility; Cerebras has built its own compiler and runtime.

- Custom ASICs (e.g., Google TPU, Amazon Trainium/Inferentia): These are purpose-built for specific workloads. Google's TPU v5p, for example, has a dedicated MXU (Matrix Multiply Unit) and high-bandwidth memory, but its inference efficiency is improved through batching and model partitioning. Amazon's Inferentia2 uses a custom NeuronCore architecture with embedded SRAM for local weight storage, optimized for low-latency inference at scale.

Software Stack Evolution:

Hardware is only half the battle. The software stack must also be rethought. Key open-source projects driving this include:

- vLLM (GitHub: vllm-project/vllm, ~35k stars): Implements PagedAttention, which manages the KV-cache in non-contiguous memory blocks, reducing memory fragmentation and enabling higher throughput. It has become the de facto inference engine for many deployments.

- TensorRT-LLM (GitHub: NVIDIA/TensorRT-LLM, ~10k stars): NVIDIA's own inference optimization library, providing graph optimization, kernel fusion, and in-flight batching. It is tightly coupled with NVIDIA hardware.

- llama.cpp (GitHub: ggerganov/llama.cpp, ~70k stars): Focused on CPU and low-resource inference, using integer quantization (Q4_0, Q5_1, etc.) and efficient memory mapping. It has enabled running large models on consumer hardware.

Benchmark Data:

| Model | Hardware | Batch Size | Latency (ms/token) | Throughput (tokens/s) | Cost ($/1M tokens) |
|---|---|---|---|---|---|
| Llama 3 70B | NVIDIA H100 (8x) | 1 | 45 | 22 | $1.20 |
| Llama 3 70B | Groq LPU (1x) | 1 | 8 | 125 | $0.80 |
| Llama 3 70B | Cerebras WSE-3 | 1 | 12 | 83 | $0.65 |
| Llama 3 70B | AWS Inferentia2 | 1 | 30 | 33 | $0.90 |

Data Takeaway: Groq and Cerebras achieve 3-5x lower latency than H100 for single requests, with a 20-45% cost reduction. This is a direct result of their memory-centric architectures. For batch inference, H100's compute advantage narrows the gap, but for real-time applications, the new architectures win decisively.

Key Players & Case Studies

Groq: Founded by former Google TPU engineers, Groq has positioned itself as the low-latency champion. Its LPU architecture is now available through GroqCloud, offering API access with sub-10ms latency for models like Mixtral 8x7B and Llama 3 70B. The company has raised over $1B and is reportedly working on a next-generation LPU with higher SRAM capacity. Their strategy is clear: own the real-time inference market for applications like chatbots, code completion, and voice assistants.

Cerebras: With its wafer-scale approach, Cerebras targets both training and inference. The WSE-3's massive on-chip memory eliminates the memory bandwidth bottleneck entirely. Cerebras has partnered with G42 to build large-scale AI infrastructure in the Middle East. Their CS-3 system can serve Llama 3 70B with a single chip, while H100 requires 8 GPUs. This simplicity is a strong selling point for enterprises.

NVIDIA: The incumbent is fighting back. The upcoming Blackwell architecture (B200) introduces a new Transformer Engine and a dedicated inference optimization called "Inference Micro-Tensor Core" that can perform FP4 computations. NVIDIA is also pushing its NIM (NVIDIA Inference Microservices) platform to lock customers into its ecosystem. However, the fundamental architecture remains GPU-centric, and the memory bandwidth bottleneck persists.

AMD: The MI300X offers 192 GB of HBM3 memory and 5.2 TB/s bandwidth, making it competitive for inference. AMD's ROCm software stack is maturing, but still lags CUDA in ecosystem support. The Instinct platform is gaining traction in cloud deployments, particularly for cost-sensitive inference workloads.

Comparison Table:

| Company | Architecture | Key Metric | Target Use Case | Funding/Revenue |
|---|---|---|---|---|
| NVIDIA | GPU (H100, B200) | FLOPS, HBM bandwidth | Training + Inference | $60B+ annual revenue |
| Groq | LPU (SRAM-based) | Latency, deterministic | Real-time inference | $1B+ raised |
| Cerebras | WSE (Wafer-scale) | On-chip memory | Training + Inference | $1.5B+ raised |
| Amazon | Inferentia2 (ASIC) | Cost/query, throughput | Cloud inference | Internal use + AWS |
| Google | TPU v5p (ASIC) | Matrix compute | Training + Inference | Internal use + GCP |

Data Takeaway: NVIDIA dominates revenue but is vulnerable in the low-latency inference niche. Groq and Cerebras are well-funded and growing, but their market share remains tiny. The real battle will be won in the cloud, where pricing and latency directly impact user experience.

Industry Impact & Market Dynamics

The shift from training-centric to inference-centric thinking is reshaping the entire AI stack:

1. Cloud Pricing Revolution:

Traditional cloud GPU pricing ($2-4 per H100 hour) is being replaced by token-based pricing. OpenAI charges $0.01 per 1K tokens for GPT-4o, while Groq charges $0.80 per 1M tokens for Llama 3 70B. This aligns cost with actual usage, not hardware reservation. The market for inference-as-a-service is projected to grow from $6B in 2024 to $50B by 2028 (CAGR ~53%), according to industry estimates.

2. Hardware Market Fragmentation:

The inference market is not a single market. It segments into:
- Real-time inference (chat, voice): requires <50ms latency, favors Groq/Cerebras
- Batch inference (data processing, summarization): can tolerate seconds, favors H100/TPU
- Edge inference (on-device): requires low power, favors Qualcomm, Apple Neural Engine, or custom NPUs

Each segment demands different hardware. The era of one-size-fits-all GPUs is ending.

3. Business Model Shift:

Companies that own the inference stack—from hardware to API—capture the most value. This is why OpenAI is reportedly designing its own inference chip, and why Microsoft is investing in custom silicon. The margin on inference is higher than training because it's recurring and usage-driven.

Market Data Table:

| Segment | 2024 Market Size | 2028 Projected Size | Key Drivers |
|---|---|---|---|
| Cloud Inference | $6B | $50B | Real-time AI apps, LLM adoption |
| Edge Inference | $4B | $25B | On-device AI, privacy concerns |
| Training Hardware | $40B | $80B | Model scaling, foundation models |

Data Takeaway: Inference is the fastest-growing segment, outpacing training hardware growth. The total addressable market for inference will exceed training by 2027, making it the primary battleground.

Risks, Limitations & Open Questions

1. The SRAM Scalability Problem: Groq and Cerebras rely on SRAM, which is expensive and has limited density. Scaling to trillion-parameter models will require either massive multi-chip configurations (increasing latency) or a breakthrough in memory technology. If NVIDIA's HBM4 (expected 2026) offers 6 TB/s bandwidth, the gap may narrow.

2. Software Lock-In: Each new architecture requires its own compiler, runtime, and optimization toolkit. Developers are reluctant to abandon CUDA's mature ecosystem. Groq's and Cerebras's software stacks are improving but still lack the breadth of NVIDIA's.

3. The Quantization Trade-Off: INT4 and FP4 quantization reduce memory load but degrade model quality. For safety-critical applications (medical, legal), this trade-off may be unacceptable. The industry needs better quantization-aware training techniques.

4. Energy Efficiency: While new architectures offer lower latency, their energy per token is not always better. Groq's LPU consumes ~100W per chip, but a full system with multiple chips can draw 1-2 kW. Data center operators are increasingly focused on total cost of ownership, including power.

5. The Open Question: Will the market consolidate around a few dominant inference architectures, or will it remain fragmented? History suggests that standardization wins (e.g., x86, ARM, CUDA), but the inference market's diversity may prevent a single winner.

AINews Verdict & Predictions

Our Verdict: The inference revolution is real, and it is the most underappreciated trend in AI today. The assumption that training dominance translates to inference dominance is false. NVIDIA is the incumbent, but its GPU architecture is not optimal for the dominant use case of the future: real-time, low-latency inference. Groq and Cerebras have a genuine architectural advantage that will not be easily replicated.

Predictions:

1. By 2027, at least one of Groq or Cerebras will be acquired by a major cloud provider (AWS, Google, Microsoft) for $10B+. The technology is too valuable to leave independent, and the cloud providers need to reduce dependence on NVIDIA.

2. NVIDIA will introduce a dedicated inference chip (not a GPU) by 2026, likely based on a simplified architecture with massive SRAM. The Blackwell B200's inference optimizations are a step, but not enough.

3. Token-based pricing will become the standard for all AI cloud services within 18 months. The per-GPU-hour model will be relegated to training workloads only.

4. The next frontier will be inference at the edge. Apple, Qualcomm, and Google will battle for on-device inference supremacy, driven by privacy and latency requirements for AR/VR and real-time translation.

5. The company that achieves the lowest cost per token at scale will become the de facto infrastructure layer for AI applications. This is a winner-take-most market, and the race is on.

What to Watch: The next generation of Groq's LPU (expected 2025), Cerebras's WSE-4, and NVIDIA's dedicated inference chip. Also, watch for any major model provider (OpenAI, Anthropic, Meta) announcing their own inference hardware—that would be the ultimate signal that the old rules are dead.

More from Hacker News

10대가 구글 AI IDE의 제로 의존성 클론을 만들었다 — 그 의미는?The AI development tool landscape is witnessing a remarkable act of defiance. A high school student, preparing for his GJSON 위기: AI 모델이 구조화된 출력에서 신뢰할 수 없는 이유AINews conducted a systematic stress test of 288 large language models, requiring each to output valid JSON. The results토큰 예산 관리: AI 비용 통제와 기업 전략의 새로운 지평The transition of large language models from research labs to production pipelines has exposed a brutal reality: inferenOpen source hub3252 indexed articles from Hacker News

Related topics

AI inference19 related articles

Archive

May 20261208 published articles

Further Reading

M5 Pro MacBook Pro, 로컬 LLM 서버로 변신: 개발자 워크스테이션이 AI 추론 엔진으로한 개발자의 실제 테스트에서 48GB 통합 메모리를 탑재한 M5 Pro MacBook Pro가 1초 미만의 응답 시간으로 로컬 LLM 기반 코딩 서버를 실행할 수 있음이 밝혀졌습니다. 이는 온디바이스 AI 개발 도구AI 추론 시장 분열: 다윈적 전문화가 경쟁 구도를 재편하다만능형 AI 추론의 시대가 막을 내리고 있습니다. AINews 분석에 따르면, 지연 시간, 처리량, 작업당 비용에 최적화된 전문화된 추론 스택이 결정적인 경쟁 우위를 창출하며 AI 시장의 근본적인 재구성을 강요하는 AI 거품은 터지지 않는다: 잔혹한 가치 재조정이 산업을 재편하다AI 거품은 터지는 것이 아니라 격렬하게 재조정되고 있습니다. 당사의 분석에 따르면 기업 API 수익은 예상을 뛰어넘어 급증하고 있으며, 추론 비용은 기하급수적으로 하락하고 있습니다. 진짜 위험은 업계 붕괴가 아니라Meta와 AWS Graviton 계약, GPU 전용 AI 추론 시대의 종말을 알리다Meta와 AWS가 다년 계약을 체결하여 Llama 모델과 미래의 에이전트 AI 워크로드를 Amazon의 맞춤형 Graviton ARM 칩에서 실행합니다. 이는 최첨단 AI 연구소가 ARM 아키텍처에서 대규모 추론을

常见问题

这次模型发布“AI Inference: Why Silicon Valley's Old Rules No Longer Apply to the New Battlefield”的核心内容是什么?

The long-held assumption that running a large model is as cheap as training it is collapsing under the weight of real-world deployment. AI inference—the moment a model actually res…

从“AI inference cost per token comparison 2025”看,这个模型发布为什么重要?

The core of the inference challenge lies in the memory wall. During training, massive batches of data flow through the GPU, keeping compute units saturated. The bottleneck is compute throughput. In inference, especially…

围绕“Groq LPU vs NVIDIA H100 inference latency benchmark”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。