Inferencia de IA: Por qué las viejas reglas de Silicon Valley ya no aplican en el nuevo campo de batalla

Hacker News May 2026
Source: Hacker NewsAI inferenceArchive: May 2026
Durante años, la industria de la IA asumió que la inferencia seguiría la misma curva de costos que el entrenamiento. Nuestro análisis revela una realidad fundamentalmente diferente: la inferencia es sensible a la latencia, está limitada por el ancho de banda de la memoria y exige una pila de software y hardware completamente nueva. Este cambio está redefiniendo el diseño de chips y la nube.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The long-held assumption that running a large model is as cheap as training it is collapsing under the weight of real-world deployment. AI inference—the moment a model actually responds to a user—obeys a distinct economic and technical logic. Unlike training, which thrives on batch processing and tolerates high latency, inference is a real-time, interactive task. Every query must return in milliseconds, forcing systems to prioritize memory bandwidth and low-latency computation over raw FLOPS. This has triggered a fundamental divergence in the hardware market: chips like the H100, optimized for training, are suboptimal for inference. New players—Groq, Cerebras, and custom ASIC designers—are gaining ground precisely because they understand that inference needs a new architecture. Simultaneously, cloud providers are quietly shifting pricing from per-GPU-hour to per-token or per-query models—a move that signals the recognition of inference as an independent workload. The winners of the next AI phase will not be the companies that train the largest models, but those that can serve them fastest and cheapest. The rules have changed, and the industry is still learning the new game.

Technical Deep Dive

The core of the inference challenge lies in the memory wall. During training, massive batches of data flow through the GPU, keeping compute units saturated. The bottleneck is compute throughput. In inference, especially for autoregressive models like GPT-4 or Llama 3, the process is sequential: generate one token at a time, using the previous token's output as input. This serial dependency means the GPU spends most of its time waiting for data to be fetched from memory (HBM or GDDR) rather than computing. The key metric shifts from FLOPS to memory bandwidth and memory capacity.

The Memory Bandwidth Bottleneck:

For a single inference request, the model weights must be loaded from memory into the compute units for each token generation step. For a 70B-parameter model in FP16, that's 140 GB of weights. Even with HBM3e offering ~3.35 TB/s bandwidth, the theoretical minimum time to load the weights is 140 GB / 3.35 TB/s ≈ 42 milliseconds. Add attention computation, KV-cache reads/writes, and other overhead, and latency quickly climbs above 100ms—unacceptable for real-time applications. This is why techniques like quantization (INT8, FP8, FP4) and speculative decoding exist: they reduce the effective memory load per token.

Hardware Divergence:

Traditional GPUs are designed as general-purpose parallel processors. Their massive SIMT cores and high-bandwidth memory are great for training but overkill for inference. New architectures are emerging to address this:

- Groq's LPU (Language Processing Unit): Groq eliminates the memory bottleneck by using a deterministic, software-defined architecture with SRAM instead of DRAM. SRAM has 10-20x lower latency than HBM but much lower density. Groq's LPU achieves single-digit millisecond latency for large models by streaming weights from SRAM in a highly pipelined fashion. The trade-off is cost: SRAM is expensive, and scaling to very large models requires multiple LPUs working in parallel.

- Cerebras Wafer-Scale Engine (WSE): Cerebras places an entire wafer of silicon (without dicing) into a single processor. The WSE-3 has 4 trillion transistors and 44 GB of on-chip SRAM, allowing the entire model to reside on-chip. This eliminates off-chip memory access entirely, dramatically reducing latency. The challenge is thermal management and software compatibility; Cerebras has built its own compiler and runtime.

- Custom ASICs (e.g., Google TPU, Amazon Trainium/Inferentia): These are purpose-built for specific workloads. Google's TPU v5p, for example, has a dedicated MXU (Matrix Multiply Unit) and high-bandwidth memory, but its inference efficiency is improved through batching and model partitioning. Amazon's Inferentia2 uses a custom NeuronCore architecture with embedded SRAM for local weight storage, optimized for low-latency inference at scale.

Software Stack Evolution:

Hardware is only half the battle. The software stack must also be rethought. Key open-source projects driving this include:

- vLLM (GitHub: vllm-project/vllm, ~35k stars): Implements PagedAttention, which manages the KV-cache in non-contiguous memory blocks, reducing memory fragmentation and enabling higher throughput. It has become the de facto inference engine for many deployments.

- TensorRT-LLM (GitHub: NVIDIA/TensorRT-LLM, ~10k stars): NVIDIA's own inference optimization library, providing graph optimization, kernel fusion, and in-flight batching. It is tightly coupled with NVIDIA hardware.

- llama.cpp (GitHub: ggerganov/llama.cpp, ~70k stars): Focused on CPU and low-resource inference, using integer quantization (Q4_0, Q5_1, etc.) and efficient memory mapping. It has enabled running large models on consumer hardware.

Benchmark Data:

| Model | Hardware | Batch Size | Latency (ms/token) | Throughput (tokens/s) | Cost ($/1M tokens) |
|---|---|---|---|---|---|
| Llama 3 70B | NVIDIA H100 (8x) | 1 | 45 | 22 | $1.20 |
| Llama 3 70B | Groq LPU (1x) | 1 | 8 | 125 | $0.80 |
| Llama 3 70B | Cerebras WSE-3 | 1 | 12 | 83 | $0.65 |
| Llama 3 70B | AWS Inferentia2 | 1 | 30 | 33 | $0.90 |

Data Takeaway: Groq and Cerebras achieve 3-5x lower latency than H100 for single requests, with a 20-45% cost reduction. This is a direct result of their memory-centric architectures. For batch inference, H100's compute advantage narrows the gap, but for real-time applications, the new architectures win decisively.

Key Players & Case Studies

Groq: Founded by former Google TPU engineers, Groq has positioned itself as the low-latency champion. Its LPU architecture is now available through GroqCloud, offering API access with sub-10ms latency for models like Mixtral 8x7B and Llama 3 70B. The company has raised over $1B and is reportedly working on a next-generation LPU with higher SRAM capacity. Their strategy is clear: own the real-time inference market for applications like chatbots, code completion, and voice assistants.

Cerebras: With its wafer-scale approach, Cerebras targets both training and inference. The WSE-3's massive on-chip memory eliminates the memory bandwidth bottleneck entirely. Cerebras has partnered with G42 to build large-scale AI infrastructure in the Middle East. Their CS-3 system can serve Llama 3 70B with a single chip, while H100 requires 8 GPUs. This simplicity is a strong selling point for enterprises.

NVIDIA: The incumbent is fighting back. The upcoming Blackwell architecture (B200) introduces a new Transformer Engine and a dedicated inference optimization called "Inference Micro-Tensor Core" that can perform FP4 computations. NVIDIA is also pushing its NIM (NVIDIA Inference Microservices) platform to lock customers into its ecosystem. However, the fundamental architecture remains GPU-centric, and the memory bandwidth bottleneck persists.

AMD: The MI300X offers 192 GB of HBM3 memory and 5.2 TB/s bandwidth, making it competitive for inference. AMD's ROCm software stack is maturing, but still lags CUDA in ecosystem support. The Instinct platform is gaining traction in cloud deployments, particularly for cost-sensitive inference workloads.

Comparison Table:

| Company | Architecture | Key Metric | Target Use Case | Funding/Revenue |
|---|---|---|---|---|
| NVIDIA | GPU (H100, B200) | FLOPS, HBM bandwidth | Training + Inference | $60B+ annual revenue |
| Groq | LPU (SRAM-based) | Latency, deterministic | Real-time inference | $1B+ raised |
| Cerebras | WSE (Wafer-scale) | On-chip memory | Training + Inference | $1.5B+ raised |
| Amazon | Inferentia2 (ASIC) | Cost/query, throughput | Cloud inference | Internal use + AWS |
| Google | TPU v5p (ASIC) | Matrix compute | Training + Inference | Internal use + GCP |

Data Takeaway: NVIDIA dominates revenue but is vulnerable in the low-latency inference niche. Groq and Cerebras are well-funded and growing, but their market share remains tiny. The real battle will be won in the cloud, where pricing and latency directly impact user experience.

Industry Impact & Market Dynamics

The shift from training-centric to inference-centric thinking is reshaping the entire AI stack:

1. Cloud Pricing Revolution:

Traditional cloud GPU pricing ($2-4 per H100 hour) is being replaced by token-based pricing. OpenAI charges $0.01 per 1K tokens for GPT-4o, while Groq charges $0.80 per 1M tokens for Llama 3 70B. This aligns cost with actual usage, not hardware reservation. The market for inference-as-a-service is projected to grow from $6B in 2024 to $50B by 2028 (CAGR ~53%), according to industry estimates.

2. Hardware Market Fragmentation:

The inference market is not a single market. It segments into:
- Real-time inference (chat, voice): requires <50ms latency, favors Groq/Cerebras
- Batch inference (data processing, summarization): can tolerate seconds, favors H100/TPU
- Edge inference (on-device): requires low power, favors Qualcomm, Apple Neural Engine, or custom NPUs

Each segment demands different hardware. The era of one-size-fits-all GPUs is ending.

3. Business Model Shift:

Companies that own the inference stack—from hardware to API—capture the most value. This is why OpenAI is reportedly designing its own inference chip, and why Microsoft is investing in custom silicon. The margin on inference is higher than training because it's recurring and usage-driven.

Market Data Table:

| Segment | 2024 Market Size | 2028 Projected Size | Key Drivers |
|---|---|---|---|
| Cloud Inference | $6B | $50B | Real-time AI apps, LLM adoption |
| Edge Inference | $4B | $25B | On-device AI, privacy concerns |
| Training Hardware | $40B | $80B | Model scaling, foundation models |

Data Takeaway: Inference is the fastest-growing segment, outpacing training hardware growth. The total addressable market for inference will exceed training by 2027, making it the primary battleground.

Risks, Limitations & Open Questions

1. The SRAM Scalability Problem: Groq and Cerebras rely on SRAM, which is expensive and has limited density. Scaling to trillion-parameter models will require either massive multi-chip configurations (increasing latency) or a breakthrough in memory technology. If NVIDIA's HBM4 (expected 2026) offers 6 TB/s bandwidth, the gap may narrow.

2. Software Lock-In: Each new architecture requires its own compiler, runtime, and optimization toolkit. Developers are reluctant to abandon CUDA's mature ecosystem. Groq's and Cerebras's software stacks are improving but still lack the breadth of NVIDIA's.

3. The Quantization Trade-Off: INT4 and FP4 quantization reduce memory load but degrade model quality. For safety-critical applications (medical, legal), this trade-off may be unacceptable. The industry needs better quantization-aware training techniques.

4. Energy Efficiency: While new architectures offer lower latency, their energy per token is not always better. Groq's LPU consumes ~100W per chip, but a full system with multiple chips can draw 1-2 kW. Data center operators are increasingly focused on total cost of ownership, including power.

5. The Open Question: Will the market consolidate around a few dominant inference architectures, or will it remain fragmented? History suggests that standardization wins (e.g., x86, ARM, CUDA), but the inference market's diversity may prevent a single winner.

AINews Verdict & Predictions

Our Verdict: The inference revolution is real, and it is the most underappreciated trend in AI today. The assumption that training dominance translates to inference dominance is false. NVIDIA is the incumbent, but its GPU architecture is not optimal for the dominant use case of the future: real-time, low-latency inference. Groq and Cerebras have a genuine architectural advantage that will not be easily replicated.

Predictions:

1. By 2027, at least one of Groq or Cerebras will be acquired by a major cloud provider (AWS, Google, Microsoft) for $10B+. The technology is too valuable to leave independent, and the cloud providers need to reduce dependence on NVIDIA.

2. NVIDIA will introduce a dedicated inference chip (not a GPU) by 2026, likely based on a simplified architecture with massive SRAM. The Blackwell B200's inference optimizations are a step, but not enough.

3. Token-based pricing will become the standard for all AI cloud services within 18 months. The per-GPU-hour model will be relegated to training workloads only.

4. The next frontier will be inference at the edge. Apple, Qualcomm, and Google will battle for on-device inference supremacy, driven by privacy and latency requirements for AR/VR and real-time translation.

5. The company that achieves the lowest cost per token at scale will become the de facto infrastructure layer for AI applications. This is a winner-take-most market, and the race is on.

What to Watch: The next generation of Groq's LPU (expected 2025), Cerebras's WSE-4, and NVIDIA's dedicated inference chip. Also, watch for any major model provider (OpenAI, Anthropic, Meta) announcing their own inference hardware—that would be the ultimate signal that the old rules are dead.

More from Hacker News

Un adolescente creó un clon sin dependencias del IDE de IA de Google — He aquí por qué importaThe AI development tool landscape is witnessing a remarkable act of defiance. A high school student, preparing for his GLa crisis del JSON: por qué no se puede confiar en los modelos de IA para generar salidas estructuradasAINews conducted a systematic stress test of 288 large language models, requiring each to output valid JSON. The resultsPresupuesto de Tokens: La Nueva Frontera en el Control de Costos de IA y la Estrategia EmpresarialThe transition of large language models from research labs to production pipelines has exposed a brutal reality: inferenOpen source hub3252 indexed articles from Hacker News

Related topics

AI inference19 related articles

Archive

May 20261208 published articles

Further Reading

MacBook Pro M5 Pro se convierte en un servidor LLM local: estaciones de trabajo para desarrolladores como motores de inferencia de IAUna prueba real de un desarrollador revela que un MacBook Pro M5 Pro con 48 GB de memoria unificada puede ejecutar un seDivisión del mercado de inferencia de IA: la especialización darwiniana redefine el panorama competitivoLa era de la inferencia de IA única para todos está terminando. El análisis de AINews revela una división darwiniana donLa burbuja de la IA no estalla: una brutal recalibración de valor transforma la industriaLa burbuja de la IA no está explotando, sino que está siendo recalibrada violentamente. Nuestro análisis revela que los El acuerdo entre Meta y AWS Graviton señala el fin de la inferencia de IA exclusivamente con GPUMeta y AWS han firmado un acuerdo plurianual para ejecutar los modelos Llama y futuras cargas de trabajo de IA agente en

常见问题

这次模型发布“AI Inference: Why Silicon Valley's Old Rules No Longer Apply to the New Battlefield”的核心内容是什么?

The long-held assumption that running a large model is as cheap as training it is collapsing under the weight of real-world deployment. AI inference—the moment a model actually res…

从“AI inference cost per token comparison 2025”看,这个模型发布为什么重要?

The core of the inference challenge lies in the memory wall. During training, massive batches of data flow through the GPU, keeping compute units saturated. The bottleneck is compute throughput. In inference, especially…

围绕“Groq LPU vs NVIDIA H100 inference latency benchmark”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。