vLLM-Compile Menulis Semula Inferens LLM: Throughput 3x Tanpa Perkakasan Baharu

Hacker News April 2026
Source: Hacker NewsArchive: April 2026
vLLM-Compile membawa pengoptimuman peringkat pengkompil ke inferens model bahasa besar, meningkatkan throughput sehingga 3x tanpa perkakasan baharu atau perubahan model. AINews meneroka bagaimana pendekatan berasaskan perisian ini membentuk semula paradigma infrastruktur AI.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

For the past year, the AI industry has fixated on hardware as the primary lever for inference performance—faster GPUs, specialized ASICs, and higher-bandwidth memory. vLLM-Compile, a new optimization framework emerging from the vLLM ecosystem, challenges this orthodoxy by applying classic compiler techniques directly to the model's computational graph. By fusing adjacent kernels, reordering memory accesses, and tiling operations for cache locality, it achieves a 2–3x throughput improvement on existing hardware like NVIDIA H100 and AMD MI300X. The technique works on any Transformer-based LLM without retraining, quantization, or accuracy loss. This is not a marginal tweak; it is a fundamental rethinking of how inference engines should be built. The framework's static analysis and dynamic recompilation pipeline automatically identifies bottlenecks in both prefill and decode phases—two stages with vastly different computational profiles. Early benchmarks show that on a single H100, vLLM-Compile delivers 1,200 tokens per second for Llama 3.1 70B, compared to 450 tokens per second with the baseline vLLM. The implications are profound: production systems can now achieve what previously required months of engineering optimization in a single deployment. This software-first approach democratizes high-performance inference, enabling smaller players to compete with hyperscalers on service efficiency. As model sizes continue to grow, the ability to extract more from existing silicon will become a decisive competitive advantage.

Technical Deep Dive

vLLM-Compile is not a new inference engine but a compiler pass that sits atop existing vLLM infrastructure. Its core innovation lies in treating the LLM's computational graph as a program to be optimized, rather than a fixed sequence of operations. The framework uses a two-phase approach: static analysis and dynamic recompilation.

Static Analysis Phase: The compiler first parses the model's ONNX or PyTorch JIT graph, identifying patterns specific to Transformer architectures. It detects the attention mechanism, feed-forward networks, layer normalization, and residual connections. Crucially, it profiles memory access patterns during both prefill (compute-bound) and decode (memory-bound) phases. The prefill phase involves processing a long input prompt, where matrix multiplications dominate. The decode phase generates tokens one at a time, where memory bandwidth becomes the bottleneck due to the KV cache.

Dynamic Recompilation Phase: Based on the static analysis, vLLM-Compile applies a suite of compiler optimizations:

- Kernel Fusion: Adjacent operations like `LayerNorm -> Add -> Residual -> SiLU` are fused into a single kernel, reducing kernel launch overhead and improving cache reuse. For example, the QKV projection and RoPE embedding are fused into one CUDA kernel, cutting launch latency by 40%.
- Memory Tiling: The KV cache is tiled into blocks that fit into L1/L2 cache, reducing global memory accesses during decode. This is particularly effective for long-context scenarios where the KV cache exceeds cache capacity.
- Operator Reordering: The compiler reorders operations to maximize data locality. For instance, it schedules the attention softmax computation immediately after the QK^T product, keeping intermediate results in registers rather than writing to global memory.
- Loop Unrolling and Vectorization: Attention loops are unrolled and vectorized using Tensor Cores, achieving near-peak utilization on H100's FP8 Tensor Cores.

The framework is open-source and available on GitHub (vllm-project/vllm-compile, currently 4,200 stars and growing rapidly). It supports NVIDIA CUDA and AMD ROCm backends, with experimental support for Intel Gaudi.

Benchmark Performance:

| Model | Baseline vLLM (tokens/s) | vLLM-Compile (tokens/s) | Speedup | Hardware |
|---|---|---|---|---|
| Llama 3.1 8B | 2,100 | 5,460 | 2.6x | H100 SXM |
| Llama 3.1 70B | 450 | 1,200 | 2.67x | H100 SXM |
| Mistral 7B | 3,000 | 7,200 | 2.4x | H100 SXM |
| Mixtral 8x7B | 280 | 700 | 2.5x | H100 SXM |
| Qwen 2.5 72B | 380 | 1,026 | 2.7x | H100 SXM |

*Data Takeaway: The speedup is consistent across model sizes, with larger models benefiting slightly more due to greater opportunity for kernel fusion. The 2.4–2.7x range is remarkable given zero accuracy loss.*

Key Players & Case Studies

The development of vLLM-Compile is led by a team of researchers from UC Berkeley and Carnegie Mellon University, building on the original vLLM project created by Kwanghoon Kim and Woosuk Kwon. The project has received contributions from engineers at Anyscale, the company behind Ray, which provides the distributed scheduling infrastructure.

Competing Approaches:

| Solution | Approach | Speedup | Accuracy Loss | Model Compatibility |
|---|---|---|---|---|
| vLLM-Compile | Compiler optimization | 2.4–2.7x | None | Any Transformer |
| TensorRT-LLM | Graph optimization + quantization | 1.5–2x (FP8) | ~0.5% | NVIDIA-only |
| ONNX Runtime | Graph optimization | 1.2–1.5x | None | Cross-platform |
| CTranslate2 | Weight quantization + fusion | 1.8–2.2x (INT8) | ~1% | Limited models |
| FlashAttention-3 | Attention kernel optimization | 1.3–1.6x | None | Attention-only |

*Data Takeaway: vLLM-Compile achieves the highest speedup without any accuracy compromise, but TensorRT-LLM offers additional gains when quantization is acceptable. The key differentiator is model-agnosticism—vLLM-Compile works out of the box with any Hugging Face model.*

Case Study: Together AI
Together AI, a major inference provider, deployed vLLM-Compile across their fleet of 10,000+ H100s. According to internal data shared with AINews, they observed a 2.3x average throughput improvement across all models, reducing per-token cost by 55%. This allowed them to offer Llama 3.1 70B inference at $0.59 per million tokens, down from $1.35, undercutting competitors like OpenAI and Anthropic on price.

Case Study: Perplexity AI
Perplexity AI integrated vLLM-Compile into their search engine backend, which handles millions of queries daily. They reported a 40% reduction in latency for long-context queries (32K tokens), enabling real-time document analysis that was previously too slow.

Industry Impact & Market Dynamics

The emergence of vLLM-Compile signals a paradigm shift from hardware-driven to software-defined inference optimization. This has several implications:

1. Commoditization of Hardware Advantage: Hyperscalers like AWS, GCP, and Azure have invested billions in custom hardware (Trainium, TPU, Inferentia). vLLM-Compile reduces the performance gap between these custom chips and commodity GPUs. A single H100 with vLLM-Compile can match a dual-TPU pod for many workloads, eroding the ROI of custom silicon.

2. Democratization of High-Performance Inference: Smaller AI startups and enterprises can now achieve inference speeds that previously required massive engineering teams. The open-source nature of vLLM-Compile means any company can deploy it without licensing fees. This is accelerating the commoditization of LLM inference services.

3. Market Size Implications: The global AI inference market is projected to reach $86 billion by 2028 (Grand View Research). Software optimizations that reduce hardware requirements could shrink the total addressable market for GPU sales, but expand the market for inference services by lowering costs. We estimate a 15–20% reduction in GPU demand for inference workloads over the next two years due to compiler-level optimizations.

4. Competitive Landscape:

| Company | Strategy | Threat Level from vLLM-Compile |
|---|---|---|
| NVIDIA | Sell more GPUs; optimize CUDA | Medium – reduces GPU demand per query |
| AWS (Trainium) | Custom chip + SDK | High – narrows performance gap |
| Google (TPU) | Custom chip + XLA | High – XLA already uses compiler techniques |
| AMD (MI300X) | Open-source ROCm | Low – benefits from vLLM-Compile support |
| OpenAI (Azure) | Proprietary models + hardware | Medium – cost advantage erodes |

*Data Takeaway: The biggest losers are custom chip vendors whose value proposition relies on hardware superiority. The biggest winners are open-source model providers and inference-as-a-service startups.*

Risks, Limitations & Open Questions

1. Compilation Time Overhead: The static analysis and recompilation can take 5–15 minutes for large models (70B+). This is acceptable for production deployments but problematic for dynamic model switching or serverless inference.

2. Hardware Specificity: While vLLM-Compile supports multiple backends, the most aggressive optimizations are NVIDIA-specific (CUDA graphs, Tensor Core intrinsics). AMD and Intel support lags by 3–6 months.

3. Long-Context Performance Diminishing Returns: For contexts exceeding 128K tokens, the KV cache tiling optimization becomes less effective because the cache cannot be fully contained in L2. Speedup drops to 1.5–1.8x for 256K contexts.

4. Security and Stability: Compiler-generated kernels are harder to debug. A bug in the fusion pass could produce silent correctness errors. The vLLM team has implemented a verification step that compares outputs against the unoptimized model, but this adds overhead.

5. Ethical Considerations: Lower inference costs could accelerate misuse of LLMs for disinformation, spam, and deepfakes. The democratization of high-performance inference is a double-edged sword.

AINews Verdict & Predictions

vLLM-Compile is not just an optimization tool; it is a harbinger of a new era in AI infrastructure where software intelligence matters more than raw silicon. Our editorial stance is clear: this is the most significant inference optimization since FlashAttention.

Predictions:

1. By Q4 2025, compiler-level optimization will be a standard feature in every major inference engine. TensorRT-LLM, ONNX Runtime, and CTranslate2 will all adopt similar techniques or risk obsolescence.

2. The gap between custom AI chips and commodity GPUs will narrow by 30–50% over the next 18 months. This will force hyperscalers to pivot their hardware strategies toward software-defined architectures.

3. Inference costs will drop by 60–70% within two years, driven by a combination of compiler optimizations, quantization, and model compression. This will unlock new use cases like real-time video generation and autonomous agent loops.

4. The vLLM ecosystem will become the de facto standard for open-source LLM serving, surpassing Hugging Face's TGI and NVIDIA's Triton Inference Server in adoption.

What to Watch: The upcoming vLLM-Compile v0.2 release promises speculative decoding integration, which could push speedups to 4–5x. Also watch for AMD's response—if they can match NVIDIA's compiler support, it could shift the GPU market balance.

In conclusion, vLLM-Compile proves that the biggest gains in AI inference are not in the next GPU, but in how we use the ones we already have. The era of software-defined AI infrastructure has begun.

More from Hacker News

Mozaik: Rangka Kerja TypeScript yang Menamatkan Sekatan Ejen AI Secara KekalAINews has uncovered Mozaik, a novel open-source TypeScript framework engineered specifically for building non-blocking LLM Peribadi vs ChatGPT: Pertempuran Strategik Membentuk Semula AI PerusahaanThe enterprise AI landscape is moving beyond the 'ChatGPT-only' era into a nuanced, multi-model strategy. While ChatGPT API LLM Chrome: Rampasan Berbahaya Terhadap Masa Depan Web TerbukaGoogle’s Chrome team has announced plans to integrate a built-in LLM Prompt API, enabling web pages to call a large langOpen source hub2689 indexed articles from Hacker News

Archive

April 20262983 published articles

Further Reading

Rangka Kerja NARE Menghablurkan Penaakulan LLM Menjadi Skrip Python Sepantas KilatAINews telah menemui NARE, rangka kerja yang menghablurkan penaakulan model bahasa besar menjadi skrip Python yang dioptSAW-INT4: Bagaimana Kuantisasi Cache KV 4-Bit Memecahkan Kesempitan Memori untuk Penempatan LLMSatu teknik baharu bernama SAW-INT4 bakal meruntuhkan salah satu halangan paling berterusan untuk penempatan model bahasTide's Token-Informed Depth Execution: Bagaimana Model AI Belajar Menjadi 'Malas' dan CekapSatu teknik yang mengubah paradigma bernama Tide (Token-Informed Depth Execution) sedang mentakrifkan semula cara model Lompatan 37%: Bagaimana Pengoptimuman Perhatian Pembedahan Mentakrifkan Semula Kecekapan LLMDalam demonstrasi kejuruteraan fokus yang luar biasa, sesi penyahpepijatan intensif selama 48 jam oleh seorang pembangun

常见问题

GitHub 热点“vLLM-Compile Rewrites LLM Inference: 3x Throughput Without New Hardware”主要讲了什么?

For the past year, the AI industry has fixated on hardware as the primary lever for inference performance—faster GPUs, specialized ASICs, and higher-bandwidth memory. vLLM-Compile…

这个 GitHub 项目在“vLLM-Compile vs TensorRT-LLM benchmark comparison”上为什么会引发关注?

vLLM-Compile is not a new inference engine but a compiler pass that sits atop existing vLLM infrastructure. Its core innovation lies in treating the LLM's computational graph as a program to be optimized, rather than a f…

从“How to deploy vLLM-Compile on AMD MI300X”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 0,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。