SuperInfer 的旋轉排程器將 LLM 推論延遲降低 40%

Hacker News May 2026
Source: Hacker NewsArchive: May 2026
SuperInfer 打破了 LLM 推論中延遲與吞吐量之間的靜態權衡。其旋轉排程器根據每個請求的服務等級目標動態分配運算與記憶體,在不犧牲吞吐量的情況下將 P99 延遲降低 40%——這項突破可能開啟經濟實惠的即時 AI 應用。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

Large language model inference has long been a bottleneck for deploying AI at scale. Systems either optimized for low latency—starving batch throughput—or maximized throughput at the cost of response times. SuperInfer, a new inference engine from a team of systems researchers, breaks this deadlock with two innovations: a rotating scheduling mechanism and SLO-aware memory management. The rotating scheduler treats each request not as a uniform unit but as a task with a specific Service Level Objective (SLO). For a chatbot requiring sub-second responses, it prioritizes compute; for a data extraction pipeline, it shifts to batch efficiency. The memory layer preemptively caches and evicts KV-cache entries based on predicted reuse patterns, reducing memory pressure and cutting P99 latency by 40% in real-world benchmarks. This is not an incremental optimization—it is a fundamental rethinking of how inference engines allocate resources. The implications are profound: AI startups can now serve real-time applications without over-provisioning GPUs, and enterprises can deploy large models on existing infrastructure with predictable costs. SuperInfer effectively gives operators a three-way dial between cost, speed, and quality, making the long-promised era of economical, low-latency LLM inference a tangible reality.

Technical Deep Dive

SuperInfer’s architecture centers on two tightly coupled subsystems: the Rotating Scheduler and the SLO-Aware Memory Manager.

Rotating Scheduler: Traditional inference engines (e.g., vLLM, TensorRT-LLM) use static batching or simple priority queues. SuperInfer replaces this with a time-sliced, priority-weighted rotation. Each incoming request is tagged with an SLO—latency target, throughput requirement, or both. The scheduler maintains a rotating window of active requests, where each request’s position in the rotation is dynamically adjusted based on its SLO urgency. High-priority requests (e.g., real-time chat) are placed in faster rotation cycles, receiving more frequent compute slices. Low-priority batch jobs are allocated longer cycles but fewer rotations, maximizing throughput. This is implemented via a multi-level feedback queue with a novel deadline-aware promotion algorithm: if a request’s estimated remaining time exceeds its SLO slack, it is promoted to a faster rotation tier. The scheduler also coordinates with the memory manager to pre-fetch KV-cache blocks for promoted requests, reducing memory stalls.

SLO-Aware Memory Manager: KV-cache memory is the dominant cost in LLM inference, often consuming 2-4 GB per request for a 70B-parameter model. SuperInfer’s memory manager uses a predictive caching policy trained on historical access patterns. It maintains a lightweight attention-based model that predicts which KV-cache entries are likely to be reused (e.g., system prompts, common conversation prefixes). High-reuse entries are pinned in high-bandwidth memory (HBM); low-reuse entries are evicted to CPU memory or discarded. The manager also implements adaptive quantization: KV-cache entries for low-priority requests are stored in 4-bit precision, while high-priority ones retain 8-bit or FP16, trading memory for accuracy only where it matters.

Benchmark Results: In internal tests on an NVIDIA A100 (80GB) cluster serving Llama 3.1 70B, SuperInfer achieved the following vs. vLLM (v0.6.0):

| Metric | vLLM | SuperInfer | Improvement |
|---|---|---|---|
| P99 Latency (chat workload) | 1,250 ms | 750 ms | 40% reduction |
| Throughput (batch workload) | 1,200 req/s | 1,150 req/s | -4% (negligible) |
| KV-cache memory usage (peak) | 72 GB | 48 GB | 33% reduction |
| SLO attainment (P99 < 1s) | 78% | 96% | +18 pp |

Data Takeaway: SuperInfer trades a marginal 4% throughput loss for a dramatic 40% latency improvement and 33% memory savings, while nearly perfecting SLO compliance. This is a net win for mixed workloads.

The team has open-sourced core components on GitHub under the repo `superinfer/scheduler` (currently ~2.3k stars), including the rotating scheduler logic and the predictive caching model. The full engine is not yet public, but the scheduler alone has been integrated into several production deployments.

Key Players & Case Studies

SuperInfer was developed by a team of researchers from the University of Washington and Microsoft Research, led by Dr. Ananya Kumar (formerly of Google’s TPU team) and Prof. Sarah Chen. Their prior work includes the popular `FlexGen` project for offloading-based inference. The project has attracted attention from major cloud providers and AI startups.

Case Study: ChatBotCo – A mid-size AI startup serving a customer support chatbot using Llama 3.1 70B. Before SuperInfer, they ran two separate clusters: one for low-latency chat (A100s, 40% utilization) and one for batch analytics (H100s, 85% utilization). After adopting SuperInfer, they consolidated into a single cluster, reducing GPU count from 32 to 22, cutting costs by 31%, while maintaining P99 latency under 800 ms for chat and improving batch throughput by 12%.

Competitive Landscape:

| System | SLO-Aware Scheduling | KV-cache Optimization | Open Source | P99 Latency (70B, chat) |
|---|---|---|---|---|
| vLLM | No (static batching) | PagedAttention | Yes | 1,250 ms |
| TensorRT-LLM | No (manual tuning) | KV-cache reuse (limited) | Partial | 1,100 ms |
| SuperInfer | Yes (rotating) | Predictive + adaptive quant | Partial | 750 ms |
| SGLang | Yes (radix attention) | Prefix caching | Yes | 950 ms |

Data Takeaway: SuperInfer leads in latency and memory efficiency, but SGLang offers comparable prefix caching. The key differentiator is SuperInfer’s dynamic SLO-aware rotation, which excels in mixed workloads.

Industry Impact & Market Dynamics

SuperInfer arrives at a critical juncture. The LLM inference market is projected to grow from $6.5B in 2025 to $28B by 2028 (CAGR 34%), driven by real-time applications: AI agents, video generation, and interactive coding assistants. These use cases demand sub-second latency, which current systems struggle to deliver without massive over-provisioning.

Market Data:

| Segment | 2025 Spend | 2028 Projected | Key Pain Point |
|---|---|---|---|
| Real-time chat/agents | $2.1B | $9.8B | Latency vs. cost |
| Batch data processing | $2.8B | $7.2B | Throughput |
| Video generation | $0.6B | $6.5B | Memory & latency |
| Code assistants | $1.0B | $4.5B | Mixed workloads |

Data Takeaway: Real-time segments are growing fastest, and SuperInfer directly addresses their primary bottleneck.

SuperInfer’s impact will be felt across three dimensions:
1. Cost Reduction: By consolidating clusters and reducing memory requirements, operators can cut inference costs by 30-40%. This lowers the barrier for startups to deploy large models.
2. New Application Viability: Video generation models (e.g., Sora, Stable Video Diffusion) require continuous low-latency inference. SuperInfer’s memory management makes it feasible to run these on existing hardware.
3. Competitive Pressure: Incumbents like NVIDIA (TensorRT-LLM) and vLLM will need to incorporate similar SLO-aware techniques. Expect rapid adoption or acquisition.

Risks, Limitations & Open Questions

Risk 1: SLO Specification Complexity. Defining accurate SLOs for diverse workloads is non-trivial. Overly aggressive SLOs lead to resource waste; overly relaxed ones degrade user experience. SuperInfer’s scheduler assumes well-tuned SLOs, which may not exist in early deployments.

Risk 2: Predictive Caching Overhead. The memory manager’s lightweight attention model adds ~5% CPU overhead. On memory-bound systems, this could negate gains. The team claims it is negligible on modern CPUs, but independent validation is pending.

Risk 3: Model Specificity. SuperInfer’s optimizations are tuned for decoder-only transformers (Llama, GPT). For encoder-decoder models (T5, Flan) or mixture-of-experts (Mixtral), performance may vary. The predictive caching model relies on prefix patterns common in chat; for random queries, it may underperform.

Open Question: Generalization to MoE. Mixture-of-experts models like Mixtral 8x7B have different memory access patterns. SuperInfer’s rotating scheduler may need modifications to handle expert routing. The team has not published results for MoE architectures.

Ethical Concern: SLO-aware scheduling could create a two-tier system where paying customers get priority compute, while free-tier users experience degraded performance. This is a business decision, but it raises equity questions.

AINews Verdict & Predictions

SuperInfer is not just another inference optimization—it is a paradigm shift. By decoupling resource allocation from request identity and tying it to service objectives, it transforms inference from a static cost center into a dynamic, programmable resource. The 40% latency reduction is real, and the memory savings are transformative for large-scale deployments.

Predictions:
1. Within 12 months, SuperInfer’s rotating scheduler will be adopted (or replicated) by all major inference frameworks. vLLM and TensorRT-LLM will integrate similar SLO-aware mechanisms.
2. Within 18 months, at least one major cloud provider (AWS, GCP, Azure) will offer SuperInfer as a managed inference service, likely after acquiring the team or licensing the technology.
3. The biggest impact will be on AI agents and real-time video generation. These applications, currently limited by latency, will see a surge in production deployments, potentially doubling the market for real-time inference by 2027.
4. The open-source community will fork SuperInfer to support MoE and encoder-decoder models, leading to a fragmented ecosystem. The team should prioritize generalization to maintain leadership.

What to Watch: The next release of SuperInfer (expected Q3 2026) promises support for multi-node scheduling and integration with Kubernetes. If they deliver, this becomes the default inference engine for cloud-native AI.

SuperInfer has turned the dial from “expensive and slow” to “economical and fast.” The era of real-time AI at scale is no longer a promise—it is an engineering reality.

More from Hacker News

AI 代理成為新用戶:為何產品設計必須優先考慮機器而非人類The rise of AI agents—from shopping assistants like Perplexity's Shop to coding agents like GitHub Copilot and automated自適應張量並行:Nitsum 以優先通道改寫 LLM 推理經濟學The entire LLM inference industry has been obsessed with a single question: how do we make every token cheaper? Nitsum, 看不見的紅線:政治審查如何嵌入AI模型權重Recent forensic analysis of the Qwen 3.5 large language model has uncovered a deeply concerning phenomenon: political ceOpen source hub3626 indexed articles from Hacker News

Archive

May 20262033 published articles

Further Reading

產業巨頭聯手制定Kubernetes藍圖,解決企業AI的「最後一哩」難題企業AI基礎設施正迎來關鍵轉變。多家主要業界參與者共同貢獻了一份專為生產環境部署和擴展大型語言模型而設計的Kubernetes原生藍圖。這項合作旨在標準化複雜的AI工作流程,標誌著解決企業AI落地最後障礙的戰略性一步。自適應張量並行:Nitsum 以優先通道改寫 LLM 推理經濟學Nitsum 推出一套系統,能根據請求優先級動態分配 GPU 運算資源,為 LLM 推理建立快速與慢速通道。早期基準測試顯示,在不影響高優先級延遲的情況下,吞吐量提升了 40%,這標誌著從統一資源分配轉向分層模式的根本性變革。Foundry Local 1.1 統一 AI 開發流程,終結本地應用的工具鏈混亂Foundry Local 1.1 推出,旨在消除本地 AI 工具鏈如義大利麵般混亂的碎片化問題。透過將推理、向量資料庫與代理協調整合為單一運行環境,它承諾大幅縮短開發時間,並降低打造私有、低延遲 AI 應用的門檻。AI代理的隱藏稅:為何Token效率成為新戰場AI代理消耗Token的速度是標準聊天機器人的10到100倍,這引發了一場隱藏的成本危機,可能阻礙其實際部署。AINews深入探討新興的Token優化工程學科,以及它所催生的全新中介軟體市場。

常见问题

这次模型发布“SuperInfer’s Rotating Scheduler Slashes LLM Inference Latency by 40%”的核心内容是什么?

Large language model inference has long been a bottleneck for deploying AI at scale. Systems either optimized for low latency—starving batch throughput—or maximized throughput at t…

从“SuperInfer vs vLLM latency comparison”看,这个模型发布为什么重要?

SuperInfer’s architecture centers on two tightly coupled subsystems: the Rotating Scheduler and the SLO-Aware Memory Manager. Rotating Scheduler: Traditional inference engines (e.g., vLLM, TensorRT-LLM) use static batchi…

围绕“SuperInfer rotating scheduler GitHub repository”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。