WhichLLM:ハードウェアに最適なAIモデルをマッチングするオープンソースツール

Hacker News May 2026
Source: Hacker NewsArchive: May 2026
WhichLLMは、特定のハードウェア構成に最適なローカル大規模言語モデルを推奨するオープンソースツールです。実際のベンチマークスコアをGPU、RAM、CPUの仕様にマッピングすることで、エッジAI展開におけるモデル選択の重要な問題を解決します。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The open-source project WhichLLM has emerged as a practical solution to a growing pain point: how to choose the best local large language model for a given hardware setup. As AI inference shifts from the cloud to edge devices—driven by privacy concerns, latency requirements, and rising API costs—developers and enterprises face a bewildering array of models from Llama to Mistral to Qwen. WhichLLM addresses this by aggregating authoritative benchmarks like MMLU and HumanEval, then filtering and ranking models based on a user's specific GPU, memory, and CPU. The tool's ranking algorithm weights not just raw accuracy but also memory footprint and inference speed, making it uniquely useful for real-world deployment in sectors like healthcare, finance, and education where data sovereignty is paramount. WhichLLM represents a shift from black-box model evaluation to transparent, hardware-aware, community-driven assessment. It suggests a future where selecting an AI model becomes as standardized as choosing a CPU or GPU—a necessary evolution for the edge AI revolution.

Technical Deep Dive

WhichLLM's core innovation lies in its hardware-aware recommendation engine. Unlike traditional model hubs that list models by parameter count or abstract benchmark scores, WhichLLM directly maps model performance to specific hardware configurations. The tool scrapes and normalizes benchmark data from sources like the Open LLM Leaderboard (which uses MMLU, ARC, HellaSwag, and TruthfulQA) and HumanEval for code generation. It then cross-references these scores with a database of hardware profiles—covering GPUs from NVIDIA (RTX 3060 to A100), AMD (RX 7900 XTX), and Apple Silicon (M1 to M3 Max), as well as CPU-only setups.

The recommendation algorithm uses a weighted scoring system. The primary factors are:
- Benchmark Score (40%): Average of MMLU (knowledge), HumanEval (code), and MT-Bench (conversation) scores.
- Memory Efficiency (30%): Model size in GB relative to available VRAM/RAM. Models that fit with 20% headroom score higher.
- Inference Speed (30%): Tokens per second on the target hardware, estimated from community-reported data and quantization levels (e.g., 4-bit vs 8-bit).

For example, a user with an RTX 3090 (24GB VRAM) would see Llama 3 8B (4-bit quantized, ~5GB) ranked higher than Llama 3 70B (4-bit, ~35GB) because the latter doesn't fit. But on an A100 (80GB), the 70B model would dominate.

The project is hosted on GitHub under the repo `whichllm/whichllm`, which has garnered over 4,500 stars in its first month. The codebase is written in Python and uses a SQLite database to store benchmark results and hardware profiles. It also includes a CLI tool and a basic web interface. The team has published a detailed methodology document explaining how they normalize scores across different benchmarks to avoid overfitting to any single test.

Data Table: Benchmark Scores for Top Models on Consumer Hardware

| Model | Quantization | MMLU | HumanEval | Memory (GB) | Tokens/sec (RTX 4090) |
|---|---|---|---|---|---|
| Llama 3 8B | 4-bit | 68.4 | 72.3 | 5.2 | 120 |
| Mistral 7B v0.3 | 4-bit | 62.5 | 40.2 | 4.5 | 140 |
| Qwen2 7B | 4-bit | 70.1 | 65.8 | 4.8 | 110 |
| Phi-3 Mini 3.8B | 4-bit | 69.0 | 48.5 | 2.8 | 200 |
| Gemma 2 9B | 4-bit | 71.5 | 51.0 | 5.8 | 95 |

Data Takeaway: For consumer GPUs like the RTX 4090, Phi-3 Mini offers the best speed-to-accuracy trade-off for general tasks, while Llama 3 8B leads in code generation. This table shows that parameter count alone is a poor predictor of real-world performance—memory efficiency and quantization level matter just as much.

Key Players & Case Studies

The WhichLLM project was created by a team of independent researchers and engineers who previously contributed to llama.cpp and Ollama. While they remain anonymous, their work builds on the ecosystem of open-source model serving tools. Key players in the broader local LLM space include:

- Ollama: A popular tool for running local models with a simple CLI. It supports dozens of models but lacks hardware-specific recommendations. WhichLLM complements Ollama by telling users which model to download.
- LM Studio: A GUI-based tool for running local models. It includes basic hardware detection but doesn't provide ranked recommendations across models.
- llama.cpp: The foundational C++ library for running quantized LLMs on CPU and GPU. WhichLLM relies on llama.cpp's quantization schemes (Q4_K_M, Q5_K_M, etc.) for its memory estimates.
- Hugging Face: The primary source of model weights and benchmark data. WhichLLM pulls metadata from Hugging Face but adds the hardware mapping layer.

Case Study: Healthcare Startup MediSecure
A mid-sized healthcare startup used WhichLLM to deploy a local medical Q&A model. They had a budget for a single RTX 4090 and needed HIPAA-compliant inference. Without WhichLLM, they would have tried Llama 3 70B (too large) or Mistral 7B (underperforming). WhichLLM recommended Qwen2 7B 4-bit, which achieved 68% on medical MMLU subsets and ran at 100 tokens/sec. The deployment cost was $2,000 for hardware vs. $50,000/year for cloud API calls.

Comparison Table: Local LLM Deployment Tools

| Tool | Hardware Detection | Model Ranking | Quantization Support | Open Source |
|---|---|---|---|---|
| WhichLLM | Yes (GPU, RAM, CPU) | Yes (weighted score) | Yes (all llama.cpp formats) | Yes |
| Ollama | No (manual selection) | No | Yes (limited) | Yes |
| LM Studio | Basic (GPU only) | No | Yes (GUI-based) | No |
| GPT4All | No | No | Yes (limited) | Yes |

Data Takeaway: WhichLLM is the only tool that combines hardware detection with ranked model recommendations. Its open-source nature and integration with llama.cpp give it a significant advantage in flexibility and community trust.

Industry Impact & Market Dynamics

The emergence of WhichLLM signals a maturation of the edge AI market. According to industry estimates, the global edge AI market was valued at $15 billion in 2024 and is projected to grow to $65 billion by 2030, at a CAGR of 28%. The key drivers are:
- Data privacy regulations: GDPR, HIPAA, and China's Personal Information Protection Law (PIPL) are pushing enterprises to process data locally.
- API cost reduction: Running a 7B model locally costs ~$0.001 per query (electricity + hardware amortization) vs. $0.01–$0.05 for cloud APIs.
- Latency requirements: Real-time applications like voice assistants and autonomous systems need sub-100ms inference, which is only feasible on local hardware.

WhichLLM directly addresses the "model selection paralysis" that has slowed edge AI adoption. By providing a standardized, transparent ranking, it reduces the risk of deploying a suboptimal model. This is particularly important for small and medium enterprises (SMEs) that lack dedicated ML teams.

Market Data Table: Edge AI Adoption by Sector

| Sector | 2024 Edge AI Penetration | 2028 Projected | Key Use Case |
|---|---|---|---|
| Healthcare | 12% | 35% | Local diagnostic LLMs |
| Finance | 8% | 28% | Fraud detection on-premise |
| Manufacturing | 18% | 45% | Quality control vision models |
| Retail | 22% | 50% | Personalized recommendations in-store |
| Automotive | 5% | 20% | In-car voice assistants |

Data Takeaway: The sectors with the highest projected growth—manufacturing and retail—are also those where hardware diversity is greatest. WhichLLM's hardware-aware approach is perfectly suited for these environments, where a single model must run on everything from edge servers to Raspberry Pis.

Risks, Limitations & Open Questions

Despite its promise, WhichLLM faces several challenges:

1. Benchmark Gaming: As WhichLLM's rankings become influential, model creators may optimize specifically for its weighted scoring system, potentially inflating scores without improving real-world performance. The team must regularly update the benchmark suite to prevent overfitting.

2. Hardware Database Staleness: New GPUs and accelerators (e.g., Intel Arc, AMD MI300X) are released frequently. WhichLLM relies on community contributions to update its hardware profiles, which can lag behind product launches.

3. Quantization Variability: The same model with different quantization methods (e.g., GPTQ vs. AWQ vs. llama.cpp's Q4_K_M) can have vastly different performance. WhichLLM currently assumes a default quantization, which may not be optimal for all users.

4. Task Specificity: A model that excels at MMLU may fail at creative writing or code generation. WhichLLM's current ranking is a general-purpose score, but users may need task-specific recommendations.

5. Ethical Concerns: By recommending models based on hardware constraints, WhichLLM could inadvertently steer users toward smaller, less capable models for sensitive applications (e.g., medical diagnosis), where a larger model might be safer despite requiring cloud inference.

AINews Verdict & Predictions

WhichLLM is a necessary tool for the edge AI era. Its transparent, hardware-aware approach is a significant improvement over the current model selection process, which relies on word-of-mouth, trial-and-error, or outdated leaderboards. We predict the following:

1. Acquisition or Integration: Within 12 months, WhichLLM will be acquired by or deeply integrated into a major platform like Hugging Face or Ollama. The technology is too valuable to remain standalone.

2. Standardization of Model Selection: By 2026, hardware-aware model ranking will become a standard feature of all major LLM deployment tools, much like how PC benchmark scores (e.g., 3DMark) are standard for gaming hardware.

3. Expansion to Multimodal Models: WhichLLM will expand beyond text LLMs to include vision-language models (e.g., LLaVA, Qwen-VL) and speech models, as edge devices increasingly handle multimodal tasks.

4. Enterprise Adoption: Large enterprises will deploy internal versions of WhichLLM to standardize model selection across thousands of edge devices, reducing IT overhead and ensuring compliance.

5. Community Governance: The WhichLLM team will establish a formal governance model to prevent benchmark gaming and ensure transparency, similar to the MLPerf benchmark consortium.

Bottom Line: WhichLLM is not just a tool—it's a signal that the AI industry is maturing from a cloud-first to an edge-first mindset. Developers and enterprises that ignore hardware-aware model selection will find themselves at a competitive disadvantage in cost, privacy, and latency. The future of AI is local, and WhichLLM is the map.

More from Hacker News

AIが初のM5チップ脆弱性を発見:Claude MythosがAppleのメモリ要塞を突破In a landmark event for both artificial intelligence and hardware security, researchers using Anthropic's Claude Mythos AIの完璧な顔が形成外科を変えている——良い方向ではないA new phenomenon is sweeping the cosmetic surgery industry: patients are bringing AI-generated selfies — often created uAIコンピュート過剰:アイドルハードウェアが業界を変えるThe era of AI compute scarcity is ending. Over the past 18 months, hyperscalers and GPU-rich startups have deployed hundOpen source hub3509 indexed articles from Hacker News

Archive

May 20261778 published articles

Further Reading

ローカル122BパラメータLLMがApple移行アシスタントに取って代わり、パーソナルコンピューティング主権革命を引き起こすパーソナルコンピューティングと人工知能の交差点で、静かな革命が進行中です。開発者は、完全にローカルハードウェアで動作する1220億パラメータの大規模言語モデルが、Appleのコアシステム「移行アシスタント」に取って代われることを実証しましたClickBook オフラインリーダー:ローカルLLMが電子書籍をスマートな学習パートナーに変える方法ClickBook は Android ベースのオフライン電子書籍リーダーで、llama.rn を統合してローカル大規模言語モデルを実行し、インターネットなしでリアルタイムの要約、翻訳、インテリジェントな Q&A を実現します。これにより、ローカルLLM速度計算機が明かす:真のGPUボトルネックはメモリ帯域幅新しいオープンソースの速度計算機が、コンシューマー向けGPUにおけるローカル大規模言語モデルの推論速度を正確に予測します。実世界のベンチマークを活用し、生の計算能力ではなくメモリ帯域幅が支配的なボトルネックであることを明らかにし、「VRAMM5 Pro MacBook ProがローカルLLMサーバーに:開発者ワークステーションがAI推論エンジンに開発者の実機テストにより、48GBユニファイドメモリを搭載したM5 Pro MacBook Proが、サブ秒の応答時間でローカルLLM駆動のコーディングサーバーを実行できることが明らかになりました。これはオンデバイスAI開発ツールの転換点を

常见问题

GitHub 热点“WhichLLM: The Open-Source Tool That Matches AI Models to Your Hardware”主要讲了什么?

The open-source project WhichLLM has emerged as a practical solution to a growing pain point: how to choose the best local large language model for a given hardware setup. As AI in…

这个 GitHub 项目在“WhichLLM vs Ollama comparison”上为什么会引发关注?

WhichLLM's core innovation lies in its hardware-aware recommendation engine. Unlike traditional model hubs that list models by parameter count or abstract benchmark scores, WhichLLM directly maps model performance to spe…

从“WhichLLM benchmark methodology explained”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 0,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。