Argmax Over LLMs: Why Simpler AI Crushes Big Models on Prediction Tasks

Q: 围绕“simple machine learning beating deep learning benchmarks”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。

For years, the AI industry has operated under a single, unchallenged assumption: bigger models yield better results. The release of GPT-4, Claude 3, and Gemini Ultra has only reinforced this belief, driving a race for ever-larger parameter counts and more expensive training runs. However, a new study published by researchers from a leading European university has thrown a wrench into this narrative. The paper demonstrates that a simple argmax algorithm—a method that selects the most frequent next activity from historical data—can match or even exceed the performance of LSTM networks, standard Transformers, and fine-tuned large language models (LLMs) on next-activity prediction tasks. The tasks in question involve predicting the next step in a structured sequence, such as a business process, a user's browsing path, or a manufacturing workflow. The implications are profound. If a nearly zero-cost algorithm can compete with models costing millions to train and deploy, then the AI industry's obsession with scale may be misallocating resources. This finding does not suggest that LLMs are useless; rather, it highlights that for many structured, deterministic prediction problems, the complexity of modern deep learning is unnecessary. The study forces a critical question: how much of the current AI investment is driven by genuine performance gains versus a collective delusion that 'bigger is always better'? AINews investigates the technical details, the key players, and the market dynamics that will be reshaped by this revelation.

Technical Deep Dive

The study in question evaluated several models on next-activity prediction benchmarks derived from real-world business process logs. The datasets included the BPIC (Business Process Intelligence Challenge) datasets, which contain sequences of activities in domains like loan applications, hospital workflows, and invoice processing. The models tested were:

- Argmax (baseline): Simply predicts the most frequent next activity for a given current state, based on the training data.
- LSTM: A standard long short-term memory network with 2 layers and 128 hidden units, trained for 100 epochs.
- Transformer: A 4-layer Transformer with 4 attention heads and a hidden dimension of 128, trained for 100 epochs.
- Fine-tuned LLM: A distilled version of GPT-3.5 (text-davinci-003) fine-tuned on the specific task using LoRA (Low-Rank Adaptation).

The results were startling. On the BPIC 2017 (loan application) dataset, the argmax algorithm achieved an F1-score of 0.89, while the LSTM scored 0.87, the Transformer scored 0.88, and the fine-tuned LLM scored 0.89. On the BPIC 2012 (invoice processing) dataset, argmax scored 0.91, LSTM 0.90, Transformer 0.91, and the fine-tuned LLM 0.90. In several cases, the argmax algorithm was statistically tied with or slightly ahead of the more complex models. The only area where LLMs showed a marginal advantage was in handling rare, unseen activity transitions, but even there, the improvement was less than 2%.

Why does this happen? The core insight is that many real-world processes are highly deterministic and repetitive. In a loan application process, after 'submit application,' the next activity is almost always 'verify income.' A complex neural network that learns this pattern through millions of parameters is essentially rediscovering a simple frequency table. The argmax algorithm captures this directly, without any training cost, inference latency, or risk of hallucination.

The GitHub angle: The code and datasets for this study are available on GitHub under the repository `simple-prediction-benchmark` (currently 1,200 stars). The repository provides a clear, reproducible baseline that any practitioner can run in minutes. This stands in stark contrast to the sprawling codebases required for LLM fine-tuning, which often involve multi-GPU setups and days of training.

Data Table: Benchmark Performance Comparison

| Model | BPIC 2017 (F1) | BPIC 2012 (F1) | Training Cost (GPU-hours) | Inference Latency (ms) |
|---|---|---|---|---|
| Argmax | 0.89 | 0.91 | 0 | <1 |
| LSTM | 0.87 | 0.90 | 4 | 15 |
| Transformer | 0.88 | 0.91 | 12 | 25 |
| Fine-tuned LLM | 0.89 | 0.90 | 200 | 200 |

Data Takeaway: The argmax algorithm delivers comparable or superior accuracy at zero training cost and negligible inference latency. The fine-tuned LLM, despite being 50x more expensive to train and 200x slower at inference, offers no practical advantage on these structured prediction tasks.

Key Players & Case Studies

This study directly challenges the strategies of several major AI companies and research groups that have bet heavily on scaling.

- OpenAI and Anthropic: Both companies have built their product roadmaps around the idea that larger models will unlock new capabilities. OpenAI's GPT-5.6, which the U.S. government recently halted for a staged rollout, is a prime example. The argmax finding suggests that for many enterprise use cases—like process automation, supply chain optimization, and customer journey mapping—a simple baseline is sufficient. This could undermine the value proposition of expensive, general-purpose LLMs for specific verticals.

- Microsoft Copilot Enterprise: The recent revelation of an 80% failure rate in Microsoft's Copilot Enterprise, largely due to hallucination, is a cautionary tale. The argmax approach, by definition, cannot hallucinate because it only outputs seen patterns. For structured tasks, reliability is paramount, and a deterministic algorithm offers guarantees that no LLM can match.

- Google DeepMind: DeepMind's work on process mining and business process management (BPM) has increasingly leaned on transformer-based architectures. The argmax result suggests that their research may be over-engineered for the problem domain. DeepMind's AlphaFold, which solved a genuinely complex problem, is a counterexample where deep learning was essential. The distinction between 'hard' and 'easy' prediction problems is now critical.

- Startups in the BPM space: Companies like Celonis (process mining) and UiPath (robotic process automation) have been integrating LLMs into their platforms. The argmax study provides a strong argument for hybrid approaches: use simple, deterministic methods for routine predictions and reserve LLMs for edge cases or natural language interfaces.

Data Table: Cost Comparison for Enterprise Deployment

| Solution | Monthly Cost (10K predictions/day) | Accuracy on Structured Tasks | Hallucination Risk |
|---|---|---|---|
| Argmax (in-house) | $0 (negligible compute) | 89-91% | None |
| Fine-tuned LLM (API) | $5,000 (GPT-4 API calls) | 89-90% | Low but non-zero |
| Custom LSTM (self-hosted) | $500 (GPU server) | 87-90% | None |

Data Takeaway: For enterprises processing millions of structured predictions daily, the argmax approach offers a 100x cost reduction with no accuracy trade-off and zero hallucination risk. This is a compelling economic argument that will force CFOs to question AI budgets.

Industry Impact & Market Dynamics

The argmax finding arrives at a critical moment for the AI industry, which is facing a 'tokenmaxxing hangover'—a realization that the cost of deploying LLMs at scale is unsustainable. The market for AI in enterprise process automation is projected to reach $50 billion by 2027, according to recent estimates. If a significant portion of that market can be served by simple algorithms, the demand for expensive LLM subscriptions could shrink dramatically.

The 'Good Enough' Revolution: This study is part of a broader trend. Recent work on linear attention (e.g., RWKV-CUDA, which has 8,000 stars on GitHub) shows that efficient alternatives to the transformer architecture can match performance at a fraction of the cost. The argmax result takes this to the extreme: for some tasks, even linear attention is overkill.

Impact on AI Research Funding: Venture capital firms have poured over $30 billion into generative AI startups in 2025 alone. The argmax finding will likely cause a recalibration. Investors will demand clearer justifications for why a startup needs a massive LLM rather than a simple statistical model. This could deflate the valuation of companies that rely on the 'scale is everything' narrative.

The Rise of Hybrid Systems: The most likely outcome is a bifurcation of the AI market. On one side, general-purpose LLMs will continue to dominate creative tasks, coding, and open-ended dialogue. On the other side, specialized, deterministic algorithms will handle structured prediction tasks in enterprise, finance, and logistics. Companies that build hybrid systems—using argmax for routine predictions and LLMs for exceptions—will have a significant cost advantage.

Risks, Limitations & Open Questions

While the argmax result is compelling, it has important limitations.

- Limited to Structured, Deterministic Domains: The method fails on tasks with high variability or long-range dependencies. For example, predicting the next word in a sentence requires understanding context, which argmax cannot capture. The study's findings do not generalize to language modeling or creative generation.

- Data Sparsity: In processes with many rare activities, the frequency table becomes sparse, and argmax may fail to predict novel transitions. The study showed that LLMs had a slight edge in these cases, though the margin was small.

- Overfitting to Training Distribution: Argmax is a memorization-based method. If the underlying process changes (concept drift), the model will fail until retrained. LLMs, with their broader knowledge, may adapt more gracefully to distribution shifts.

- Ethical Concerns: The study does not address bias. If the training data contains biased activity patterns (e.g., certain demographic groups being routed to different processes), argmax will encode and amplify that bias without any opportunity for correction. LLMs can, in theory, be fine-tuned to reduce bias, though in practice this remains challenging.

AINews Verdict & Predictions

The argmax study is not a death knell for LLMs, but it is a much-needed corrective to the industry's scaling mania. Our editorial judgment is clear: the era of 'bigger is always better' is over.

Prediction 1: Within 12 months, at least three major enterprise AI platforms (e.g., Salesforce Einstein, SAP AI Core, ServiceNow) will announce hybrid architectures that use argmax-like baselines for routine predictions, reducing their LLM API costs by 60-80%. This will be marketed as 'AI Efficiency 2.0.'

Prediction 2: A new wave of startups will emerge that specialize in 'minimum viable AI'—solutions that use the simplest possible algorithm to solve a problem. These startups will undercut LLM-heavy competitors on price and reliability, capturing significant market share in process automation.

Prediction 3: The AI research community will pivot toward 'efficiency benchmarking' as a standard practice. Future papers will be required to compare against simple baselines (argmax, linear regression, decision trees) to justify the complexity of their proposed models. This will lead to a healthier, more rigorous field.

What to watch next: Keep an eye on the GitHub repository `simple-prediction-benchmark`. If it reaches 10,000 stars within six months, it will signal that the practitioner community has embraced this paradigm shift. Also, monitor the earnings calls of companies like UiPath and Celonis. If they mention 'algorithmic efficiency' or 'lightweight prediction' in their next quarterly reports, the shift is underway.

The AI industry has been drunk on scale. The argmax study is the hangover cure. It will be painful, but necessary.

常见问题

这次模型发布“Argmax Over LLMs: Why Simpler AI Crushes Big Models on Prediction Tasks”的核心内容是什么？

For years, the AI industry has operated under a single, unchallenged assumption: bigger models yield better results. The release of GPT-4, Claude 3, and Gemini Ultra has only reinf…

从“argmax algorithm vs LLM prediction accuracy comparison”看，这个模型发布为什么重要？

The study in question evaluated several models on next-activity prediction benchmarks derived from real-world business process logs. The datasets included the BPIC (Business Process Intelligence Challenge) datasets, whic…

围绕“simple machine learning beating deep learning benchmarks”，这次模型更新对开发者和企业有什么影响？