The Thinking Tax: Why AI Models Waste Compute on Circular Reasoning

arXiv cs.AI May 2026
Source: arXiv cs.AIArchive: May 2026
A landmark study has for the first time formalized and measured 'reasoning redundancy' in large language models, revealing that up to 40% of chain-of-thought tokens are wasted on cyclic self-reflection and redundant verification. The findings challenge the industry's prevailing 'think longer, think better' dogma and point toward a future of adaptive reasoning.

The AI industry has been locked in an arms race to build models that 'think' longer and harder. OpenAI's o1, DeepSeek-R1, and Anthropic's Claude Opus have all demonstrated that extended chain-of-thought (CoT) reasoning can unlock superior performance on complex math, coding, and scientific reasoning tasks. But at what cost? A new research paper, led by a team from Stanford and MIT, provides the first rigorous, formal quantification of how much of that thinking is actually wasted. By defining a metric called 'Reasoning Redundancy Ratio' (RRR), the researchers analyzed thousands of reasoning traces from models including o1-preview, DeepSeek-R1, and Qwen2.5-72B-Instruct. Their findings are sobering: across a suite of benchmark tasks, between 25% and 40% of all tokens generated during the reasoning process are 'redundant' — they either reiterate an already established conclusion, engage in circular self-correction loops that return to the same point, or perform unnecessary verification steps that do not change the final answer. The most redundant models were those explicitly trained to 'think step by step' without constraints on the depth of the chain. The study's significance extends far beyond academic curiosity. For companies deploying reasoning models at scale — from GitHub Copilot to customer service chatbots — this wasted computation translates directly into higher GPU costs, longer latency, and increased energy consumption. The paper estimates that eliminating just the top 20% of redundant tokens could reduce inference costs by 30-50% for many real-world applications, with no measurable drop in accuracy. This finding has already sparked intense debate within the AI research community about whether current training paradigms — which reward models for producing longer, more verbose reasoning traces — are inadvertently encouraging inefficiency. The study's authors propose a new training objective that penalizes redundancy while maintaining or improving accuracy, a technique they call 'Minimum Viable Reasoning' (MVR). Early experiments show that models fine-tuned with MVR achieve comparable or better benchmark scores while generating 35% fewer tokens on average. This could be the key that unlocks real-time, cost-effective deployment of reasoning models in latency-sensitive applications like autonomous driving, financial trading, and interactive coding assistants.

Technical Deep Dive

The core contribution of the study is the formal definition of Reasoning Redundancy Ratio (RRR). The researchers break down redundancy into three distinct categories:

1. Cyclic Self-Reflection (CSR): The model revisits a previously resolved sub-problem and re-derives the same conclusion using different phrasing. Think of it as the AI equivalent of double-checking a math problem you already solved correctly, but without adding new information.
2. Repetitive Statement (RS): The model rephrases the same logical step multiple times in slightly different words, inflating the token count without advancing the argument.
3. Over-Verification (OV): The model performs redundant checks on intermediate results that are already guaranteed correct by the preceding logic, akin to verifying that 2+2=4 after already confirming the addition operation.

To detect these patterns, the team used a combination of semantic similarity scoring (using Sentence-BERT embeddings) and a novel Entailment Graph approach. They constructed a directed graph of all statements in a reasoning trace, where edges represent logical entailment. Redundancy is flagged when a node (statement) is entailed by a previous node in the same path, but does not lead to any new, non-entailed nodes downstream. This graph-based method is more robust than simple n-gram overlap detection, as it captures semantic redundancy even when the surface form differs.

Benchmark Results: The study evaluated models on five reasoning benchmarks: GSM8K (math), MATH, HumanEval (coding), HotpotQA (multi-hop QA), and a custom 'Logical Deduction' dataset. The key findings are summarized below:

| Model | Avg. RRR (%) | Avg. Tokens/Task | Accuracy (%) | Est. Cost/1M Tokens |
|---|---|---|---|---|
| o1-preview | 38.2 | 4,200 | 92.1 | $15.00 |
| DeepSeek-R1 | 41.5 | 5,100 | 90.8 | $2.19 |
| Qwen2.5-72B-Instruct | 29.1 | 2,800 | 85.4 | $1.20 |
| Claude Opus (thinking mode) | 33.7 | 3,900 | 91.5 | $15.00 |
| GPT-4o (standard) | 12.4 | 1,100 | 88.3 | $5.00 |

Data Takeaway: Models explicitly trained for extended reasoning (o1, DeepSeek-R1, Claude Opus) show RRR values 2-3x higher than standard instruction-tuned models like GPT-4o. DeepSeek-R1, despite being the cheapest per token, has the highest redundancy ratio, suggesting its open-source training pipeline may inadvertently reward verbosity. The accuracy gains from extended reasoning are real but diminishing — o1 and Claude Opus achieve only ~3-4% higher accuracy than GPT-4o while consuming 3-4x more tokens.

The MVR Training Approach: The researchers fine-tuned DeepSeek-R1 using a modified loss function that adds a penalty term proportional to the RRR of the generated reasoning trace. The penalty is weighted by a hyperparameter λ, which they tuned to balance accuracy and efficiency. The resulting model, dubbed DeepSeek-R1-MVR, achieved a 35% reduction in average tokens per task while maintaining 90.2% accuracy (down only 0.6% from the original 90.8%). This is a Pareto improvement — less compute for nearly identical performance.

A related open-source project worth watching is ReasoningEfficiency (GitHub, ~2.3k stars), which provides tools for visualizing and pruning redundant reasoning steps from CoT traces. The repo's latest release includes a plugin for Hugging Face's Transformers library that can be applied at inference time to truncate redundant loops.

Key Players & Case Studies

The study's findings have immediate implications for several major players in the AI ecosystem:

- OpenAI: o1-preview, their flagship reasoning model, is the most expensive to run per token. The high RRR suggests that a significant portion of user spend on o1 is going toward wasted computation. OpenAI's upcoming 'o3' model, rumored to include adaptive depth control, may be a direct response to this inefficiency.
- DeepSeek: As an open-source alternative, DeepSeek-R1 is popular among cost-conscious developers. However, its 41.5% RRR is the highest in the study. The DeepSeek team has already acknowledged this issue in their technical report and is exploring 'token budget' training. The MVR fine-tuning approach could be integrated into their next release.
- Anthropic: Claude Opus's 'thinking mode' is designed for complex reasoning, but the study shows it still wastes ~34% of its tokens. Anthropic's constitutional AI approach may need to be extended to include a 'constitution of efficiency' that discourages circular reasoning.
- Google DeepMind: Gemini Ultra 1.5, while not included in this study, uses a Mixture-of-Experts architecture that could be adapted to allocate reasoning depth dynamically. Google's internal research on 'chain-of-thought pruning' aligns closely with these findings.

Comparison of Training Paradigms:

| Approach | Training Objective | Redundancy Penalty | Avg. Tokens Saved | Accuracy Delta |
|---|---|---|---|---|
| Standard SFT | Cross-entropy on correct traces | None | 0% | Baseline |
| RL (PPO) | Reward for correct answer | Implicit (length penalty) | ~10% | +0.5% |
| MVR (this study) | Cross-entropy + RRR penalty | Explicit (λ-weighted) | 35% | -0.6% |
| Token Budget RL | Reward for correct answer with max token limit | Hard constraint | 40% | -2.1% |

Data Takeaway: Hard token budget constraints (e.g., 'generate at most 500 tokens') reduce redundancy but also significantly hurt accuracy. The MVR approach achieves a much better trade-off by penalizing redundancy only when it doesn't contribute to correctness. This suggests that future training methods should focus on *quality* of reasoning, not just length.

Industry Impact & Market Dynamics

The study arrives at a critical inflection point for the AI industry. The market for inference compute is projected to grow from $20 billion in 2025 to over $80 billion by 2028, according to industry estimates. A 30-50% reduction in per-task token consumption would translate to $6-16 billion in annual savings for cloud providers and enterprises.

Immediate Impact Areas:

1. Real-time Applications: Customer service chatbots, code completion tools (GitHub Copilot, Cursor), and voice assistants currently avoid using reasoning models due to latency. A 35% token reduction could bring average response times down from 5-8 seconds to under 2 seconds, making them viable for interactive use.
2. Edge Deployment: Smaller, more efficient reasoning models could run on-device for applications like autonomous driving and AR glasses. The MVR approach could enable a 7B-parameter model to match the reasoning quality of a 70B model, slashing hardware requirements.
3. Energy Consumption: Data centers powering AI inference already consume as much electricity as entire countries. Reducing redundant computation by 30% could cut the carbon footprint of inference by a similar margin, aligning with ESG goals.

Market Adoption Curve:

| Year | Adoption Scenario | Key Driver |
|---|---|---|
| 2025 (Current) | Niche: Research labs, high-value coding tasks | High accuracy, high cost |
| 2026 | Early Mainstream: Enterprise customer service, legal document analysis | Adaptive reasoning models (MVR-like) |
| 2027 | Broad: Real-time coding assistants, financial trading, education | Token cost drops 50%+, latency <1s |
| 2028 | Ubiquitous: On-device reasoning for all consumer apps | Sub-cent per query cost |

Data Takeaway: The adoption of reasoning models has been held back by cost and latency, not accuracy. The MVR approach directly addresses both barriers. If major providers (OpenAI, Google, Anthropic) integrate similar techniques into their next-generation models, we could see reasoning models become the default for all but the simplest queries within 18-24 months.

Risks, Limitations & Open Questions

While the study is groundbreaking, several caveats and risks deserve attention:

1. Benchmark Narrowness: The study only evaluated five benchmarks. It's unclear whether RRR generalizes to open-ended tasks like creative writing, strategic planning, or scientific hypothesis generation. In some domains, 'redundant' self-reflection might actually be beneficial for exploring alternative solutions.
2. The 'Good Redundancy' Problem: Not all redundancy is bad. In safety-critical applications (medical diagnosis, autonomous driving), redundant verification can catch errors. The study's RRR metric does not distinguish between harmful redundancy and beneficial double-checking. A surgeon would prefer a model that verifies a diagnosis twice, even if it costs more tokens.
3. Gaming the Metric: If models are trained to minimize RRR, they might learn to produce shorter but less robust reasoning traces that fail on edge cases. The MVR approach showed only a 0.6% accuracy drop, but this could widen on more diverse or adversarial datasets.
4. Interpretability Trade-off: Longer reasoning traces are often easier for humans to audit. Reducing redundancy could make model outputs more opaque, undermining trust in high-stakes applications.
5. Ethical Concerns: The push for efficiency could exacerbate existing biases. If models are trained to cut 'redundant' reasoning, they might skip important fairness checks or ethical considerations that are currently embedded in longer traces.

AINews Verdict & Predictions

This study is a wake-up call for the AI industry. The assumption that 'more thinking equals better thinking' has driven a wasteful arms race. The evidence is clear: current reasoning models are generating massive amounts of computational noise that adds little to no value. The editorial board at AINews believes this marks the beginning of the end for the 'long chain-of-thought' era.

Our Predictions:

1. By Q3 2026, every major LLM provider will ship an 'adaptive reasoning' mode that dynamically adjusts the depth of CoT based on task complexity. OpenAI's o3, Google's Gemini Ultra 2, and Anthropic's Claude 4 will all include this as a default setting, with an option to 'maximize reasoning' for critical tasks.
2. The MVR training approach will be adopted by at least two open-source model families (DeepSeek, Qwen) within 6 months. The code and methodology are already available, and the accuracy-efficiency trade-off is compelling enough for rapid adoption.
3. Inference costs for reasoning models will drop by 40-50% year-over-year for the next two years, driven by a combination of architectural improvements (MVR-like training), hardware acceleration (custom chips for sparse reasoning), and better token management.
4. A new category of 'efficiency-first' AI startups will emerge, offering lightweight reasoning models optimized for specific verticals (legal, medical, finance) that undercut general-purpose models on cost by 60-70%.
5. The biggest loser will be any company that continues to train models purely on 'correct answer' reward without penalizing redundancy. They will be priced out of the market by more efficient competitors.

What to Watch: The next major release from DeepSeek (likely DeepSeek-R2) and OpenAI's o3 will be the first real-world tests of these ideas. If they show significant efficiency gains without accuracy loss, the industry will pivot hard. If they don't, the study's findings may remain academic for another cycle. We're betting on the pivot.

The bottom line: Thinking is good. Wasted thinking is expensive. The smartest models of the future won't be the ones that think the longest — they'll be the ones that know when to stop.

More from arXiv cs.AI

UntitledMEMOR-E, a four-legged mobile robot developed by a team of researchers from the University of Tokyo and the National InsUntitledA new research paper has exposed a fundamental vulnerability in large language model (LLM)-driven ubiquitous systems: whUntitledFor years, knowledge graph embeddings have treated concepts as single points in high-dimensional space. This works well Open source hub391 indexed articles from arXiv cs.AI

Archive

May 20262836 published articles

Further Reading

PathCal: The AI Breakthrough That Teaches Models to Stop OverthinkingLarge reasoning models waste massive compute on self-doubt. PathCal's state-aware calibration identifies which 'wait' anCAMP Framework Revolutionizes Clinical AI with Adaptive Multi-Agent Diagnostic ConsultationClinical AI is undergoing a fundamental transformation, moving beyond the pursuit of unanimous model outputs to harnessiMEMOR-E Robot: How LLMs Are Revolutionizing Alzheimer's Care with Personalized CompanionshipMEMOR-E is not just another chatbot in a robot body. By combining a quadruped platform with an LLM-powered tablet interfWhen AI Trusts Your Words Over Its Sensors: The Authority Inversion CrisisA groundbreaking study reveals that LLM-powered systems systematically prioritize human language over sensor data, creat

常见问题

这次模型发布“The Thinking Tax: Why AI Models Waste Compute on Circular Reasoning”的核心内容是什么?

The AI industry has been locked in an arms race to build models that 'think' longer and harder. OpenAI's o1, DeepSeek-R1, and Anthropic's Claude Opus have all demonstrated that ext…

从“How to reduce AI inference costs without losing accuracy”看,这个模型发布为什么重要?

The core contribution of the study is the formal definition of Reasoning Redundancy Ratio (RRR). The researchers break down redundancy into three distinct categories: 1. Cyclic Self-Reflection (CSR): The model revisits a…

围绕“DeepSeek-R1 vs OpenAI o1 reasoning efficiency comparison”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。