ハードモード革命:新たなオープンソースフレームワークがAIの真の推論能力を再定義する方法

arXiv cs.AI April 2026
Source: arXiv cs.AIformal verificationArchive: April 2026
パラダイムシフトを起こすオープンソースフレームワークが、AIの推論力を測定する方法における重大な欠陥を明らかにしています。『ハードモード』ベンチマークは、AIエージェントに「どのように証明するか」に取り組む前に「何を証明すべきか」を発見させることで、現在の評価が真の能力を歪んだ鏡のように映し出していることを示しています。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The field of Automated Theorem Proving (ATP) is undergoing a fundamental reassessment driven by the release of a novel open-source agent framework built on Lean 4. This framework introduces a rigorous 'Hard Mode' benchmark that directly challenges the prevailing 'Easy Mode' evaluation paradigm. In Easy Mode, common in benchmarks like MiniF2F and MATH, the theorem to be proven is explicitly stated within the problem prompt. This allows systems, particularly large language models (LLMs), to function primarily as proof formalizers—translating a given conclusion into a valid chain of reasoning within a formal system like Lean, Coq, or Isabelle. While impressive, this tests syntactic manipulation and retrieval more than genuine deductive discovery.

The new framework eliminates this shortcut. In Hard Mode, the AI agent is presented with a mathematical context, definitions, and potentially lemmas, but not the final theorem statement. The agent must first explore the problem space, hypothesize what might be true, and only then attempt to construct a formal proof for its discovered conjecture. This process much more closely mirrors human mathematical reasoning, where the creative leap of formulating the correct conjecture is often more challenging than the subsequent verification. The framework's creators argue that Easy Mode has led to an overestimation of AI's true reasoning abilities, creating systems adept at proof search but deficient in conceptual discovery.

The implications are profound. For mathematical research, it shifts the potential role of AI from a verification assistant to a collaborative partner capable of suggesting novel avenues and conjectures. In software verification and program synthesis, it demands a deeper, semantic understanding of code behavior rather than just pattern-matching against known specifications. By open-sourcing this framework and its benchmark, the initiative aims to catalyze a community-wide recalibration, steering research toward building AI that doesn't just memorize and rearrange but genuinely reasons and innovates.

Technical Deep Dive

At its core, the Hard Mode framework is an agent architecture built atop the Lean 4 theorem prover and programming language. Lean's metaprogramming capabilities and efficient kernel make it ideal for orchestrating the two-phase process of conjecture discovery followed by proof formalization. The framework typically implements a search-based agent that interacts with Lean's `Tactic` state.

Phase 1: Conjecture Discovery. The agent begins in an environment defined by imported theories (e.g., basic number theory, group definitions). It does not have a `theorem ... := by ...` goal. Instead, it employs strategic exploration:
1. Forward Reasoning: Applying existing lemmas and definitions to generate new facts from the givens.
2. Backward Chaining with Metavariables: Proposing potential theorem statements with placeholders (e.g., `∀ (a b : ℕ), a + b = ?x`), then attempting to solve for `?x` through unification and constraint solving.
3. LLM-Guided Heuristics: A tightly integrated LLM (like GPT-4 or Claude 3) acts as a heuristic generator. Given the current proof state and context, it suggests plausible conjectures or productive directions. Crucially, the LLM's suggestions are not taken as ground truth but as hypotheses to be tested by the formal system.

Phase 2: Proof Construction. Once a candidate conjecture is generated and deemed interesting (e.g., non-trivial, not immediately refuted by a counterexample search), the agent switches to a more traditional ATP mode. It now has a concrete goal and can use tactics like `simp`, `ring`, `omega`, and its own learned proof-search strategies to build a verifiable proof.

Key to the framework is the separation of concerns between a heuristic, often neural, conjecture proposer and a sound, symbolic verifier. This aligns with the "neuro-symbolic" paradigm but with a strict gate: no conjecture proceeds without passing the symbolic filter. A relevant open-source repository demonstrating early principles is `lean-step` (GitHub: `lean-step`), a toolkit for training reinforcement learning agents to interact with Lean. While not implementing full Hard Mode, it provides the foundational infrastructure for agents to learn proof-search policies, which can be extended to conjecture search.

The performance gap between Easy and Hard Mode is stark. Preliminary results from the framework's benchmark suite show a dramatic drop in success rates for current state-of-the-art agents.

| Agent / Model | Easy Mode (MiniF2F) Success Rate | Hard Mode (Proposed Benchmark) Success Rate | Notes |
|---|---|---|---|
| GPT-4 + Lean Copilot | ~42% | <5% | Relies heavily on being given the theorem statement. |
| Claude 3 Opus + Proof Search | ~38% | ~3% | Similar pattern; strong formalization, weak discovery. |
| Specialized ATP (Vampire, E) | High on eligible problems | ~0% | Not designed for open-ended conjecture generation. |
| Hard Mode Framework (v0.1) | N/A | ~12% | Baseline performance on curated discovery problems. |

Data Takeaway: The table reveals a catastrophic drop in performance when the answer is not embedded in the question. Even the most advanced LLMs, which approach a 40-50% pass rate in Easy Mode, fall to near-zero in genuine discovery tasks. The specialized Hard Mode framework, while starting at a low absolute rate, establishes a non-zero baseline for a capability that was previously almost unmeasured.

Key Players & Case Studies

The push for Hard Mode evaluation is being driven by a coalition of academic researchers and open-source developers focused on the intersection of LLMs and formal methods. Key figures include Christian Szegedy at Google, whose work on formal mathematics and the `LeanDojo` project has highlighted the dataset contamination problem in Easy Mode benchmarks. Stanislas Polu and Katherine Crowson from OpenAI's former mathematics team have contributed to understanding LLM limitations in formal reasoning. The `ProofNet` benchmark, created by researchers including Albert Q. Jiang and Sean Welleck, was an early attempt to create a cleaner, contamination-resistant dataset, though it still largely operates in an Easy Mode paradigm.

The primary case study is the development ecosystem around Lean 4 and the Lean Community. Projects like `mathlib4`, the monumental collaborative formalization of mathematics, provide the essential library against which any reasoning agent must be tested. The Hard Mode framework is, in many ways, a direct response to the needs of `mathlib4` contributors, who spend most of their time figuring out *what* to formalize next, not just how.

A competing but complementary approach comes from Meta's `Code Llama` and related models fine-tuned on code and mathematics. While powerful for in-context learning and code generation, their evaluation has largely been on HumanEval or MATH, which are effectively Easy Mode for their domains. The release of the Hard Mode framework creates pressure for these teams to demonstrate their models' capabilities in this more rigorous setting.

| Entity | Primary Contribution | Stance on Hard Mode | Key Product/Project |
|---|---|---|---|
| Lean Community / OSS Devs | Created `Lean 4`, `mathlib4`, `lean-step` | Driving force; sees it as essential for true assistive AI in math. | Hard Mode Framework (reference implementation) |
| Google Research (Brain) | `LeanDojo`, `MiniF2F`, foundational LLM-math research | Acknowledges the problem; actively working on discovery-capable agents. | Potential integration with Gemini models |
| OpenAI | GPT-4's strong performance on MATH, investment in formal reasoning | Historically focused on benchmark performance; may need to adapt evaluations. | ChatGPT, potential future "reasoning" models |
| Meta AI | `Code Llama`, `LLaMA` models, open-weight approach | Could leverage open framework to fine-tune and benchmark LLaMA models. | Code Llama, LLaMA 3 |
| Academic ATP Community | Tools like Vampire, E, Isabelle, Coq | Traditionally focused on proof search; Hard Mode presents a new, harder challenge. | Specialized theorem provers |

Data Takeaway: The landscape shows a split between traditional ATP (focused on verification), LLM developers (focused on broad benchmarks), and the formal math community (focused on practical utility). The Hard Mode framework, born from the latter, is now challenging the evaluation strategies of the former two groups, potentially forcing a convergence on more meaningful metrics.

Industry Impact & Market Dynamics

The Hard Mode shift is not merely academic; it has concrete implications for the burgeoning market of AI in STEM, software development, and verification.

1. Mathematical Research & Education: Tools like Wolfram Alpha and emerging AI tutors have been limited to answering well-posed questions. A Hard Mode-capable AI could become a research collaborator, suggesting lemmas or potential proof strategies in interactive theorem provers. This could accelerate fields like number theory or combinatorics where conjecture generation is key. The market for advanced research tools, currently niche, could expand significantly.

2. Software Verification & DevOps: Companies like Synopsys, Coverity, and GitHub (with Copilot and Advanced Security) offer static analysis and code-suggestion tools. These tools identify bugs against known patterns or specifications. Hard Mode reasoning would enable AI to *infer* correct behavior or security properties from code alone, moving towards automatic specification generation and proving that code adheres to inferred invariants. This represents a multi-billion dollar leap in reliability for critical software in aerospace, finance, and infrastructure.

3. AI Safety & Alignment Research: A core challenge in aligning advanced AI is specifying human intent in a rigorous, unambiguous way. Hard Mode reasoning is essentially the process of inferring correct, general rules ("theorems") from specific examples and constraints ("the problem context"). Progress here directly feeds into creating AI that can robustly infer and adhere to complex human values. Funding in this area, from both philanthropic (Open Philanthropy, FTX Future Fund legacy) and corporate (Anthropic's focus on constitutional AI) sources, is likely to flow towards teams demonstrating Hard Mode capabilities.

| Application Sector | Current AI Approach (Easy Mode) | Future with Hard Mode Capability | Potential Market Impact (5-Yr Projection) |
|---|---|---|---|
| Mathematical Research | Proof automation for stated theorems. | Conjecture generation, research direction suggestion. | $500M - $1B in tooling and subscription services. |
| Software Verification | Pattern-based bug detection (SAST). | Automatic specification inference and proof. | Could capture 20-30% of the $15B+ application security market. |
| AI Safety & Alignment | Fine-tuning on human feedback datasets. | Learning and reasoning about abstract principles from interaction. | Not directly monetizable but critical for enabling trillion-dollar AGI deployments. |
| STEM Education | Step-by-step solvers for textbook problems. | Interactive exploration partners that ask *student* to conjecture. | Enhanced share of the $10B+ EdTech AI market. |

Data Takeaway: The economic value shifts from automation of defined tasks (Easy Mode) to augmentation of creative and discovery processes (Hard Mode). The software verification market represents the most immediate and high-value commercial application, where moving from detecting known bugs to proving the absence of unknown ones is a paradigm shift worth billions.

Risks, Limitations & Open Questions

1. The Scalability Wall: The combinatorial explosion in the space of possible conjectures is far vaster than the space of proofs for a given conjecture. While LLMs provide heuristic guidance, their exploration is still slow and computationally expensive. Scaling Hard Mode agents to the complexity of research-level mathematics remains a monumental engineering and algorithmic challenge.

2. Evaluation Subjectivity: What constitutes a "good" or "interesting" conjecture? The framework needs metrics beyond mere provability. Is a trivial generalization of an existing lemma a success? This requires embedding mathematical taste into evaluation, a famously subjective area.

3. Overfitting to the Framework: As the Hard Mode benchmark gains adoption, there is a risk that teams will over-optimize agents for its specific problem distribution or Lean 4's particular tactics, rather than developing general discovery reasoning. This would repeat the cycle of benchmark distortion it aims to break.

4. Neglect of Other Reasoning Forms: The focus on formal mathematics, while rigorous, is narrow. Genuine human reasoning includes physical intuition, analogical thinking, and abductive reasoning under uncertainty. An overemphasis on deductive theorem discovery could lead to AI that is brittle outside formal domains.

5. The "Oracle" Problem: If an LLM is used as the conjecture proposer, the system's discovery capability is ultimately bounded by the latent knowledge and biases of that LLM, which is trained on existing human data. This may limit truly novel, revolutionary discoveries that lie outside the training distribution.

AINews Verdict & Predictions

The introduction of the Hard Mode framework is the most important corrective in AI reasoning evaluation since the creation of the MATH dataset. It successfully identifies and attacks a critical blind spot in our collective assessment of AI intelligence. By demanding that AI not just verify but propose, it moves the goalposts from syntactic competence to semantic understanding.

Our predictions are as follows:

1. Benchmark Supremacy Shift (12-18 months): Within the next year, no serious paper claiming advances in AI reasoning will be able to rely solely on Easy Mode benchmarks like MATH or GSM8K. Hard Mode or similarly rigorous discovery-based evaluations will become the new standard for top-tier conferences (NeurIPS, ICLR).
2. First "Hard Mode" Startup Acquisition (24 months): A small team that demonstrates a significant breakthrough on the Hard Mode benchmark—say, achieving a 30% success rate on a non-trivial subset—will be acquired by a major cloud or software vendor (Microsoft/GitHub, Google, or a cybersecurity giant like Palo Alto Networks) for a sum between $50M and $200M. The acquirer will be buying the capability to move towards self-verifying software.
3. Lean 4 Becomes the De Facto Platform: The momentum around `mathlib4` and the agent frameworks being built on Lean 4 will cement its position as the leading platform for AI-driven formal reasoning research, outpacing Coq and Isabelle in this niche due to its modern design and developer-friendly tooling.
4. The Emergence of the "Discovery Score": We will see the development of a standardized, composite metric—a "Discovery Score"—that weights an AI's ability to generate novel, non-trivial, and provable conjectures. This metric will become a key differentiator in marketing for foundation model companies by 2026.
5. Limited Near-Term Commercial Impact, Massive Long-Term Leverage: The direct commercial products from Hard Mode AI will be limited for the next 2-3 years, primarily serving advanced research labs and verification teams. However, the techniques developed will be the foundational scaffolding for the next generation of AI systems that can genuinely reason about the world, plan complex actions, and innovate. The companies that master this paradigm first will have a decisive, long-term advantage in the race toward robust artificial general intelligence.

The Hard Mode revolution is ultimately a call for intellectual honesty. It forces the field to stop conflating proof search with proof discovery, and in doing so, it charts a harder but truer path toward machines that can genuinely think.

More from arXiv cs.AI

CreativityBenchがAIの隠れた欠点を露呈:既成概念にとらわれない思考ができないThe AI community has long celebrated progress in logic, code generation, and environmental interaction. But a new evaluaARMOR 2025:軍事AI安全ベンチマークがすべてを変えるThe AI safety community has long focused on preventing models from generating hate speech, misinformation, or harmful adエージェントの安全性はモデルではなく、エージェント同士の対話方法にあるFor years, the AI safety community operated under a seemingly reasonable assumption: if each model in a multi-agent systOpen source hub280 indexed articles from arXiv cs.AI

Related topics

formal verification20 related articles

Archive

April 20263042 published articles

Further Reading

ProofSketcherのハイブリッドアーキテクチャ、検証によりLLMの数学的ハルシネーションを解決ProofSketcherと呼ばれる画期的な研究フレームワークは、大規模言語モデルが数学的に流暢だが論理的に欠陥のある証明を生成するという、AIの最も根強い課題の一つに取り組みます。創造的な生成と厳密な検証を分離することで、このハイブリッド形式証明がAIワークフローガバナンスを解き放ち、創造性を犠牲にしないRocq 8.19とInteraction Treesを用いた画期的な形式検証研究により、AIワークフローアーキテクチャが内部表現力を犠牲にすることなく完全な透明性を達成できることが証明されました。ガバナンス演算子Gは、未証明の補題を一切使バイナリ・スパイキングニューラルネットワークの解明:SATソルバーがニューロモルフィック・ブラックボックスに論理をもたらす研究者らは初めて、バイナリ・スパイキングニューラルネットワーク(BSNN)をバイナリ因果モデルとして形式化し、SATおよびSMTソルバーを活用して各ニューロンの発火に対する最小限で正確な因果説明を生成しました。このニューロモルフィックコンピ形式検証が特許法に出会う:AI生成証明が法的確実性をいかに創出するか確率的な法律見解に長らく支配されてきた不透明な特許訴訟の世界に、数学的革命が訪れようとしています。大規模言語モデルとLean4のような形式的定理証明器を組み合わせ、特許侵害分析のための機械検証可能な証明を生成する新種のシステムが登場していま

常见问题

GitHub 热点“Hard Mode Revolution: How New Open-Source Frameworks Are Redefining AI's True Reasoning Capabilities”主要讲了什么?

The field of Automated Theorem Proving (ATP) is undergoing a fundamental reassessment driven by the release of a novel open-source agent framework built on Lean 4. This framework i…

这个 GitHub 项目在“open source Lean 4 theorem proving framework hard mode”上为什么会引发关注?

At its core, the Hard Mode framework is an agent architecture built atop the Lean 4 theorem prover and programming language. Lean's metaprogramming capabilities and efficient kernel make it ideal for orchestrating the tw…

从“how to implement AI conjecture generation in Lean”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 0,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。