LeanDojo Bridges Machine Learning and Formal Proof: A New Data Pipeline for AI Math

LeanDojo is an open-source tool that extracts structured training data from the Lean theorem prover, a popular interactive proof assistant used in mathematics and computer science. The project addresses a critical gap in the Lean ecosystem: the lack of a standardized, reproducible interface for feeding proof data into machine learning models. It allows researchers to automatically generate proof states, tactics, and environment snapshots from Lean codebases, enabling the training of models that can suggest proof steps or complete partial proofs. The tool's design emphasizes reproducibility and ease of use, with a clear API for data extraction and a built-in environment for simulating the Lean interaction loop. This is significant because it directly supports the growing field of neural theorem proving, where models like GPT-4 and specialized transformers are being applied to formal mathematics. By providing a clean data pipeline, LeanDojo reduces the engineering overhead for AI researchers, allowing them to focus on model architecture and training strategies. The project has already attracted attention from major research groups, including those at Meta and Carnegie Mellon, who are exploring its use for training proof assistants. Its GitHub repository has seen steady growth, with over 800 stars, reflecting strong community interest. LeanDojo's impact extends beyond academia: it could enable automated verification of critical software, from cryptographic protocols to smart contracts, by making formal proof more accessible to AI systems. The tool is not just a data extractor; it is a bridge between two rapidly advancing fields—machine learning and formal verification—that have historically operated in separate domains.

Technical Deep Dive

LeanDojo operates by parsing Lean's internal representation of proof states and tactics. The core architecture consists of three layers: a data extraction module, a proof state generator, and an interaction simulator. The extraction module scans Lean source files, identifies all theorem declarations and proof blocks, and serializes them into a structured format (JSON or Protocol Buffers). It captures the full context: the goal type, the local hypotheses, the current proof state, and the tactic applied at each step. This is non-trivial because Lean's elaborator performs type inference and macro expansion, so LeanDojo hooks into the Lean server process to obtain the fully elaborated terms.

The proof state generator then creates a standardized representation of each state, including the goal type as a string, the hypotheses as a list of typed variables, and the tactic as a tokenized command. This is crucial for machine learning models that expect fixed-size inputs. The interaction simulator mimics the Lean environment, allowing a model to propose a tactic and receive the resulting new proof state, enabling reinforcement learning or supervised fine-tuning.

A key technical challenge is handling the combinatorial explosion of proof states. LeanDojo addresses this by using a caching mechanism and a priority queue to explore only the most promising branches. It also supports incremental extraction, so users can update their dataset as the Lean codebase evolves.

| Feature | LeanDojo | Custom Scripts (e.g., Python + Lean API) |
|---|---|---|
| Setup time | Minutes (pip install) | Days to weeks |
| Data format | Standardized JSON/Protobuf | Ad-hoc, varies per project |
| Reproducibility | Built-in versioning and hashing | Manual, error-prone |
| Proof state caching | Yes, with LRU eviction | Usually no |
| Supported Lean versions | 4.x (latest) | Often outdated |

Data Takeaway: LeanDojo reduces the engineering overhead of data extraction by an order of magnitude, making it feasible for small teams to train proof models without building custom infrastructure. The standardized format also enables cross-project comparisons and benchmarks.

Key Players & Case Studies

The primary developer is Kaiyu Yang, a PhD student at Caltech, who has been instrumental in bridging machine learning and formal verification. His previous work on the "HOList" environment for HOL Light laid the groundwork for LeanDojo. The project is hosted on GitHub under the `lean-dojo` organization and has received contributions from researchers at Meta AI, Carnegie Mellon University, and the University of Cambridge.

A notable case study is the use of LeanDojo to train a GPT-2 variant on the Mathlib library, the largest formal mathematics repository. The model, called "LeanGPT", was able to suggest correct tactics for 15% of unseen theorems, a significant improvement over random baselines. Another case is the integration with the "ReProver" system from Google DeepMind, which uses retrieval-augmented generation to find relevant lemmas. LeanDojo provided the training data for their retrieval model.

| System | Base Model | Success Rate (on Mathlib test set) | Training Data Source |
|---|---|---|---|
| LeanGPT | GPT-2 (124M params) | 15% | LeanDojo-extracted Mathlib |
| ReProver (DeepMind) | T5 (220M params) | 28% | LeanDojo + custom retrieval |
| GPT-4 (zero-shot) | GPT-4 (est. 1.8T params) | 12% | No fine-tuning |
| Random baseline | — | 0.5% | — |

Data Takeaway: Specialized models trained on LeanDojo data outperform even massive general-purpose models like GPT-4 in the formal proof domain, highlighting the value of domain-specific data pipelines.

Industry Impact & Market Dynamics

LeanDojo's emergence signals a maturation of the AI-for-math field. The global market for automated theorem proving is small but growing, driven by applications in software verification, hardware design, and cryptography. According to a 2025 report from the Formal Methods Europe association, the market for formal verification tools reached $2.1 billion in 2024, with a compound annual growth rate of 18%. AI-assisted tools are expected to capture 30% of this market by 2028.

LeanDojo directly competes with similar tools for other proof assistants, such as "CoqGym" for Coq and "HOList" for HOL Light. However, Lean's growing popularity—especially after the success of the Liquid Tensor Experiment and the adoption of Lean 4—gives LeanDojo a first-mover advantage in the Lean ecosystem.

| Tool | Target Prover | GitHub Stars | Last Update | Key Limitation |
|---|---|---|---|---|
| LeanDojo | Lean 4 | 809 | Active (2025) | Only supports Lean |
| CoqGym | Coq | 450 | 2023 | Outdated Coq version |
| HOList | HOL Light | 200 | 2022 | Small community |
| TacticToe | HOL4 | 150 | 2021 | No active maintenance |

Data Takeaway: LeanDojo is the most actively maintained and popular tool in its category, reflecting the broader shift of the formal proof community toward Lean.

Risks, Limitations & Open Questions

Despite its promise, LeanDojo has several limitations. First, the extracted data is only as good as the underlying Lean codebase. If the code contains errors or incomplete proofs, the training data will be noisy. Second, the tool currently only supports Lean 4, which is still evolving; breaking changes in the language could require significant updates. Third, the interaction simulator is a simplified version of the real Lean environment—it does not support all of Lean's features, such as custom tactics or macros, which limits the types of proofs that can be learned.

There is also a risk of over-reliance on data-driven methods. Neural models trained on LeanDojo data may learn spurious correlations rather than genuine mathematical reasoning, leading to brittle performance on novel theorems. The community must develop robust evaluation benchmarks that test generalization, not just memorization.

Ethically, there is a concern that automated proof assistants could be used to verify malicious code, such as backdoors in cryptographic implementations. However, this is a double-edged sword: the same tools can also detect such backdoors.

AINews Verdict & Predictions

LeanDojo is a foundational tool that will accelerate the integration of AI into formal mathematics. We predict that within two years, every major AI lab working on theorem proving will use LeanDojo or a derivative. The tool's open-source nature and active maintenance ensure it will remain relevant as Lean evolves.

Our specific predictions:
1. By 2027, a model trained on LeanDojo data will achieve a 50% success rate on the Mathlib test set, up from the current 28%.
2. LeanDojo will be integrated into commercial verification tools, such as those used by Amazon Web Services for formal verification of cloud infrastructure.
3. The project will spawn a dedicated benchmark suite, similar to the MATH dataset, for evaluating neural theorem provers.

What to watch next: The release of LeanDojo 2.0, which is expected to support multi-file projects and parallel data extraction, and the emergence of transformer architectures specifically designed for proof state representation.

More from GitHub

常见问题

GitHub 热点“LeanDojo Bridges Machine Learning and Formal Proof: A New Data Pipeline for AI Math”主要讲了什么？

LeanDojo is an open-source tool that extracts structured training data from the Lean theorem prover, a popular interactive proof assistant used in mathematics and computer science.…

这个 GitHub 项目在“LeanDojo vs CoqGym comparison for AI theorem proving”上为什么会引发关注？

LeanDojo operates by parsing Lean's internal representation of proof states and tactics. The core architecture consists of three layers: a data extraction module, a proof state generator, and an interaction simulator. Th…

从“How to install LeanDojo and extract data from Mathlib”看，这个 GitHub 项目的热度表现如何？

当前相关 GitHub 项目总星标约为 809，近一日增长约为 0，这说明它在开源社区具有较强讨论度和扩散能力。