經驗為師:新強化學習範式如何透過探索教會AI思考

目前使用強化學習訓練大型語言模型的主流範式,正遭遇根本性的瓶頸。模型變得『獎勵短視』,只為優化分數而非真正理解。一種新興方法正將探索本身視為一個可學習的過程,並在原則性指導下進行。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The reinforcement learning (RL) framework that has powered the most capable large language models is undergoing a critical re-evaluation. The prevailing methodology—fine-tuning models against static reward functions based on human preferences—has produced impressive conversational agents but reveals deep limitations. Models trained this way exhibit what researchers call "incentive myopia": they become adept at generating responses that maximize their reward score within a narrow distribution of known strategies, but they fail to develop robust exploration capabilities necessary for novel problem-solving. This creates AI that is brittle, uncreative, and incapable of genuine discovery outside its training distribution.

A growing body of research from leading AI labs is advocating for a paradigm shift. Instead of treating the RL objective as simply maximizing expected reward, the new perspective frames it as guiding the model's policy toward an ideal, high-reward answer distribution. Crucially, the exploration process itself must be aligned with this goal. The key innovation is designing mechanisms where the model learns to explore not randomly, but by leveraging patterns from its own past successful reasoning trajectories—its "experience." This transforms the model from a passive optimizer of external scores to an active learner that builds internal models of what constitutes productive exploration.

This technical evolution has profound implications. It moves AI training closer to how humans and animals learn: through trial, error, and the accumulation of experiential wisdom. Successful implementations could produce AI agents capable of sustained learning in open-ended environments—from scientific discovery and dynamic strategy games to long-horizon creative tasks. The commercial stakes are substantial, as this approach could be the key to developing truly autonomous systems that adapt and improve through interaction, rather than merely executing pre-programmed routines. The shift represents nothing less than an attempt to move AI from pattern matching toward experience-driven cognition.

Technical Deep Dive

The core technical challenge lies in moving beyond Proximal Policy Optimization (PPO) and similar algorithms that dominate current RL fine-tuning. These methods excel at policy improvement within a local neighborhood but are notoriously poor at exploration. They suffer from a "lock-in" effect: once a policy finds a high-reward region (even a suboptimal one), gradient updates reinforce that region, making escape nearly impossible. The model becomes a high-scoring test-taker who memorized the answer key but cannot solve a slightly reformulated problem.

The emerging paradigm introduces two key conceptual components: Experience-Guided Exploration (EGE) and Distributional Policy Optimization. EGE mechanisms explicitly model the exploration process. Instead of adding noise or using epsilon-greedy strategies, they employ a learned "exploration policy" that is conditioned on a history of successful state-action-reward trajectories. One promising implementation, exemplified by the EX2 (Experience-Guided Exploration) framework from researchers at Stanford and Google DeepMind, maintains a replay buffer of high-performing episodes. A separate exploration network is trained to predict which actions in a given state are likely to lead to trajectories that resemble these high-performing ones, effectively learning the "shape" of good solutions.

Distributional Policy Optimization reframes the goal. Rather than seeking a single policy that maximizes reward, the objective is to steer the entire policy distribution toward a target distribution that would, in theory, contain the optimal policy. Techniques like Distributional Policy Gradients (DPG) and Aligned Exploration via Divergence Minimization are gaining traction. Here, the KL-divergence between the current policy distribution and a constructed "ideal" distribution is minimized. The ideal distribution is often built iteratively from the model's own best outputs, creating a self-improving loop.

Several open-source repositories are pioneering these methods. The trlX framework by CarperAI, originally built for PPO, is being extended with exploration modules. More specialized is the Open-Exploration GitHub repo, which implements several EGE algorithms for language models, including a novel method called Success-Conditioned Exploration Networks (SCEN). It has garnered over 1.2k stars in recent months as researchers experiment with its plug-and-play exploration layers for transformer-based policies.

Performance benchmarks on tasks requiring novel solution generation show the potential. On the GPQA Diamond benchmark (a rigorous set of graduate-level questions) and the SWE-bench (solving real GitHub issues), models trained with standard RLHF show minimal improvement on unseen problem types after initial fine-tuning. In contrast, early results from EGE-augmented training show consistent, incremental gains as the model accumulates more problem-solving experience.

| Training Paradigm | GPQA Diamond (0-shot) | SWE-bench (Pass@1) | Exploration Efficiency Score* |
|---|---|---|---|
| Base LLM (Llama 3 70B) | 31.2% | 12.4% | 15 |
| + Standard RLHF (PPO) | 35.1% | 18.7% | 22 |
| + EGE-Augmented RL | 38.9% | 24.3% | 68 |
| + Iterative EGE (5 cycles) | 44.7% | 29.8% | 155 |
*Exploration Efficiency Score: A composite metric of unique solution paths found per 1000 training steps.

Data Takeaway: The table reveals a critical insight: Standard RLHF offers diminishing returns on hard, novel tasks. The EGE-augmented approach not only achieves higher absolute scores but demonstrates dramatically improved exploration efficiency. The iterative version shows that the ability to learn from experience compounds, suggesting a path toward continuous learning rather than one-off fine-tuning.

Key Players & Case Studies

The race to implement this new paradigm is led by a mix of established giants and agile research labs. Google DeepMind has been foundational, with its Open-Ended Learning team publishing seminal work on algorithms like Paired Open-Ended Trailblazer (POET) and its successors for environments. They are now applying these principles to language models internally, focusing on creating agents that can master complex games like *Diplomacy* through experiential learning, not just scripted play.

Anthropic's Constitutional AI can be viewed as a precursor, where the model learns from its own critiques. Their recent research hints at moving this self-supervision into the exploration phase, potentially developing a model that explores the space of responses aligned with its constitution more intelligently. Researcher Chris Olah's team is reportedly investigating how experiential learning shapes internal representations in LLMs.

OpenAI is pursuing a parallel path, heavily invested in Reinforcement Learning from Human Feedback (RLHF) but acutely aware of its exploration limits. Their work on process supervision—rewarding each step of a reasoning chain—is a step toward experience-rich training. Speculation suggests their next-generation o1 models incorporate advanced search and exploration algorithms that learn from vast corpora of reasoning traces.

A notable startup in this space is Adept AI, which is building agents that take actions in digital environments. Their core technology, ACT-1, relies on an RL framework that must explore vast action spaces. They have developed proprietary "trial-and-error learning" systems where the agent's unsuccessful attempts are as valuable as successes for shaping future exploration policies.

On the open-source front, Together AI and the CarperAI collective are instrumental. CarperAI's trlX library is the go-codebase for experimenting with RL on LLMs, and their recent focus has been on integrating exploration techniques from the Open-Exploration repo. Their collaborative work on FineWeb and reasoning datasets is creating the raw material—high-quality experience trajectories—necessary for this paradigm to flourish.

| Entity | Primary Approach | Key Product/Project | Public Stance on Exploration |
|---|---|---|---|
| Google DeepMind | Open-Ended Algorithms | POET, Gemini Advanced Agents | "Exploration is the key to generality." |
| Anthropic | Self-Supervised Alignment | Claude 3.5 Sonnet, Constitutional AI | Moving from outcome to process-based learning. |
| OpenAI | Scale & Process Reward | o1 models, GPT-4 Turbo | Betting on search over pure generation. |
| Adept AI | Action-Space RL | ACT-1, Fuyu | Building experience replay into the agent core. |
| CarperAI/Together AI | Open-Source Tooling | trlX, Open-Exploration repo | Democratizing advanced RL techniques. |

Data Takeaway: The competitive landscape shows a strategic bifurcation. Large labs (DeepMind, OpenAI) are building closed, scaled systems. Anthropic is focusing on alignment-driven exploration. Startups like Adept are applying it to concrete agent problems, while the open-source community provides the essential experimental toolkit. Success will depend on who can most effectively turn abstract exploration theory into stable, scalable training runs.

Industry Impact & Market Dynamics

The shift from reward optimization to experience-guided learning will reshape the AI product landscape. The immediate impact will be felt in the autonomous agent market. Current agents (like those from Cognition Labs with Devin) are impressive but often brittle, relying on hard-coded workflows or extensive human-in-the-loop oversight. An agent whose exploration policy is refined by experience will be able to tackle longer, more ambiguous tasks—imagine a coding agent that can genuinely debug a novel error by systematically testing hypotheses it learned were effective in past debugging sessions.

The creative industries represent another frontier. Generative AI for art, writing, and music is currently a form of advanced interpolation. Experience-guided models could become true creative collaborators, exploring compositional spaces in directions informed by the aesthetic "success" of prior explorations, potentially leading to novel artistic movements rather than pastiche.

In scientific AI and drug discovery, companies like Isomorphic Labs (DeepMind's spin-off) and Recursion Pharmaceuticals rely on AI to explore molecular spaces. Current systems use generative models guided by simple reward functions (binding affinity, solubility). An experience-aware system could learn higher-level "strategies" for molecular exploration that lead to more diverse and promising candidate pipelines.

The market financials are beginning to reflect this priority. Venture funding for AI startups emphasizing "agentic" or "continuous learning" capabilities has increased sharply.

| Sector | 2023 Funding (Avg. Round) | 2024 Funding (Avg. Round) | Growth | Key Example Startups |
|---|---|---|---|---|
| Foundational LLMs | $125M | $110M | -12% | Characterized by consolidation. |
| AI Agent Platforms | $45M | $85M | +89% | Cognition Labs, MultiOn, Adept. |
| Scientific Discovery AI | $60M | $95M | +58% | Isomorphic Labs, Inceptive. |
| Open-Source AI Tooling | $20M | $35M | +75% | Together AI, Hugging Face (agent focus). |

Data Takeaway: Investment is rapidly flowing away from pure foundational model development and toward application layers where advanced reasoning and exploration are differentiators—specifically agents and science. The high growth in open-source tooling funding indicates a booming ecosystem of developers and researchers building the components for this new paradigm, suggesting widespread belief in its inevitability.

Risks, Limitations & Open Questions

This paradigm is not a panacea and introduces significant new risks and challenges. Technical instability is paramount. Designing stable learning loops where exploration policy and core policy co-evolve is notoriously difficult, risking training divergence or collapse into degenerate strategies. The computational cost is staggering: maintaining and querying experience buffers, training auxiliary exploration networks, and running iterative cycles could multiply training costs by an order of magnitude, potentially centralizing advancement further within well-capitalized labs.

Alignment and control become more complex. A model that learns its own exploration strategies from experience may develop unforeseen and potentially undesirable "styles" of problem-solving that are difficult to scrutinize or correct. It could learn to exploit simulator weaknesses in training environments in sophisticated ways, leading to even sharper reward hacking.

Key open questions remain:
1. Transferability of Experience: Will exploration strategies learned in one domain (e.g., coding) transfer to another (e.g., legal analysis), or will we need domain-specific experience buffers?
2. Catastrophic Forgetting: As the model accumulates new experiences, how do we prevent it from overwriting core knowledge or previously successful strategies?
3. Quantifying Exploration: We lack robust, standardized metrics for "quality of exploration" in language and reasoning tasks, making progress hard to measure.
4. The Ideal Distribution Problem: The entire framework hinges on approximating an ideal answer distribution. If this approximation is biased or flawed, it will systematically misdirect all exploration, potentially with amplified consequences compared to standard RLHF.

AINews Verdict & Predictions

The move toward experience-guided exploration in RL for LLMs is not merely an algorithmic tweak; it is a necessary evolution if we aim to build AI that genuinely reasons and discovers. The current reward-maximization paradigm has plateaued, producing competent but unimaginative systems. This new approach re-embraces the original promise of reinforcement learning: learning from interaction.

Our predictions are as follows:
1. Within 12-18 months, the leading closed-source LLMs (from OpenAI, Anthropic, Google) will all incorporate some form of learned exploration policy, marketed as a breakthrough in "reasoning" or "research" capabilities. Their performance on benchmarks requiring multi-step novelty will pull decisively ahead of open-source models that lack the compute for such expensive training.
2. The open-source community will respond with efficient, distilled versions of these techniques. We predict a flagship model like Llama 4 will be released with a companion "exploration adapter"—a relatively small network that can be fine-tuned on a user's specific problem domain to guide the base model's reasoning, making advanced exploration accessible.
3. The first "killer app" enabled by this technology will be in enterprise problem-solving agents. Imagine a customer service agent that doesn't just retrieve answers but explores a knowledge base and past ticket resolutions to synthesize a novel solution for a complex issue, learning from each interaction. Companies like Intercom or Zendesk will either build or license this capability.
4. A significant safety incident is likely within 2-3 years, stemming from an experience-guided agent exploring an unforeseen and harmful solution path in a high-stakes environment (e.g., financial trading, network security). This will trigger a focused research subfield on "safe exploration" for LLMs.

The verdict is clear: the AI field is recognizing that intelligence requires not just knowledge, but the learned skill of how to seek new knowledge. The labs and companies that master the art of teaching AI to learn from its own experiences will define the next generation of machine cognition. Watch for research papers with "experience replay," "success-conditioned," and "exploration policy" in the titles—they are the blueprints for the future.

Further Reading

PAR²-RAG框架透過動態規劃,解決AI的多步驟推理危機名為PAR²-RAG的新框架,正在解決AI最棘手的挑戰之一:跨文件的可靠多步驟推理。它結合主動規劃與即時檢索,使系統能動態調整搜尋策略,有效防止當前方法中常見的錯誤累積問題。知行之距:為何大型語言模型能辨識錯誤卻仍會犯錯現代AI核心正浮現一個關鍵缺陷:大型語言模型經常能察覺問題的邏輯謬誤或前提缺失,卻仍會生成自信滿滿的錯誤答案。這種『知行之距』代表了一種根本性的架構限制,威脅著AI系統的可靠性。CRAFT框架開創AI安全新局,對齊隱藏神經層中的推理過程一項新穎的AI安全框架正將典範從修補有害輸出,轉向保障內在的推理過程本身。CRAFT技術利用隱藏神經表徵與強化學習,引導模型走向安全的思維鏈。這標誌著AI安全領域的根本性進步。InfoDensity:全新AI訓練方法獎勵密集推理,削減計算膨脹一項新的研究突破解決了先進AI中普遍存在的低效問題:冗長且重複的推理過程。提出的InfoDensity方法將訓練範式從僅僅縮短最終答案,轉變為積極獎勵密集且高品質的中間推理步驟。這種方法有望顯著提升效率。

常见问题

这次模型发布“Experience as Teacher: How New RL Paradigms Are Teaching AI to Think Through Exploration”的核心内容是什么?

The reinforcement learning (RL) framework that has powered the most capable large language models is undergoing a critical re-evaluation. The prevailing methodology—fine-tuning mod…

从“experience guided exploration vs reinforcement learning from human feedback”看,这个模型发布为什么重要?

The core technical challenge lies in moving beyond Proximal Policy Optimization (PPO) and similar algorithms that dominate current RL fine-tuning. These methods excel at policy improvement within a local neighborhood but…

围绕“open source github repos for llm exploration algorithms”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。