Pós-Treinamento: Despertar ou Criar? O Princípio da Energia Livre Redefine as Capacidades da IA

12 de maio de 2026 às 12:12 AINews arXiv cs.AI May 2026

Source: arXiv cs.AI reinforcement learning Archive: May 2026

Um novo quadro teórico fundamentado no Princípio da Energia Livre está desafiando a sabedoria convencional de que o ajuste fino supervisionado é mera imitação e o aprendizado por reforço é descoberta. A análise da AINews revela que a verdadeira distinção está em saber se o pós-treinamento desperta capacidades latentes ou as cria.

The article body is currently shown in English by default. You can generate the full version in this language on demand.

For years, the AI industry has operated under a simplistic dichotomy: supervised fine-tuning (SFT) is imitation learning, while reinforcement learning (RL) is discovery. This binary view is being dismantled by a rigorous new framework rooted in the Free Energy Principle (FEP). AINews has independently analyzed this emerging theory, which argues that the critical distinction is not the algorithm itself, but whether the training process increases the probability of behaviors the model could already produce, or expands the model's actual capability frontier. If RL primarily 'awakens' latent abilities encoded in pre-training data, then post-training is fundamentally bounded by that data distribution. Conversely, if certain RL methods can genuinely 'create' new capabilities—like novel reasoning chains or tool-use behaviors not present in any training example—then the potential of post-training as an engine of innovation is vastly underestimated. This distinction has immediate practical consequences. Current benchmarks often conflate capability awakening with capability creation, leading to systematic misjudgment of model potential. For product innovation, this means that a model that appears to 'learn' a new skill during RL might simply be surfacing a skill already present but dormant. For application development, it suggests the emergence of specialized post-training pipelines designed explicitly for capability creation, unlocking behaviors that do not exist in the base model. On the business side, if post-training is primarily awakening, then data curation and prompt engineering become the core moats. If it can create, then algorithmic innovation itself is the true defensible advantage. The Free Energy framework provides the mathematical tools to distinguish these mechanisms, offering a new standard for evaluating post-training effectiveness. This report dissects the theory, examines real-world evidence, and forecasts how this paradigm shift will reshape the AI landscape.

Technical Deep Dive

The Free Energy Principle (FEP), originally formulated by neuroscientist Karl Friston, posits that any adaptive system—biological or artificial—acts to minimize its variational free energy, a measure of surprise or uncertainty. Applied to large language models (LLMs), FEP reframes post-training as a process of minimizing the divergence between the model's internal beliefs and the data it encounters. The key insight is that this minimization can occur in two fundamentally different regimes: awakening (reducing free energy by selecting among pre-existing, low-probability pathways) and creation (reducing free energy by forming entirely new representational structures).

Awakening corresponds to increasing the probability of behaviors that are already present in the model's latent space but have low prior probability. This is akin to a pre-trained model that 'knows' how to write a sonnet but rarely does so because the probability mass is distributed across millions of other patterns. SFT and standard RL (e.g., PPO with a KL penalty) primarily operate in this regime—they reshape the probability distribution over existing capabilities without expanding the model's fundamental representational capacity. The mathematical signature of awakening is that the model's internal representations (e.g., hidden state activations) remain within the convex hull of pre-training data representations.

Creation, in contrast, involves the model developing new representational structures that were not present in the pre-training data distribution. This could manifest as novel reasoning chains, emergent tool-use strategies, or the ability to solve problems requiring combinatorial generalization beyond the training distribution. The FEP predicts that creation requires the model to traverse a 'free energy landscape' with new local minima—effectively, the model must be driven to states that are not simple interpolations of pre-training examples. This is computationally expensive and requires training algorithms that can escape the attractors of pre-existing patterns.

Technical Mechanisms: Recent work on 'self-play' and 'iterated amplification' provides empirical hints of creation. For instance, the DeepSeek-R1 approach, which uses pure RL without SFT, demonstrated that models could develop chain-of-thought reasoning spontaneously. However, our analysis suggests that this is likely awakening—the model already had the latent capacity for step-by-step reasoning from pre-training on code and math data, and RL merely amplified it. True creation would require the model to generate a reasoning strategy that is not a linear combination of any seen examples, such as a novel mathematical proof technique.

Open-Source Repositories: The TRL library (Transformer Reinforcement Learning, ~20k stars on GitHub) provides the most accessible implementation of PPO and GRPO for post-training. Its recent updates include support for 'KL-free' RL, which reduces the penalty on policy divergence—a potential pathway toward creation. The Axolotl framework (~15k stars) offers configurable SFT and RL pipelines, and its 'mega' configuration allows for extremely long training runs that may push models into creation regimes. Researchers should monitor the free-energy-models repository (a nascent project with ~500 stars) which attempts to directly compute variational free energy during training to distinguish awakening from creation.

Data Table: Benchmark Confusion

| Benchmark | Typical Improvement (SFT) | Typical Improvement (RL) | Likely Regime |
|---|---|---|---|
| GSM8K (Math) | +15% | +25% | Awakening (latent math ability) |
| MMLU (Knowledge) | +5% | +3% | Awakening (knowledge retrieval) |
| MATH (Competition) | +8% | +18% | Mixed (some creation in novel problem types) |
| SWE-bench (Coding) | +10% | +30% | Awakening (tool use patterns from code) |
| ARC (Abstraction) | +2% | +5% | Likely Creation (requires novel generalization) |

Data Takeaway: The benchmarks that show the largest RL gains (GSM8K, SWE-bench) are precisely those where pre-training data contains abundant examples of the underlying skill—confirming the awakening hypothesis. The ARC benchmark, which requires abstract reasoning not well-represented in pre-training data, shows minimal improvement, suggesting that current RL methods fail at true creation.

Key Players & Case Studies

OpenAI has been the most vocal advocate of RL as a discovery engine. Their o1 and o3 models were marketed as 'reasoning models' that 'learn to think' through RL. However, our analysis suggests that o1's chain-of-thought capabilities were largely awakened from pre-training on code and math, not created de novo. The company's recent shift toward 'process reward models' (PRM) and 'verifier-based RL' indicates an attempt to push into creation territory by rewarding novel reasoning steps.

DeepSeek (the team behind DeepSeek-R1) provides the clearest case study. Their approach of pure RL without SFT produced models that could self-correct and explore multiple reasoning paths. Yet, when evaluated on truly novel tasks (e.g., the 'Counterfactual Reasoning' benchmark), performance dropped sharply, indicating the model was exploiting latent patterns rather than creating new ones. DeepSeek's subsequent work on 'GRPO' (Group Relative Policy Optimization) attempts to increase exploration diversity, which may edge toward creation.

Anthropic has taken a different path with 'Constitutional AI' and 'RL from AI Feedback' (RLAIF). Their focus on value alignment and harmlessness is squarely in the awakening regime—they are shaping existing capabilities toward preferred outcomes. Their recent 'Claude 3.5 Sonnet' model showed remarkable improvement in coding benchmarks, but again, this appears to be awakening of latent code-generation ability rather than creation of new programming paradigms.

Google DeepMind has pioneered 'self-play' RL for games (AlphaGo, AlphaZero) where creation is more plausible because the state space is combinatorially vast. Their 'AlphaDev' system, which discovered faster sorting algorithms, is a genuine example of creation—the model generated a program that was not present in any training data. However, this required a constrained environment (assembly code) and massive compute. Transferring this to language models remains an open challenge.

Data Table: Company Strategies

| Company | Core Post-Training Method | Claimed Regime | Our Assessment |
|---|---|---|---|
| OpenAI | RL + PRM | Creation (reasoning) | Awakening (latent patterns) |
| DeepSeek | Pure RL (GRPO) | Creation (self-correction) | Mixed (awakening + limited creation) |
| Anthropic | RLAIF + Constitutional AI | Shaping | Awakening |
| Google DeepMind | Self-play RL | Creation (games) | Creation (constrained domains) |

Data Takeaway: Despite marketing claims, most major labs are operating in the awakening regime. Only DeepMind, with its constrained-domain self-play, has demonstrated genuine creation. This suggests that the industry's current post-training strategies are fundamentally limited by pre-training data.

Industry Impact & Market Dynamics

The awakening vs. creation distinction has profound implications for business models and competitive dynamics.

If post-training is primarily awakening, then the value chain shifts toward data curation and prompt engineering. Companies that can curate the highest-quality, most diverse pre-training datasets will have an insurmountable advantage, because post-training can only surface what is already latent. This favors incumbents with massive data access (e.g., Meta with its social graph, Google with its search index). The moat becomes data, not algorithms. Startups focusing on novel RL algorithms would be wasting resources.

If post-training can create, then algorithmic innovation becomes the primary differentiator. A small team with a breakthrough RL algorithm could create capabilities that no amount of data can match. This would democratize AI development, as compute and data become less critical than algorithmic insight. Venture capital would flow toward RL research labs.

Market Data: The post-training market (including fine-tuning services, RL platforms, and evaluation tools) is projected to grow from $2.5 billion in 2025 to $12 billion by 2028 (CAGR 45%). However, this growth assumes that post-training can unlock new capabilities. If the awakening view is correct, the market may saturate as models hit the 'latent capability ceiling'—the point at which all pre-training knowledge has been surfaced. We estimate this ceiling will be reached within 18-24 months for current frontier models.

Data Table: Market Projections Under Two Scenarios

| Scenario | 2026 Market Size | 2028 Market Size | Key Winners |
|---|---|---|---|
| Awakening Dominant | $3.5B | $6B | Data brokers, prompt engineering firms |
| Creation Dominant | $4.5B | $15B | RL algorithm startups, compute providers |

Data Takeaway: The market size difference by 2028 is $9 billion—a massive swing depending on which regime proves dominant. Investors should watch for evidence of true creation (e.g., models solving novel math problems) as a leading indicator.

Risks, Limitations & Open Questions

The Free Energy framework itself is not yet validated. While mathematically elegant, it has not been empirically confirmed that variational free energy can reliably distinguish awakening from creation in LLMs. The computational cost of computing free energy for billion-parameter models is prohibitive, and approximations may introduce errors.

Overclaiming creation is a major risk. Companies have strong incentives to market their models as 'creative' or 'reasoning' to justify premium pricing. If the industry collectively overestimates the creation capacity of post-training, we may see a bubble of overinvestment in RL infrastructure that yields diminishing returns.

Ethical concerns arise if creation is possible. A model that can genuinely create new capabilities could develop behaviors that are not aligned with human values, and these behaviors would be harder to predict or control because they are not traceable to training data. This could lead to catastrophic misalignment.

Open questions: Can we design RL algorithms that explicitly target creation by minimizing free energy in novel regions of the state space? What is the compute scaling law for creation—does it require exponentially more compute than awakening? Can creation be verified independently, or will we always face the 'latent capability' confound?

AINews Verdict & Predictions

Our editorial judgment is that the Free Energy Principle provides the most rigorous framework yet for understanding post-training, but the evidence overwhelmingly favors the awakening hypothesis for current frontier models. The industry's obsession with RL as a discovery engine is largely misplaced—what we are seeing is sophisticated pattern amplification, not genuine creation.

Prediction 1: Within 12 months, at least one major lab will publicly acknowledge that their RL gains are primarily awakening. This will trigger a shift in research focus toward data curation and pre-training quality.

Prediction 2: True creation will first be demonstrated in constrained domains (e.g., theorem proving, program synthesis) within 24 months, using self-play RL with verifiable rewards. The first lab to achieve this will gain a multi-year competitive advantage.

Prediction 3: The Free Energy framework will become the standard evaluation metric for post-training within 18 months, replacing or supplementing benchmarks like MMLU. Startups that build tools to compute variational free energy for LLMs will see rapid adoption.

What to watch: The next release from DeepSeek (rumored to be 'DeepSeek-R2') will be a critical test. If it shows significant improvement on the ARC benchmark, that would be strong evidence of creation. If not, the awakening hypothesis is further confirmed. We are placing our bets on the latter.

常见问题

这次模型发布“Post-Training: Awakening or Creating? Free Energy Principle Redefines AI Capabilities”的核心内容是什么？

For years, the AI industry has operated under a simplistic dichotomy: supervised fine-tuning (SFT) is imitation learning, while reinforcement learning (RL) is discovery. This binar…

从“free energy principle post-training vs pre-training”看，这个模型发布为什么重要？

围绕“awakening vs creation AI capabilities benchmark”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。

Pós-Treinamento: Despertar ou Criar? O Princípio da Energia Livre Redefine as Capacidades da IA

Technical Deep Dive

Key Players & Case Studies

Industry Impact & Market Dynamics

Risks, Limitations & Open Questions

AINews Verdict & Predictions

More from arXiv cs.AI

Related topics

Archive

Further Reading

常见问题