OpenAI 연구자의 새로운 강화 학습 패러다임: 파라미터 업데이트 대신 Python 코드 작성으로 학습

May 2026
reinforcement learningArchive: May 2026
OpenAI 연구원 Weng Jiayi가 파라미터 업데이트를 완전히 배제한 혁신적인 강화 학습 패러다임을 소개했습니다. 에이전트는 의사 결정 정책으로 Python 스크립트를 자율적으로 생성 및 실행하여 '학습'을 프로그램 합성 문제로 전환합니다. 이 개방형 접근 방식은 AI 훈련 방식을 근본적으로 바꿀 수 있습니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

In a development that challenges the very foundations of modern AI, OpenAI researcher Weng Jiayi has proposed a new reinforcement learning (RL) paradigm where agents learn without updating a single neural network parameter. The core idea is elegantly simple: instead of using gradient descent to tweak millions of weights in a hidden layer, the agent writes a Python script (.py file) that encodes its decision-making policy. The script is then executed directly, and the agent refines it through iterative code generation rather than backpropagation.

This approach, which has been released as an open-source project, represents a fundamental shift from 'learning as optimization' to 'learning as programming.' The agent's entire 'knowledge' is stored in human-readable, auditable Python code, not in an opaque matrix of floating-point numbers. Early experiments show that for rule-based tasks—such as grid navigation, simple robotics control, and logic puzzles—the Python-script agent achieves comparable or superior performance to traditional RL agents while using a fraction of the compute. For example, on a classic 'CartPole' task, the script-based agent reached a stable policy after generating only 50 script iterations, whereas a typical DQN agent required over 10,000 parameter updates.

The significance extends beyond efficiency. In safety-critical domains like autonomous driving and industrial robotics, the ability to inspect, understand, and manually correct an agent's decision logic is invaluable. A black-box neural network offers no such transparency. Weng's method effectively turns the RL policy into a 'glass box'—every decision can be traced back to a specific line of Python code. This could accelerate regulatory approval for AI systems and reduce debugging time for developers. However, the method is not a panacea. It struggles with tasks requiring large-scale pattern recognition, such as image classification or natural language understanding, where neural networks' ability to learn distributed representations remains unmatched. The real promise lies in hybrid systems: using script-based RL for high-level planning and neural networks for perception.

Technical Deep Dive

Weng Jiayi's paradigm, which we'll call 'Script-based Reinforcement Learning' (SRL), replaces the traditional policy network with a code generator. The architecture is straightforward: a language model (LM) serves as the core, receiving the current state observation and a history of past rewards as input. The LM then generates a Python script that defines a function `policy(state) -> action`. This script is executed in a sandboxed environment, and the resulting action is applied to the environment. The reward signal is fed back to the LM, which uses it to generate a new, improved script in the next iteration.

Crucially, the LM's own parameters are never updated. The 'learning' occurs entirely in the space of generated programs. The LM acts as a powerful prior for code generation, and the iterative refinement process is guided by a simple evolutionary strategy: scripts that yield higher rewards are kept and mutated, while low-reward scripts are discarded. This is reminiscent of genetic programming but with a modern twist—the mutation and crossover operations are performed by the LM itself, which can generate semantically meaningful code changes.

The open-source repository (available on GitHub under the name 'rl-by-code') has already garnered over 4,000 stars. The repository includes a minimal implementation using OpenAI's GPT-4o-mini as the code generator, with a total cost of less than $0.50 per training run for simple environments. The key hyperparameter is the 'temperature' of the LM during code generation: a higher temperature encourages exploration of diverse scripts, while a lower temperature favors exploitation of known good patterns.

Benchmark Performance:

| Environment | SRL (Script-based) | DQN (Neural) | PPO (Neural) | SRL Compute Cost | DQN Compute Cost |
|---|---|---|---|---|---|
| CartPole-v1 | 500 (max score) | 500 (max score) | 500 (max score) | ~$0.30 | ~$5.00 |
| LunarLander-v2 | 280 | 260 | 290 | ~$1.20 | ~$15.00 |
| FrozenLake (8x8) | 0.85 (success rate) | 0.78 | 0.82 | ~$0.10 | ~$2.00 |
| Taxi-v3 | 9.5 (avg reward) | 9.2 | 9.6 | ~$0.50 | ~$8.00 |

Data Takeaway: SRL matches or exceeds traditional RL algorithms on these discrete, rule-based environments while reducing compute costs by 10-50x. The cost advantage is most pronounced in simpler environments where the optimal policy can be expressed in a few dozen lines of code.

The method's Achilles' heel is scalability. On tasks requiring high-dimensional sensory input (e.g., playing Atari games from raw pixels), SRL's performance degrades significantly. The generated Python scripts become unwieldy—thousands of lines of nested conditionals—and the LM struggles to generate coherent code for such complex mappings. This suggests that SRL is best suited for tasks where the state space can be meaningfully abstracted into a small set of discrete or continuous variables.

Key Players & Case Studies

Weng Jiayi, the lead researcher, is a member of OpenAI's post-training team and has a background in both reinforcement learning and program synthesis. This work builds on a lineage of research into 'neurosymbolic' AI, but Weng's contribution is unique in its radical simplicity: no hybrid architecture, no neural-symbolic integration—just pure code generation.

Several other organizations are pursuing related directions. DeepMind's 'FunSearch' project uses LLMs to generate code for solving mathematical problems, but it operates in a supervised setting, not RL. Google's 'Code-as-Policies' (CaP) framework generates robot control code from natural language, but it requires a pre-trained policy and does not involve iterative learning. The key differentiator of Weng's approach is that it is a complete RL algorithm, not just a code generation tool.

Comparison of Related Approaches:

| Approach | Organization | Learning Mechanism | Interpretability | Compute Efficiency | Task Suitability |
|---|---|---|---|---|---|
| Script-based RL (SRL) | OpenAI (Weng) | Code generation + evolution | Very High | Very High | Rule-based, discrete |
| Code-as-Policies (CaP) | Google | LLM code generation from prompts | High | Medium | Robotics, manipulation |
| FunSearch | DeepMind | LLM code generation + evolutionary search | Medium | Low | Mathematical discovery |
| Traditional RL (DQN/PPO) | Multiple | Gradient descent on neural nets | Low | Low | General, high-dimensional |

Data Takeaway: SRL occupies a unique niche: it is the only approach that combines full interpretability with a self-contained RL learning loop. Its compute efficiency is unmatched, but its applicability is currently the narrowest.

A notable case study is the application of SRL to a simulated autonomous driving task (CARLA simulator, simplified version). The agent was tasked with lane-keeping and obstacle avoidance using only 10 sensor readings (distances to objects, speed, steering angle). The generated Python script was a mere 30 lines of code, using simple if-else logic. The policy achieved a 95% success rate in 100 test scenarios, and a human engineer could inspect the script and immediately identify a potential failure mode (e.g., not handling a specific intersection type). This level of transparency is unheard of in neural-network-based driving policies.

Industry Impact & Market Dynamics

The immediate impact of SRL will be felt in industries where AI safety and interpretability are paramount. The autonomous vehicle sector, for instance, is currently grappling with the 'black box' problem: regulators and insurers demand explanations for decisions, but neural network policies offer none. SRL could provide a path to certifiable AI systems, where the policy is a formally verifiable program.

Industrial robotics is another prime candidate. Factory robots often operate in highly structured environments with well-defined rules. SRL could allow plant engineers to directly modify a robot's behavior by editing its Python policy, without needing a team of ML engineers. This democratization of AI could accelerate automation adoption in small and medium-sized enterprises.

The market for interpretable AI is projected to grow significantly. According to industry estimates, the global market for explainable AI (XAI) was valued at $6.5 billion in 2024 and is expected to reach $21.5 billion by 2030, at a CAGR of 22%. SRL directly addresses this demand by providing inherently interpretable policies.

Market Growth Projections for Interpretable AI:

| Year | Market Size (USD Billion) | Key Drivers |
|---|---|---|
| 2024 | 6.5 | Regulatory pressure (EU AI Act, GDPR) |
| 2026 | 10.2 | Autonomous vehicle certification requirements |
| 2028 | 15.0 | Industrial automation adoption |
| 2030 | 21.5 | Healthcare and finance AI deployment |

Data Takeaway: The market is growing at over 20% annually, and SRL's timing is perfect. As regulations tighten, the demand for interpretable AI will only increase, and SRL offers a practical solution.

However, the impact on the broader AI industry will be more nuanced. SRL is unlikely to replace large language models or vision transformers for their core tasks. Instead, it will likely carve out a complementary role. We predict that within two years, major cloud AI platforms (AWS, Google Cloud, Azure) will offer 'interpretable RL as a service' based on this paradigm, targeting enterprise customers in regulated industries.

Risks, Limitations & Open Questions

1. Scalability to High-Dimensional Tasks: The most significant limitation is that SRL struggles with tasks requiring complex pattern recognition. For example, controlling a robot arm from camera images would require the generated Python script to include a computer vision pipeline, which is impractical. A hybrid approach—using a neural network for perception and SRL for decision-making—is a natural next step, but it reintroduces some of the black-box issues.

2. Code Quality and Safety: The generated Python scripts are not guaranteed to be safe. A buggy script could cause a robot to behave dangerously. While the sandboxed execution environment mitigates this during training, deploying such scripts in the real world requires rigorous testing and formal verification, which is an open research problem.

3. Dependence on the Underlying Language Model: The performance of SRL is highly dependent on the quality of the LM used for code generation. If the LM has biases or limitations, they will be reflected in the generated policies. Moreover, using a proprietary LM (like GPT-4o) introduces a dependency on an external API, which may not be acceptable for some applications.

4. Exploration vs. Exploitation Trade-off: The evolutionary strategy used in SRL is relatively primitive. More sophisticated methods (e.g., Bayesian optimization over program space) could improve sample efficiency, but they also add complexity. The current implementation's exploration strategy is essentially random mutation, which may be inefficient for complex tasks.

5. Ethical Concerns: An interpretable policy is a double-edged sword. While it enables auditing, it also makes it easier for malicious actors to reverse-engineer and exploit the system. For example, an autonomous vehicle's policy could be analyzed to find edge cases that cause dangerous behavior.

AINews Verdict & Predictions

Weng Jiayi's Script-based RL is not just a clever trick; it is a genuine paradigm shift. It challenges the deep learning orthodoxy that 'more parameters = better learning' and offers a viable alternative for a significant class of problems. Our editorial judgment is that this will be remembered as a foundational paper in the field of neurosymbolic AI, even though it does not use neural networks for the policy itself.

Predictions:

1. Within 12 months: At least two major autonomous driving companies (e.g., Waymo, Tesla) will announce pilot programs using SRL for their high-level planning modules, citing regulatory pressure and safety concerns.

2. Within 24 months: The 'rl-by-code' repository will surpass 50,000 GitHub stars, and a startup will emerge to commercialize SRL for industrial robotics, raising at least $20 million in Series A funding.

3. Within 36 months: SRL will be integrated into mainstream RL libraries (e.g., Stable-Baselines3, Ray RLlib) as an alternative algorithm, and a benchmark suite for 'interpretable RL' will be established, with SRL as the baseline.

4. The killer app: We believe the first major commercial success of SRL will be in automated trading systems. Financial regulators are increasingly demanding explainability for algorithmic trading strategies, and SRL's transparent, code-based policies are a perfect fit. A hedge fund using SRL for its trading logic could provide regulators with a complete audit trail of every decision.

What to watch next: The next frontier is 'multi-agent SRL,' where multiple agents generate and share Python scripts, potentially leading to emergent collaboration or competition. If Weng's team releases a follow-up paper on this, it will be a must-read.

In conclusion, SRL is a reminder that AI progress is not always about bigger models and more data. Sometimes, the most profound advances come from rethinking the fundamental assumptions of how machines should learn. By turning learning into programming, Weng has opened a door that many had forgotten existed.

Related topics

reinforcement learning94 related articles

Archive

May 20263028 published articles

Further Reading

100달러 로봇 개, 경량 월드 모델로 엔비디아 GPU 왕좌를 무너뜨리다1,000달러 미만의 로봇 개가 실제 운동 테스트에서 엔비디아의 플래그십 시뮬레이션 플랫폼을 이겼습니다. AINews가 그 비밀을 공개합니다: 저전력 엣지 칩에서 실행되는 경량 월드 모델이 GPU 클러스터를 완전히 SFT 우선: 멀티모달 AI 훈련에서 RL을 서두르면 역효과가 나는 이유점점 더 많은 AI 팀이 강화 학습을 멀티모달 모델에 서둘러 적용하고 있지만, 성능이 붕괴되는 결과만 초래하고 있습니다. AINews가 그 근본 원인을 밝힙니다: 지도 미세 조정 단계에서 해결되지 않은 '숨은 상처'구체화 스케일링 법칙 검증 완료: 1시간 내 99% 성공률 달성, 물리적 AI의 GPT-3 순간을 알리다오랫동안 가설로만 존재했던 '구체화 스케일링 법칙'이 결정적으로 검증되었습니다. 한 선도적인 AI 기업이 로봇이 단 1시간의 시뮬레이션 훈련만으로 새롭고 복잡한 물리적 조작 작업을 학습하여, 실제 세계에서 배치 시 Small Model Revolution: 1500-Dollar HRM Challenges Billion-Parameter GiantsA 1-billion parameter model trained for just $1,500 is upending the AI industry's obsession with scale. HRM, publicly en

常见问题

这次模型发布“OpenAI Researcher's New RL Paradigm: Learning by Writing Python, Not Updating Parameters”的核心内容是什么?

In a development that challenges the very foundations of modern AI, OpenAI researcher Weng Jiayi has proposed a new reinforcement learning (RL) paradigm where agents learn without…

从“How does script-based RL compare to traditional reinforcement learning in terms of sample efficiency?”看,这个模型发布为什么重要?

Weng Jiayi's paradigm, which we'll call 'Script-based Reinforcement Learning' (SRL), replaces the traditional policy network with a code generator. The architecture is straightforward: a language model (LM) serves as the…

围绕“Can script-based RL be combined with neural networks for perception tasks?”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。