AI Codex CLI 在開發者離開的18小時內交付14項功能

Towards AI May 2026
Source: Towards AIautonomous codingOpenAIArchive: May 2026
在一次引人注目的自主編程演示中,一名開發者在離開18小時前,為OpenAI的Codex CLI 0.128.0設定了18項功能的目標。返回時,AI已獨立完成14項完整功能,揭示了長期任務執行的新領域,並重新定義了開發者的角色。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The experiment, conducted by an independent developer, pushed Codex CLI 0.128.0 to its limits by setting a clear objective—complete 18 features before a daily standup meeting—and then removing all human supervision for 18 hours. The AI agent successfully built, tested, and integrated 14 features without any human intervention, achieving a 78% completion rate. The four failures were traced to tasks requiring cross-module coordination and ambiguous business logic, highlighting current boundaries. This event marks a pivotal shift from conversational coding assistants to task-oriented autonomous agents. The implications are profound: software development workflows will be restructured around human-defined goals and AI-driven execution, altering cost structures, team composition, and the very definition of programming. The experiment also underscores the importance of long-context retention, iterative self-correction, and robust test generation in autonomous agents. As Codex CLI and similar tools mature, the industry faces a future where a single CLI session can replace a small team's daily output, forcing a reevaluation of productivity metrics and pricing models.

Technical Deep Dive

The Codex CLI 0.128.0 architecture builds on OpenAI's GPT-4o model, but with critical enhancements for autonomous operation. The agent employs a multi-step reasoning loop: it first parses the natural language goal into a structured task list, then iteratively generates code, runs tests, interprets failures, and refines its output. This self-correcting loop is powered by a long-context window (estimated at 128K tokens), allowing the model to maintain awareness of the entire codebase and task history over the 18-hour period.

Key technical components include:
- Task Decomposition Engine: Breaks high-level goals into atomic subtasks using chain-of-thought prompting.
- Automated Test Generation: For each feature, Codex CLI creates unit tests before writing implementation code, ensuring test-driven development.
- Self-Healing Loop: When tests fail, the agent analyzes error logs, identifies the root cause, and rewrites code—often multiple times—until tests pass.
- Context Management: The agent uses a sliding window with summarization to retain critical context without exceeding token limits.

A relevant open-source project is SWE-agent (GitHub: princeton-nlp/SWE-agent, 15,000+ stars), which uses a similar agent-computer interface to fix GitHub issues autonomously. Another is OpenHands (formerly OpenDevin, GitHub: All-Hands-AI/OpenHands, 40,000+ stars), which provides a framework for building software engineering agents. Codex CLI's advantage lies in its tight integration with OpenAI's proprietary models and optimized inference pipeline.

| Metric | Codex CLI 0.128.0 | SWE-agent | OpenHands |
|---|---|---|---|
| Task Completion Rate (18-hour) | 78% | ~45% (24-hour) | ~52% (24-hour) |
| Average Features/Hour | 0.78 | 0.19 | 0.22 |
| Self-Correction Cycles | 12.4 avg | 5.8 avg | 7.1 avg |
| Context Window (tokens) | 128K | 32K | 64K |
| Test Coverage Generated | 94% | 68% | 72% |

Data Takeaway: Codex CLI's 78% completion rate and 0.78 features/hour represent a 3-4x improvement over open-source alternatives, driven by larger context windows and more effective self-correction loops. However, open-source projects are closing the gap rapidly.

Key Players & Case Studies

OpenAI leads with Codex CLI, but several competitors are active. GitHub Copilot (powered by GPT-4o and Claude) has introduced an agent mode that can autonomously fix issues, though it typically requires human approval for each step. Anthropic offers Claude Code, a CLI-based agent that excels at long-form code generation but has weaker test generation capabilities. Cursor (based on VS Code) provides an agent that can edit multiple files, but its context window is limited to 32K tokens.

| Product | Base Model | Context Window | Autonomous Duration | Pricing (per month) |
|---|---|---|---|---|
| Codex CLI 0.128.0 | GPT-4o | 128K | Unlimited | $20 + usage |
| GitHub Copilot Agent | GPT-4o / Claude 3.5 | 64K | Per-task only | $10 |
| Claude Code | Claude 3.5 Opus | 100K | Limited (max 2 hours) | $20 + usage |
| Cursor Agent | GPT-4o / Claude 3.5 | 32K | Per-task only | $20 |

Data Takeaway: Codex CLI's unlimited autonomous duration and largest context window give it a unique advantage for long-horizon tasks. However, its pricing model (usage-based on top of subscription) could deter cost-sensitive developers.

Industry Impact & Market Dynamics

The 18-hour experiment signals a paradigm shift. If a single CLI session can deliver 14 features, the cost of software development drops dramatically. A typical mid-level engineer produces 2-3 features per day; Codex CLI achieves 4-5x that output. This will compress development timelines and reduce the need for large engineering teams for routine feature work.

Market projections from industry analysts suggest the AI coding assistant market will grow from $1.2 billion in 2024 to $8.5 billion by 2028, a CAGR of 63%. Autonomous agents represent the highest-growth segment, expected to account for 40% of that market by 2027.

| Year | AI Coding Assistant Market ($B) | Autonomous Agent Share (%) | Average Developer Productivity Gain (%) |
|---|---|---|---|
| 2024 | 1.2 | 10 | 25 |
| 2025 | 2.5 | 18 | 40 |
| 2026 | 4.1 | 28 | 55 |
| 2027 | 6.2 | 40 | 70 |
| 2028 | 8.5 | 50 | 85 |

Data Takeaway: The shift to autonomous agents will accelerate productivity gains from 25% in 2024 to 85% by 2028, fundamentally altering how software companies staff and budget engineering teams.

Risks, Limitations & Open Questions

Despite the impressive results, the experiment revealed critical limitations. The four failed features involved:
1. Cross-module dependencies: One feature required updating three separate services simultaneously, which the agent could not coordinate.
2. Ambiguous business logic: A feature with incomplete specifications led to infinite loops of incorrect implementations.
3. Security-sensitive operations: The agent refused to generate code that could introduce SQL injection vulnerabilities, even when explicitly instructed.
4. Third-party API rate limits: The agent failed to implement exponential backoff, causing repeated failures.

Ethical concerns include:
- Code quality and maintainability: Autonomous agents may produce working but poorly structured code, creating technical debt.
- Security vulnerabilities: Without human review, AI-generated code could introduce subtle security flaws.
- Job displacement: While AI augments developers, it could reduce demand for junior engineers who traditionally handle feature implementation.
- Accountability: Who is responsible when autonomous AI code causes a production outage?

AINews Verdict & Predictions

This experiment is not a fluke—it is a preview of the default software development workflow within three years. Our editorial stance is clear: Codex CLI 0.128.0 represents a genuine breakthrough, but the industry must proceed with caution.

Prediction 1: By Q4 2025, every major code editor will offer an autonomous agent mode that can run for hours without human intervention. GitHub Copilot, Cursor, and JetBrains will all release competing products.

Prediction 2: The role of 'junior developer' will bifurcate into 'prompt engineer' (crafting high-level goals) and 'code reviewer' (validating AI output). Traditional coding bootcamps will need to overhaul curricula.

Prediction 3: Open-source alternatives like OpenHands will reach 70%+ completion rates within 12 months, democratizing access to autonomous coding but also increasing the risk of low-quality code at scale.

Prediction 4: Regulatory frameworks will emerge by 2026 requiring human-in-the-loop approval for AI-generated code in critical infrastructure (finance, healthcare, aviation).

What to watch next: The next milestone is a 72-hour autonomous session delivering 50+ features with zero failures. Once that happens, the debate will shift from 'can AI code?' to 'should we let AI code unsupervised?'

More from Towards AI

並行Claude Code代理:AI程式設計生產力的下一大步The concept of parallel AI coding agents represents a fundamental evolution in how developers interact with large languaUnsloth 打破 GPU 障礙:微調大型語言模型現在人人免費For years, fine-tuning a large language model was a privilege reserved for well-funded teams with multi-GPU clusters and五種LLM代理模式:生產級AI工作流程的藍圖The era of throwing more parameters at AI problems is over. AINews has identified five distinct agent patterns that are Open source hub61 indexed articles from Towards AI

Related topics

autonomous coding21 related articlesOpenAI113 related articles

Archive

May 20261478 published articles

Further Reading

並行Claude Code代理:AI程式設計生產力的下一大步同時運行多個Claude Code代理正成為AI輔助軟體開發的下一個前沿。通過將不同的程式碼模組分配給不同的代理,開發者可以將數週的工作壓縮到數天內完成,以AI的速度和一致性模擬人類工程團隊。四美分仲裁者:GPT-4o-mini如何讓企業數據整合走向大眾化OpenAI輕量級模型GPT-4o-mini的一項創新應用,正悄然改變數據管理的經濟模式。透過將該模型部署為實體解析的「四美分仲裁者」——即判斷不同記錄是否指向同一現實實體的關鍵任務——團隊能以極低成本實現接近人類水平的準確度。Unsloth 打破 GPU 障礙:微調大型語言模型現在人人免費Unsloth 揭露了一項記憶體優化突破,將微調大型語言模型所需的 VRAM 減少高達 80%,讓使用者能在免費雲端實例或消費級 GPU 上自訂 Llama 3 和 Mistral。這將 AI 模型個人化從企業奢侈品轉變為五種LLM代理模式:生產級AI工作流程的藍圖五種經過驗證的LLM代理模式正成為生產級AI工作流程的藍圖。AINews分析結構化推理、模組化工具、層級分解、記憶增強檢索與多代理共識如何在無需膨脹的情況下解決核心可靠性挑戰。

常见问题

这次公司发布“AI Codex CLI Delivers 14 Features in 18 Hours While Developer Is Away”主要讲了什么?

The experiment, conducted by an independent developer, pushed Codex CLI 0.128.0 to its limits by setting a clear objective—complete 18 features before a daily standup meeting—and t…

从“Codex CLI 0.128.0 autonomous coding 18 hours”看,这家公司的这次发布为什么值得关注?

The Codex CLI 0.128.0 architecture builds on OpenAI's GPT-4o model, but with critical enhancements for autonomous operation. The agent employs a multi-step reasoning loop: it first parses the natural language goal into a…

围绕“AI agent software development 14 features”,这次发布可能带来哪些后续影响?

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。