AI Codex CLI、開発者が不在の18時間で14の機能を提供

Towards AI May 2026
Source: Towards AIautonomous codingOpenAIArchive: May 2026
自律的なコーディングの顕著な実証として、開発者はOpenAIのCodex CLI 0.128.0に18の機能という目標を割り当て、18時間離れました。戻ると、AIは独立して14の完全な機能を提供しており、長期的なタスク実行の新たなフロンティアを明らかにし、開発者の役割を再定義しています。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The experiment, conducted by an independent developer, pushed Codex CLI 0.128.0 to its limits by setting a clear objective—complete 18 features before a daily standup meeting—and then removing all human supervision for 18 hours. The AI agent successfully built, tested, and integrated 14 features without any human intervention, achieving a 78% completion rate. The four failures were traced to tasks requiring cross-module coordination and ambiguous business logic, highlighting current boundaries. This event marks a pivotal shift from conversational coding assistants to task-oriented autonomous agents. The implications are profound: software development workflows will be restructured around human-defined goals and AI-driven execution, altering cost structures, team composition, and the very definition of programming. The experiment also underscores the importance of long-context retention, iterative self-correction, and robust test generation in autonomous agents. As Codex CLI and similar tools mature, the industry faces a future where a single CLI session can replace a small team's daily output, forcing a reevaluation of productivity metrics and pricing models.

Technical Deep Dive

The Codex CLI 0.128.0 architecture builds on OpenAI's GPT-4o model, but with critical enhancements for autonomous operation. The agent employs a multi-step reasoning loop: it first parses the natural language goal into a structured task list, then iteratively generates code, runs tests, interprets failures, and refines its output. This self-correcting loop is powered by a long-context window (estimated at 128K tokens), allowing the model to maintain awareness of the entire codebase and task history over the 18-hour period.

Key technical components include:
- Task Decomposition Engine: Breaks high-level goals into atomic subtasks using chain-of-thought prompting.
- Automated Test Generation: For each feature, Codex CLI creates unit tests before writing implementation code, ensuring test-driven development.
- Self-Healing Loop: When tests fail, the agent analyzes error logs, identifies the root cause, and rewrites code—often multiple times—until tests pass.
- Context Management: The agent uses a sliding window with summarization to retain critical context without exceeding token limits.

A relevant open-source project is SWE-agent (GitHub: princeton-nlp/SWE-agent, 15,000+ stars), which uses a similar agent-computer interface to fix GitHub issues autonomously. Another is OpenHands (formerly OpenDevin, GitHub: All-Hands-AI/OpenHands, 40,000+ stars), which provides a framework for building software engineering agents. Codex CLI's advantage lies in its tight integration with OpenAI's proprietary models and optimized inference pipeline.

| Metric | Codex CLI 0.128.0 | SWE-agent | OpenHands |
|---|---|---|---|
| Task Completion Rate (18-hour) | 78% | ~45% (24-hour) | ~52% (24-hour) |
| Average Features/Hour | 0.78 | 0.19 | 0.22 |
| Self-Correction Cycles | 12.4 avg | 5.8 avg | 7.1 avg |
| Context Window (tokens) | 128K | 32K | 64K |
| Test Coverage Generated | 94% | 68% | 72% |

Data Takeaway: Codex CLI's 78% completion rate and 0.78 features/hour represent a 3-4x improvement over open-source alternatives, driven by larger context windows and more effective self-correction loops. However, open-source projects are closing the gap rapidly.

Key Players & Case Studies

OpenAI leads with Codex CLI, but several competitors are active. GitHub Copilot (powered by GPT-4o and Claude) has introduced an agent mode that can autonomously fix issues, though it typically requires human approval for each step. Anthropic offers Claude Code, a CLI-based agent that excels at long-form code generation but has weaker test generation capabilities. Cursor (based on VS Code) provides an agent that can edit multiple files, but its context window is limited to 32K tokens.

| Product | Base Model | Context Window | Autonomous Duration | Pricing (per month) |
|---|---|---|---|---|
| Codex CLI 0.128.0 | GPT-4o | 128K | Unlimited | $20 + usage |
| GitHub Copilot Agent | GPT-4o / Claude 3.5 | 64K | Per-task only | $10 |
| Claude Code | Claude 3.5 Opus | 100K | Limited (max 2 hours) | $20 + usage |
| Cursor Agent | GPT-4o / Claude 3.5 | 32K | Per-task only | $20 |

Data Takeaway: Codex CLI's unlimited autonomous duration and largest context window give it a unique advantage for long-horizon tasks. However, its pricing model (usage-based on top of subscription) could deter cost-sensitive developers.

Industry Impact & Market Dynamics

The 18-hour experiment signals a paradigm shift. If a single CLI session can deliver 14 features, the cost of software development drops dramatically. A typical mid-level engineer produces 2-3 features per day; Codex CLI achieves 4-5x that output. This will compress development timelines and reduce the need for large engineering teams for routine feature work.

Market projections from industry analysts suggest the AI coding assistant market will grow from $1.2 billion in 2024 to $8.5 billion by 2028, a CAGR of 63%. Autonomous agents represent the highest-growth segment, expected to account for 40% of that market by 2027.

| Year | AI Coding Assistant Market ($B) | Autonomous Agent Share (%) | Average Developer Productivity Gain (%) |
|---|---|---|---|
| 2024 | 1.2 | 10 | 25 |
| 2025 | 2.5 | 18 | 40 |
| 2026 | 4.1 | 28 | 55 |
| 2027 | 6.2 | 40 | 70 |
| 2028 | 8.5 | 50 | 85 |

Data Takeaway: The shift to autonomous agents will accelerate productivity gains from 25% in 2024 to 85% by 2028, fundamentally altering how software companies staff and budget engineering teams.

Risks, Limitations & Open Questions

Despite the impressive results, the experiment revealed critical limitations. The four failed features involved:
1. Cross-module dependencies: One feature required updating three separate services simultaneously, which the agent could not coordinate.
2. Ambiguous business logic: A feature with incomplete specifications led to infinite loops of incorrect implementations.
3. Security-sensitive operations: The agent refused to generate code that could introduce SQL injection vulnerabilities, even when explicitly instructed.
4. Third-party API rate limits: The agent failed to implement exponential backoff, causing repeated failures.

Ethical concerns include:
- Code quality and maintainability: Autonomous agents may produce working but poorly structured code, creating technical debt.
- Security vulnerabilities: Without human review, AI-generated code could introduce subtle security flaws.
- Job displacement: While AI augments developers, it could reduce demand for junior engineers who traditionally handle feature implementation.
- Accountability: Who is responsible when autonomous AI code causes a production outage?

AINews Verdict & Predictions

This experiment is not a fluke—it is a preview of the default software development workflow within three years. Our editorial stance is clear: Codex CLI 0.128.0 represents a genuine breakthrough, but the industry must proceed with caution.

Prediction 1: By Q4 2025, every major code editor will offer an autonomous agent mode that can run for hours without human intervention. GitHub Copilot, Cursor, and JetBrains will all release competing products.

Prediction 2: The role of 'junior developer' will bifurcate into 'prompt engineer' (crafting high-level goals) and 'code reviewer' (validating AI output). Traditional coding bootcamps will need to overhaul curricula.

Prediction 3: Open-source alternatives like OpenHands will reach 70%+ completion rates within 12 months, democratizing access to autonomous coding but also increasing the risk of low-quality code at scale.

Prediction 4: Regulatory frameworks will emerge by 2026 requiring human-in-the-loop approval for AI-generated code in critical infrastructure (finance, healthcare, aviation).

What to watch next: The next milestone is a 72-hour autonomous session delivering 50+ features with zero failures. Once that happens, the debate will shift from 'can AI code?' to 'should we let AI code unsupervised?'

More from Towards AI

並列Claude Codeエージェント:AIプログラミング生産性の次の飛躍The concept of parallel AI coding agents represents a fundamental evolution in how developers interact with large languaUnsloth、GPUの壁を打破:LLMのファインチューニングが今や誰でも無料にFor years, fine-tuning a large language model was a privilege reserved for well-funded teams with multi-GPU clusters and5つのLLMエージェントパターン:本番環境向けAIワークフローの設計図The era of throwing more parameters at AI problems is over. AINews has identified five distinct agent patterns that are Open source hub61 indexed articles from Towards AI

Related topics

autonomous coding21 related articlesOpenAI113 related articles

Archive

May 20261478 published articles

Further Reading

並列Claude Codeエージェント:AIプログラミング生産性の次の飛躍複数のClaude Codeエージェントを同時に実行することが、AI支援ソフトウェア開発の次のフロンティアとして浮上しています。異なるコードモジュールを個別のエージェントに割り当てることで、開発者は数週間の作業を数日に圧縮し、AIの速度と一4セントの仲裁者:GPT-4o-miniが企業データ統合を民主化する方法OpenAIの軽量モデルGPT-4o-miniの新たな応用が、データ管理の経済性を静かに覆しつつあります。異なるレコードが同一の実世界エンティティを指すかどうかを判断する重要なタスク「エンティティ解決」において、このモデルを「4セントの仲裁Unsloth、GPUの壁を打破:LLMのファインチューニングが今や誰でも無料にUnslothは、大規模言語モデルのファインチューニングに必要なVRAMを最大80%削減するメモリ最適化のブレークスルーを発表しました。これにより、Llama 3やMistralを無料のクラウドインスタンスやコンシューマー向けGPUでカスタ5つのLLMエージェントパターン:本番環境向けAIワークフローの設計図5つの実証済みLLMエージェントパターンが、本番環境向けAIワークフローの設計図として台頭しています。AINewsは、構造化推論、モジュール化ツール、階層的分解、メモリ拡張検索、マルチエージェント合意が、肥大化を招かずに中核的な信頼性課題を

常见问题

这次公司发布“AI Codex CLI Delivers 14 Features in 18 Hours While Developer Is Away”主要讲了什么?

The experiment, conducted by an independent developer, pushed Codex CLI 0.128.0 to its limits by setting a clear objective—complete 18 features before a daily standup meeting—and t…

从“Codex CLI 0.128.0 autonomous coding 18 hours”看,这家公司的这次发布为什么值得关注?

The Codex CLI 0.128.0 architecture builds on OpenAI's GPT-4o model, but with critical enhancements for autonomous operation. The agent employs a multi-step reasoning loop: it first parses the natural language goal into a…

围绕“AI agent software development 14 features”,这次发布可能带来哪些后续影响?

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。