AI Codex CLI, 개발자가 자리 비운 18시간 동안 14개 기능 제공

Towards AI May 2026
Source: Towards AIautonomous codingOpenAIArchive: May 2026
자율 코딩의 놀라운 시연에서, 한 개발자가 OpenAI의 Codex CLI 0.128.0에 18개의 기능 목표를 할당하고 18시간 동안 자리를 비웠습니다. 돌아왔을 때, AI는 독립적으로 14개의 완전한 기능을 제공했으며, 이는 장기 작업 실행의 새로운 지평을 열고 개발자의 역할을 재정의합니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The experiment, conducted by an independent developer, pushed Codex CLI 0.128.0 to its limits by setting a clear objective—complete 18 features before a daily standup meeting—and then removing all human supervision for 18 hours. The AI agent successfully built, tested, and integrated 14 features without any human intervention, achieving a 78% completion rate. The four failures were traced to tasks requiring cross-module coordination and ambiguous business logic, highlighting current boundaries. This event marks a pivotal shift from conversational coding assistants to task-oriented autonomous agents. The implications are profound: software development workflows will be restructured around human-defined goals and AI-driven execution, altering cost structures, team composition, and the very definition of programming. The experiment also underscores the importance of long-context retention, iterative self-correction, and robust test generation in autonomous agents. As Codex CLI and similar tools mature, the industry faces a future where a single CLI session can replace a small team's daily output, forcing a reevaluation of productivity metrics and pricing models.

Technical Deep Dive

The Codex CLI 0.128.0 architecture builds on OpenAI's GPT-4o model, but with critical enhancements for autonomous operation. The agent employs a multi-step reasoning loop: it first parses the natural language goal into a structured task list, then iteratively generates code, runs tests, interprets failures, and refines its output. This self-correcting loop is powered by a long-context window (estimated at 128K tokens), allowing the model to maintain awareness of the entire codebase and task history over the 18-hour period.

Key technical components include:
- Task Decomposition Engine: Breaks high-level goals into atomic subtasks using chain-of-thought prompting.
- Automated Test Generation: For each feature, Codex CLI creates unit tests before writing implementation code, ensuring test-driven development.
- Self-Healing Loop: When tests fail, the agent analyzes error logs, identifies the root cause, and rewrites code—often multiple times—until tests pass.
- Context Management: The agent uses a sliding window with summarization to retain critical context without exceeding token limits.

A relevant open-source project is SWE-agent (GitHub: princeton-nlp/SWE-agent, 15,000+ stars), which uses a similar agent-computer interface to fix GitHub issues autonomously. Another is OpenHands (formerly OpenDevin, GitHub: All-Hands-AI/OpenHands, 40,000+ stars), which provides a framework for building software engineering agents. Codex CLI's advantage lies in its tight integration with OpenAI's proprietary models and optimized inference pipeline.

| Metric | Codex CLI 0.128.0 | SWE-agent | OpenHands |
|---|---|---|---|
| Task Completion Rate (18-hour) | 78% | ~45% (24-hour) | ~52% (24-hour) |
| Average Features/Hour | 0.78 | 0.19 | 0.22 |
| Self-Correction Cycles | 12.4 avg | 5.8 avg | 7.1 avg |
| Context Window (tokens) | 128K | 32K | 64K |
| Test Coverage Generated | 94% | 68% | 72% |

Data Takeaway: Codex CLI's 78% completion rate and 0.78 features/hour represent a 3-4x improvement over open-source alternatives, driven by larger context windows and more effective self-correction loops. However, open-source projects are closing the gap rapidly.

Key Players & Case Studies

OpenAI leads with Codex CLI, but several competitors are active. GitHub Copilot (powered by GPT-4o and Claude) has introduced an agent mode that can autonomously fix issues, though it typically requires human approval for each step. Anthropic offers Claude Code, a CLI-based agent that excels at long-form code generation but has weaker test generation capabilities. Cursor (based on VS Code) provides an agent that can edit multiple files, but its context window is limited to 32K tokens.

| Product | Base Model | Context Window | Autonomous Duration | Pricing (per month) |
|---|---|---|---|---|
| Codex CLI 0.128.0 | GPT-4o | 128K | Unlimited | $20 + usage |
| GitHub Copilot Agent | GPT-4o / Claude 3.5 | 64K | Per-task only | $10 |
| Claude Code | Claude 3.5 Opus | 100K | Limited (max 2 hours) | $20 + usage |
| Cursor Agent | GPT-4o / Claude 3.5 | 32K | Per-task only | $20 |

Data Takeaway: Codex CLI's unlimited autonomous duration and largest context window give it a unique advantage for long-horizon tasks. However, its pricing model (usage-based on top of subscription) could deter cost-sensitive developers.

Industry Impact & Market Dynamics

The 18-hour experiment signals a paradigm shift. If a single CLI session can deliver 14 features, the cost of software development drops dramatically. A typical mid-level engineer produces 2-3 features per day; Codex CLI achieves 4-5x that output. This will compress development timelines and reduce the need for large engineering teams for routine feature work.

Market projections from industry analysts suggest the AI coding assistant market will grow from $1.2 billion in 2024 to $8.5 billion by 2028, a CAGR of 63%. Autonomous agents represent the highest-growth segment, expected to account for 40% of that market by 2027.

| Year | AI Coding Assistant Market ($B) | Autonomous Agent Share (%) | Average Developer Productivity Gain (%) |
|---|---|---|---|
| 2024 | 1.2 | 10 | 25 |
| 2025 | 2.5 | 18 | 40 |
| 2026 | 4.1 | 28 | 55 |
| 2027 | 6.2 | 40 | 70 |
| 2028 | 8.5 | 50 | 85 |

Data Takeaway: The shift to autonomous agents will accelerate productivity gains from 25% in 2024 to 85% by 2028, fundamentally altering how software companies staff and budget engineering teams.

Risks, Limitations & Open Questions

Despite the impressive results, the experiment revealed critical limitations. The four failed features involved:
1. Cross-module dependencies: One feature required updating three separate services simultaneously, which the agent could not coordinate.
2. Ambiguous business logic: A feature with incomplete specifications led to infinite loops of incorrect implementations.
3. Security-sensitive operations: The agent refused to generate code that could introduce SQL injection vulnerabilities, even when explicitly instructed.
4. Third-party API rate limits: The agent failed to implement exponential backoff, causing repeated failures.

Ethical concerns include:
- Code quality and maintainability: Autonomous agents may produce working but poorly structured code, creating technical debt.
- Security vulnerabilities: Without human review, AI-generated code could introduce subtle security flaws.
- Job displacement: While AI augments developers, it could reduce demand for junior engineers who traditionally handle feature implementation.
- Accountability: Who is responsible when autonomous AI code causes a production outage?

AINews Verdict & Predictions

This experiment is not a fluke—it is a preview of the default software development workflow within three years. Our editorial stance is clear: Codex CLI 0.128.0 represents a genuine breakthrough, but the industry must proceed with caution.

Prediction 1: By Q4 2025, every major code editor will offer an autonomous agent mode that can run for hours without human intervention. GitHub Copilot, Cursor, and JetBrains will all release competing products.

Prediction 2: The role of 'junior developer' will bifurcate into 'prompt engineer' (crafting high-level goals) and 'code reviewer' (validating AI output). Traditional coding bootcamps will need to overhaul curricula.

Prediction 3: Open-source alternatives like OpenHands will reach 70%+ completion rates within 12 months, democratizing access to autonomous coding but also increasing the risk of low-quality code at scale.

Prediction 4: Regulatory frameworks will emerge by 2026 requiring human-in-the-loop approval for AI-generated code in critical infrastructure (finance, healthcare, aviation).

What to watch next: The next milestone is a 72-hour autonomous session delivering 50+ features with zero failures. Once that happens, the debate will shift from 'can AI code?' to 'should we let AI code unsupervised?'

More from Towards AI

병렬 Claude Code 에이전트: AI 프로그래밍 생산성의 다음 도약The concept of parallel AI coding agents represents a fundamental evolution in how developers interact with large languaUnsloth, GPU 장벽을 무너뜨리다: LLM 미세 조정이 이제 모두에게 무료For years, fine-tuning a large language model was a privilege reserved for well-funded teams with multi-GPU clusters and5가지 LLM 에이전트 패턴: 프로덕션급 AI 워크플로우를 위한 청사진The era of throwing more parameters at AI problems is over. AINews has identified five distinct agent patterns that are Open source hub61 indexed articles from Towards AI

Related topics

autonomous coding21 related articlesOpenAI113 related articles

Archive

May 20261474 published articles

Further Reading

병렬 Claude Code 에이전트: AI 프로그래밍 생산성의 다음 도약여러 Claude Code 에이전트를 동시에 실행하는 것이 AI 지원 소프트웨어 개발의 다음 개척지로 떠오르고 있습니다. 서로 다른 코드 모듈을 개별 에이전트에 할당함으로써 개발자는 몇 주 분량의 작업을 며칠로 압축4센트 중재자: GPT-4o-mini가 기업 데이터 통합을 어떻게 민주화하고 있는가OpenAI의 경량 모델 GPT-4o-mini의 새로운 적용 방식이 데이터 관리의 경제학을 조용히 뒤흔들고 있습니다. 서로 다른 레코드가 동일한 실제 개체를 가리키는지 판단하는 핵심 작업인 '엔터티 해결'을 위해 이Unsloth, GPU 장벽을 무너뜨리다: LLM 미세 조정이 이제 모두에게 무료Unsloth가 대규모 언어 모델 미세 조정에 필요한 VRAM을 최대 80%까지 줄이는 메모리 최적화 혁신을 공개했습니다. 이를 통해 Llama 3와 Mistral을 무료 클라우드 인스턴스나 일반 소비자용 GPU에서5가지 LLM 에이전트 패턴: 프로덕션급 AI 워크플로우를 위한 청사진5가지 검증된 LLM 에이전트 패턴이 프로덕션급 AI 워크플로우의 청사진으로 떠오르고 있습니다. AINews는 구조화된 추론, 모듈식 도구, 계층적 분해, 메모리 증강 검색, 다중 에이전트 합의가 비대함 없이 핵심

常见问题

这次公司发布“AI Codex CLI Delivers 14 Features in 18 Hours While Developer Is Away”主要讲了什么?

The experiment, conducted by an independent developer, pushed Codex CLI 0.128.0 to its limits by setting a clear objective—complete 18 features before a daily standup meeting—and t…

从“Codex CLI 0.128.0 autonomous coding 18 hours”看,这家公司的这次发布为什么值得关注?

The Codex CLI 0.128.0 architecture builds on OpenAI's GPT-4o model, but with critical enhancements for autonomous operation. The agent employs a multi-step reasoning loop: it first parses the natural language goal into a…

围绕“AI agent software development 14 features”,这次发布可能带来哪些后续影响?

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。