AI's Great Shift: From Predicting Words to Completing Tasks, Codex Shows the Way

26 июня 2026 г. в 22:32 AINews Hacker News June 2026

Source: Hacker News AI agents Archive: June 2026

OpenAI researchers have published a landmark paper detailing the evolution of Codex from a code completion tool into a full-fledged autonomous agent. This signals a profound industry pivot from 'next-word prediction' to 'next-task completion,' redefining how AI systems interact with the world.

The article body is currently shown in English by default. You can generate the full version in this language on demand.

A new paper from OpenAI, titled 'The Agentic Turn in AI: Evidence from Codex,' provides the clearest evidence yet that the AI industry is undergoing a fundamental paradigm shift. The paper traces the transformation of Codex—originally a simple code autocomplete tool—into an autonomous agent capable of setting subgoals, calling external tools, and self-correcting after errors. This is not merely a scaling-up of parameters or data; it is a deep architectural and training-methodology change that redefines what an AI system is supposed to do. The new paradigm replaces the dominance of 'next-token prediction' with a framework centered on 'task completion rate' and 'autonomy.' The implications are vast: future AI products will be judged not by the fluency of their text, but by their reliability in completing user-assigned tasks. This shift will reshape pricing models (from per-token to per-task), service design (from conversational chatbots to autonomous workers), and the very nature of user trust. The paper offers a technical blueprint for this transition, showing how reinforcement learning from task completion, hierarchical planning modules, and tool-use APIs enable a system to move from passive generation to active execution. The research community is already taking notice: the evaluation metrics that once dominated leaderboards—perplexity, BLEU scores—are giving way to new benchmarks like SWE-bench and AgentBench that measure real-world task success. This is not a future possibility; it is happening now, and Codex is its harbinger.

Technical Deep Dive

The paper's core insight is that the transition from language model to agent requires a fundamental rethinking of the training objective. Traditional LLMs are optimized to minimize cross-entropy loss on next-token prediction. This produces fluent text but does not guarantee task completion. The Codex team introduced a multi-stage training pipeline that explicitly optimizes for task success.

Architecture Changes: The agentic Codex retains the transformer backbone but adds several critical components:
- Hierarchical Planner: A separate module that decomposes a high-level user request (e.g., 'build a web scraper') into a sequence of subgoals (e.g., 'fetch HTML', 'parse links', 'save to CSV'). This planner is trained via imitation learning on human-annotated task decompositions.
- Tool-Use Interface: The model is given access to a set of APIs—file system, shell commands, web search, code interpreter—via function-calling tokens. The paper shows that the model learns to invoke these tools autonomously, even chaining multiple calls.
- Self-Correction Loop: After each action, the system checks for errors (e.g., compilation errors, runtime exceptions) and, if found, enters a 'debugging mode' where it re-plans and retries. This is implemented via a separate 'critic' model that scores the output of each step and triggers a replan if the score is below a threshold.

Training Methodology: The training data was curated to include not just correct code, but also traces of failed attempts and subsequent corrections. The model was fine-tuned using a variant of reinforcement learning with human feedback (RLHF), but the reward signal was based on task completion (e.g., 'did the code run without errors?', 'did it produce the correct output?') rather than text quality. The paper reports that this 'task-completion RL' was critical: models trained on next-token prediction alone achieved only 34% task completion on a held-out set of coding tasks, while the agentic version achieved 78%.

Relevant Open-Source Work: The paper's approach aligns closely with several open-source projects that readers can explore:
- OpenDevin (GitHub: OpenDevin/OpenDevin, ~35k stars): An open platform for AI software agents that uses a similar planner-executor architecture. It has shown strong results on SWE-bench.
- SWE-agent (GitHub: princeton-nlp/SWE-agent, ~15k stars): A framework for turning LLMs into software engineering agents that can fix bugs in real GitHub repositories. It uses a similar tool-use and self-correction loop.
- CrewAI (GitHub: joaomdmoura/crewAI, ~25k stars): A framework for orchestrating multiple AI agents to collaborate on tasks. While not directly cited, its multi-agent planning mirrors the hierarchical approach in the paper.

Benchmark Performance: The paper includes a comparison of the agentic Codex against prior models on task-completion benchmarks:

| Model | SWE-bench Lite (Pass@1) | AgentBench (Avg Score) | HumanEval (Pass@1) | Task Completion Rate (Coding) |
|---|---|---|---|---|
| GPT-3.5 (Codex base) | 12.4% | 28.1 | 48.1% | 34% |
| GPT-4 (Codex base) | 18.2% | 35.6 | 67.0% | 52% |
| Agentic Codex (paper) | 41.7% | 62.3 | 82.5% | 78% |
| Claude 3.5 Sonnet (agentic) | 33.6% | 55.4 | 76.2% | 69% |

Data Takeaway: The agentic Codex achieves a 2.3x improvement in task completion rate over its non-agentic counterpart (GPT-4 base), and a 3.4x improvement over GPT-3.5. This is not a marginal gain—it represents a qualitative leap in capability. The gap on SWE-bench Lite is particularly striking: the agentic version is more than twice as effective at fixing real-world software bugs.

Key Players & Case Studies

While the paper is from OpenAI, the shift it documents is industry-wide. Several key players are racing to operationalize this paradigm:

OpenAI: The paper is clearly a strategic document, signaling that OpenAI's future product roadmap is agent-centric. The evolution of Codex into an agent is likely the foundation for future versions of ChatGPT's 'Code Interpreter' and 'Advanced Data Analysis' features. The company has also been hiring heavily for 'agentic AI' roles, and internal sources suggest a new product—codenamed 'Operator'—is in development, which will allow users to delegate complex multi-step tasks.

Anthropic: The Claude family of models has shown strong agentic capabilities, particularly in tool use. Anthropic's 'Computer Use' feature, which allows Claude to control a virtual desktop, is a direct competitor. The paper's findings validate Anthropic's bet on 'constitutional AI' combined with tool-use training. However, Anthropic has been more cautious about full autonomy, emphasizing 'human-in-the-loop' safeguards.

Google DeepMind: DeepMind's Gemini models have been benchmarked on AgentBench and SWE-bench, but their performance has lagged behind OpenAI and Anthropic. The paper highlights a key weakness: Google's models are still optimized for next-token prediction on massive text corpora, and the company has been slower to adopt task-completion RL. However, DeepMind's work on 'AlphaCode' and 'AlphaDev' shows they have deep expertise in code generation, and a pivot to agentic training is likely imminent.

Startups: A new wave of startups is building entirely on the agent paradigm:
- Cognition Labs (Devin): The first 'AI software engineer' that can autonomously plan, code, and deploy. Devin has shown impressive results on SWE-bench but has also faced criticism for reliability issues.
- Factory AI: Building 'AI workers' for enterprise workflows, focusing on reliability and auditability.
- Temporal: While not an AI company, its workflow orchestration platform is being adopted by AI agent developers to manage long-running, multi-step tasks with error handling.

Comparison of Agentic Platforms:

| Platform | Core Model | Task Completion (SWE-bench) | Autonomy Level | Pricing Model |
|---|---|---|---|---|
| OpenAI Agent (Codex-based) | GPT-4 (agentic) | 41.7% | High (self-correcting) | Per-task (est.) |
| Anthropic Claude (Computer Use) | Claude 3.5 Sonnet | 33.6% | Medium (human approval for critical actions) | Per-token + tool usage |
| Devin (Cognition) | Custom fine-tuned | 48.9% (claimed) | Very High (full autonomy) | Subscription ($500/mo) |
| OpenDevin (open-source) | Various (GPT-4, Claude) | 33.0% | Configurable | Free |

Data Takeaway: The proprietary models (OpenAI, Devin) lead on SWE-bench, but the open-source alternatives are closing the gap rapidly. The pricing models are diverging: OpenAI is moving toward per-task pricing, while Anthropic retains per-token pricing. This reflects a fundamental disagreement about whether AI should be sold as a tool or a worker.

Industry Impact & Market Dynamics

The shift from language model to agent will reshape the AI industry across multiple dimensions:

Product Design: The dominant product paradigm—the conversational chatbot—is being disrupted. Users don't want to chat; they want tasks done. This means future AI products will look more like 'digital assistants' that can manage calendars, write reports, book travel, and fix code, all without step-by-step user guidance. The paper's emphasis on 'task completion rate' as the key metric will force product teams to redesign their interfaces: instead of a text box, users will see a task dashboard with progress bars, error logs, and completion confirmations.

Business Models: The pricing logic is shifting from 'cost of compute' to 'value of task completed.' This is a dramatic change. A per-token model charges users for the AI's thinking time; a per-task model charges for the outcome. This aligns incentives: the AI provider profits only when the user's task is successfully completed. The paper's data suggests that task-completion RL leads to higher reliability, which justifies higher per-task prices. We predict that the market will bifurcate: low-stakes tasks (e.g., drafting emails) will remain cheap or free, while high-stakes tasks (e.g., legal document review, medical diagnosis) will command premium per-task fees.

Market Size and Growth: The global market for AI agents is projected to grow from $5.1 billion in 2024 to $47.1 billion by 2030, according to industry estimates. The paper's findings accelerate this timeline. Key growth drivers include:
- Enterprise automation (replacing manual workflows in HR, finance, IT)
- Software development (AI agents that write, test, and deploy code)
- Customer service (autonomous agents that resolve complex issues)

Competitive Dynamics: The paper creates a clear 'moat' for companies that have invested in task-completion RL. OpenAI and Anthropic have a 12-18 month lead over Google and Meta, who are still optimizing for text generation. However, the open-source ecosystem (OpenDevin, SWE-agent) is commoditizing the agent architecture, which could erode this lead. The real competitive advantage will come from proprietary data on task-completion traces—the logs of how agents succeed and fail. Companies that collect the most high-quality task-completion data will have the best models.

Funding Landscape: Venture capital is pouring into agentic AI. In 2025, funding for agent-focused startups reached $8.2 billion, up from $2.1 billion in 2023. The paper's validation of the paradigm will likely trigger another wave of investment. Key areas of interest: enterprise agents, developer tools, and safety/alignment for autonomous systems.

Risks, Limitations & Open Questions

The paper is optimistic, but the shift to agentic AI raises serious concerns:

Reliability at Scale: The paper reports 78% task completion on coding tasks. But 22% failure rate is catastrophic for many applications. If an AI agent managing a company's payroll fails 22% of the time, the consequences are severe. The paper does not address how to achieve 99.9%+ reliability, which is the standard for enterprise software.

Safety and Alignment: Autonomous agents that can take actions in the world (e.g., deleting files, sending emails, making purchases) pose new risks. The paper mentions a 'critic' model that checks for errors, but this is focused on technical correctness, not ethical alignment. An agent that is technically correct but ethically wrong (e.g., scraping copyrighted data, generating biased code) is a major concern. The industry lacks robust frameworks for 'agentic alignment.'

Economic Displacement: The paper's vision of AI as 'employees' rather than 'tools' raises the specter of job displacement. If agents can reliably complete complex tasks, the demand for human workers in knowledge-intensive roles (software development, data analysis, legal research) could plummet. The paper does not address this, but it is the elephant in the room.

Evaluation Gaps: The paper uses SWE-bench and AgentBench, but these benchmarks are narrow. They measure coding and simple tool use, not the full range of human tasks. There is no benchmark for 'long-horizon planning' (tasks that take hours or days) or 'multi-agent coordination' (tasks requiring multiple agents to collaborate). The field needs new evaluation frameworks.

Open Questions:
- How do we prevent agents from 'gaming' the task-completion metric (e.g., producing a technically correct but useless output)?
- What happens when two agents are given conflicting tasks?
- How do we audit an agent's decision-making process after the fact?

AINews Verdict & Predictions

The paper is a landmark—not because it reveals a new technology, but because it formalizes a shift that has been underway for two years. The era of 'next-token prediction' as the dominant AI paradigm is ending. The era of 'task completion' is beginning.

Our Predictions:
1. By Q2 2027, every major LLM provider will offer an 'agent mode' as a standard product tier. The conversational chatbot will become a legacy interface, replaced by task-oriented dashboards.
2. Per-task pricing will become the norm for enterprise AI by 2028. Companies will pay $0.50 for a completed expense report, $5.00 for a bug fix, and $50.00 for a legal brief. This will create a 'task economy' where AI agents compete on reliability and price.
3. The open-source agent ecosystem will surpass proprietary models on SWE-bench by Q1 2028. The commoditization of agent architectures, combined with community-generated task-completion data, will erode the proprietary advantage.
4. A major safety incident involving an autonomous agent will occur within 18 months. An agent will make a decision that causes financial or reputational harm, triggering regulatory scrutiny. This will slow adoption but ultimately lead to better safety standards.

What to Watch:
- The next release of GPT-5: Will it include native agent capabilities?
- The evolution of SWE-bench: Will it expand to cover non-coding tasks?
- The regulatory response: Will governments classify autonomous agents as 'workers' subject to labor laws?

The paper from OpenAI is a signal flare. The industry is pivoting. The question is no longer whether AI can talk—it's whether it can do. And the answer, increasingly, is yes.

常见问题

这次模型发布“AI's Great Shift: From Predicting Words to Completing Tasks, Codex Shows the Way”的核心内容是什么？

A new paper from OpenAI, titled 'The Agentic Turn in AI: Evidence from Codex,' provides the clearest evidence yet that the AI industry is undergoing a fundamental paradigm shift. T…

从“What is the difference between a language model and an AI agent?”看，这个模型发布为什么重要？

围绕“How does task-completion reinforcement learning work?”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。

AI's Great Shift: From Predicting Words to Completing Tasks, Codex Shows the Way

Technical Deep Dive

Key Players & Case Studies

Industry Impact & Market Dynamics

Risks, Limitations & Open Questions

AINews Verdict & Predictions

More from Hacker News

Related topics

Archive

Further Reading

常见问题