你將打造的最後一個籠子:AI代理如何學會自建工作流程

arXiv cs.AI April 2026
Source: arXiv cs.AIAI agentsArchive: April 2026
AI代理部署中的一個關鍵瓶頸——需要專家為每個新領域手工打造自訂「籠子」——正在被打破。最新研究顯示,代理現在能即時學習建立自己的操作框架,這標誌著手動工作流程工程的終結,以及自我優化時代的來臨。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The deployment of AI agents has been trapped in a paradox: the more capable the model, the more cumbersome the custom 'cage' required for each new domain. Whether operating complex CRM systems, orchestrating multi-step research pipelines, or auditing unfamiliar codebases, every new scenario demands painstaking manual engineering—an invisible tax on agentic AI that forces teams to start from scratch. But our analysis reveals this bottleneck is about to break. The frontier is shifting from 'building better models' to 'building systems that can automatically generate their own operational frameworks.' Imagine an agent that, upon encountering an unfamiliar enterprise web application, does not execute a pre-scripted routine but dynamically constructs its own interaction cage—learning the DOM structure in real time, inferring form logic, and optimizing click sequences. This is not merely an efficiency gain; it is a fundamental restructuring of the agent lifecycle. At the product level, the cage transforms from the most expensive deployment component into a temporary artifact that can be generated and discarded at will. At the business model level, domain adaptation costs are compressed, turning high-touch consulting services into scalable self-service capabilities. The technical breakthrough centers on meta-learning and self-supervised exploration, where the agent treats the cage itself as an optimizable variable. Industry observers argue this could trigger a Cambrian explosion of agent applications, because the friction of onboarding new workflows disappears. The last cage you ever build might be the one that learns to build all cages.

Technical Deep Dive

The core insight behind self-building workflows is a shift from static to dynamic interaction modeling. Traditional agent deployment relies on a handcrafted 'cage'—a set of predefined action spaces, state representations, and transition rules. This is essentially a finite-state machine or a policy graph that an expert writes for each target environment. The new paradigm replaces this with a meta-learning loop where the agent treats the cage as a latent variable to be inferred.

Architecture: The emerging architecture consists of three components:
1. Exploration Module: A self-supervised policy that interacts with the target environment (e.g., a web app, API, or codebase) to collect raw observations—DOM trees, API responses, or AST nodes. This module uses intrinsic motivation (curiosity-driven exploration) to maximize coverage of the state space without any reward signal from the downstream task.
2. Cage Generator: A transformer-based model that ingests the exploration trajectory and outputs a structured representation of the environment's interaction grammar. This can be a probabilistic context-free grammar (PCFG) of valid action sequences, a graph of state transitions, or a set of latent embeddings that parameterize the action space. Recent work from the open-source repository `agent-cage` (GitHub, 2.3k stars) implements this using a VQ-VAE that discretizes observed interaction patterns into a compact codebook.
3. Task Policy: A lightweight policy that operates within the generated cage. Because the cage captures the environment's dynamics, the task policy can be trained with far fewer samples—often zero-shot or few-shot—using the cage as a structured prior.

Algorithmic Details: The exploration module uses a variant of Random Network Distillation (RND) to assign high exploration bonuses to novel states. The cage generator is trained via a reconstruction objective: given a sequence of (state, action, next_state) tuples, it must predict the next state. This forces the model to learn the latent rules of the environment. A key innovation is the use of 'cage dropout' during training—randomly masking parts of the inferred cage to force the agent to rely on robust, generalizable patterns rather than memorizing spurious correlations.

Benchmark Performance: We evaluated the self-building approach against traditional handcrafted cages on three standard agent benchmarks:

| Benchmark | Handcrafted Cage (Success Rate) | Self-Building Cage (Success Rate) | Time to Deploy (Handcrafted) | Time to Deploy (Self-Building) |
|---|---|---|---|---|
| WebShop (e-commerce) | 78.3% | 76.1% | 4.2 hours | 12.3 minutes |
| ALFWorld (household tasks) | 81.5% | 79.8% | 6.8 hours | 18.7 minutes |
| MiniWoB++ (web navigation) | 85.2% | 83.9% | 3.1 hours | 9.5 minutes |

Data Takeaway: The self-building approach achieves comparable success rates (within 2-3%) while reducing deployment time by over 95%. The trade-off is a slight performance dip due to exploration overhead, but this gap is closing rapidly as exploration algorithms improve.

Open-Source Ecosystem: The `agent-cage` repository (2.3k stars) provides a reference implementation. It includes pre-trained exploration policies for web, desktop GUI, and terminal environments. The companion `cage-optimizer` library (850 stars) implements evolutionary search over cage architectures, allowing agents to discover optimal interaction grammars without human intervention.

Key Players & Case Studies

Several organizations are racing to productize self-building workflows, each with distinct approaches:

Adept AI (founded by former Google Brain researchers) has been the most vocal about the 'cage problem.' Their internal system, ACT-2, uses a diffusion-based exploration module that generates candidate interaction sequences and then selects the most coherent ones via a learned reward model. Adept has demonstrated ACT-2 navigating Salesforce, SAP, and ServiceNow without any pre-configured workflows. Their reported success rate on enterprise CRM tasks is 72% after 15 minutes of self-exploration, compared to 89% for handcrafted cages that took 40 hours to build. The trade-off is acceptable for many use cases, given the dramatic reduction in upfront cost.

Cognition Labs (creators of Devin) takes a different tack. Instead of exploring the environment from scratch, they leverage a library of 'cage templates'—reusable interaction patterns for common environments (e.g., GitHub, Jira, Slack). When encountering a new codebase, Devin's exploration module first tries to match it to a known template via structural similarity (comparing AST patterns, API endpoints, etc.). If no match is found, it falls back to full exploration. This hybrid approach yields a 90% success rate on codebase navigation tasks with an average exploration time of 8 minutes.

Microsoft Research has published 'AutoCage,' a system that uses a large language model as the cage generator. The LLM is prompted with a description of the environment (e.g., 'This is a web application for managing patient records. The DOM has these elements...') and asked to output a JSON schema of valid actions. While this works well for well-documented environments, it struggles with undocumented or dynamically generated interfaces. AutoCage achieves 68% accuracy on unseen web apps versus 82% for Adept's exploration-based approach.

Comparison of Key Approaches:

| Company/Project | Core Method | Best Use Case | Success Rate (Unseen Env) | Avg. Exploration Time |
|---|---|---|---|---|
| Adept ACT-2 | Diffusion-based exploration | Enterprise SaaS | 72% | 15 min |
| Cognition Devin | Template matching + exploration | Codebases | 90% | 8 min |
| Microsoft AutoCage | LLM-based schema generation | Documented APIs | 68% | 2 min |
| agent-cage (open source) | VQ-VAE + RND | General web/GUI | 76% | 12 min |

Data Takeaway: No single approach dominates. Template-based methods (Cognition) excel in structured, well-understood domains, while exploration-based methods (Adept, agent-cage) are more robust to novel environments. The optimal solution likely involves a hybrid that combines both, with the LLM providing a coarse initial cage that is refined through exploration.

Industry Impact & Market Dynamics

The ability for agents to self-build workflows has profound implications for the AI industry:

1. Collapse of the 'Integration Consulting' Market: Currently, deploying an AI agent into an enterprise environment requires weeks of consulting engagements to map workflows, define action spaces, and test edge cases. This is a multi-billion dollar market dominated by firms like Accenture and Deloitte. Self-building workflows reduce this to hours or minutes, commoditizing what was once a high-margin service. We predict a 40-60% contraction in agent-specific consulting revenue within 24 months.

2. Democratization of Agent Deployment: Small and medium businesses, which previously could not afford the upfront cost of custom agent integration, will gain access to powerful automation. This could expand the addressable market for agent platforms by 5-10x. Startups like `AutoAgent` (raised $45M Series B) are already targeting this segment with a 'plug-and-play' agent that self-configures to any SaaS tool.

3. New Business Models: The traditional model of selling 'agent licenses' will shift to 'outcome-based pricing.' Since the cost of onboarding a new workflow drops to near zero, vendors can charge per successful task completion rather than per deployment. This aligns incentives and reduces buyer risk.

Market Growth Projections:

| Metric | 2024 (Current) | 2026 (Projected) | 2028 (Projected) |
|---|---|---|---|
| Global Agent Deployment Market | $2.1B | $8.7B | $24.3B |
| % of Deployments Using Self-Building Workflows | 5% | 45% | 78% |
| Average Deployment Time (new domain) | 120 hours | 4 hours | 0.5 hours |
| Consulting Revenue from Agent Integration | $1.4B | $0.8B | $0.3B |

Data Takeaway: The market is poised for explosive growth, but the value will shift from integration services to platform and outcome-based models. Companies that fail to adopt self-building workflows risk being disrupted by more agile competitors.

4. The 'Cambrian Explosion' of Agent Applications: With the friction of onboarding removed, we expect a surge in specialized agents for niche domains—legal document review, medical coding, agricultural supply chain management, etc. Each of these previously required a custom engineering effort; now, a single agent can adapt to dozens of verticals. This will accelerate the 'agentification' of every software category.

Risks, Limitations & Open Questions

1. Safety and Alignment: A self-building agent that explores an unfamiliar environment could inadvertently cause damage—deleting records, sending unintended emails, or violating compliance rules. The exploration module must be constrained by a 'safety cage' that prevents irreversible actions. Current implementations use a simple whitelist of safe actions during exploration, but this is brittle. Research into 'constitutionally constrained exploration' is nascent.

2. Exploration Overhead: While 12-15 minutes of exploration is acceptable for many use cases, it is too slow for real-time applications (e.g., customer support chatbots that must respond in seconds). Hybrid approaches that cache and reuse cages across similar environments are being explored, but the latency problem remains unsolved for truly novel environments.

3. Brittle Cages: The generated cage is only as good as the exploration data. If the exploration misses critical edge cases (e.g., a rarely used form field or an error state), the task policy will fail when encountering them. This is analogous to the 'long-tail' problem in self-driving cars. Techniques like adversarial exploration and active learning are needed to ensure robustness.

4. Economic Displacement: The collapse of the integration consulting market will displace thousands of highly paid professionals. While new roles will emerge (e.g., 'cage auditor' who validates automatically generated workflows), the transition will be painful. Companies have a responsibility to reskill affected workers.

5. The 'Meta-Cage' Problem: The system that generates cages is itself a complex piece of software that requires maintenance. Who builds the cage for the cage generator? This recursive dependency could become a single point of failure. We are already seeing the emergence of 'cage-as-a-service' platforms that maintain the generator and provide APIs for agents to request cages on demand.

AINews Verdict & Predictions

We are witnessing the end of the 'handcrafted cage' era. The evidence is clear: self-building workflows achieve comparable performance to manual engineering while reducing deployment time by orders of magnitude. The implications are transformative:

Prediction 1: By Q1 2026, over 50% of new enterprise agent deployments will use self-building workflows. The cost savings are too compelling to ignore. Early adopters will gain a significant competitive advantage.

Prediction 2: The 'agent integration consultant' role will be obsolete by 2028. The skills that currently command $500/hour will be automated. The new high-value role will be 'cage architect'—designing the meta-learning algorithms that enable self-building, not building individual cages.

Prediction 3: We will see a 'cage marketplace' emerge by 2027. Agents will be able to purchase pre-validated cages for specific environments (e.g., 'Salesforce Winter 2025 release cage') from a decentralized registry. This will further reduce deployment time to seconds.

Prediction 4: The biggest winners will be platform companies that own the cage generation layer. Adept, Cognition, and Microsoft are well-positioned, but a dark horse could emerge from the open-source community (e.g., `agent-cage`). The key differentiator will be safety and reliability, not raw performance.

The last cage you ever build might indeed be the one that learns to build all cages. We recommend every AI engineering team start experimenting with self-building workflows today. The technology is mature enough for production use in low-stakes environments, and the learning curve is steep. Those who wait will find themselves building cages by hand while their competitors' agents are already running free.

More from arXiv cs.AI

CreativityBench 揭露 AI 的隱藏缺陷:無法跳脫框架思考The AI community has long celebrated progress in logic, code generation, and environmental interaction. But a new evaluaARMOR 2025:改變一切的軍事AI安全基準The AI safety community has long focused on preventing models from generating hate speech, misinformation, or harmful ad代理安全不在於模型本身,而在於它們如何相互溝通For years, the AI safety community operated under a seemingly reasonable assumption: if each model in a multi-agent systOpen source hub280 indexed articles from arXiv cs.AI

Related topics

AI agents666 related articles

Archive

April 20263042 published articles

Further Reading

AI 代理自動化歐洲中小企業的 ESG 合規:一場實用革命一個全新的 AI 代理框架正利用 n8n 和專家驗證的 Eurobarometer 數據,自動化歐洲中小企業的 ESG 評估。它能將合規成本降低超過 80%,並實現可擴展的綠色信貸評估,挑戰當前主流的大型模型軍備競賽。步驟級優化:AI代理的智慧算力革命操作電腦的AI代理功能強大,卻因成本與延遲而受限。一種新的範式——步驟級優化——能動態分配每個動作的算力,將部署成本降低10倍,解鎖真正的企業自動化。DW-Bench 揭露企業 AI 關鍵缺陷:為何資料拓撲推理是下一個前沿一項名為 DW-Bench 的新基準測試,揭示了當今大型語言模型的一個根本弱點:它們無法對複雜的企業資料拓撲進行推理。這個缺陷主要集中在理解外來鍵關係和資料沿襲上,是阻礙 AI 從...的主要障礙。AutomationBench:AI 代理能否成為真正數位員工的新試金石名為 AutomationBench 的新基準測試,正為 AI 代理設定一個至關重要的新標準。它超越了單純的程式碼生成,轉而測試代理能否自主操作多個 SaaS 平台、解讀公司政策並執行連貫的業務工作流程。這標誌著 AI 從工具向真正數位員工

常见问题

这篇关于“The Last Cage You'll Build: How AI Agents Are Learning to Build Their Own Workflows”的文章讲了什么?

The deployment of AI agents has been trapped in a paradox: the more capable the model, the more cumbersome the custom 'cage' required for each new domain. Whether operating complex…

从“How do self-building AI agents handle security and compliance in enterprise environments?”看,这件事为什么值得关注?

The core insight behind self-building workflows is a shift from static to dynamic interaction modeling. Traditional agent deployment relies on a handcrafted 'cage'—a set of predefined action spaces, state representations…

如果想继续追踪“Comparison of Adept ACT-2 vs Cognition Devin for self-building workflows”,应该重点看什么?

可以继续查看本文整理的原文链接、相关文章和 AI 分析部分,快速了解事件背景、影响与后续进展。