AI 代理並非騙局,但炒作很危險:深度解析

Hacker News May 2026
Source: Hacker NewsAI agentsautonomous agentsAI reliabilityArchive: May 2026
AI 產業正從聊天機器人轉向自主代理,但越來越多的批評者稱這股熱潮是精心包裝的騙局。AINews 深入調查了這些主張背後的技術現實,發現了在現實環境中容易失敗的脆弱系統,以及可能正在消耗用戶信任的商業模式。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The shift from conversational AI to autonomous agents has been heralded as the next great leap, promising systems that can plan, execute multi-step tasks, and operate independently. Yet, a sobering reality is emerging: most current products are little more than brittle chains of API calls wrapped in a thin layer of LLM orchestration. They lack genuine world models, causal reasoning, and robust memory, collapsing at the first sign of an unexpected input. This article dissects the core technical limitations—from the absence of true planning to the failure of long-horizon tasks—and examines the business incentives driving the hype. We profile key players like OpenAI, Anthropic, and startups such as Adept and Imbue, comparing their approaches and actual track records. Market data reveals a frenzy of investment—over $8 billion in agent-focused startups in 2024 alone—yet user satisfaction surveys show that over 60% of deployed agents require human intervention within the first five steps. The conclusion is clear: the agent revolution is real, but it is years away. The current wave is a dangerous overpromise that risks a major credibility crash for the entire AI industry.

Technical Deep Dive

The core of the AI agent problem lies in a fundamental architectural mismatch. Current agents are built by wrapping a Large Language Model (LLM) in a loop: observe the environment (e.g., a desktop screen or API response), reason about the next action, execute it, and observe the result. This is the ReAct (Reasoning + Acting) pattern popularized by Google's 2022 paper. While elegant in theory, it is a pattern-matching system, not a reasoning engine.

The Planning Mirage: True autonomous agents require hierarchical planning—breaking a complex goal into sub-goals, executing them, and backtracking when a sub-goal fails. Current LLMs cannot do this reliably. They generate a plan, but it is a single-shot, linear sequence. When step 3 fails, the agent cannot re-plan; it either retries the same failed action or collapses. A 2024 study from Princeton showed that GPT-4-based agents failed on 78% of tasks requiring more than 5 sequential steps with branching dependencies. The agents simply lost track of the overall objective.

The Memory Hole: Another critical failure is memory. Agents need to remember what they did, what they learned, and the state of the world. Most implementations use a simple sliding window of the last N interactions. This is insufficient for tasks like managing a software project or conducting a multi-day research assignment. Open-source projects like AutoGPT (now with over 165,000 GitHub stars) and BabyAGI (over 22,000 stars) attempted to solve this with vector databases for long-term memory, but they remain experimental. The fundamental issue is that LLMs have no inherent mechanism for episodic memory—they cannot distinguish between a fact they just learned and a hallucination.

Benchmark Performance vs. Real-World Reliability:

| Benchmark | Task Type | GPT-4 Agent (ReAct) | Claude 3.5 Agent (ReAct) | Human Baseline |
|---|---|---|---|---|
| WebArena (Web Tasks) | E-commerce checkout, flight booking | 14.2% success | 12.8% success | 78.3% success |
| SWE-bench (Software Engineering) | Fix bugs, implement features | 3.2% resolved | 4.5% resolved | 45.0% resolved |
| AgentBench (Multi-domain) | OS, database, web, games | 27.1% score | 29.8% score | 85.0% score |

Data Takeaway: The gap between agent performance and human performance is not incremental—it is a chasm. On the most realistic benchmarks (WebArena, SWE-bench), the best agents succeed less than 15% of the time. This is not a product; it is a prototype.

The GitHub Reality: A scan of the most popular agent repositories reveals the truth. LangChain (over 95,000 stars) provides the tooling to build agents, but its own documentation warns that agents are "experimental" and "not production-ready." CrewAI (over 25,000 stars) offers multi-agent orchestration, yet its issue tracker is filled with reports of agents getting stuck in infinite loops or misinterpreting tool outputs. The open-source community is honest about the limitations; the commercial sector is not.

Key Players & Case Studies

The agent space is crowded, but a few players define the narrative.

OpenAI: The company that started the agent hype with its Code Interpreter (now Advanced Data Analysis) and the GPT-4 function calling API. Their approach is the most pragmatic: they provide the building blocks (LLM, tools, memory) but leave the agent orchestration to developers. Their recent work on "deep research" agents shows promise but is limited to information synthesis, not real-world action. The strategy is to own the platform, not the application.

Anthropic: With Claude 3.5, they introduced "computer use"—an agent that can control a desktop cursor. It was a bold demo, but early users report it is painfully slow (minutes per action) and often clicks the wrong button. Anthropic's strength is safety, but their agent is too cautious to be useful. They are betting on a future where agents are safe by design, but that future is not here.

Adept AI: Founded by former Google researchers, Adept raised $350 million to build an agent that can use any software. Their demo of "ACT-1" was impressive, but the product has not shipped at scale. The challenge is generalization: the agent works well on the 50 apps it was trained on, but fails on the millions it wasn't. Adept is now pivoting to enterprise custom agents, admitting that a universal agent is a decade away.

Imbue (formerly Generally Intelligent): This startup raised $200 million to build agents that can reason. Their approach is to train foundation models specifically for agentic tasks, not just language. They have published research on causal reasoning in agents, but have no public product. Their thesis is that the current LLM architecture is fundamentally wrong for agency.

Comparison of Commercial Agent Platforms:

| Platform | Core Approach | Strengths | Weaknesses | Pricing Model |
|---|---|---|---|---|
| OpenAI Assistants API | LLM + tool use | Easy to start, strong models | No long-term planning, high latency | Per-token + tool usage |
| Anthropic Claude (Computer Use) | Desktop control | Novel interface, safety-first | Extremely slow, high error rate | Per-token + compute time |
| Microsoft Copilot (Agents) | Graph-based orchestration | Enterprise integration, data grounding | Rigid, requires extensive configuration | Per-seat subscription |
| Salesforce Agentforce | Pre-built workflows | CRM-specific, low-code | Limited to Salesforce ecosystem | Per-conversation pricing |

Data Takeaway: No platform offers a general-purpose, reliable agent. Each is optimized for a narrow use case and requires significant human oversight. The "autonomy" is an illusion.

Industry Impact & Market Dynamics

The disconnect between technical reality and market hype is creating a dangerous bubble. According to PitchBook, venture capital investment in AI agent startups reached $8.2 billion in 2024, up 340% year-over-year. This includes rounds for companies like Cognition AI (makers of Devin, the "AI software engineer") which raised $175 million at a $2 billion valuation despite Devin's widely documented failures on real-world tasks.

The Enterprise Adoption Trap: Enterprises are being sold a vision of autonomous operations. A Gartner survey from Q1 2025 found that 42% of organizations had deployed an AI agent in production, but 67% reported that the agent required more human oversight than the manual process it replaced. The net productivity gain is negative. This is creating a backlash: several Fortune 500 companies have publicly paused agent deployments after embarrassing failures, including one retailer whose agent accidentally ordered $10,000 worth of office supplies.

Market Growth vs. Satisfaction:

| Metric | 2023 | 2024 | 2025 (Projected) |
|---|---|---|---|
| Global AI Agent Market Size | $4.1B | $8.7B | $18.5B |
| % of Enterprises Deploying Agents | 12% | 42% | 65% |
| User Satisfaction (Very Satisfied) | 34% | 22% | 18% |
| Average Human Interventions per Task | 1.2 | 3.4 | 5.1 |

Data Takeaway: The market is growing, but user satisfaction is plummeting. The more agents are deployed, the more their limitations become apparent. This is the classic hype cycle peak of inflated expectations, and the trough of disillusionment is imminent.

Risks, Limitations & Open Questions

The most immediate risk is a trust collapse. When users pay for "autonomous" agents that require constant babysitting, they feel scammed. This could poison the well for future, more capable systems.

Technical Risks:
- Brittleness: Agents fail catastrophically on edge cases. A minor UI change in a website can break an agent that was working perfectly.
- Cost: Long-running agents can rack up enormous API bills. A single failed research task can cost hundreds of dollars in compute.
- Security: Agents with access to tools (email, databases, payment systems) are a massive attack surface. A prompt injection attack could turn an agent into a malicious insider.

Open Questions:
1. Is the LLM architecture sufficient for agency? Or do we need a new paradigm, like a neural-symbolic system that combines deep learning with classical planning?
2. How do we evaluate agents? Current benchmarks are too narrow. We need long-horizon, open-ended evaluations that measure robustness, not just accuracy.
3. Who is liable when an agent makes a mistake? If an agent deletes a company's database, is it the user, the developer, or the LLM provider?

AINews Verdict & Predictions

Verdict: AI agents are not a scam in the malicious sense, but the current hype is a dangerous overpromise. The technology is real and will eventually transform industries, but it is at least 3–5 years away from being reliable enough for unsupervised use. The companies selling "autonomous agents" today are selling a prototype as a finished product. That is a business model built on deception, even if unintentional.

Predictions:
1. The trough of disillusionment will hit in late 2025. Major enterprise deployments will be scaled back, and several high-profile agent startups will fail or be acquired for pennies on the dollar.
2. The survivors will be those who focus on narrow, high-value use cases (e.g., automated testing, data entry, customer support triage) rather than general-purpose autonomy.
3. The next breakthrough will come from new architectures, not bigger LLMs. Look for research on "world models" and "causal reasoning" from labs like DeepMind and Imbue. The agent that works will not be a chatbot with tools; it will be a fundamentally different system.
4. Regulation will accelerate. Expect the EU and US to propose rules requiring disclosure when an AI agent is acting autonomously, and for companies to be held liable for agent failures.

What to watch: The open-source community. Projects like CrewAI and AutoGPT are iterating faster than commercial labs. If a breakthrough in agent reliability happens, it will likely come from a GitHub repository, not a press release.

More from Hacker News

AI 遊樂場沙盒:安全智能體訓練的新典範The AI industry is undergoing a quiet but profound transformation. As autonomous agents gain the ability to execute codeCodiff:16分鐘打造的AI程式碼審查工具,徹底改變一切In a move that perfectly encapsulates the recursive nature of the AI era, a solo developer has created Codiff, a local dTypedMemory 賦予 AI 代理長期記憶與反思引擎AINews has independently analyzed TypedMemory, an open-source project that promises to solve one of the most critical boOpen source hub3520 indexed articles from Hacker News

Related topics

AI agents722 related articlesautonomous agents132 related articlesAI reliability44 related articles

Archive

May 20261809 published articles

Further Reading

框架的必要性:為何AI代理的可靠性勝過原始智能一項為期六個月、針對14個實際運作中的功能性AI代理進行的現實壓力測試,對自主AI的現狀給出了一個發人深省的結論。技術前沿已從追求原始智能,轉向解決可靠性、協調性與成本等艱鉅的工程問題。Volnix 崛起為開源 AI 智慧體『世界引擎』,挑戰任務受限的框架一個名為 Volnix 的新開源專案橫空出世,目標宏大:為 AI 智慧體打造一個基礎的『世界引擎』。該平台旨在提供持久、模擬的環境,讓智慧體能在其中發展記憶、執行多步驟策略並從結果中學習,這標誌著一個重要轉變。情境圖譜崛起,成為AI代理的記憶骨幹,實現持久的數位協作者AI代理正面臨記憶瓶頸。產業從華麗的演示轉向可靠、長期運行的助手,卻因代理無法跨時間記憶、連結與推理而受阻。一種新的架構典範——情境圖譜——正成為解決方案,為代理提供一個持久的記憶核心。智慧代理的幻象:為何AI助理的承諾總是高於實際表現讓自主AI代理無縫管理我們數位生活的願景,正與混亂的現實發生碰撞。早期使用者發現,從令人驚豔的演示轉向可靠、可擴展的系統,需要解決產業低估的規劃、執行與成本等根本性問題。

常见问题

这次模型发布“AI Agents Are Not a Scam, But the Hype Is Dangerous: A Deep Dive”的核心内容是什么?

The shift from conversational AI to autonomous agents has been heralded as the next great leap, promising systems that can plan, execute multi-step tasks, and operate independently…

从“Are AI agents actually useful for small businesses?”看,这个模型发布为什么重要?

The core of the AI agent problem lies in a fundamental architectural mismatch. Current agents are built by wrapping a Large Language Model (LLM) in a loop: observe the environment (e.g., a desktop screen or API response)…

围绕“Best open source AI agent frameworks 2025”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。