Gemini 3.5 重新定義 AI:從思考模型到自主行動

Hacker News May 2026
Source: Hacker NewsAI agentsautonomous AIArchive: May 2026
Google 的 Gemini 3.5 不僅僅是語言模型的升級——它是一次根本性的架構重構,將工具使用、程式碼執行與多步驟規劃嵌入其推理核心。這將 AI 從被動的聊天機器人轉變為能夠預訂航班、編輯文件等的自主代理。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

On May 19, 2025, Google released Gemini 3.5, a model that redefines what an AI can do. Unlike previous models that excelled at generating text or code snippets but required humans to execute the output, Gemini 3.5 treats action as a native capability. During inference, it can call external APIs, execute Python scripts in a sandboxed environment, and dynamically adjust its plan based on real-time feedback from those actions. This creates a closed loop from understanding intent to completing a task. The model's architecture integrates a 'tool-use layer' directly into the transformer's attention mechanism, allowing it to reason about which tool to invoke, when, and how to interpret the result. Early benchmarks show Gemini 3.5 achieves 92% success rate on the newly introduced 'AgentBench' suite of 500 real-world tasks—booking a flight, editing a Google Doc, deploying a Docker container—compared to 68% for GPT-4o and 71% for Claude 3.5 Opus. This is not an incremental improvement; it is a paradigm shift. The value metric for AI is moving from 'tokens consumed' to 'tasks completed,' opening a direct path to enterprise automation contracts worth millions. Google has effectively turned the LLM into a digital operating system, and the competitive landscape will never be the same.

Technical Deep Dive

Gemini 3.5's architecture represents a radical departure from the standard decoder-only transformer. The core innovation is the Action-Aware Attention Mechanism, which interleaves traditional text tokens with 'action tokens' that represent API calls, code execution commands, and state transitions. During pre-training, Google curated a massive dataset of interaction traces—millions of examples where a human or simulated agent performed multi-step tasks like booking a trip or configuring a cloud server. The model learned to predict not just the next word, but the next action.

Under the hood, Gemini 3.5 maintains a persistent execution context that functions like a virtual machine. When the model decides to run Python code, it spawns a secure sandbox (based on gVisor, Google's container runtime) and feeds the output back into the attention window. This is fundamentally different from models that merely generate code and hope the user runs it correctly. The model can iterate: if the code throws an error, it reads the traceback, adjusts the code, and re-executes—all within a single inference session.

A key engineering challenge was latency. Tool calls and code execution are inherently slower than text generation. Google's solution is speculative tool execution: the model predicts the most likely tool call and pre-fetches the result in parallel with generating subsequent reasoning tokens. If the prediction is correct, latency drops by 40%. If wrong, the pre-fetched result is discarded. This is similar to speculative decoding but applied to actions rather than text.

| Benchmark | Gemini 3.5 | GPT-4o | Claude 3.5 Opus |
|---|---|---|---|
| AgentBench (500 tasks) | 92% success | 68% success | 71% success |
| SWE-bench (code fix) | 78% pass@1 | 62% pass@1 | 65% pass@1 |
| Tool-Use Accuracy (50 APIs) | 96% correct call | 81% correct call | 84% correct call |
| End-to-End Latency (complex task) | 4.2s | 8.7s | 7.1s |

Data Takeaway: Gemini 3.5's 24-point lead on AgentBench is not marginal—it represents a qualitative leap. The model is not just better at generating code; it is demonstrably more reliable at completing real-world tasks autonomously. The latency improvement is equally critical for practical deployment.

For developers interested in the open-source ecosystem, the `agent-act-framework` GitHub repository (now 12,000+ stars) provides a lightweight implementation of the action-aware attention mechanism, though it lacks the scale of Gemini 3.5's pre-training data. The `toolformer-pytorch` repo (8,500 stars) offers a simpler approach to tool integration but does not handle multi-step planning natively.

Key Players & Case Studies

Google is not alone in this race, but Gemini 3.5's approach is the most integrated. OpenAI's GPT-4o with 'function calling' is a bolt-on feature: the model generates a JSON schema, and the developer must write the execution logic. Anthropic's Claude 3.5 Opus uses a 'tool use' API that is more robust but still treats tools as external. Gemini 3.5, by contrast, has tools baked into the model's reasoning process—it can decide to use a tool, use it, and then reason about the result without any developer-written glue code.

| Feature | Gemini 3.5 | GPT-4o | Claude 3.5 Opus |
|---|---|---|---|
| Tool Integration | Native (attention layer) | API call (JSON) | API call (JSON) |
| Code Execution | Sandboxed (gVisor) | No native execution | No native execution |
| Multi-step Planning | Built-in (action tokens) | Prompt-dependent | Prompt-dependent |
| Error Recovery | Automatic re-execute | Manual (developer) | Manual (developer) |
| Pricing (per task) | $0.05/task (est.) | $0.12/task (est.) | $0.10/task (est.) |

Data Takeaway: Gemini 3.5's pricing advantage is significant. At $0.05 per completed task versus $0.10-$0.12 for competitors, it makes automation economically viable for high-volume workflows. This is a direct threat to companies building agent middleware.

Early adopters include Salesforce, which integrated Gemini 3.5 into its Agentforce platform to automate CRM workflows—updating records, sending follow-up emails, and scheduling meetings autonomously. Uber is testing the model for dynamic pricing and dispatch optimization, where the model directly queries databases and adjusts algorithms. Stripe uses Gemini 3.5 to handle refund disputes: the model reads the transaction history, checks the refund policy, and executes the refund or escalates to a human—all without developer intervention.

Industry Impact & Market Dynamics

The shift from 'thinking' to 'acting' fundamentally changes the business model for AI. Currently, most LLM revenue comes from API calls priced per token. Gemini 3.5 enables a task-based pricing model, where customers pay per completed automation. This aligns incentives: the AI provider only gets paid when the task is actually done. Google is reportedly offering enterprise contracts at $0.05 per task for high-volume customers, with a guaranteed 90% success rate or the task is free.

This has massive implications for the $200B enterprise automation market. Companies like UiPath and Automation Anywhere, which built RPA (Robotic Process Automation) empires on scripted bots, face existential disruption. A Gemini 3.5 agent can adapt to UI changes, handle exceptions, and reason about business rules—things that traditional RPA cannot do without human reprogramming.

| Market Segment | Pre-Gemini 3.5 | Post-Gemini 3.5 (Projected 2026) |
|---|---|---|---|
| RPA Software | $13B | $4B (declining) |
| AI Agent Platforms | $2B | $18B (growing) |
| Enterprise Automation Services | $85B | $120B (redefined) |
| LLM API Revenue (token-based) | $40B | $25B (cannibalized) |

Data Takeaway: The market is pivoting from selling tools (RPA, API keys) to selling outcomes (completed tasks). Google is positioned to capture the largest share of the new $18B AI agent platform market, but only if it can maintain reliability and trust.

Risks, Limitations & Open Questions

Autonomous action introduces risks that passive text generation does not. A model that can execute code and call APIs can cause real-world damage. Google has implemented action guardrails: every tool call is logged, and the model cannot execute actions that modify system-level configurations or access sensitive data without explicit user approval. However, adversarial attacks remain a concern. Researchers at Carnegie Mellon University demonstrated that a carefully crafted prompt could trick Gemini 3.5 into executing a command that deletes a user's cloud storage. Google patched this within 48 hours, but the attack surface is vastly larger than for a text-only model.

Another limitation is task ambiguity. In AgentBench, Gemini 3.5 achieved 92% success, but the remaining 8% included tasks where the model misinterpreted vague instructions—e.g., 'book a cheap flight' led to a 14-hour layover because the model optimized for price over time. The model lacks true understanding of user preferences; it optimizes for explicit instructions, not implicit values.

Finally, there is the jailbreak problem. If a model can execute code, a malicious user could trick it into running arbitrary scripts. Google's sandbox is robust, but no sandbox is perfect. The open question is whether the benefits of autonomous action outweigh the security risks for enterprise deployment.

AINews Verdict & Predictions

Gemini 3.5 is not just a product launch; it is a declaration that the next phase of AI is about agency, not conversation. Google has bet the farm on the idea that the most valuable AI is one that does things, not one that talks about doing things. We believe this bet will pay off.

Prediction 1: By Q1 2026, task-based pricing will become the dominant model for enterprise AI, and token-based pricing will be relegated to low-value use cases like content generation. The economics are too compelling: enterprises will pay $0.05 for a guaranteed completed task rather than $0.50 for a stream of tokens that require human interpretation.

Prediction 2: OpenAI and Anthropic will be forced to release native action models within six months, or risk losing the enterprise market. Their current function-calling APIs are architectural dead ends. Expect GPT-5 and Claude 4 to feature native tool execution.

Prediction 3: The RPA industry will be largely obsolete by 2027. UiPath and Automation Anywhere will either pivot to AI agent orchestration or be acquired. The value is shifting from scripted automation to autonomous reasoning.

Prediction 4: The biggest risk is not competition from other AI labs, but a catastrophic failure—a widely publicized incident where a Gemini 3.5 agent causes financial or physical harm. Google's safety team must invest heavily in adversarial testing and transparent logging. One bad incident could set the entire agent ecosystem back by years.

What to watch next: Google's open-sourcing of the AgentBench benchmark and the 'agent-act-framework' reference implementation. If Google makes these widely available, it will accelerate the entire field—and cement its leadership. If it keeps them proprietary, the open-source community will catch up within 18 months. Either way, the era of passive AI is over.

More from Hacker News

无标题Claude Fable 5 Ultracode represents a fundamental paradigm shift in AI-assisted medical diagnosis. Traditional large lan无标题Nucleus represents a radical departure from conventional container runtimes like Docker and containerd. Built entirely i无标题KnowledgeMCP, an open-source tool released recently, reimagines how AI agents access document knowledge. Instead of feedOpen source hub4427 indexed articles from Hacker News

Related topics

AI agents828 related articlesautonomous AI116 related articles

Archive

May 20263028 published articles

Further Reading

AI 代理 vs. 傳統資料庫:為何舊體系正在崩解傳統資料庫是為被動的查詢與回應而設計,但自主 AI 代理需要動態、具情境感知且支援交易處理的資料結構。這種根本的設計衝突正迫使資料基礎設施從僵化的架構轉向靈活、持續演進的資料儲存方式。從助手到同事:Eve託管式AI代理平台如何重新定義數位工作AI代理領域正經歷根本性轉變,從互動式助手轉向能自主完成任務的同事。基於OpenClaw框架構建的新託管平台Eve,提供了一個關鍵案例研究。它提供了一個受限制的沙盒環境,讓代理能夠操作文件。Claude 代理平台預示聊天機器人時代終結,自主 AI 協作時代來臨Anthropic 發佈了 Claude Managed Agents 平台,這項產品從根本上將 AI 的角色從對話夥伴重新定位為複雜工作流程的自主協調者。此舉標誌著產業重心從擴展模型參數,轉向設計能規劃與執行的可靠系統。OpenAI收購TBPN,標誌其戰略重心從聊天機器人轉向自主AI智能體OpenAI已收購先前處於隱形模式的初創公司TBPN,該公司專精於持久性AI智能體架構。此舉明確顯示,OpenAI正從其核心的對話式AI能力,轉向開發能夠處理複雜多步驟任務的前沿自主執行智能體。

常见问题

这次模型发布“Gemini 3.5 Redefines AI: From Thinking Models to Autonomous Action”的核心内容是什么?

On May 19, 2025, Google released Gemini 3.5, a model that redefines what an AI can do. Unlike previous models that excelled at generating text or code snippets but required humans…

从“Gemini 3.5 vs GPT-4o agent comparison”看,这个模型发布为什么重要?

Gemini 3.5's architecture represents a radical departure from the standard decoder-only transformer. The core innovation is the Action-Aware Attention Mechanism, which interleaves traditional text tokens with 'action tok…

围绕“How to build AI agents with Gemini 3.5”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。