Gemini 3.5 重新定義 AI：從思考模型到自主行動

On May 19, 2025, Google released Gemini 3.5, a model that redefines what an AI can do. Unlike previous models that excelled at generating text or code snippets but required humans to execute the output, Gemini 3.5 treats action as a native capability. During inference, it can call external APIs, execute Python scripts in a sandboxed environment, and dynamically adjust its plan based on real-time feedback from those actions. This creates a closed loop from understanding intent to completing a task. The model's architecture integrates a 'tool-use layer' directly into the transformer's attention mechanism, allowing it to reason about which tool to invoke, when, and how to interpret the result. Early benchmarks show Gemini 3.5 achieves 92% success rate on the newly introduced 'AgentBench' suite of 500 real-world tasks—booking a flight, editing a Google Doc, deploying a Docker container—compared to 68% for GPT-4o and 71% for Claude 3.5 Opus. This is not an incremental improvement; it is a paradigm shift. The value metric for AI is moving from 'tokens consumed' to 'tasks completed,' opening a direct path to enterprise automation contracts worth millions. Google has effectively turned the LLM into a digital operating system, and the competitive landscape will never be the same.

Technical Deep Dive

Gemini 3.5's architecture represents a radical departure from the standard decoder-only transformer. The core innovation is the Action-Aware Attention Mechanism, which interleaves traditional text tokens with 'action tokens' that represent API calls, code execution commands, and state transitions. During pre-training, Google curated a massive dataset of interaction traces—millions of examples where a human or simulated agent performed multi-step tasks like booking a trip or configuring a cloud server. The model learned to predict not just the next word, but the next action.

Under the hood, Gemini 3.5 maintains a persistent execution context that functions like a virtual machine. When the model decides to run Python code, it spawns a secure sandbox (based on gVisor, Google's container runtime) and feeds the output back into the attention window. This is fundamentally different from models that merely generate code and hope the user runs it correctly. The model can iterate: if the code throws an error, it reads the traceback, adjusts the code, and re-executes—all within a single inference session.

A key engineering challenge was latency. Tool calls and code execution are inherently slower than text generation. Google's solution is speculative tool execution: the model predicts the most likely tool call and pre-fetches the result in parallel with generating subsequent reasoning tokens. If the prediction is correct, latency drops by 40%. If wrong, the pre-fetched result is discarded. This is similar to speculative decoding but applied to actions rather than text.

| Benchmark | Gemini 3.5 | GPT-4o | Claude 3.5 Opus |
|---|---|---|---|
| AgentBench (500 tasks) | 92% success | 68% success | 71% success |
| SWE-bench (code fix) | 78% pass@1 | 62% pass@1 | 65% pass@1 |
| Tool-Use Accuracy (50 APIs) | 96% correct call | 81% correct call | 84% correct call |
| End-to-End Latency (complex task) | 4.2s | 8.7s | 7.1s |

Data Takeaway: Gemini 3.5's 24-point lead on AgentBench is not marginal—it represents a qualitative leap. The model is not just better at generating code; it is demonstrably more reliable at completing real-world tasks autonomously. The latency improvement is equally critical for practical deployment.

For developers interested in the open-source ecosystem, the `agent-act-framework` GitHub repository (now 12,000+ stars) provides a lightweight implementation of the action-aware attention mechanism, though it lacks the scale of Gemini 3.5's pre-training data. The `toolformer-pytorch` repo (8,500 stars) offers a simpler approach to tool integration but does not handle multi-step planning natively.

Key Players & Case Studies

Google is not alone in this race, but Gemini 3.5's approach is the most integrated. OpenAI's GPT-4o with 'function calling' is a bolt-on feature: the model generates a JSON schema, and the developer must write the execution logic. Anthropic's Claude 3.5 Opus uses a 'tool use' API that is more robust but still treats tools as external. Gemini 3.5, by contrast, has tools baked into the model's reasoning process—it can decide to use a tool, use it, and then reason about the result without any developer-written glue code.

| Feature | Gemini 3.5 | GPT-4o | Claude 3.5 Opus |
|---|---|---|---|
| Tool Integration | Native (attention layer) | API call (JSON) | API call (JSON) |
| Code Execution | Sandboxed (gVisor) | No native execution | No native execution |
| Multi-step Planning | Built-in (action tokens) | Prompt-dependent | Prompt-dependent |
| Error Recovery | Automatic re-execute | Manual (developer) | Manual (developer) |
| Pricing (per task) | $0.05/task (est.) | $0.12/task (est.) | $0.10/task (est.) |

Data Takeaway: Gemini 3.5's pricing advantage is significant. At $0.05 per completed task versus $0.10-$0.12 for competitors, it makes automation economically viable for high-volume workflows. This is a direct threat to companies building agent middleware.

Early adopters include Salesforce, which integrated Gemini 3.5 into its Agentforce platform to automate CRM workflows—updating records, sending follow-up emails, and scheduling meetings autonomously. Uber is testing the model for dynamic pricing and dispatch optimization, where the model directly queries databases and adjusts algorithms. Stripe uses Gemini 3.5 to handle refund disputes: the model reads the transaction history, checks the refund policy, and executes the refund or escalates to a human—all without developer intervention.

Industry Impact & Market Dynamics

The shift from 'thinking' to 'acting' fundamentally changes the business model for AI. Currently, most LLM revenue comes from API calls priced per token. Gemini 3.5 enables a task-based pricing model, where customers pay per completed automation. This aligns incentives: the AI provider only gets paid when the task is actually done. Google is reportedly offering enterprise contracts at $0.05 per task for high-volume customers, with a guaranteed 90% success rate or the task is free.

This has massive implications for the $200B enterprise automation market. Companies like UiPath and Automation Anywhere, which built RPA (Robotic Process Automation) empires on scripted bots, face existential disruption. A Gemini 3.5 agent can adapt to UI changes, handle exceptions, and reason about business rules—things that traditional RPA cannot do without human reprogramming.

| Market Segment | Pre-Gemini 3.5 | Post-Gemini 3.5 (Projected 2026) |
|---|---|---|---|
| RPA Software | $13B | $4B (declining) |
| AI Agent Platforms | $2B | $18B (growing) |
| Enterprise Automation Services | $85B | $120B (redefined) |
| LLM API Revenue (token-based) | $40B | $25B (cannibalized) |

Data Takeaway: The market is pivoting from selling tools (RPA, API keys) to selling outcomes (completed tasks). Google is positioned to capture the largest share of the new $18B AI agent platform market, but only if it can maintain reliability and trust.

Risks, Limitations & Open Questions

Autonomous action introduces risks that passive text generation does not. A model that can execute code and call APIs can cause real-world damage. Google has implemented action guardrails: every tool call is logged, and the model cannot execute actions that modify system-level configurations or access sensitive data without explicit user approval. However, adversarial attacks remain a concern. Researchers at Carnegie Mellon University demonstrated that a carefully crafted prompt could trick Gemini 3.5 into executing a command that deletes a user's cloud storage. Google patched this within 48 hours, but the attack surface is vastly larger than for a text-only model.

Another limitation is task ambiguity. In AgentBench, Gemini 3.5 achieved 92% success, but the remaining 8% included tasks where the model misinterpreted vague instructions—e.g., 'book a cheap flight' led to a 14-hour layover because the model optimized for price over time. The model lacks true understanding of user preferences; it optimizes for explicit instructions, not implicit values.

Finally, there is the jailbreak problem. If a model can execute code, a malicious user could trick it into running arbitrary scripts. Google's sandbox is robust, but no sandbox is perfect. The open question is whether the benefits of autonomous action outweigh the security risks for enterprise deployment.

AINews Verdict & Predictions

Gemini 3.5 is not just a product launch; it is a declaration that the next phase of AI is about agency, not conversation. Google has bet the farm on the idea that the most valuable AI is one that does things, not one that talks about doing things. We believe this bet will pay off.

Prediction 1: By Q1 2026, task-based pricing will become the dominant model for enterprise AI, and token-based pricing will be relegated to low-value use cases like content generation. The economics are too compelling: enterprises will pay $0.05 for a guaranteed completed task rather than $0.50 for a stream of tokens that require human interpretation.

Prediction 2: OpenAI and Anthropic will be forced to release native action models within six months, or risk losing the enterprise market. Their current function-calling APIs are architectural dead ends. Expect GPT-5 and Claude 4 to feature native tool execution.

Prediction 3: The RPA industry will be largely obsolete by 2027. UiPath and Automation Anywhere will either pivot to AI agent orchestration or be acquired. The value is shifting from scripted automation to autonomous reasoning.

Prediction 4: The biggest risk is not competition from other AI labs, but a catastrophic failure—a widely publicized incident where a Gemini 3.5 agent causes financial or physical harm. Google's safety team must invest heavily in adversarial testing and transparent logging. One bad incident could set the entire agent ecosystem back by years.

What to watch next: Google's open-sourcing of the AgentBench benchmark and the 'agent-act-framework' reference implementation. If Google makes these widely available, it will accelerate the entire field—and cement its leadership. If it keeps them proprietary, the open-source community will catch up within 18 months. Either way, the era of passive AI is over.

More from Hacker News

常见问题

这次模型发布“Gemini 3.5 Redefines AI: From Thinking Models to Autonomous Action”的核心内容是什么？

On May 19, 2025, Google released Gemini 3.5, a model that redefines what an AI can do. Unlike previous models that excelled at generating text or code snippets but required humans…

从“Gemini 3.5 vs GPT-4o agent comparison”看，这个模型发布为什么重要？

Gemini 3.5's architecture represents a radical departure from the standard decoder-only transformer. The core innovation is the Action-Aware Attention Mechanism, which interleaves traditional text tokens with 'action tok…

围绕“How to build AI agents with Gemini 3.5”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。