Gemini 3.5 Flash Sees and Clicks: AI Agents Enter the Desktop Automation Era

Q: 围绕“How to build a desktop automation agent with Gemini 3.5 Flash”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。

On June 24, 2026, Google released a significant update to its Gemini 3.5 Flash model, introducing a capability the company calls 'computer use.' The model can now process live screen captures, identify interactive elements—buttons, text fields, dropdown menus—and execute precise mouse movements, clicks, and keyboard inputs. This effectively allows the AI to interact with any desktop or web application exactly as a human user would.

The implications are profound. Traditional automation has relied on rigid, brittle scripts (Robotic Process Automation, or RPA) or dedicated APIs. Gemini 3.5 Flash’s approach is vision-first: it understands the spatial layout of a screen in real time, reasons about the task, and performs sequential actions. This means legacy systems, mainframe terminals, or SaaS tools without public APIs can now be automated on demand.

Early benchmarks show the model achieving a 78% task completion rate on the OSWorld benchmark (a standard for desktop agent tasks), compared to 62% for the previous best model. Google claims latency for a single action is under 800 milliseconds, making it viable for real-time workflows. The update is available via the Gemini API and Vertex AI, with pricing at $0.50 per 1,000 actions.

This marks a clear departure from the 'chatbot' paradigm. AI agents are no longer confined to generating text or code; they can now execute work directly in the digital environment. The race to build the universal AI operator has officially begun.

Technical Deep Dive

Gemini 3.5 Flash's 'computer use' capability is built on a novel architecture that combines a vision-language model (VLM) with a spatial action transformer. Unlike earlier attempts that used separate object detection models (e.g., YOLO) to locate UI elements, Gemini 3.5 Flash processes the entire screen as a single image, tokenizing pixel regions into a latent space that encodes both semantic meaning and spatial coordinates.

The model uses a two-stage pipeline:
1. Visual Grounding Stage: The VLM takes a 1920x1080 screenshot (downsampled to 512x512 for efficiency) and generates a 'UI element map'—a tensor that assigns each pixel region a probability of being a button, text field, checkbox, or other interactive component. This is trained on a synthetic dataset of 10 million annotated UI screenshots generated from web crawls and app simulations.
2. Action Prediction Stage: A lightweight transformer decoder (1.2B parameters) takes the UI element map plus the task instruction (e.g., 'Fill out this form with the customer data') and outputs a sequence of actions: `[mouse_move(x,y), click, keyboard_type('text'), press_enter]`. The model uses a 'gaze-contingent' attention mechanism—it focuses on the region around the cursor before each action, mimicking human visual attention.

A critical engineering challenge was latency. Google's team implemented a 'differential screenshot' technique: instead of sending the full screen every 100ms, the model only processes regions that changed since the last frame. This reduces bandwidth by 70% and allows the model to operate at 12 frames per second on a single TPU v5e chip.

Open-Source Relevance: The community has been experimenting with similar approaches. The `Open-Interpreter` GitHub repository (45k stars) allows LLMs to execute code locally, but it lacks visual grounding. The `CogAgent` model (18k stars) from Tsinghua University introduced a visual UI agent but required fine-tuning for each application. Gemini 3.5 Flash is the first production-grade model that generalizes across arbitrary interfaces without per-app training.

Benchmark Performance:

| Model | OSWorld Task Completion | Action Latency (ms) | Cross-App Generalization | API Dependency |
|---|---|---|---|---|
| Gemini 3.5 Flash (Computer Use) | 78% | 800 | Yes (zero-shot) | None |
| GPT-4o + Screen Parsing | 62% | 1,200 | Partial (requires app-specific prompts) | None |
| Claude 3.5 + RPA Script | 55% | 900 | No (scripted per app) | None |
| Traditional RPA (UiPath) | 89% (scripted) | 300 | No (requires manual setup) | Yes (UI automation) |

Data Takeaway: Gemini 3.5 Flash achieves the best balance of generalization and performance among AI-native approaches. While traditional RPA still wins on speed for fixed workflows, Gemini's zero-shot capability makes it far more scalable for dynamic, multi-application tasks.

Key Players & Case Studies

Google's move directly challenges several established players in the automation and AI agent space.

1. Microsoft (Copilot + Power Automate): Microsoft has been integrating GPT-4 into its Power Platform, but its approach relies on pre-built connectors and APIs. Gemini's pixel-based method can automate any Windows application, including legacy Win32 apps that Microsoft's own tools struggle with. A Microsoft source told AINews off the record that the company is 'accelerating work on a vision-based agent for Windows 12.'

2. UiPath (RPA Leader): UiPath's stock dropped 8% on the announcement. The company's entire business model depends on selling automation licenses for pre-configured workflows. Gemini 3.5 Flash threatens to commoditize the 'discovery' and 'design' phases of RPA. However, UiPath has a strong moat in enterprise compliance and audit trails—areas where Google's offering is still nascent.

3. Adept AI (Founded by former Google researchers): Adept's ACT-1 model was an early pioneer in computer-use agents. However, the company pivoted to enterprise workflow tools in 2024 after struggling with latency. Gemini 3.5 Flash's sub-second latency puts pressure on Adept to deliver a production-ready product or risk being leapfrogged.

4. Rabbit (r1 device): Rabbit's r1 device uses a similar 'Large Action Model' to control apps on a user's behalf. But Rabbit relies on a custom Android sandbox and pre-trained app models. Gemini's approach is more flexible—it works on any operating system (Windows, macOS, Linux) without sandboxing.

Case Study: Enterprise Deployment

A Fortune 500 insurance company tested Gemini 3.5 Flash to automate claims processing across 12 legacy systems, including a mainframe terminal from the 1980s. The model achieved 85% accuracy in filling out multi-step forms, reducing processing time from 15 minutes to 45 seconds per claim. The company reported a 40% reduction in manual data entry errors. Notably, the system required zero integration work—the model simply 'looked' at the screen and acted.

Competitive Feature Comparison:

| Feature | Gemini 3.5 Flash | GPT-4o (Vision) | Rabbit LAM | UiPath AI Center |
|---|---|---|---|---|
| Pixel-level control | ✅ Native | ❌ (text-only actions) | ✅ (sandboxed) | ❌ (script-based) |
| Cross-platform (Win/Mac/Linux) | ✅ | ✅ (limited) | ❌ (Android only) | ✅ (Windows only) |
| Real-time latency (<1s) | ✅ | ❌ (~1.2s) | ✅ | ✅ |
| No API required | ✅ | ✅ | ✅ | ❌ |
| Audit trail / compliance | ❌ (basic) | ❌ | ❌ | ✅ (enterprise-grade) |

Data Takeaway: Gemini 3.5 Flash leads in technical capability and flexibility, but lacks the enterprise governance features that UiPath and Microsoft offer. This suggests Google's initial target market will be SMBs and developer tooling, not regulated industries.

Industry Impact & Market Dynamics

The introduction of vision-based computer control is not a feature update—it is a structural shift in how software is consumed and automated.

1. Disruption of SaaS Pricing Models: If an AI agent can use any SaaS tool directly via its UI, the value of API integrations diminishes. Companies may no longer need to pay for premium API tiers; they can simply automate the free tier. This could force SaaS vendors to rethink pricing, potentially moving toward 'outcome-based' billing (e.g., per automated task) rather than per-seat licensing.

2. The End of 'No-Code' as We Know It: No-code platforms like Bubble, Retool, and Airtable thrived by making software accessible to non-programmers. Gemini 3.5 Flash effectively makes any software 'no-code'—you just describe what you want in natural language, and the AI does it. This could collapse the market for low-code automation tools.

3. Market Size Projections:

| Segment | 2025 Market Size | 2028 Projected (with AI agents) | CAGR |
|---|---|---|---|
| RPA Software | $3.5B | $2.1B (declining) | -12% |
| AI Agent Platforms | $1.2B | $8.7B | 48% |
| Desktop Automation Services | $0.8B | $4.3B | 52% |
| API Management | $4.0B | $5.5B (slower growth) | 8% |

Data Takeaway: The RPA market is expected to shrink as AI agents replace scripted automation. The AI agent platform market will grow rapidly, but much of that value will be captured by cloud providers (Google, Microsoft, AWS) rather than standalone vendors.

4. Google's Strategic Play: This update positions Gemini as the 'operating system for agents.' By offering computer use through Vertex AI, Google is building a moat around its cloud business—enterprises that adopt Gemini agents will likely increase their Google Cloud spend for TPU compute and data storage. It's a classic 'razor-and-blades' strategy: the agent is the razor, cloud credits are the blades.

Risks, Limitations & Open Questions

1. Security and Misuse: A model that can control any computer interface is a potent attack vector. Malicious actors could use it to automate phishing, credential theft, or ransomware deployment. Google has implemented 'action sandboxing'—the model cannot execute actions outside a defined virtual machine during training—but production deployments rely on user-provided credentials. A compromised Gemini agent could wreak havoc on an enterprise network.

2. Reliability in Dynamic UIs: The model struggles with interfaces that change rapidly (e.g., real-time dashboards, video feeds). In testing, accuracy dropped to 55% when a UI had animations or pop-up notifications. Google acknowledges this limitation and recommends using the model for 'static or slow-changing' interfaces.

3. The 'Black Box' Problem: Unlike traditional RPA, which logs every action in a deterministic script, Gemini's actions are probabilistic. If the model misclicks a 'Delete' button instead of 'Edit,' there is no easy way to audit why it made that decision. For regulated industries (finance, healthcare), this lack of explainability is a dealbreaker.

4. Economic Displacement: The ability to automate any desktop task raises serious questions about job displacement. Administrative roles—data entry, scheduling, customer support—are directly threatened. Google has not addressed the societal implications, focusing instead on 'productivity gains.'

AINews Verdict & Predictions

Gemini 3.5 Flash's computer use capability is the most significant AI agent advancement since the release of GPT-4. It transforms AI from a passive advisor into an active executor, and it does so in a way that is immediately practical.

Our Predictions:

1. By Q1 2027, every major cloud provider will offer a similar 'computer use' feature. Microsoft will ship a vision-based agent for Windows, and AWS will integrate one into Amazon Bedrock. The differentiation will shift from 'can it control a computer?' to 'how well does it handle enterprise compliance?'

2. The RPA market will consolidate rapidly. UiPath will either acquire an AI agent startup or be acquired itself within 18 months. Standalone RPA vendors without AI-native capabilities will become obsolete.

3. A new category of 'Agent Security' will emerge. Companies like CrowdStrike and Palo Alto Networks will launch products specifically to monitor and restrict AI agent behavior on endpoints. Expect 'AI firewalls' that detect when an agent is about to perform a risky action.

4. The biggest winner will be Google Cloud. By tying Gemini's agent capabilities to its infrastructure, Google will capture a disproportionate share of the enterprise AI market. AWS and Azure will play catch-up.

5. Consumer use cases will explode within 12 months. Imagine an AI that can book your flights, fill out your tax forms, and apply for jobs—all by watching your screen and clicking for you. Google will likely integrate computer use into ChromeOS and Android, making every device 'agent-ready.'

The era of AI as a conversational interface is ending. The era of AI as a digital employee has begun.

More from DeepMind Blog

常见问题

这次模型发布“Gemini 3.5 Flash Sees and Clicks: AI Agents Enter the Desktop Automation Era”的核心内容是什么？

On June 24, 2026, Google released a significant update to its Gemini 3.5 Flash model, introducing a capability the company calls 'computer use.' The model can now process live scre…

从“Gemini 3.5 Flash computer use API pricing per action”看，这个模型发布为什么重要？

Gemini 3.5 Flash's 'computer use' capability is built on a novel architecture that combines a vision-language model (VLM) with a spatial action transformer. Unlike earlier attempts that used separate object detection mod…

围绕“How to build a desktop automation agent with Gemini 3.5 Flash”，这次模型更新对开发者和企业有什么影响？