Doom Runs Inside Claude.ai: LLMs Become Virtual Machines for Real-Time Gaming

Q: 围绕“How to run a game in Claude.ai context window”，这次模型更新对开发者和企业有什么影响？

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会，企业则会更关心可替代性、接入门槛和商业化落地空间。

In a stunning proof-of-concept, a developer managed to run the iconic first-person shooter Doom inside Claude.ai's chat interface. This is not a video feed or a simulation — the game engine is executing directly within the model's context window, with the AI generating and parsing pixel-level data to maintain a persistent, interactive game state. The experiment, which has circulated among AI researchers and hobbyists, represents a radical stress test of large language model capabilities. It forces the model to simultaneously simulate game logic, render graphics, and respond to user inputs in real time, far beyond the typical scope of text-based conversation. This achievement blurs the line between AI inference and traditional computing, suggesting that LLMs could evolve into general-purpose execution environments — a kind of 'AI operating system' that hosts lightweight applications directly within a chat session. While the current implementation is slow and resource-intensive, it opens the door to a future where AI assistants run code, host interactive content, and serve as native platforms for everything from education tools to entertainment, without relying on external servers or plugins. The implications for business models, security, and computational efficiency are profound, and this single experiment may be remembered as the moment the wall between AI and classical computing began to crumble.

Technical Deep Dive

The feat of running Doom inside Claude.ai is not a simple emulation trick. It exploits the fundamental structure of a large language model's context window — the limited memory space where the model processes input and generates output. In this experiment, the developer encoded the entire game state as a sequence of tokens within the context window. Each frame of Doom is represented as a grid of pixel values, serialized into text. The model, when prompted with the current state and a user action (e.g., 'move forward', 'shoot'), generates the next frame as a new block of pixel tokens. This is a form of 'in-context execution' where the model acts as both the game engine and the renderer.

Key technical components:
- State encoding: The game world — player position, enemy locations, health, ammo, map geometry — is flattened into a structured text representation. For example, a 320x200 pixel frame at 256 colors would require roughly 64,000 tokens per frame (assuming one token per pixel value). Claude's context window of 200K tokens limits the maximum resolution and complexity.
- Inference loop: The user sends a command, the model reads the current state from the context, computes the next state, and outputs the new pixel grid. This is not a pre-trained game engine; the model must infer the rules of Doom from the context alone, using its general reasoning abilities.
- Latency constraints: Each frame generation takes several seconds on Claude's backend, resulting in a playable but sluggish experience (roughly 0.2–0.5 frames per second, compared to Doom's original 35 FPS). This is a fundamental limitation of current transformer architectures, which are optimized for text generation, not real-time pixel processing.

For readers interested in the underlying mechanics, the open-source repository 'llm-doom' (currently 1,200+ stars on GitHub) provides a reference implementation. It uses a lightweight Python script to interface with Claude's API, managing the token budget and state serialization. The repo's README notes that the same approach could theoretically work with any LLM that supports a sufficiently large context window and strong instruction-following capabilities.

Data Table: Performance Metrics of In-Context Doom vs. Native Doom

| Metric | Native Doom (1993) | In-Context Doom (Claude.ai) |
|---|---|---|
| Frame rate | 35 FPS | 0.2–0.5 FPS |
| Context window usage | N/A | ~64K tokens per frame |
| Latency per action | <30 ms | 2–5 seconds |
| Resolution | 320x200 | 160x100 (downscaled) |
| Color depth | 256 colors | 16 colors (reduced) |
| Total cost per minute | $0 (local hardware) | ~$0.50 (API usage) |

Data Takeaway: The performance gap is staggering — native Doom runs 70–175 times faster than the in-context version. However, the cost and latency are not the point. The experiment demonstrates that an LLM can *functionally* replace a game engine, even if inefficiently. As model inference speeds improve (e.g., through speculative decoding or specialized hardware), this gap will narrow, potentially making in-context execution viable for simple applications within 2–3 years.

Key Players & Case Studies

The developer behind this experiment, known pseudonymously as 'gamer-ai', is a independent researcher who previously demonstrated running a simple chess engine inside GPT-4. Their work builds on a growing body of research into 'in-context learning as computation' — a concept explored by teams at Google DeepMind and Anthropic. Notably, Anthropic's own research on 'contextual reasoning' shows that Claude 3.5 Sonnet can maintain coherent state across 100K+ tokens, which is essential for this kind of application.

Other relevant case studies:
- OpenAI's Code Interpreter: A product that allows GPT-4 to execute Python code in a sandboxed environment. This is conceptually similar but relies on an external runtime, not the model itself. The Doom experiment goes further by using the model's own inference as the runtime.
- Anthropic's Claude 3.5 Opus: The model used in the experiment. Its 200K token context window and strong instruction-following make it uniquely suited for this task. Anthropic has not officially endorsed or commented on the experiment, but internal sources suggest the company is monitoring such use cases for potential product features.
- Google's Gemini 1.5 Pro: With a 1 million token context window, Gemini could theoretically run a higher-resolution version of Doom. However, its architecture is less optimized for pixel-level generation, and no public demonstration exists yet.

Data Table: LLM Capabilities for In-Context Execution

| Model | Context Window | Max Tokens/Frame (estimated) | Suitability for Game Execution |
|---|---|---|---|
| Claude 3.5 Opus | 200K | 64K | High (used in this demo) |
| GPT-4 Turbo | 128K | 40K | Medium (shorter window, weaker state tracking) |
| Gemini 1.5 Pro | 1M | 320K | High (theoretically, but no demo) |
| Llama 3.1 405B | 128K | 40K | Low (open-source, but requires local hardware) |

Data Takeaway: Claude currently leads in practical in-context execution due to its combination of large context window and reliable state maintenance. Gemini's 1M token window offers the highest theoretical ceiling, but its performance on pixel-level generation tasks is unproven. The race is now on to see which model can first achieve a playable frame rate (at least 5 FPS) for in-context gaming.

Industry Impact & Market Dynamics

This experiment is more than a novelty — it signals a potential paradigm shift in how we think about AI platforms. If LLMs can serve as general-purpose runtimes, the implications for the software industry are massive:

- New business models: AI companies could charge for 'compute time' within the context window, similar to cloud computing but at the model level. A game that runs inside Claude.ai would consume tokens for every frame, creating a usage-based revenue stream. Anthropic could offer a 'developer mode' where users pay per minute of in-context execution.
- Disintermediation of app stores: If an AI assistant can run any lightweight application without installation, traditional app stores and operating systems become less relevant. A user could play a game, edit a document, or run a simulation entirely within a chat session.
- Edge computing redefined: In-context execution could enable AI-powered devices to run complex applications without local hardware. A smart speaker could host a game or a productivity tool, using the cloud LLM as its engine. This reduces the need for powerful on-device chips.

However, the economics are currently unfavorable. Running Doom for one minute costs approximately $0.50 in API fees (based on Claude's pricing of $15 per million input tokens and $75 per million output tokens). At that rate, a 30-minute gaming session would cost $15 — far more than buying the original game outright. For mass adoption, token costs must drop by at least 10x.

Data Table: Cost Comparison of In-Context vs. Traditional Gaming

| Gaming Method | Cost per Hour | Hardware Required | Platform Dependency |
|---|---|---|---|
| Native Doom (1993) | $0 | PC or console | Standalone |
| Cloud gaming (e.g., GeForce Now) | $10–$20 | Internet connection | Cloud server |
| In-context Doom (Claude.ai) | $30–$60 | Internet + API access | LLM provider |

Data Takeaway: In-context gaming is currently 3–6x more expensive than cloud gaming, which itself is more expensive than local play. But the cost trajectory for LLM inference is falling rapidly — some analysts predict a 50% year-over-year reduction. By 2027, in-context execution could become cost-competitive with cloud gaming for simple applications.

Risks, Limitations & Open Questions

While the Doom experiment is impressive, it raises serious concerns:

- Security: Running arbitrary code inside a model's context window introduces a new attack surface. A malicious user could craft prompts that cause the model to execute unintended operations, potentially leaking data from other conversations (context window poisoning). Anthropic's safety systems are designed to prevent this, but the Doom demo shows that the model can be coerced into executing complex state machines — a capability that could be exploited.
- Resource consumption: Each in-context execution consumes massive computational resources. If thousands of users run games simultaneously, the load on Anthropic's servers could degrade performance for all users. This is a scalability challenge that current infrastructure is not designed to handle.
- Model reliability: LLMs are probabilistic, not deterministic. The same input can produce different outputs, meaning the game state could become corrupted if the model 'hallucinates' a pixel or a game rule. In the Doom demo, the developer implemented a checksum system to detect and correct errors, but this adds overhead and reduces performance.
- Ethical concerns: Should AI models be used as general-purpose computers? This blurs the line between AI and traditional software, raising questions about accountability. If a game inside Claude.ai crashes or causes harm (e.g., a seizure-inducing flash), who is responsible — the developer, the user, or Anthropic?

AINews Verdict & Predictions

This experiment is a watershed moment. It proves that LLMs are not just text generators — they are nascent general-purpose computers. Our editorial stance is clear: this capability will be productized within 18 months, and it will fundamentally alter the competitive landscape of both AI and software.

Predictions:
1. By Q2 2027, at least one major AI provider (likely Anthropic or OpenAI) will launch a 'developer runtime' feature that allows users to run simple applications inside the context window. This will initially target education and productivity tools, not gaming.
2. By 2028, in-context execution will achieve 5–10 FPS for 2D games, thanks to specialized inference hardware and model distillation. 3D games like Doom will remain at 1–2 FPS.
3. The first security exploit targeting in-context execution will be discovered within 12 months, prompting a wave of new safety research. This will slow adoption but ultimately lead to more robust systems.
4. Traditional app stores (Apple App Store, Google Play) will begin to see competition from AI-native platforms, where apps are 'prompts' rather than binaries. This will be a slow shift, but the Doom demo is the first shot.

What to watch next: Keep an eye on GitHub repos like 'llm-doom' and 'context-engine' for community-driven improvements. Also monitor Anthropic's developer blog for any mention of 'context execution' or 'runtime mode' — that will be the signal that the company is moving on this.

The wall between AI and classical computing is not just cracking — it's being blown open, one pixel at a time.

More from Hacker News

常见问题

这次模型发布“Doom Runs Inside Claude.ai: LLMs Become Virtual Machines for Real-Time Gaming”的核心内容是什么？

In a stunning proof-of-concept, a developer managed to run the iconic first-person shooter Doom inside Claude.ai's chat interface. This is not a video feed or a simulation — the ga…

从“Can you play Doom inside ChatGPT?”看，这个模型发布为什么重要？

The feat of running Doom inside Claude.ai is not a simple emulation trick. It exploits the fundamental structure of a large language model's context window — the limited memory space where the model processes input and g…

围绕“How to run a game in Claude.ai context window”，这次模型更新对开发者和企业有什么影响？