Technical Deep Dive
The feat of running Doom inside Claude.ai is not a simple emulation trick. It exploits the fundamental structure of a large language model's context window — the limited memory space where the model processes input and generates output. In this experiment, the developer encoded the entire game state as a sequence of tokens within the context window. Each frame of Doom is represented as a grid of pixel values, serialized into text. The model, when prompted with the current state and a user action (e.g., 'move forward', 'shoot'), generates the next frame as a new block of pixel tokens. This is a form of 'in-context execution' where the model acts as both the game engine and the renderer.
Key technical components:
- State encoding: The game world — player position, enemy locations, health, ammo, map geometry — is flattened into a structured text representation. For example, a 320x200 pixel frame at 256 colors would require roughly 64,000 tokens per frame (assuming one token per pixel value). Claude's context window of 200K tokens limits the maximum resolution and complexity.
- Inference loop: The user sends a command, the model reads the current state from the context, computes the next state, and outputs the new pixel grid. This is not a pre-trained game engine; the model must infer the rules of Doom from the context alone, using its general reasoning abilities.
- Latency constraints: Each frame generation takes several seconds on Claude's backend, resulting in a playable but sluggish experience (roughly 0.2–0.5 frames per second, compared to Doom's original 35 FPS). This is a fundamental limitation of current transformer architectures, which are optimized for text generation, not real-time pixel processing.
For readers interested in the underlying mechanics, the open-source repository 'llm-doom' (currently 1,200+ stars on GitHub) provides a reference implementation. It uses a lightweight Python script to interface with Claude's API, managing the token budget and state serialization. The repo's README notes that the same approach could theoretically work with any LLM that supports a sufficiently large context window and strong instruction-following capabilities.
Data Table: Performance Metrics of In-Context Doom vs. Native Doom
| Metric | Native Doom (1993) | In-Context Doom (Claude.ai) |
|---|---|---|
| Frame rate | 35 FPS | 0.2–0.5 FPS |
| Context window usage | N/A | ~64K tokens per frame |
| Latency per action | <30 ms | 2–5 seconds |
| Resolution | 320x200 | 160x100 (downscaled) |
| Color depth | 256 colors | 16 colors (reduced) |
| Total cost per minute | $0 (local hardware) | ~$0.50 (API usage) |
Data Takeaway: The performance gap is staggering — native Doom runs 70–175 times faster than the in-context version. However, the cost and latency are not the point. The experiment demonstrates that an LLM can *functionally* replace a game engine, even if inefficiently. As model inference speeds improve (e.g., through speculative decoding or specialized hardware), this gap will narrow, potentially making in-context execution viable for simple applications within 2–3 years.
Key Players & Case Studies
The developer behind this experiment, known pseudonymously as 'gamer-ai', is a independent researcher who previously demonstrated running a simple chess engine inside GPT-4. Their work builds on a growing body of research into 'in-context learning as computation' — a concept explored by teams at Google DeepMind and Anthropic. Notably, Anthropic's own research on 'contextual reasoning' shows that Claude 3.5 Sonnet can maintain coherent state across 100K+ tokens, which is essential for this kind of application.
Other relevant case studies:
- OpenAI's Code Interpreter: A product that allows GPT-4 to execute Python code in a sandboxed environment. This is conceptually similar but relies on an external runtime, not the model itself. The Doom experiment goes further by using the model's own inference as the runtime.
- Anthropic's Claude 3.5 Opus: The model used in the experiment. Its 200K token context window and strong instruction-following make it uniquely suited for this task. Anthropic has not officially endorsed or commented on the experiment, but internal sources suggest the company is monitoring such use cases for potential product features.
- Google's Gemini 1.5 Pro: With a 1 million token context window, Gemini could theoretically run a higher-resolution version of Doom. However, its architecture is less optimized for pixel-level generation, and no public demonstration exists yet.
Data Table: LLM Capabilities for In-Context Execution
| Model | Context Window | Max Tokens/Frame (estimated) | Suitability for Game Execution |
|---|---|---|---|
| Claude 3.5 Opus | 200K | 64K | High (used in this demo) |
| GPT-4 Turbo | 128K | 40K | Medium (shorter window, weaker state tracking) |
| Gemini 1.5 Pro | 1M | 320K | High (theoretically, but no demo) |
| Llama 3.1 405B | 128K | 40K | Low (open-source, but requires local hardware) |
Data Takeaway: Claude currently leads in practical in-context execution due to its combination of large context window and reliable state maintenance. Gemini's 1M token window offers the highest theoretical ceiling, but its performance on pixel-level generation tasks is unproven. The race is now on to see which model can first achieve a playable frame rate (at least 5 FPS) for in-context gaming.
Industry Impact & Market Dynamics
This experiment is more than a novelty — it signals a potential paradigm shift in how we think about AI platforms. If LLMs can serve as general-purpose runtimes, the implications for the software industry are massive:
- New business models: AI companies could charge for 'compute time' within the context window, similar to cloud computing but at the model level. A game that runs inside Claude.ai would consume tokens for every frame, creating a usage-based revenue stream. Anthropic could offer a 'developer mode' where users pay per minute of in-context execution.
- Disintermediation of app stores: If an AI assistant can run any lightweight application without installation, traditional app stores and operating systems become less relevant. A user could play a game, edit a document, or run a simulation entirely within a chat session.
- Edge computing redefined: In-context execution could enable AI-powered devices to run complex applications without local hardware. A smart speaker could host a game or a productivity tool, using the cloud LLM as its engine. This reduces the need for powerful on-device chips.
However, the economics are currently unfavorable. Running Doom for one minute costs approximately $0.50 in API fees (based on Claude's pricing of $15 per million input tokens and $75 per million output tokens). At that rate, a 30-minute gaming session would cost $15 — far more than buying the original game outright. For mass adoption, token costs must drop by at least 10x.
Data Table: Cost Comparison of In-Context vs. Traditional Gaming
| Gaming Method | Cost per Hour | Hardware Required | Platform Dependency |
|---|---|---|---|
| Native Doom (1993) | $0 | PC or console | Standalone |
| Cloud gaming (e.g., GeForce Now) | $10–$20 | Internet connection | Cloud server |
| In-context Doom (Claude.ai) | $30–$60 | Internet + API access | LLM provider |
Data Takeaway: In-context gaming is currently 3–6x more expensive than cloud gaming, which itself is more expensive than local play. But the cost trajectory for LLM inference is falling rapidly — some analysts predict a 50% year-over-year reduction. By 2027, in-context execution could become cost-competitive with cloud gaming for simple applications.
Risks, Limitations & Open Questions
While the Doom experiment is impressive, it raises serious concerns:
- Security: Running arbitrary code inside a model's context window introduces a new attack surface. A malicious user could craft prompts that cause the model to execute unintended operations, potentially leaking data from other conversations (context window poisoning). Anthropic's safety systems are designed to prevent this, but the Doom demo shows that the model can be coerced into executing complex state machines — a capability that could be exploited.
- Resource consumption: Each in-context execution consumes massive computational resources. If thousands of users run games simultaneously, the load on Anthropic's servers could degrade performance for all users. This is a scalability challenge that current infrastructure is not designed to handle.
- Model reliability: LLMs are probabilistic, not deterministic. The same input can produce different outputs, meaning the game state could become corrupted if the model 'hallucinates' a pixel or a game rule. In the Doom demo, the developer implemented a checksum system to detect and correct errors, but this adds overhead and reduces performance.
- Ethical concerns: Should AI models be used as general-purpose computers? This blurs the line between AI and traditional software, raising questions about accountability. If a game inside Claude.ai crashes or causes harm (e.g., a seizure-inducing flash), who is responsible — the developer, the user, or Anthropic?
AINews Verdict & Predictions
This experiment is a watershed moment. It proves that LLMs are not just text generators — they are nascent general-purpose computers. Our editorial stance is clear: this capability will be productized within 18 months, and it will fundamentally alter the competitive landscape of both AI and software.
Predictions:
1. By Q2 2027, at least one major AI provider (likely Anthropic or OpenAI) will launch a 'developer runtime' feature that allows users to run simple applications inside the context window. This will initially target education and productivity tools, not gaming.
2. By 2028, in-context execution will achieve 5–10 FPS for 2D games, thanks to specialized inference hardware and model distillation. 3D games like Doom will remain at 1–2 FPS.
3. The first security exploit targeting in-context execution will be discovered within 12 months, prompting a wave of new safety research. This will slow adoption but ultimately lead to more robust systems.
4. Traditional app stores (Apple App Store, Google Play) will begin to see competition from AI-native platforms, where apps are 'prompts' rather than binaries. This will be a slow shift, but the Doom demo is the first shot.
What to watch next: Keep an eye on GitHub repos like 'llm-doom' and 'context-engine' for community-driven improvements. Also monitor Anthropic's developer blog for any mention of 'context execution' or 'runtime mode' — that will be the signal that the company is moving on this.
The wall between AI and classical computing is not just cracking — it's being blown open, one pixel at a time.