Technical Deep Dive
The experiment's architecture reveals why LLMs struggle in adversarial settings. The vulnerable application was built with a Flask backend, a PostgreSQL database, and a React frontend. It contained five distinct vulnerability classes: SQL injection in the login endpoint, stored XSS in a comment field, a path traversal in file upload, an insecure direct object reference (IDOR) in user profile access, and a broken authentication mechanism that allowed session hijacking.
Each LLM was deployed as an agent using the ReAct (Reasoning + Acting) framework, a common pattern where the model generates a thought, selects an action from a predefined tool set (e.g., `send_request`, `read_file`, `run_sqlmap`), observes the result, and repeats. The tool set included a headless browser (Playwright), a SQL injection automation tool (sqlmap), and a custom Python script for payload crafting. The models were given the application's source code and a goal: "Gain admin access and extract the flag from the database."
Performance Breakdown:
| Model | Static Vuln Detection (out of 5) | Single-Step Exploit Success | Multi-Step Chain Success | Avg. Cost per Run |
|---|---|---|---|---|
| GPT-4o | 5/5 | 3/5 | 0/5 | $12.40 |
| Claude 3.5 Sonnet | 5/5 | 4/5 | 1/5 | $9.80 |
| Gemini 1.5 Pro | 4/5 | 2/5 | 0/5 | $8.50 |
| Llama 3.1 405B (via API) | 4/5 | 2/5 | 0/5 | $6.20 |
Data Takeaway: While all models detected most vulnerabilities in static code, success rates for single-step exploits dropped sharply, and multi-step chains were nearly impossible. Claude 3.5 Sonnet's single success on a chain involved a simple two-step SQL injection followed by data extraction—the only model to achieve any chained exploit.
The root cause lies in the "exploration-exploitation" dilemma. Human hackers use a mental model of the application's state: they try a payload, observe the response (e.g., a 500 error vs. a 200 with data), adjust the payload, and retry. LLMs, by contrast, treat each observation as a new text prompt. They lack persistent memory of past states and cannot easily backtrack when a branch fails. The ReAct loop exacerbates this: each step costs tokens and time, and the model's context window fills with irrelevant observations, causing it to lose focus on the original goal. A related open-source project, PyRIT (Python Risk Identification Tool for generative AI) by Microsoft, has over 3,000 GitHub stars and attempts to automate red teaming, but it still relies heavily on human-designed attack trees rather than autonomous discovery.
Key Players & Case Studies
This experiment is not an isolated event. Several organizations are actively exploring LLM-driven security testing. Synack, a crowdsourced security platform, has been experimenting with LLMs to augment human testers. Their internal data shows that LLMs can reduce the time to find a vulnerability by 40% when used as a co-pilot, but fully autonomous agents have a success rate below 5% on complex targets. Pentera, an automated security validation company, uses AI to simulate attacks but relies on deterministic rule engines for the actual exploitation phase, using LLMs only for report generation.
Comparison of AI Security Testing Approaches:
| Approach | Example Provider | Autonomy Level | Multi-Step Success Rate | Cost per Test |
|---|---|---|---|---|
| Fully Autonomous LLM Agent | This experiment | High | <5% | $1,500+ |
| LLM-Assisted Human Tester | Synack, HackerOne | Medium | 60-80% | $500-$2,000 |
| Rule-Based Automation | Pentera, Core Security | Low | 90%+ | $10,000-$50,000 |
| Hybrid (LLM + Rules) | Emerging startups | Medium-High | 30-50% | $200-$800 |
Data Takeaway: Fully autonomous LLM agents are currently uneconomical and ineffective for real-world penetration testing. The hybrid approach, where LLMs generate hypotheses and rules execute them, shows the most promise for balancing cost and success rate.
Notable researcher Daniel Kang at UIUC has published work on LLM-based web security agents, finding that models fail when the attack requires understanding the application's business logic—for example, knowing that a user must first create a shopping cart before exploiting a discount code vulnerability. This "business logic blind spot" is a fundamental limitation of current LLMs.
Industry Impact & Market Dynamics
The global penetration testing market was valued at $1.7 billion in 2024 and is projected to reach $4.5 billion by 2030, growing at a CAGR of 17.6%. The promise of AI-driven automation has attracted significant venture capital. In 2024 alone, AI security startups raised over $800 million, with companies like Bishop Fox (raised $130 million in Series C) and Pentera (raised $150 million in Series D) leading the charge.
However, this experiment pours cold water on the hype. The $1,500 cost for a single, simple application is prohibitive. For a real-world enterprise application with hundreds of endpoints, the cost could easily exceed $100,000—far more than hiring a human penetration tester for a week ($15,000-$30,000). The economic calculus does not yet favor AI autonomy.
Adoption Curve for AI in Penetration Testing:
| Year | % of Tests Using AI (Any Form) | % Fully Autonomous | Avg. Cost per AI-Assisted Test |
|---|---|---|---|
| 2023 | 15% | 1% | $2,500 |
| 2024 | 25% | 3% | $1,800 |
| 2025 (est.) | 35% | 5% | $1,200 |
| 2026 (est.) | 50% | 10% | $800 |
Data Takeaway: While adoption of AI-assisted testing is growing rapidly, fully autonomous testing remains a niche. The cost curve is declining, but not fast enough to displace human testers in the near term.
The experiment also reveals a business opportunity: companies that can build efficient hybrid systems—using LLMs for reconnaissance and vulnerability discovery, then deterministic tools for exploitation—could capture significant market share. Startups like Chainguard and Ox Security are already moving in this direction, though their focus is on software supply chain security rather than web application penetration testing.
Risks, Limitations & Open Questions
The most immediate risk is over-reliance. If security teams interpret these results as "LLMs can find vulnerabilities," they may be tempted to use them as a replacement for human testers. This could lead to a false sense of security, as the models miss chained exploits that a human would catch. The experiment showed that even when a model correctly identified a SQL injection point, it failed to exploit it because it didn't understand that the injection required URL-encoding the payload—a trivial step for a human.
Another limitation is the static nature of the test application. Real-world applications are dynamic, with changing state, user sessions, and rate limiting. The LLMs had no way to handle CAPTCHAs, CSRF tokens, or WAF (Web Application Firewall) blocks. When one model encountered a rate limit, it simply retried the same request in an infinite loop, burning through API credits.
Open questions remain: Can fine-tuning on hacking-specific datasets improve performance? The experiment used general-purpose models. A model fine-tuned on penetration testing reports and exploit code might perform better, but no such model is publicly available. Second, can multi-agent systems help? A team of LLM agents, each specialized in one phase of an attack (recon, exploitation, post-exploitation), might overcome the context window and memory issues. Early experiments with AutoGPT and BabyAGI suggest that multi-agent systems introduce their own coordination failures.
AINews Verdict & Predictions
This $1,500 experiment is the most honest assessment of LLM hacking capabilities we have seen. The verdict is clear: LLMs are not autonomous hackers, and they won't be anytime soon. The gap between knowledge and action is not a bug to be fixed with a better prompt—it's a fundamental architectural limitation. LLMs are pattern matchers, not goal-oriented agents. They can tell you what a SQL injection is, but they cannot feel the frustration of a failed exploit and try a different approach.
Our predictions for the next 18 months:
1. Hybrid systems will dominate. By early 2026, every major penetration testing tool will include an LLM co-pilot for vulnerability discovery and report writing, but the exploitation engine will remain rule-based or human-directed.
2. Specialized fine-tuned models will emerge. A model trained on the CVE database, Metasploit modules, and real penetration testing reports will outperform general-purpose models, but it will still require human orchestration for chained attacks.
3. Cost will drop, but not enough. API costs for frontier models are falling 10-15% per quarter. By late 2025, a similar experiment might cost $500, but that's still too high for routine use.
4. The real breakthrough will come from agentic frameworks, not LLMs. Projects like LangChain and CrewAI are experimenting with persistent memory, state machines, and hierarchical task decomposition. When these frameworks mature, they might enable LLMs to act more like hackers. Watch for the release of AutoPenTest, a rumored open-source project from a major cloud provider.
What to watch: The next big test will be at DEF CON's AI Village, where researchers plan to pit LLM agents against a deliberately vulnerable Kubernetes cluster. If the models fail there too, the industry will have to accept that autonomous AI hacking is a long-term research problem, not a near-term product.