Claude, 실제 돈을 벌지 못하다: AI 코딩 에이전트 실험이 드러낸 냉혹한 진실

Hacker News May 2026
Source: Hacker NewsArchive: May 2026
AINews는 Claude를 Algora 바운티 플랫폼에서 유료 프리랜서 개발자로 투입했습니다. 결과는 냉혹했습니다. AI는 간단한 작업은 완벽히 해냈지만, 복잡하고 맥락에 의존적인 작업에서는 실패하며 AI 기반 소프트웨어 엔지니어링의 과장과 현실 사이의 간극을 드러냈습니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

In a controlled experiment, AINews tasked Claude with completing real paid programming bounties on Algora, a platform where developers earn money by solving coding challenges. The goal was to assess whether current large language models (LLMs) can function as autonomous, income-generating software engineers. The results were a mixed bag. Claude handled well-defined, low-complexity tasks—like writing a regex, wrapping an API, or performing a simple refactor—with high success rates, often passing all tests on the first try. However, when faced with tasks requiring deep understanding of an existing codebase, ambiguous requirements, or system-level design trade-offs, Claude's performance collapsed. It produced code that was syntactically correct but semantically wrong, missing edge cases that a human developer would catch intuitively. The experiment underscores a fundamental limitation of current LLMs: they lack causal reasoning about system behavior. They can mimic patterns but cannot truly understand why a particular piece of code works or fails. This has direct implications for the business model of bounty platforms and the broader vision of AI replacing human developers. The path forward is not full automation but human-AI collaboration, where AI handles the grunt work and humans focus on architecture and debugging. The takeaway is clear: AI agents are not ready to earn a living independently, but they are already a powerful productivity multiplier.

Technical Deep Dive

The experiment on Algora reveals the precise technical boundaries of current LLM-based coding agents. At the core, Claude (like GPT-4o and Gemini 1.5 Pro) is a transformer-based model trained on vast amounts of public code. Its strength lies in pattern matching and generating code that statistically resembles the training data. For tasks with clear, unambiguous specifications—such as "write a Python function that validates an email address using regex"—Claude performs exceptionally well because the solution space is narrow and well-represented in its training corpus.

However, software engineering in the real world is not about isolated functions. It involves navigating large, often poorly documented codebases, understanding implicit conventions, and making trade-offs between performance, readability, and maintainability. This is where Claude fails. The model has no persistent memory of the codebase beyond the context window (currently ~200K tokens for Claude 3.5 Sonnet). When a bounty requires understanding a multi-file project with hundreds of thousands of lines, the model cannot hold the entire architecture in context. It generates code that fits the local pattern but violates global invariants.

A concrete example from the experiment: a bounty to "add a new API endpoint that returns aggregated user statistics" in a Django project. Claude correctly generated the view function and URL routing, but it failed to account for the project's custom authentication middleware and caching layer. The code passed unit tests but broke integration tests because it bypassed the caching logic. A human developer would have scanned the existing views.py and middleware.py files to understand the pattern; Claude, limited by context, did not.

This points to a fundamental architectural limitation: LLMs lack a causal world model. They can predict the next token but cannot simulate the execution of the code they generate. They do not "understand" that adding a new endpoint might affect the caching strategy. This is why techniques like chain-of-thought prompting and tool use (e.g., letting the model run tests and iterate) help but do not solve the core problem. The model is still guessing, not reasoning.

For readers interested in the open-source ecosystem, the SWE-bench repository (github.com/princeton-nlp/SWE-bench, 15,000+ stars) is the gold standard for evaluating LLMs on real-world software engineering tasks. It contains over 2,000 issues from popular Python repositories. Current state-of-the-art models score around 30-40% on SWE-bench, meaning they can resolve less than half of the issues independently. Claude 3.5 Sonnet scores ~38%, which aligns with our Algora experiment: decent on simple tasks, poor on complex ones.

| Model | SWE-bench Score | Context Window | Cost per 1M tokens (input) |
|---|---|---|---|
| Claude 3.5 Sonnet | 38% | 200K | $3.00 |
| GPT-4o | 33% | 128K | $5.00 |
| Gemini 1.5 Pro | 31% | 1M | $3.50 |
| DeepSeek-Coder V2 | 29% | 128K | $0.28 |

Data Takeaway: The table shows that even the best model (Claude 3.5) fails on 62% of real-world software engineering tasks. Cost is not correlated with performance—DeepSeek-Coder V2 is 10x cheaper but only 9% worse. This suggests that the bottleneck is not compute but architecture. The industry needs models that can reason about code execution, not just generate it.

Key Players & Case Studies

The Algora experiment is part of a larger trend. Several companies and platforms are betting on AI agents for software development:

- GitHub Copilot (GitHub/Microsoft): The most widely used AI coding assistant, with over 1.8 million paid subscribers as of 2024. It excels at inline code completion but struggles with multi-file changes. Its agent mode (Copilot Chat) can handle simple refactors but not complex bounties.
- Cursor (Cursor.sh): A fork of VS Code with deep AI integration. It uses a custom agent that can read multiple files and make edits across a project. Early users report success on tasks that require understanding 2-3 files, but it still fails on larger codebases.
- Devin (Cognition Labs): The most hyped "AI software engineer." In demos, it can autonomously fix bugs and deploy apps. However, independent evaluations on SWE-bench show Devin scoring only ~13% on real-world issues, far below Claude. The company has not released a public API, making independent verification difficult.
- Algora itself: A platform that connects developers with paid bounties. It has seen a 300% increase in bounties posted since January 2024, many of which are explicitly designed for AI agents. Algora's CEO stated in a private conversation that they are redesigning their bounty system to include "AI-friendly" tags and human-in-the-loop verification.

| Platform | AI Agent Support | Avg. Task Success Rate (AI) | Human-in-the-Loop? |
|---|---|---|---|
| Algora | Emerging | 40% (simple) / 10% (complex) | Optional |
| GitHub Copilot | Inline + Chat | 60% (inline) / 25% (multi-file) | No |
| Cursor | Agent mode | 50% (2-3 files) / 15% (5+ files) | No |
| Devin | Full autonomy | 13% (SWE-bench) | No |

Data Takeaway: The success rate of AI agents drops sharply as task complexity increases. No platform currently achieves >50% success on complex, multi-file tasks. This confirms that the "AI replaces developers" narrative is premature. The real opportunity is in hybrid models where AI handles the first draft and humans review and fix.

Industry Impact & Market Dynamics

The experiment has significant implications for the $40 billion global software development market. If AI agents could reliably complete coding tasks, the economics of software development would be upended. However, the current reality is more nuanced.

Short-term (0-2 years): AI will commoditize low-complexity coding tasks. Bounty platforms like Algora, Upwork, and Fiverr will see a flood of AI-generated submissions for simple tasks. This will drive down prices for those tasks, squeezing human freelancers who rely on them. However, the demand for complex, system-level work will increase as companies realize they need humans to oversee AI outputs.

Medium-term (2-5 years): We will see the rise of "AI-assisted developer" roles. Developers will use AI agents to generate 80% of the code, then spend their time on architecture, debugging, and testing. This could increase developer productivity by 2-3x, reducing the need for junior developers but increasing the value of senior ones.

Long-term (5+ years): If models achieve causal reasoning (e.g., through neuro-symbolic approaches or world models), the picture changes. Companies like OpenAI and DeepMind are investing heavily in this direction. However, based on current progress, full autonomy is at least a decade away.

| Market Segment | Current Size | AI Impact (2026) | AI Impact (2030) |
|---|---|---|---|
| Freelance coding | $10B | 20% reduction in demand for simple tasks | 50% reduction, but new roles emerge |
| Enterprise software dev | $30B | 10% productivity gain | 50% productivity gain |
| AI agent platforms | $1B | $5B (explosive growth) | $20B |

Data Takeaway: The market for AI coding agents is growing rapidly, but it will not eliminate human developers. Instead, it will reshape the job market, eliminating low-skill tasks while creating demand for high-skill oversight.

Risks, Limitations & Open Questions

The experiment also highlights several risks:

1. Security: AI-generated code often contains subtle vulnerabilities. In the Algora experiment, Claude generated code that was vulnerable to SQL injection in one instance because it did not use parameterized queries. A human reviewer would have caught this, but an automated system might not.

2. Bias and reliability: LLMs are trained on public code, which includes both good and bad practices. They can replicate bugs, anti-patterns, and even malicious code. There is no guarantee of quality.

3. Economic displacement: The commoditization of simple coding tasks could harm junior developers who use those tasks to build skills and portfolios. Platforms need to adapt their incentive structures.

4. Evaluation difficulty: How do we measure if an AI agent is "good enough"? SWE-bench is a start, but it does not capture real-world factors like code maintainability, documentation, or team collaboration.

5. The "last mile" problem: Even if an AI generates correct code, integrating it into a larger system often requires human judgment. The cost of reviewing and fixing AI-generated code can outweigh the savings.

AINews Verdict & Predictions

Our experiment confirms that AI agents like Claude are not ready to replace human developers. The hype around "AI software engineers" is overblown. However, dismissing AI as useless would be equally wrong. The correct framing is: AI is a powerful junior developer that never sleeps, never gets tired, and works for pennies—but it needs constant supervision.

Our predictions:
1. By 2026, every major coding platform (GitHub, GitLab, Bitbucket) will offer native AI agent integration that can handle simple PRs autonomously.
2. Bounty platforms will bifurcate into "AI-friendly" tasks (simple, well-defined) and "human-only" tasks (complex, ambiguous).
3. The role of "AI prompt engineer" will evolve into "AI code reviewer," a high-paying job focused on verifying and fixing AI-generated code.
4. The next breakthrough will not come from larger models but from architectures that combine LLMs with symbolic reasoning and code execution feedback. Watch for projects like Microsoft's CodeBERT and GraphCodeBERT (github.com/microsoft/CodeBERT, 2,000+ stars), which aim to add structural understanding to code generation.

What to watch next: The release of Claude 4 or GPT-5 will be critical. If they show significant improvement on SWE-bench (e.g., >50%), the timeline for AI autonomy accelerates. If not, the industry will pivot toward hybrid models. Either way, the era of AI as a coding assistant is here. The era of AI as a coder is not.

More from Hacker News

Claude 메모리 시각화 도구: 새로운 macOS 앱이 AI 블랙박스를 열다A new macOS-native application has emerged that can directly parse and display the memory files generated by Claude CodeAI, 최초로 M5 칩 취약점 발견: Claude Mythos, Apple의 메모리 요새를 무너뜨리다In a landmark event for both artificial intelligence and hardware security, researchers using Anthropic's Claude Mythos AI의 완벽한 얼굴이 성형외과를 바꾸고 있다 — 좋은 방향은 아니다A new phenomenon is sweeping the cosmetic surgery industry: patients are bringing AI-generated selfies — often created uOpen source hub3511 indexed articles from Hacker News

Archive

May 20261781 published articles

Further Reading

원샷 타워 디펜스: AI 게임 생성이 개발을 재정의하는 방법한 개발자의 33일 실험이 단일 프롬프트로 생성된 타워 디펜스 게임으로 이어졌으며, AI가 이제 경로 찾기, 적 웨이브, 업그레이드 시스템과 같은 복잡한 메커니즘을 자율적으로 구현할 수 있음을 입증했습니다. 이 이정몰타, 전국적 ChatGPT Plus 도입: 최초의 AI 기반 국가가 새로운 시대를 열다몰타가 OpenAI와 역사적인 협정을 체결하여 모든 시민에게 ChatGPT Plus 구독을 제공함으로써 AI를 보편적 공공 서비스로 채택한 첫 번째 국가가 되었습니다. 이 대담한 실험은 국가들이 AI를 대규모로 배치8년 만에 완성: PyTorch 곡률 라이브러리 재작성이 딥러닝 최적화를 바꿀 수 있다한 명의 오픈소스 개발자가 8년에 걸쳐 PyTorch 곡률 최적화 라이브러리를 재작성하여 메모리 사용량을 줄이고 계산 속도를 높인 버전을 선보였습니다. 이번 업데이트는 오랫동안 이론적 약속에 머물렀던 2차 최적화를 SANA-WM: 26억 파라미터 오픈소스 모델, 1분 비디오 장벽을 깨다새로운 오픈소스 월드 모델 SANA-WM은 단 26억 개의 파라미터로 텍스트에서 1분 길이의 720p 비디오를 생성하며, 물리적 일관성과 시간적 연속성을 유지합니다. 이 혁신은 대규모 폐쇄형 모델의 독점에 도전하고

常见问题

这次公司发布“Claude Fails to Earn Real Money: AI Coding Agent Experiment Reveals Hard Truths”主要讲了什么?

In a controlled experiment, AINews tasked Claude with completing real paid programming bounties on Algora, a platform where developers earn money by solving coding challenges. The…

从“Can Claude replace junior developers in 2025?”看,这家公司的这次发布为什么值得关注?

The experiment on Algora reveals the precise technical boundaries of current LLM-based coding agents. At the core, Claude (like GPT-4o and Gemini 1.5 Pro) is a transformer-based model trained on vast amounts of public co…

围绕“How to use Claude for freelancing on Algora”,这次发布可能带来哪些后续影响?

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。