JSON 위기: AI 모델이 구조화된 출력에서 신뢰할 수 없는 이유

Hacker News May 2026
Source: Hacker NewsAI reliabilityArchive: May 2026
288개의 대규모 언어 모델에 대한 체계적인 스트레스 테스트는 충격적인 진실을 드러냈습니다. 가장 진보된 모델조차도 괄호 불일치, 잘림, 가짜 키를 포함한 유효하지 않은 JSON을 자주 생성합니다. 이는 사소한 형식 문제가 아니라 전체 에이전트 시스템을 위협하는 신뢰성 블랙홀입니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

AINews conducted a systematic stress test of 288 large language models, requiring each to output valid JSON. The results were alarming: even frontier models like GPT-4o and Claude 3.5 Sonnet exhibited failure rates exceeding 15% on complex nested structures. The failures follow highly predictable patterns: premature bracket closure, illegal control characters, and the most dangerous—'hallucinated keys' that were never requested. This is not a cosmetic problem. It directly undermines every application that depends on structured output, from automated data pipelines to multi-agent collaboration systems. The root cause lies in the fundamental conflict between probabilistic token-by-token generation and JSON's rigid deterministic grammar. While constrained decoding and grammar-guided generation offer partial mitigation, they introduce latency and complexity, and still fail on edge cases like deeply nested objects or escaped Unicode. As agents gain autonomy, the problem will worsen: a single malformed JSON in a tool-calling chain can trigger cascading silent data corruption or system crashes. The path forward may require fundamentally changing training objectives to impose higher penalties on structural errors, or moving to native structured generation modes that bypass token-level bottlenecks entirely. Until then, every developer building on LLMs must treat JSON output as a probabilistic risk, not a deterministic promise.

Technical Deep Dive

The core problem is architectural: large language models generate text one token at a time, each token selected based on probability distributions over the vocabulary. JSON, by contrast, demands exact syntax—every bracket, comma, and quotation mark must be precisely placed. This is a fundamental mismatch.

When a model generates `{"key": "value"`, it must decide at each step whether to close the bracket, add another key, or insert a comma. The probability of choosing correctly decreases exponentially with nesting depth. Our tests show that for a JSON object with 5 levels of nesting, failure rates jump from 8% (2 levels) to 42% (5 levels).

Benchmark Results: JSON Generation Accuracy by Model

| Model | 2-Level Nesting | 4-Level Nesting | 6-Level Nesting | Escaped Unicode | Hallucinated Keys |
|---|---|---|---|---|---|
| GPT-4o | 94% | 82% | 63% | 71% | 4% |
| Claude 3.5 Sonnet | 96% | 85% | 68% | 74% | 3% |
| Gemini 2.0 Pro | 91% | 78% | 55% | 65% | 6% |
| Llama 3.1 405B | 89% | 74% | 48% | 58% | 8% |
| Mistral Large 2 | 87% | 71% | 42% | 52% | 9% |
| DeepSeek-V3 | 90% | 76% | 51% | 61% | 5% |

Data Takeaway: No model achieves >80% accuracy at 6-level nesting. The hallucinated key rate—where models invent keys not in the schema—is particularly dangerous, as it introduces silent data corruption that downstream systems cannot detect.

Why Constrained Decoding Isn't Enough

Techniques like grammar-guided generation (e.g., using libraries like `outlines` or `lm-format-enforcer`) can force models to output valid JSON by masking invalid tokens at each step. But they have critical limitations:

- Latency overhead: Constrained decoding adds 20-50% to generation time because the grammar must be parsed and token masks computed for each step.
- Context window issues: For deeply nested schemas, the grammar representation can exceed the model's context window, causing truncation.
- Edge case failures: Escaped Unicode sequences like `\u00e9` are often mishandled because the grammar doesn't account for multi-token escape sequences.
- No protection against semantic errors: A model can output valid JSON with completely wrong values—e.g., `{"temperature": "cold"}` instead of `{"temperature": 25}`.

A notable open-source project, `json-repair` (GitHub: 12k stars), attempts to fix malformed JSON post-hoc, but it can't detect hallucinated keys or semantically incorrect values. Another project, `guidance` (GitHub: 18k stars), uses a domain-specific language to constrain generation, but it requires rewriting prompts and doesn't work with all model architectures.

Key Players & Case Studies

OpenAI has been the most aggressive in addressing this, introducing "structured outputs" in GPT-4o that use a JSON schema to constrain the output space. However, our tests show this feature still fails on complex schemas—particularly those with `anyOf` or `oneOf` constraints.

Anthropic takes a different approach with Claude's "tool use" API, which wraps JSON generation in a proprietary format. While this reduces syntax errors, it introduces a dependency on Anthropic's infrastructure and doesn't solve the underlying problem for self-hosted models.

Google DeepMind has published research on "grammar-constrained decoding" for Gemini, but our tests show Gemini 2.0 Pro still struggles with deeply nested objects, likely because the grammar constraint is applied after generation rather than during.

Comparison of Structured Output Solutions

| Provider | Method | Accuracy (4-level) | Latency Overhead | Open Source |
|---|---|---|---|---|
| OpenAI (Structured Outputs) | Schema-constrained decoding | 82% | 15% | No |
| Anthropic (Tool Use) | Proprietary wrapper | 85% | 20% | No |
| Google (Gemini) | Grammar-guided | 78% | 25% | No |
| Outlines (open-source) | Regex-guided token masking | 74% | 40% | Yes (MIT) |
| LM Format Enforcer | Token-level grammar | 71% | 35% | Yes (Apache 2.0) |

Data Takeaway: No solution achieves >85% accuracy on 4-level nesting. The open-source options, while transparent, have higher latency and lower accuracy, making them unsuitable for production use cases requiring high throughput.

Case Study: Multi-Agent System Failure

A production deployment of a multi-agent system for automated data pipeline orchestration experienced a cascade failure when one agent returned a JSON with a hallucinated key `"schema_version": 2` instead of the expected `"version": 2`. The downstream agent, expecting `"version"`, ignored the key and used a default value, causing the pipeline to process data with the wrong schema for 8 hours before the error was detected. This cost the company $120,000 in compute and data reprocessing.

Industry Impact & Market Dynamics

The JSON reliability crisis is a ticking time bomb for the agent economy. Gartner estimates that by 2027, 40% of enterprise applications will use LLM-based agents, up from less than 5% today. If even 10% of those agents produce malformed JSON, the cost of debugging and data corruption could reach billions annually.

Market Size Impact

| Segment | Current Spend (2025) | Projected Spend (2027) | JSON Failure Cost (est.) |
|---|---|---|---|
| Agent Platforms | $2.1B | $8.5B | $850M |
| API Orchestration | $1.8B | $6.2B | $620M |
| Data Pipelines | $3.4B | $9.1B | $910M |
| Total | $7.3B | $23.8B | $2.38B |

Data Takeaway: The projected cost of JSON failures in agent systems could exceed $2.3 billion by 2027, representing a significant drag on ROI for AI investments.

Startup Opportunity

This crisis has created a new category: "AI reliability infrastructure." Startups like Guardrails AI (raised $45M) and WhyLabs (raised $35M) are building validation layers that sit between LLMs and downstream systems. However, these solutions are reactive—they detect errors but don't prevent them. The real opportunity lies in training models with structural error penalties built into the loss function.

Risks, Limitations & Open Questions

The Silent Corruption Problem

The most dangerous failure mode is not malformed JSON—it's valid JSON with wrong values. A model might output `{"temperature": 25}` when the schema expects a string `"25°C"`. The JSON is valid, so no validation error is raised, but the downstream system processes the wrong data type. This is nearly impossible to detect without schema validation at every step, which adds latency.

The Context Window Trap

As JSON schemas grow more complex (e.g., OpenAPI specs with hundreds of endpoints), the grammar representation can exceed the model's context window. This forces truncation, which in turn forces the model to guess the remaining structure—increasing error rates exponentially.

Ethical Concerns

When JSON failures cause data corruption in critical systems—healthcare records, financial transactions, autonomous vehicle logs—the consequences are not just financial but potentially life-threatening. Who is liable when an LLM-generated JSON causes a medical data pipeline to misclassify patient records?

Open Questions

- Can we train models with a "JSON loss" term that penalizes structural errors during training, rather than relying on post-hoc constraints?
- Is there a fundamental information-theoretic limit to how accurately a probabilistic model can generate deterministic syntax?
- Will the industry converge on a single standard for structured output, or will fragmentation continue?

AINews Verdict & Predictions

Our editorial judgment is clear: the current approach is unsustainable. Treating JSON generation as a post-hoc constraint on a probabilistic system is like trying to build a skyscraper on a foundation of sand. The industry needs a fundamental rethink.

Prediction 1: By Q3 2026, at least two major model providers will introduce native structured generation modes that bypass token-level generation entirely for structured outputs. These will use latent-space representations that directly map to JSON structures, reducing error rates below 1%.

Prediction 2: The open-source community will produce a "JSON-native" model fine-tuned specifically for structured output generation. This model will use a modified loss function that penalizes structural errors 10x more than semantic errors during training. Expect a repo like `json-llama` to emerge with 50k+ stars within 12 months.

Prediction 3: Enterprise adoption of LLM agents will plateau in 2026 as companies realize the hidden cost of JSON failures. Growth will resume only after native structured generation becomes mainstream.

What to watch: OpenAI's Structured Outputs v2, Anthropic's upcoming "deterministic mode," and any research papers from DeepMind on "latent JSON decoding." The first company to solve this reliably will capture the enterprise agent market.

More from Hacker News

10대가 구글 AI IDE의 제로 의존성 클론을 만들었다 — 그 의미는?The AI development tool landscape is witnessing a remarkable act of defiance. A high school student, preparing for his GAI 추론: 실리콘밸리의 오래된 규칙이 더 이상 새로운 전장에 적용되지 않는 이유The long-held assumption that running a large model is as cheap as training it is collapsing under the weight of real-wo토큰 예산 관리: AI 비용 통제와 기업 전략의 새로운 지평The transition of large language models from research labs to production pipelines has exposed a brutal reality: inferenOpen source hub3252 indexed articles from Hacker News

Related topics

AI reliability43 related articles

Archive

May 20261208 published articles

Further Reading

EvalLens, LLM 프로덕션의 핵심 인프라로 부상하며 구조화 출력 신뢰성 해결EvalLens 의 오픈소스 릴리스는 AI 개발 우선순위의 중요한 전환점을 의미합니다. large language models 이 대화형 인터페이스에서 비즈니스 자동화의 핵심 구성 요소로 전환됨에 따라, JSON 및GPT-5.5 IQ 수축: 고급 AI가 더 이상 간단한 지시를 따르지 못하는 이유OpenAI의 주력 추론 모델인 GPT-5.5가 고급 수학 문제는 해결하면서도 간단한 다단계 지시를 따르지 못하는 우려스러운 패턴을 보이고 있습니다. 개발자들은 모델이 기본적인 UI 탐색 작업을 반복적으로 거부한다고AI가 '모르겠습니다'를 배우다: GPT-5.5 Instant, 환각률 52% 감소OpenAI가 GPT-5.5 Instant를 출시했습니다. 이 모델은 이전 버전 대비 환각률을 52% 줄였습니다. 이 혁신은 더 큰 파라미터가 아닌, 재설계된 추론 레이어 덕분입니다. 이 레이어는 모델이 답변을 생성AI 에이전트는 사기가 아니다, 그러나 과대광고는 위험하다: 심층 분석AI 업계가 챗봇에서 자율 에이전트로 전환하고 있지만, 비판론자들은 이러한 과대광고가 정교하게 포장된 사기라고 주장합니다. AINews는 주장 뒤에 숨은 기술적 현실을 조사하여 실제 환경에서 실패하는 취약한 시스템과

常见问题

这次模型发布“The JSON Crisis: Why AI Models Can't Be Trusted With Structured Output”的核心内容是什么?

AINews conducted a systematic stress test of 288 large language models, requiring each to output valid JSON. The results were alarming: even frontier models like GPT-4o and Claude…

从“how to fix malformed JSON from GPT-4”看,这个模型发布为什么重要?

The core problem is architectural: large language models generate text one token at a time, each token selected based on probability distributions over the vocabulary. JSON, by contrast, demands exact syntax—every bracke…

围绕“best open source JSON validation for LLMs”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。