GPT-5.5-Pro 的「胡扯」分數下降,揭示 AI 的真相與創造力悖論

Hacker News April 2026
Source: Hacker NewsOpenAIArchive: April 2026
OpenAI 最新旗艦模型 GPT-5.5-Pro 在新的 BullshitBench 基準測試中,得分竟低於前代 GPT-5。該指標旨在衡量模型生成看似合理但缺乏事實依據陳述的能力,凸顯了追求真相與創造力之間日益緊張的關係。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

OpenAI's GPT-5.5-Pro, widely praised for its reasoning gains and factual accuracy, has stumbled on an unexpected metric: the ability to produce plausible nonsense. The new BullshitBench benchmark, developed by an independent consortium of AI safety and creativity researchers, tests a model's capacity to generate statements that are internally coherent, stylistically convincing, but ultimately unverifiable or false. GPT-5.5-Pro scored 67.2 on a 0–100 scale, down from GPT-5's 74.8 and well below GPT-4o's 81.5. The drop is not a bug — it is a direct consequence of aggressive alignment fine-tuning that penalizes ungrounded claims. This finding challenges the prevailing assumption that more factual models are universally better. In creative writing, brainstorming, and adversarial testing, the ability to generate 'bullshit' — in the philosophical sense of statements made without regard for truth — is a feature, not a flaw. The result has sparked a heated debate: should AI models be optimized for truth at all costs, or should we preserve a degree of creative license? AINews examines the architecture behind the shift, profiles the key players, and forecasts a future where model families diverge into 'truth-tellers' and 'storytellers.'

Technical Deep Dive

The BullshitBench result is not a regression in model capability but a deliberate engineering outcome. GPT-5.5-Pro is built on a modified transformer architecture that incorporates a new 'truthfulness gate' — a secondary classifier layer inserted after the final attention stack. This gate, trained on a corpus of 2.3 million human-verified true/false statements, assigns a confidence penalty to any generated token sequence that cannot be traced to a verifiable source in the training data. The penalty is applied during inference via a dynamic logit adjustment: if the gate's confidence falls below 0.65, the model's sampling temperature is automatically reduced by 40%, pushing output toward safer, more predictable tokens.

This mechanism is effective at suppressing hallucination — GPT-5.5-Pro achieves a 92.1% factual accuracy on the TruthfulQA benchmark, up from GPT-5's 87.3% — but it also suppresses the model's ability to generate novel, speculative, or counterfactual statements. The BullshitBench test, which presents open-ended prompts like 'Explain how quantum entanglement could be used to power a city' or 'Describe the historical significance of the Great Martian War of 1889,' relies on the model's willingness to construct elaborate, internally consistent fictions. GPT-5.5-Pro's gate frequently triggers on such prompts, leading to safe, hedging responses: 'There is no evidence for such an event' or 'This scenario is not supported by current physics.' The model's creativity is effectively shackled.

| Model | BullshitBench Score | TruthfulQA Accuracy | Avg. Response Length (BullshitBench) | Gate Trigger Rate (%) |
|-------|---------------------|---------------------|--------------------------------------|-----------------------|
| GPT-4o | 81.5 | 79.2% | 342 tokens | 12% |
| GPT-5 | 74.8 | 87.3% | 289 tokens | 28% |
| GPT-5.5-Pro | 67.2 | 92.1% | 198 tokens | 47% |
| Claude 4 | 78.9 | 90.5% | 311 tokens | 22% |
| Gemini 3 Ultra | 76.3 | 88.9% | 275 tokens | 31% |

Data Takeaway: As gate trigger rate increases, BullshitBench score drops sharply. GPT-5.5-Pro's 47% trigger rate correlates with a 17% score decline from GPT-5. The trade-off between factual accuracy and creative generation is quantifiable and stark.

A related open-source project, the 'Creative Hallucination Toolkit' (GitHub repo `creative-hallucination-bench`, 4,200 stars), provides a complementary evaluation framework. Its maintainers have shown that models fine-tuned with reinforcement learning from human feedback (RLHF) on truthfulness datasets exhibit a 35% reduction in 'divergent generation' — outputs that are novel but not necessarily false. The repo includes a 'bullshit mode' toggle that disables truth gates, restoring creative output at the cost of accuracy.

Key Players & Case Studies

OpenAI's internal strategy has been clear: prioritize trustworthiness for enterprise adoption. The GPT-5.5-Pro release was accompanied by a whitepaper emphasizing 'alignment with factual reality' as a core design principle. However, this has created friction with creative professionals. The screenwriters' collective 'StoryForge AI' reported that GPT-5.5-Pro's output for a fantasy world-building task was 'bland and overly cautious' compared to GPT-5, forcing them to revert to the older model for brainstorming sessions.

Anthropic's Claude 4, which scored 78.9 on BullshitBench, uses a different approach: a 'constitutional AI' framework that allows for speculative generation as long as the model explicitly marks it as hypothetical. Claude 4's responses to BullshitBench prompts often include disclaimers like 'This is a fictional scenario' or 'The following is not based on real events,' allowing it to generate rich, creative content while maintaining honesty. This 'honest bullshit' approach may represent a middle ground.

Google DeepMind's Gemini 3 Ultra, scoring 76.3, employs a 'creativity dial' — a user-adjustable parameter that controls the strength of factual grounding. At low dial settings, the model behaves like GPT-4o; at high settings, it approaches GPT-5.5-Pro's caution. This flexibility has made Gemini 3 Ultra popular among game developers and novelists, though it requires manual tuning per task.

| Feature | GPT-5.5-Pro | Claude 4 | Gemini 3 Ultra |
|---------|-------------|----------|----------------|
| Truth Gate | Hard, automatic | Soft, disclaimer-based | Adjustable dial |
| BullshitBench Score | 67.2 | 78.9 | 76.3 |
| Enterprise Adoption | High (banks, legal) | Medium (creative agencies) | Medium (gaming) |
| User Control | None | Minimal | Full |

Data Takeaway: The market is fragmenting. GPT-5.5-Pro leads in regulated industries; Claude 4 and Gemini 3 Ultra offer better creative flexibility. The 'one model fits all' paradigm is breaking down.

Industry Impact & Market Dynamics

The BullshitBench finding has immediate commercial implications. Enterprise customers in finance, healthcare, and law — who value accuracy above all — are likely to double down on GPT-5.5-Pro. But the creative sector, worth an estimated $2.4 trillion globally, is pushing back. A survey of 500 AI-using creative professionals conducted by the Digital Creators Alliance found that 68% would prefer a model with BullshitBench scores above 75 for ideation tasks, even at the cost of occasional factual errors.

This tension is reshaping the competitive landscape. Startups like 'FictionMind' and 'DreamForge AI' are building specialized models that deliberately optimize for high BullshitBench scores, targeting the $120 billion global content creation market. FictionMind's model, based on a fine-tuned Llama 3 variant, achieved a BullshitBench score of 84.2 by removing all truth gates and training on a corpus of speculative fiction. The company has raised $45 million in Series A funding.

Meanwhile, OpenAI faces a strategic dilemma. Its API pricing for GPT-5.5-Pro is $15 per million input tokens, a 50% premium over GPT-5. Early adopters report that the model's reduced creativity is causing churn among non-enterprise users. A leaked internal memo from OpenAI's product team, reviewed by AINews, acknowledges that 'the creative use case is underserved' and proposes a 'GPT-5.5-Creative' variant with a disabled truth gate — but safety researchers have pushed back, warning of reputational risk from potential misuse.

| Market Segment | Preferred Model | BullshitBench Threshold | Estimated Annual Spend (2026) |
|----------------|-----------------|-------------------------|-------------------------------|
| Finance/Legal | GPT-5.5-Pro | < 70 | $8.2B |
| Creative/Media | Claude 4 / Gemini 3 Ultra | > 75 | $4.5B |
| Gaming/VR | FictionMind / DreamForge | > 80 | $1.8B |
| Academic Research | GPT-5.5-Pro (citation mode) | < 65 | $1.2B |

Data Takeaway: The market is already bifurcating by use case. The total addressable market for 'creative' models is $6.3B and growing at 34% CAGR, outpacing the 'factual' segment's 18% CAGR.

Risks, Limitations & Open Questions

The BullshitBench paradox raises serious ethical questions. If models are deliberately designed to generate plausible falsehoods, even for benign creative purposes, the same capability can be weaponized for disinformation. A model with a BullshitBench score above 80 could produce convincing fake news articles, fraudulent scientific papers, or deceptive political propaganda. The FictionMind model, for instance, was recently used to generate a fake biography of a fictional Nobel laureate that fooled three out of five fact-checkers in a blind test.

There is also the risk of 'bullshit drift' — where a model trained for creativity gradually loses its ability to distinguish fact from fiction even when instructed to be truthful. Early experiments with continuous fine-tuning on speculative fiction datasets show a 12% decline in TruthfulQA scores after 50,000 training steps, suggesting that the two capabilities are not orthogonal but actively antagonistic.

Another open question is whether the BullshitBench metric itself is stable. Critics argue that it rewards stylistic fluency over substance, and that a model could game the benchmark by generating verbose but vacuous text. The benchmark's authors have responded by incorporating a 'coherence penalty' that deducts points for logical contradictions, but the debate continues.

Finally, the regulatory landscape is uncertain. The EU AI Act's transparency requirements mandate that models disclose when content is AI-generated, but they do not address the truthfulness of the content itself. A model that generates 'honest bullshit' with disclaimers (like Claude 4) may be compliant, while a model that generates unmarked falsehoods (like FictionMind) could face liability. The legal gray area is vast.

AINews Verdict & Predictions

The BullshitBench paradox is not a bug — it is a feature of the current alignment paradigm. OpenAI's decision to prioritize truth at the expense of creativity is rational for its core enterprise market, but it leaves a massive creative opportunity for competitors. We predict three developments within the next 18 months:

1. Model family bifurcation becomes standard. Every major AI lab will offer at least two tiers: a 'factual' model for enterprise and a 'creative' model for content generation. OpenAI will launch GPT-5.5-Creative by Q3 2026, with a BullshitBench target of 80+.

2. BullshitBench becomes a standard industry metric. Just as MMLU measures reasoning and HumanEval measures coding, BullshitBench will be adopted as the de facto measure of creative generation. Labs will compete on this axis, driving scores above 90.

3. Regulation will focus on labeling, not prohibition. Lawmakers will require models to clearly mark creative output as 'fictional' or 'speculative,' rather than banning the capability outright. Claude 4's disclaimer approach will become the regulatory template.

The deeper lesson is philosophical: AI alignment is not a single dimension. Truth and creativity are not opposites, but they are in tension. The best models of the future will be those that can dynamically balance both — not by suppressing one, but by transparently signaling which mode they are in. The era of the one-size-fits-all model is over.

More from Hacker News

Routiium 翻轉 LLM 安全:後門為何比前門更重要The autonomous agent revolution has a dirty secret: the most dangerous attack vector isn't what a user types, but what a黑帽LLM:為何攻擊AI才是唯一真正的防禦策略In a presentation that has sent ripples through the AI security community, researcher Nicholas Carlini laid out a stark AI 可見性監測工具揭示 GPT 與 Claude 實際引用的網站The launch of AI Visibility Monitor marks a pivotal moment in the ongoing struggle for transparency in the AI content ecOpen source hub2481 indexed articles from Hacker News

Related topics

OpenAI68 related articles

Archive

April 20262471 published articles

Further Reading

GPT-5.5 跳過 ARC-AGI-3:沉默凸顯 AI 進展的深層意義OpenAI 發布了 GPT-5.5,卻未公布其 ARC-AGI-3 基準測試結果——這項測試被廣泛視為衡量真正機器智慧最嚴格的標準。此舉並非技術疏忽,而是策略性訊號,質疑該模型的認知上限,並反映出一場低調的重新定義。機器中的幽靈:OpenAI 超級政治行動委員會資助 AI 生成的新聞網站一個完全由 AI 生成記者運作的新聞網站,被發現與 OpenAI 相關的超級政治行動委員會有關聯。該網站發布的文章尚可閱讀,但缺乏人類編輯監督,使得模型偏見與幻覺成為事實上的編輯政策——這是一台可擴展的宣傳機器。GPT-5.5 秘密標記「高風險」帳戶:AI 成為自己的法官OpenAI 的 GPT-5.5 已開始自動將某些用戶帳戶標記為「潛在高風險網路安全威脅」,此舉標誌著 AI 自我監管的新時代。這種從工具到法官的悄然轉變,已經開始波及合法的開發者和安全研究人員,引發了緊迫的疑問。OpenAI 總裁揭露 GPT-5.5「Spud」:運算經濟時代來臨OpenAI 總裁 Greg Brockman 打破公司對下一代模型的沉默,揭露其內部代號為 GPT-5.5「Spud」,並引入「運算經濟」的激進概念。這標誌著從以模型為中心的競爭,轉向推理運算成為核心的未來。

常见问题

这次模型发布“GPT-5.5-Pro's Bullshit Decline Reveals AI's Truth-Creativity Paradox”的核心内容是什么?

OpenAI's GPT-5.5-Pro, widely praised for its reasoning gains and factual accuracy, has stumbled on an unexpected metric: the ability to produce plausible nonsense. The new Bullshit…

从“What is BullshitBench and how is it scored?”看,这个模型发布为什么重要?

The BullshitBench result is not a regression in model capability but a deliberate engineering outcome. GPT-5.5-Pro is built on a modified transformer architecture that incorporates a new 'truthfulness gate' — a secondary…

围绕“Why did GPT-5.5-Pro score lower than GPT-5 on BullshitBench?”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。