Technical Deep Dive
The Gaokao essay failure is a direct consequence of how large language models (LLMs) are built. Current architectures, including the transformer-based models tested, operate on next-token prediction. They generate text by calculating the most probable sequence of words given a prompt and their training data. This works brilliantly for factual summarization, code generation, and structured reasoning. But it fundamentally fails at creative writing that demands *intentionality* and *emotional authenticity*.
The Core Problem: Statistical vs. Semantic Understanding
LLMs do not 'understand' meaning in the human sense. They model the distribution of language. When asked to write an essay with 'emotional depth,' a model like Doubao retrieves patterns associated with emotion—words like 'love,' 'sorrow,' 'hope'—but cannot *feel* or *intend* their use. The result is a hollow pastiche. In the Gaokao essay, which requires a personal narrative or argument that resonates with universal human experience, this limitation is fatal.
The 'Repetition Trap'
Doubao's failing grade was largely due to repetitive phrasing and circular arguments. This is a known issue in LLMs, often mitigated by tuning parameters like `top_k` and `temperature`. However, even with optimal settings, models can fall into a 'repetition trap' when the prompt demands a specific length (the Gaokao essay requires 800 words). The model exhausts its limited 'creative' vocabulary and begins to recycle sentences. This is not a bug but a feature of models that lack a global planning mechanism—they write left-to-right without a coherent overarching structure.
Benchmark Performance Comparison
To contextualize the results, AINews compiled performance data from standard benchmarks alongside our Gaokao evaluation:
| Model | Gaokao Essay Score (out of 60) | MMLU (Accuracy %) | HumanEval (Python Pass@1) | Chinese Poetry Generation (BLEU-4) |
|---|---|---|---|---|
| GPT-4o | 48 | 88.7 | 89.0 | 0.32 |
| DeepSeek-V3 | 46 | 87.5 | 82.6 | 0.28 |
| Claude 3.5 Sonnet | 45 | 88.3 | 84.2 | 0.30 |
| Qwen2.5-72B | 44 | 86.4 | 79.1 | 0.35 |
| Gemini 2.0 Pro | 43 | 87.1 | 80.5 | 0.29 |
| ERNIE 4.0 | 42 | 85.2 | 76.8 | 0.31 |
| Doubao (ByteDance) | 34 | 82.1 | 71.3 | 0.22 |
Data Takeaway: The Gaokao essay score correlates poorly with standard reasoning benchmarks (MMLU) and code generation (HumanEval). GPT-4o, which scores highest on MMLU, also tops the essay test, but the gap is narrow. Doubao's failure is most pronounced in the Chinese Poetry Generation benchmark (BLEU-4), a metric that measures surface-level n-gram overlap. This suggests that Doubao's training data or architecture is particularly weak in handling the stylistic and rhythmic demands of Chinese literary forms—a direct analogue to the Gaokao essay task.
Relevant Open-Source Efforts
Several GitHub repositories are attempting to address these limitations. [Chinese-LLaMA-Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) (over 18,000 stars) focuses on fine-tuning LLaMA for Chinese, but its outputs still lack literary flair. [MiniCPM](https://github.com/OpenBMB/MiniCPM) (over 10,000 stars) from OpenBMB has shown promise in small-model efficiency but has not been tested on creative writing tasks. The most relevant project is [LongWriter](https://github.com/THUDM/LongWriter) (over 3,000 stars), which attempts to solve the long-text coherence problem by introducing a 'pre-planning' module. Early results show improved structure in 800+ word outputs, but emotional depth remains elusive. The fundamental challenge is that no open-source model has yet cracked the 'intentionality' problem—generating text that is not just coherent but *meaningful*.
Key Players & Case Studies
ByteDance's Doubao: The High-Profile Failure
Doubao, launched in 2023, was positioned as ByteDance's flagship AI assistant, leveraging the company's vast data from Douyin (TikTok's Chinese counterpart). Its failure is particularly embarrassing given ByteDance's resources. The model uses a mixture-of-experts (MoE) architecture with an estimated 180 billion active parameters. However, the training data appears heavily skewed toward short-form, viral content—perfect for generating catchy headlines or social media posts, but disastrous for sustained, reflective prose. This is a classic case of product-market fit: Doubao is optimized for speed and virality, not depth.
OpenAI's GPT-4o: The Top Scorer, But Still Lacking
GPT-4o's 48-point essay was praised for its logical structure and appropriate use of classical Chinese references (e.g., quoting Confucius). However, the evaluators noted that the essay 'felt like a well-written textbook chapter, not a personal reflection.' This highlights a key insight: GPT-4o's training on a massive, curated corpus of high-quality text gives it an advantage in formal writing, but it cannot simulate genuine personal experience. The model's 'voice' is a statistical average of millions of authors—competent but soulless.
DeepSeek-V3: The Open-Source Dark Horse
DeepSeek-V3, developed by the Chinese AI lab DeepSeek, scored 46, just behind GPT-4o. Its essay was noted for its innovative use of a Tang Dynasty poem as a framing device. However, the emotional arc was judged as 'mechanical.' DeepSeek's architecture uses a novel 'multi-head latent attention' mechanism that improves long-context retrieval, which likely helped it maintain thematic consistency. Yet, it still fell short on the '情理交融' (fusion of reason and emotion) criterion.
Comparison of Training Approaches
| Model | Training Data Focus | Key Architecture Feature | Essay Strength | Essay Weakness |
|---|---|---|---|---|
| GPT-4o | Web text, books, code (multilingual) | Standard Transformer, RLHF | Logical structure, formal language | Lacks personal voice |
| DeepSeek-V3 | Chinese web, academic papers | Multi-head latent attention | Good use of literary references | Mechanical emotional arc |
| Qwen2.5 | Chinese web, synthetic data | MoE, 72B active params | Coherent narrative | Generic metaphors |
| Doubao | Short-form video captions, social media | MoE, 180B active params | Fast generation | Repetitive, shallow |
Data Takeaway: The training data composition is the strongest predictor of essay quality. Models trained on long-form, curated content (GPT-4o, DeepSeek) outperform those trained on short, fragmented data (Doubao). This suggests that future improvements in AI writing will come not from larger models, but from better data curation—specifically, including more literary and reflective texts in the training corpus.
Industry Impact & Market Dynamics
The Gaokao essay test has immediate implications for China's booming AI education market. According to industry estimates, the AI-powered tutoring segment in China is expected to grow from $2.3 billion in 2025 to $7.5 billion by 2027, driven by demand for personalized learning tools. However, the current generation of AI writing assistants—including products like iFLYTEK's Spark Model, Baidu's ERNIE Bot, and ByteDance's Doubao—are primarily marketed for their ability to generate essays quickly. This test proves that speed is not enough.
The Shift from 'Fluency' to 'Depth'
For years, AI education products have focused on fluency—generating grammatically correct, coherent text. The Gaokao test reveals that fluency is table stakes. The real differentiator will be 'depth': the ability to generate text that demonstrates original thought, emotional intelligence, and cultural literacy. This is a much harder problem. Companies that invest in 'emotional AI'—models trained on literature, philosophy, and psychology—will have a competitive advantage.
ByteDance's Strategic Blow
Doubao's failure is a significant setback for ByteDance's education ambitions. The company has been aggressively pushing Doubao as a homework helper and essay assistant, integrating it into its education app, 'Dali Education.' A public failure on the most important exam in China could erode consumer trust. ByteDance will likely need to retrain Doubao on a more literary corpus, a process that could take 6-12 months. In the meantime, competitors like DeepSeek and Alibaba's Qwen team have an opening.
Market Share Projections
| Company | AI Education Product | Current Market Share (2025 est.) | Projected Share (2027) | Key Risk |
|---|---|---|---|---|
| Baidu | ERNIE Bot for Education | 28% | 25% | Slower innovation in creative writing |
| ByteDance | Doubao / Dali Education | 22% | 18% | Reputation damage from Gaokao failure |
| Alibaba | Tongyi Qianwen (Qwen) | 18% | 22% | Strong performance in benchmarks |
| DeepSeek | DeepSeek-V3 (API) | 8% | 15% | Open-source, rapid iteration |
| iFLYTEK | Spark Model | 15% | 12% | Focus on STEM, not humanities |
Data Takeaway: ByteDance is projected to lose market share, while DeepSeek could nearly double its presence. The key driver is not raw performance but *perceived* performance in high-stakes tasks like the Gaokao. Parents and students will gravitate toward models that have 'proven' themselves in real exams.
Risks, Limitations & Open Questions
The 'Good Enough' Trap
One risk is that AI education products will lower the bar for writing quality. If students use AI to generate essays that score 42-48 (a 'B' grade), they may never develop their own writing skills. This could lead to a generation of students who are fluent but not thoughtful—a dangerous outcome for a society that values literary sophistication.
The Evaluation Problem
Our evaluation used human judges, but this is not scalable. Automated essay scoring (AES) systems, such as those used in the TOEFL, are notoriously poor at evaluating creativity. If AI models are optimized to fool AES, they could produce essays that score high on surface metrics but are even more hollow. This creates an arms race between AI writers and AI graders.
Cultural Homogenization
All models in the test produced essays that were structurally similar—introduction, body paragraphs with examples, conclusion. This is a reflection of their training data, which is dominated by 'model essays' from test prep books. Over-reliance on AI could lead to a homogenization of writing styles, where every essay follows the same formula. The unique voice of individual students could be lost.
The 'Emotion Gap' is Not Closing
Despite rapid advances in model size and reasoning, the emotional depth of AI-generated text has not improved significantly in the past two years. A 2024 study comparing GPT-3.5 and GPT-4o on creative writing tasks found only a 5% improvement in human-rated emotional resonance. This suggests that the problem is not one of scale but of fundamental architecture. Current models lack a 'theory of mind'—the ability to attribute mental states to themselves and others. Without this, genuine empathy in writing is impossible.
AINews Verdict & Predictions
Verdict: The Gaokao essay test is a clear and present warning to the AI industry. The emperor has no clothes—or rather, the emperor can write a decent five-paragraph essay but cannot make you cry. The hype around 'human-level' AI writing is overblown. We are at least 3-5 years away from a model that can consistently produce essays that rival top human students.
Predictions:
1. ByteDance will retrain Doubao within 12 months. The company cannot afford to have its flagship AI product associated with failure. Expect a 'Doubao 2.0' with a focus on literary training data, possibly acquired through partnerships with Chinese publishing houses.
2. DeepSeek will emerge as the leader in Chinese creative AI. Its open-source approach allows for rapid community-driven improvements. Within two years, DeepSeek-based products will likely outperform closed-source rivals on creative writing benchmarks.
3. The next breakthrough will come from 'emotion-aware' architectures. Research labs are already exploring models that incorporate affective computing—using sentiment analysis and psychological models to guide text generation. The first model to successfully integrate a 'theory of mind' module will win the creative writing race.
4. AI education products will pivot from 'essay generators' to 'essay coaches.' Instead of writing essays for students, AI will be used to provide feedback, suggest improvements, and challenge students' thinking. This is a harder technical problem but aligns with educational goals.
5. The Gaokao itself may adapt. If AI-generated essays become indistinguishable from human ones (a distant prospect), examiners may change the format to include more oral components, real-time writing under observation, or prompts that require personal anecdotes that cannot be faked.
What to Watch: Keep an eye on the open-source project 'LongWriter' and the startup 'EmoWriter' (founded by ex-DeepMind researchers). Both are working on the 'intentionality' problem. If either achieves a breakthrough, it will reshape the entire landscape of AI writing.