Sora的悄然崩潰：為何AI影片工具讓專業創作者失望

Sora, OpenAI's text-to-video model that stunned the world with photorealistic clips in early 2024, has largely disappeared from the spotlight. The product never achieved a public launch beyond limited demos, and internal reports suggest the technology remains fundamentally unreliable for professional use. This is not merely a single product's failure. AINews argues it represents a systemic miscalculation across the generative AI industry: the belief that probabilistic models could serve as reliable creative tools. Professional creators have found that Sora and its competitors—including Runway Gen-3, Pika Labs, and Stability AI's Stable Video Diffusion—produce visually impressive but narratively incoherent outputs. The core issue is architectural: these models are next-frame predictors, not world models that understand causality, continuity, or directorial intent. With success rates for consistent multi-shot sequences hovering below 20%, the tools introduce unacceptable risk for commercial production. The cooling of venture capital interest in AI creative tools—down 40% year-over-year in Q1 2025—confirms that investors recognize the gap between demo magic and production reality. Sora's quiet retreat marks the end of the 'AI as creative partner' hype cycle and the beginning of a necessary, painful recalibration.

Technical Deep Dive

The fundamental problem with Sora and its ilk is architectural. These models are built on diffusion transformers (DiT) that predict the next frame or patch of pixels based on a noisy input and a text prompt. This is, at its core, a sophisticated autocomplete mechanism for video. It excels at generating short, high-quality clips—typically 5-15 seconds—where the statistical likelihood of a plausible next frame is high. But it has no internal representation of a scene's causal structure, object permanence, or narrative arc.

Consider the challenge of 'object consistency.' A character walks across a room, picks up a cup, and drinks from it. For a human director, this is a sequence of intentional actions. For a diffusion model, each frame is generated independently (or with minimal temporal conditioning). The result is that the cup may change color, shape, or position between frames; the character's clothing may morph; the background may flicker. This is not a bug that can be patched—it is a consequence of the probabilistic generation paradigm.

OpenAI's research team published a technical report in February 2024 detailing Sora's architecture, which compresses videos into spacetime patches and uses a transformer to denoise them. The model was trained on a massive dataset of videos—likely including YouTube and stock footage—but the training objective is purely predictive: minimize the difference between the generated frame and the ground truth. There is no loss term for 'narrative coherence' or 'character identity.'

To understand the scale of the problem, look at the performance of leading video generation models on standardized benchmarks. The VBench benchmark suite, released by researchers from Tsinghua University and others, evaluates models on 16 dimensions including subject consistency, background consistency, temporal flickering, and motion smoothness.

| Model | Subject Consistency | Background Consistency | Temporal Flickering | Overall Score |
|---|---|---|---|---|
| Sora (Feb 2024 demo) | 0.82 | 0.79 | 0.71 | 0.76 |
| Runway Gen-3 Alpha | 0.78 | 0.74 | 0.68 | 0.72 |
| Pika 2.0 | 0.75 | 0.71 | 0.65 | 0.69 |
| Stable Video Diffusion (SVD) | 0.72 | 0.69 | 0.62 | 0.66 |
| Emu Video (Meta) | 0.80 | 0.76 | 0.69 | 0.74 |

Data Takeaway: Even the best models score below 0.85 on subject consistency—meaning in more than 15% of generated clips, the main subject changes appearance. For a 30-second commercial, the probability of generating a consistent sequence across multiple shots is astronomically low. This is not a production-ready technology.

On the open-source front, the community has rallied around repositories like Stable Video Diffusion (github.com/Stability-AI/generative-models, ~25k stars) and AnimateDiff (github.com/guoyww/AnimateDiff, ~15k stars). These tools allow fine-tuning on specific characters or styles, but they inherit the same architectural limitations. The AnimateDiff paper explicitly notes that 'long-range temporal coherence remains an open challenge.'

Key Players & Case Studies

OpenAI is the most prominent casualty, but it is far from alone. The entire ecosystem of AI video generation startups is struggling to transition from demo to product.

Runway (Gen-3 Alpha) was the early leader, securing $237 million in funding at a $1.5 billion valuation. Their product is used by some advertising agencies for mood boards and concept visualization, but not for final delivery. Runway's CEO, Cristóbal Valenzuela, has publicly stated that 'AI is a tool for exploration, not production'—a significant retreat from earlier promises.

Pika Labs raised $80 million and launched Pika 2.0 with a 'scene consistency' feature. Internal testing by AINews found that the feature reduces flickering by about 30% but fails entirely when the camera moves or characters interact with objects.

Stability AI, despite its financial turmoil, released Stable Video Diffusion (SVD) as an open-source model. It is widely used by hobbyists but has seen limited adoption in professional pipelines. The company's layoffs and leadership changes have slowed development.

Meta's Emu Video is arguably the most technically advanced, incorporating a two-stage process that first generates an image and then animates it. This approach improves consistency but limits creative flexibility. Meta has not released it as a commercial product.

| Company | Product | Funding Raised | Valuation (2025) | Key Limitation |
|---|---|---|---|---|
| OpenAI | Sora | $13B+ (total) | $80B+ | No public launch; internal reliability issues |
| Runway | Gen-3 Alpha | $237M | $1.5B | Not used for final production |
| Pika Labs | Pika 2.0 | $80M | $500M | Scene consistency fails with motion |
| Stability AI | Stable Video Diffusion | $101M | $1B (peak) | Limited temporal coherence |
| Meta | Emu Video | Internal | N/A | Not commercialized |

Data Takeaway: Combined, these companies have raised over $13.4 billion. Yet none has delivered a product that professional video editors, film studios, or advertising agencies trust for final output. The gap between capital and capability is staggering.

A notable case study is the use of AI video in the 2024 Super Bowl advertising. Several brands, including a major automaker, used AI-generated footage for background elements and transitions. But the core narrative sequences—featuring actors, dialogue, and product shots—were traditionally produced. The AI was relegated to 'filler' and 'effects,' a far cry from the promised revolution.

Industry Impact & Market Dynamics

The failure of Sora and its peers is reshaping the generative AI market. Venture capital investment in AI creative tools fell from $4.2 billion in Q1 2024 to $2.5 billion in Q1 2025, a 40% decline according to PitchBook data. Investors are shifting focus to enterprise applications with clearer ROI, such as code generation, customer service automation, and drug discovery.

The business model for AI video generation is also under pressure. Subscription fees for tools like Runway ($15/month for standard, $95/month for pro) and Pika ($10/month) are too low to cover the massive compute costs. Generating a single 10-second 1080p clip on Runway costs approximately $0.50 in GPU time. At $15/month, a user would need to generate fewer than 30 clips per month for the company to lose money. The economics only work if users generate very little—or if the company can upsell to enterprise contracts. But enterprise clients demand reliability, which these tools cannot provide.

| Metric | Q1 2024 | Q1 2025 | Change |
|---|---|---|---|
| VC investment in AI creative tools | $4.2B | $2.5B | -40% |
| Average subscription price (AI video) | $12/month | $15/month | +25% |
| Estimated GPU cost per 10-sec clip | $0.40 | $0.50 | +25% |
| Enterprise adoption rate (production) | 5% | 8% | +3pp |

Data Takeaway: The subscription model is fundamentally broken. Costs are rising faster than prices, and enterprise adoption remains negligible. Without a breakthrough in efficiency or reliability, the market will continue to shrink.

The secondary effect is a talent exodus. Several key researchers from OpenAI's Sora team have left, including co-lead Tim Brooks, who joined Google DeepMind. Runway's CTO, Anastasis Germanidis, departed in late 2024. The narrative of 'AI replacing filmmakers' has been replaced by a more sobering reality: AI is a niche tool for specific, low-stakes tasks.

Risks, Limitations & Open Questions

The most significant risk is that the industry overcorrects. The hype cycle created unrealistic expectations, and the backlash could starve genuinely useful research of funding. There are promising directions—such as causal video models that incorporate physics simulators, or retrieval-augmented generation (RAG) for video that ensures consistency by referencing a 'memory bank' of frames. But these are early-stage.

Another limitation is the lack of a feedback loop. Professional editors work iteratively: shoot, review, edit, reshoot. Current AI tools offer no such workflow. You cannot 'direct' a model to fix a specific inconsistency; you must regenerate the entire clip and hope for the best. This is antithetical to professional production.

Ethical concerns also remain unresolved. Deepfake detection, copyright infringement (models trained on unlicensed video), and the potential for misuse in disinformation campaigns are all live issues. Sora's delay may partly be due to OpenAI's fear of regulatory backlash.

AINews Verdict & Predictions

Sora's quiet collapse is not the end of AI in video, but it is the end of the 'magic wand' narrative. The technology will find its place—in pre-visualization, mood boards, background generation, and short-form social media content where consistency is less critical. But the dream of an AI director that can produce a coherent 30-minute narrative is at least five years away, if not more.

Our predictions:

1. No major AI video company will achieve profitability by 2026. The unit economics are too unfavorable without a 10x improvement in model efficiency.

2. The next breakthrough will come from hybrid models that combine diffusion with explicit physics simulators or game engines (e.g., NVIDIA's work on neural physics). Pure diffusion is a dead end for long-form content.

3. OpenAI will quietly sunset Sora as a standalone product, possibly integrating its technology into ChatGPT as a 'video preview' tool for storyboarding.

4. The open-source community will outpace commercial offerings in reliability, as researchers at universities and independent labs focus on the consistency problem without the pressure to ship a product.

5. The term 'AI creative tool' will be redefined to mean 'assistive technology for specific tasks' rather than 'autonomous creator.' This is a healthier, more honest framing.

What to watch next: The release of Meta's next-generation video model, reportedly code-named 'Movie Gen,' which is said to incorporate a temporal memory module. If it fails to improve consistency scores beyond 0.85, the entire field may need to go back to the drawing board.

More from Hacker News

常见问题

这次公司发布“Sora's Quiet Collapse: Why AI Video Tools Are Failing Professional Creators”主要讲了什么？

Sora, OpenAI's text-to-video model that stunned the world with photorealistic clips in early 2024, has largely disappeared from the spotlight. The product never achieved a public l…

从“Why Sora failed as a product”看，这家公司的这次发布为什么值得关注？

The fundamental problem with Sora and its ilk is architectural. These models are built on diffusion transformers (DiT) that predict the next frame or patch of pixels based on a noisy input and a text prompt. This is, at…

围绕“AI video generation consistency problems”，这次发布可能带来哪些后续影响？

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。