DragNUWA:拖放式影片編輯能否最終成為主流?

GitHub April 2026
⭐ 720
Source: GitHubArchive: April 2026
NUWA 專案的 DragNUWA 將拖放式動作控制引入 AI 影片生成,承諾提供直觀的編輯體驗。但僅有 720 顆星且無預訓練模型,這究竟是突破還是研究產物?AINews 深入調查其技術現實。
The article body is currently shown in English by default. You can generate the full version in this language on demand.

DragNUWA, developed by the Project NUWA team at Microsoft Research Asia, represents a significant step in making video generation controllable by non-experts. The core innovation is extending the 'drag' interaction paradigm—popularized by image-editing tools like DragGAN—into the temporal dimension of video. Instead of typing a text prompt, users can specify the trajectory of key points on an object, and the model generates a video where that object follows the path. Technically, DragNUWA combines optical flow estimation with spatial attention mechanisms within a latent diffusion model. The pipeline involves three stages: first, training a text-to-video base model; second, fine-tuning with optical flow as an additional condition; and third, introducing the drag-based control via a lightweight adapter. This multi-stage approach allows the model to learn motion dynamics without catastrophic forgetting. However, the project currently lacks a released pretrained model and comprehensive documentation, making it inaccessible to most practitioners. The GitHub repository has accumulated 720 stars, indicating strong community interest but also frustration over the high barrier to entry. The significance of DragNUWA lies in its potential to democratize video editing—replacing complex keyframing and motion tracking software with a simple drag gesture. If the team releases a working demo, it could disrupt the workflow of independent creators, animators, and social media content producers. But the missing pieces—pretrained weights, inference scripts, and a clear license—suggest the research is still in an early, exploratory phase. AINews believes the core idea is viable, but the path to a product requires solving inference speed, temporal consistency, and handling of occlusions, none of which are trivial.

Technical Deep Dive

DragNUWA sits at the intersection of two hot research areas: diffusion-based video generation and interactive image editing. To understand its architecture, we must first appreciate the challenge. Video generation models like Stable Video Diffusion or AnimateDiff already produce temporally coherent clips, but controlling the exact motion of specific objects remains an open problem. Text prompts are too coarse—saying 'the cat jumps left' doesn't specify the arc, speed, or final position. DragNUWA solves this by introducing a new conditioning signal: a set of handle points and their target trajectories.

Architecture Overview

The framework builds on a latent diffusion model (LDM) backbone, similar to Stable Diffusion. The key modifications are:
1. Optical Flow Encoder: A separate network (often a pre-trained RAFT or a lightweight variant) estimates dense optical flow from the input video. This flow is encoded into a feature map that is injected into the U-Net decoder via cross-attention or feature concatenation.
2. Spatial Attention with Drag Tokens: Instead of standard self-attention, DragNUWA uses a modified attention layer where the user-specified drag points are represented as learnable 'drag tokens'. These tokens attend to the spatial features of the frame, effectively telling the model 'this pixel should move to that location'.
3. Multi-Stage Training: The authors employ a three-stage curriculum:
- Stage 1: Train a text-to-video LDM on a large dataset (e.g., WebVid-10M) to learn basic motion priors.
- Stage 2: Freeze the base model and train the optical flow encoder using pairs of video frames and their ground-truth flow.
- Stage 3: Fine-tune the entire model with drag supervision, where synthetic drag trajectories are generated by perturbing object keypoints in existing videos.

Where It Falls Short

While the approach is elegant, the current implementation has several limitations. First, the optical flow encoder adds significant computational overhead—inference on a 512x512, 16-frame video reportedly takes over 2 minutes on an A100 GPU. Second, the drag control is limited to sparse keypoints; complex deformations (e.g., a waving flag) are poorly handled. Third, the model struggles with occlusions: if a dragged object passes behind another, the result often shows ghosting or abrupt disappearance.

Comparison with Alternatives

| Feature | DragNUWA | DragGAN (Image) | Runway Gen-3 | Pika Labs |
|---|---|---|---|---|
| Input Modality | Video + drag points | Image + drag points | Text prompt | Text prompt |
| Motion Control | Explicit trajectory | Implicit (via optimization) | Implicit (text) | Implicit (text) |
| Temporal Consistency | Good (flow-guided) | N/A (single image) | Excellent | Good |
| Inference Speed | ~2 min per 16 frames | ~10 sec per image | ~30 sec per 5 sec clip | ~45 sec per 3 sec clip |
| Pretrained Model Available | No | Yes | Yes (API) | Yes (API) |
| Open Source | Partial (code only) | Yes | No | No |

Data Takeaway: DragNUWA offers the most direct motion control but at a severe speed penalty and without a usable model. The closed-source alternatives (Runway, Pika) prioritize speed and polish, sacrificing fine-grained control. This trade-off defines the current market gap.

Relevant Open-Source Repos
- ProjectNUWA/DragNUWA (⭐720): The subject of this article. Code is available but no weights. Recent commits show a focus on documentation, not model release.
- Stability-AI/generative-models (⭐25k+): The base for many video diffusion models, including the one likely used by DragNUWA.
- NVlabs/DragGAN (⭐35k+): The image-based predecessor that inspired DragNUWA. Fully functional with pretrained models.

Key Players & Case Studies

The DragNUWA project is led by researchers from Microsoft Research Asia (MSRA) , a prolific lab known for foundational work in computer vision and NLP. The team includes names like Yifan Jiang, Yue Wu, and Ziwei Liu, who have published on controllable generation before. MSRA's strategy is typical: publish cutting-edge research to establish IP and attract talent, while the productization is left to internal teams like Azure AI or external partners.

Competing Products

| Product | Company | Approach | Strengths | Weaknesses |
|---|---|---|---|---|
| Runway Gen-3 | Runway ML | Diffusion transformer | High quality, fast, polished UI | No drag control, subscription cost |
| Pika Labs | Pika | Diffusion + motion modules | Easy text-to-video, good for social | Limited editing, no keyframe control |
| ComfyUI + AnimateDiff | Community | Modular diffusion | Full control, free | Steep learning curve, no drag UI |
| DragNUWA | MSRA | Flow + drag tokens | Direct motion control | No model, slow, research-only |

Case Study: The Independent Animator

Consider a motion graphics artist who wants to animate a logo flying across a screen. Using current tools, they would either keyframe in After Effects (hours of work) or use a text-to-video model (unpredictable results). DragNUWA promises a middle ground: drag the logo from point A to point B, and the model generates the in-between frames with realistic motion blur and lighting. If released, this could save hours per project. However, the lack of temporal consistency in current demos—where the logo's color or shape drifts between frames—makes it unusable for professional work.

Industry Impact & Market Dynamics

The video generation market is projected to grow from $2.5 billion in 2024 to $15 billion by 2030 (CAGR 35%). The key battleground is controllability. Early models (Make-A-Video, Imagen Video) were impressive but uncontrollable. The current leaders—Runway, Pika, and the open-source AnimateDiff—offer text-based control, but users demand more precision. DragNUWA represents the next frontier: direct manipulation.

Adoption Curve

| Phase | Timeframe | Key Enabler | Example |
|---|---|---|---|
| Text-to-Video | 2023-2024 | Diffusion models | Runway Gen-1/2 |
| Text + Image-to-Video | 2024-2025 | Reference networks | Pika 2.0 |
| Drag-to-Video | 2025-2026 | Flow-based control | DragNUWA (if released) |
| Full Scene Editing | 2026+ | 3D-aware models | N/A |

Data Takeaway: The market is moving from 'generate anything' to 'generate exactly what I want'. DragNUWA's approach is the most intuitive for non-technical users, but it must overcome the speed and quality gap to compete.

Funding Landscape

- Runway ML raised $237M total (Series D at $1.5B valuation).
- Pika raised $55M (Series A at $250M valuation).
- Microsoft's investment in AI video is indirect (Azure OpenAI, M365 Copilot).

If DragNUWA becomes a product, it could be integrated into Microsoft's Clipchamp or Adobe's Premiere Pro, creating a new revenue stream. Alternatively, a startup could license the technology and build a consumer app.

Risks, Limitations & Open Questions

1. The Pretrained Model Problem: Without a released model, the project is just a paper. The community's frustration is palpable—GitHub issues ask 'When will weights be released?' with no response. If MSRA never releases it, the project becomes a footnote.

2. Temporal Drift: In current demos, the dragged object often changes appearance over time (e.g., a red car becomes blue after 10 frames). This is a fundamental issue with diffusion models—they sample noise per frame, and the drag signal is not strong enough to enforce identity consistency.

3. Occlusion Handling: When a dragged object passes behind another, the model often 'forgets' the occluded part, leading to objects that disappear and reappear. This is a hard problem that may require explicit 3D reasoning.

4. Ethical Concerns: Drag-based editing could be used to create deepfakes with precise motion control, e.g., making a politician 'wave' in a video. The lack of safeguards in the current code (no watermarking, no detection) is concerning.

5. Compute Requirements: The multi-stage training requires hundreds of GPU hours. Even inference is expensive, limiting deployment to cloud APIs rather than edge devices.

AINews Verdict & Predictions

Verdict: DragNUWA is a brilliant research prototype that is 12-18 months away from being a usable product. The core idea—drag-based motion control—is the right direction, but the engineering challenges are substantial.

Predictions:
1. Within 6 months: MSRA will release a limited demo (maybe a Gradio app) but not the full model. This will generate a wave of press but disappoint developers.
2. Within 12 months: A startup (possibly spun out from MSRA) will build a commercial product using a similar approach, likely with a lighter architecture (e.g., replacing optical flow with a learned motion field).
3. Within 18 months: Adobe will acquire or build a drag-based video editing feature into Premiere Pro, likely using a variant of this technique.
4. The open-source community will fork the code and train a model on a smaller dataset (e.g., UCF-101), achieving reasonable results but not production quality.

What to Watch: The next commit on the DragNUWA repo. If the team adds an inference script and a model card, the timeline accelerates. If the repo goes silent, the idea will be re-implemented by others.

Final Thought: DragNUWA is a reminder that in AI, the hardest part is not the algorithm—it's the data, the infrastructure, and the product polish. The research is inspiring, but the real impact will come when someone ships a button that just works.

More from GitHub

Chipyard:加州大學柏克萊分校的開源框架,有望讓RISC-V晶片設計普及化Chipyard, developed at UC Berkeley's ASPIRE Lab, represents a paradigm shift in how custom silicon is designed. Unlike tAstral:開源工具終於讓 GitHub 星標變得真正有用GitHub Stars have always been a one-dimensional bookmark: you click the star, and the repository disappears into a flat,GitHub Stars Manager:終於修復 GitHub 書籤功能的工具GitHub's native starred repositories feature is, by any honest measure, a glorified bookmark list. You can star a repo, Open source hub1142 indexed articles from GitHub

Archive

April 20262658 published articles

Further Reading

影片世界模型:AR擴散革命重塑AI對動態的理解一個精心策劃的GitHub倉庫「awesome-video-world-models-with-ar-diffusion」迅速崛起,單日內獲得超過450顆星。該資源系統性地繪製了自回歸模型與擴散過程在影片預測和生成上的融合,標誌著AI對動態MagicAnimate:擴散模型如何攻克人類影片生成的最後難關Magic Research 推出的創新框架 MagicAnimate,在從單一圖像和動作序列生成時間連貫的人類動畫方面實現了重大飛躍。它透過巧妙改進擴散模型並結合專用注意力機制,有效解決了影片閃爍等長期存在的挑戰。AnimateDiff 運動模組革命:即插即用影片生成如何普及化 AI 內容創作AnimateDiff 框架代表了 AI 影片生成領域的典範轉移。它將動作學習與內容創作分離,讓任何擁有預訓練圖像模型的人,都能以最少的額外訓練產出連貫的影片序列。這項技術突破正迅速普及化 AI 內容創作。宇樹科技推出官方 PyBullet 模擬平台,普及四足機器人開發作為商業四足機器人領域的主導力量,宇樹科技為其 Go1 和 A1 平台發布了官方的 PyBullet 模擬環境。此舉標誌著該公司正策略性地轉向普及先進機器人技術開發,讓研究人員和開發者能夠更便捷地進行原型設計與測試。

常见问题

GitHub 热点“DragNUWA: Can Drag-and-Drop Video Editing Finally Go Mainstream?”主要讲了什么?

DragNUWA, developed by the Project NUWA team at Microsoft Research Asia, represents a significant step in making video generation controllable by non-experts. The core innovation i…

这个 GitHub 项目在“DragNUWA vs Runway Gen-3 motion control comparison”上为什么会引发关注?

DragNUWA sits at the intersection of two hot research areas: diffusion-based video generation and interactive image editing. To understand its architecture, we must first appreciate the challenge. Video generation models…

从“How to use DragNUWA without pretrained weights”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 720,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。