Cómo los investigadores chinos están resolviendo la animación de múltiples personas con datos mínimos

March 2026
AI video generationArchive: March 2026
Un equipo de investigación ha desarrollado un método novedoso para generar animaciones complejas de múltiples personas utilizando únicamente datos de interacción entre dos personas. Este avance aborda desafíos fundamentales para mantener la coherencia de los personajes y modelar las interacciones espaciales, lo que podría democratizar la animación de alta calidad.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The field of visual generation is undergoing a fundamental transition from simply creating content to creating content with precise structural control. In character animation specifically, researchers have long sought systems that can generate realistic, continuous animations from input images and pose sequences. While single-character animation has seen substantial progress, scaling to multiple characters introduces exponential complexity. Models must maintain each character's visual identity across frames, correctly map actions to specific individuals, and realistically model spatial interactions between characters—all while operating with limited training data.

The new research tackles this multi-faceted problem with an elegant solution: training on only two-person interaction data to generate animations involving numerous characters. The core innovation lies in a disentangled representation framework that separates appearance modeling from motion dynamics and interaction modeling. By learning fundamental principles of human interaction from dyadic data, the system can extrapolate to more complex group scenarios. This approach dramatically reduces data requirements while maintaining high visual fidelity and logical consistency in generated animations.

From a technical perspective, the method employs a hierarchical attention mechanism that tracks individual characters while modeling their relational dynamics. A novel consistency preservation module ensures that each character's appearance remains stable throughout sequences, even during occlusions and complex interactions. The architecture demonstrates remarkable generalization capabilities, successfully animating groups of four or more characters using only pairwise training data. This represents a paradigm shift in how AI systems learn complex social and physical interactions, moving from brute-force data collection to learning transferable interaction principles.

The implications extend far beyond academic research. This data-efficient approach could enable smaller studios and individual creators to produce professional-quality animated content without massive datasets. The technology aligns with growing industry demand for personalized, interactive media where users can animate groups of friends, family members, or custom characters with minimal input. As AI-generated video moves toward mainstream adoption, solutions that balance quality with practical data requirements will determine which technologies achieve commercial viability.

Technical Deep Dive

The research introduces a novel architecture called Dyadic-to-Group Animation Transformer (DGAT), which fundamentally rethinks how multi-person interactions are learned and generated. At its core is the insight that complex group dynamics can be decomposed into pairwise interactions and higher-order emergent behaviors. The system employs three interconnected modules: the Appearance Encoder, Interaction Dynamics Module, and Spatial Composition Engine.

The Appearance Encoder uses a combination of StyleGAN-like latent space manipulation and attention-based tracking to maintain character consistency. Each character is encoded into a normalized appearance vector that remains invariant across frames, while a separate deformation field handles pose-dependent variations. This separation prevents identity leakage between characters—a common failure mode in multi-person generation.

The Interaction Dynamics Module represents the breakthrough component. Instead of learning all possible N-person interactions directly, it trains exclusively on two-person scenarios. It employs a Relational Graph Neural Network that learns fundamental interaction primitives: approaching, separating, mirroring, leading-following, and collision avoidance. During inference for larger groups, the system constructs a complete graph of all character pairs, applies the learned dyadic interactions, and then uses a graph attention mechanism to resolve conflicts and synthesize emergent group behaviors.

The Spatial Composition Engine handles the final rendering, ensuring proper occlusion handling and depth ordering. It uses a differentiable depth estimation module that predicts per-pixel depth maps for each character independently, then composites them using soft z-buffering. This allows for smooth transitions when characters pass in front of or behind one another.

Key to the system's data efficiency is its progressive training curriculum. The model first masters single-character animation with various poses, then learns two-character interactions across different spatial configurations, and finally learns to extrapolate to three-plus characters through a novel interpolation regularization technique that encourages the model to treat multi-person scenes as compositions of pairwise relationships.

Performance benchmarks show remarkable results given the limited training data:

| Metric | DGAT (Ours) | Previous SOTA (Full Data) | Previous SOTA (Limited Data) |
|---|---|---|---|
| FID Score (Lower Better) | 18.7 | 15.2 | 32.4 |
| Identity Consistency Score | 0.89 | 0.91 | 0.72 |
| Interaction Realism (Human Eval) | 4.2/5.0 | 4.4/5.0 | 3.1/5.0 |
| Training Data Required | 10K dyad videos | 100K+ group videos | 50K group videos |
| Inference Time (128x128, 30fps) | 0.8s/frame | 1.2s/frame | 0.9s/frame |

Data Takeaway: The DGAT system achieves performance within 5-10% of state-of-the-art models trained on 10x more data, while dramatically outperforming previous limited-data approaches. The identity consistency score is particularly impressive, indicating the method successfully maintains character appearance despite minimal training examples per identity.

While the researchers haven't released a complete implementation, several components build on open-source repositories. The appearance encoder extends the FOMM (First Order Motion Model) architecture from Samsung AI, while the interaction module draws inspiration from Social GAN for pedestrian trajectory prediction. A relevant emerging repo is MultiDiffusion, which explores similar composition approaches for text-to-image generation and has gained 2.3k stars since its February 2025 release.

Key Players & Case Studies

This research emerges from a competitive landscape where both academic institutions and technology companies are racing to solve controllable video generation. The Chinese team's work positions them uniquely against several established approaches.

Academic Leaders:
- University of California, Berkeley's BAIR lab has pioneered work in neural rendering and neural radiance fields (NeRF) for dynamic scenes, but their methods typically require extensive multi-view data.
- MIT's CSAIL has developed TimeSformer architectures for video understanding that could be adapted for generation, though they focus more on analysis than synthesis.
- Max Planck Institute's work on Monocular Neural Human Rendering represents the gold standard for single-person animation but struggles with multiple interacting characters.

Industry Implementations:
- Runway ML's Gen-2 video model offers multi-subject generation but lacks precise control over individual character motions.
- Pika Labs has demonstrated impressive character consistency in their 1.0 release but focuses on stylistic rather than pose-controlled animation.
- Stability AI's Stable Video Diffusion includes some control mechanisms but doesn't specifically address multi-person interaction modeling.
- Chinese companies like ByteDance (through CapCut's AI features) and Tencent (via its gaming division) have internal research teams working on similar problems, though they typically pursue data-heavy approaches leveraging their massive user-generated content repositories.

| Approach | Data Requirements | Multi-Person Capability | Control Precision | Commercial Availability |
|---|---|---|---|---|
| DGAT (This Research) | Minimal (dyadic only) | Excellent | High | Research only |
| Runway Gen-2 | Massive (web-scale) | Moderate | Medium | Public API |
| Pika 1.0 | Large (curated) | Limited | Low-Medium | Waitlist |
| Stable Video Diffusion | Massive (LAION-based) | Basic | Low | Open-source |
| Meta's Make-A-Video | Extremely Large | Poor | Very Low | Internal only |

Data Takeaway: The DGAT approach occupies a unique position in the trade-off space, offering high control precision with minimal data requirements—a combination that could enable entirely new applications where training data is scarce or expensive to obtain.

Notable researchers contributing to this space include Jianming Zhang (Adobe Research) whose work on CONTROLNET inspired many pose-conditioned generation systems, and Ting-Chun Wang (NVIDIA) who developed the Few-shot Vid2Vid framework. The Chinese team's approach builds on but meaningfully advances these foundations by addressing the combinatorial complexity of multi-person scenarios.

Industry Impact & Market Dynamics

The DGAT research arrives as the AI video generation market experiences explosive growth and intensifying competition. The global market for AI-powered video creation tools is projected to reach $4.2 billion by 2027, growing at a CAGR of 32.5% from 2024. However, current solutions face significant adoption barriers due to data requirements, computational costs, and limited controllability—particularly for character-driven content.

This technology could disrupt several established markets:

1. Social Media & Content Creation:
Platforms like TikTok, Instagram, and their Chinese counterparts Douyin and Kuaishou are increasingly integrating AI generation tools. The ability to animate groups of friends or create interactive stories with multiple characters using just a few photos would dramatically lower the barrier to sophisticated content creation. ByteDance's internal metrics suggest that AI-powered features increase user engagement by 40-60% and content production by 25% among casual creators.

2. Gaming & Interactive Entertainment:
The game development industry spends approximately $15-20 billion annually on animation and motion capture. Procedural animation systems that can generate NPC interactions or player character animations from minimal data could reduce these costs by 30-50% for certain applications. Unity and Unreal Engine are both investing heavily in AI animation tools, with Unity's recent acquisition of Ziva Dynamics signaling the strategic importance of realistic character simulation.

3. Education & Corporate Training:
Animated educational content typically costs $10,000-$50,000 per minute when produced professionally. AI tools that allow educators to animate multiple historical figures, scientific concepts, or training scenarios could reduce these costs by 80-90% while enabling unprecedented personalization.

Market Adoption Projections:

| Application Segment | Current Market Size (2025) | Projected with DGAT-like Tech (2028) | Key Adoption Drivers |
|---|---|---|---|
| Social Media Tools | $850M | $2.1B | User-generated content, viral trends |
| Independent Game Dev | $320M | $950M | Cost reduction, rapid prototyping |
| Educational Content | $410M | $1.2B | Personalized learning, scalability |
| Advertising & Marketing | $1.1B | $2.8B | Dynamic personalization, A/B testing |
| Total Addressable Market | $2.68B | $7.05B | 37% CAGR vs. current 32% |

Data Takeaway: The data-efficient approach represented by DGAT could accelerate market growth by making advanced animation accessible to smaller creators and organizations, potentially adding $2-3 billion to the projected 2028 market size through expanded use cases and lower barriers to entry.

Funding patterns reflect this opportunity. Venture investment in AI video generation startups reached $4.7 billion in 2025, with particular interest in controllable generation systems. Companies like Synthesia (valued at $1.5B) and Hour One (valued at $800M) have demonstrated the commercial viability of AI avatars, but their current offerings are limited to single-character talking heads. The next generation of multi-character systems could command even higher valuations by addressing broader use cases.

Risks, Limitations & Open Questions

Despite its impressive capabilities, the DGAT approach faces several significant challenges that must be addressed before widespread adoption.

Technical Limitations:
1. Temporal Consistency Beyond Short Sequences: The current implementation demonstrates strong performance on 2-4 second clips (60-120 frames) but shows degradation in longer sequences. Character identities can drift, and interaction patterns may become repetitive or physically implausible over extended durations.
2. Limited Interaction Vocabulary: While the system generalizes well from dyadic data, it struggles with truly novel interaction types not represented in the training pairs. Complex group behaviors like coordinated dancing, team sports plays, or conversational turn-taking that involve subtle social cues remain challenging.
3. Background Integration: The method focuses primarily on character animation, treating backgrounds as static or simple parallax layers. Real-world applications require seamless integration with dynamic environments where characters interact with objects and scenery.

Ethical & Societal Concerns:
1. Deepfake Proliferation: By making high-quality multi-person animation accessible with minimal data, this technology could lower the barrier for creating convincing deepfake content involving groups of people. The ability to generate fabricated interactions between public figures or private individuals raises serious concerns about misinformation and consent.
2. Bias Amplification: The training data, though limited, still contains societal biases about gender roles, racial dynamics, and social interactions. These biases could be amplified when the system generates animations extrapolating from dyadic patterns to group behaviors.
3. Economic Displacement: Professional animators, particularly those working on lower-budget projects, could face displacement if AI tools can produce comparable quality at dramatically lower cost. The animation industry employs approximately 500,000 professionals globally, with many in entry-level positions most vulnerable to automation.

Open Research Questions:
1. How can physical constraints be better incorporated? The current approach learns interaction patterns statistically but doesn't explicitly model physics. Integrating physics engines or learning physical priors could improve realism, especially for actions involving contact, weight transfer, or object manipulation.
2. Can the approach scale to non-human characters? The principles should theoretically apply to animals, fictional creatures, or even abstract entities, but this remains untested.
3. What is the minimum viable training data? While the research uses 10,000 dyad videos, further work could determine if certain "interaction primitives" could be learned from even fewer examples through meta-learning or synthetic data generation.

Commercialization Challenges:
1. Computational Requirements: Real-time inference remains challenging, with current implementations requiring specialized hardware for interactive applications.
2. Integration with Existing Pipelines: Professional animation studios use complex toolchains (Maya, Blender, Unity, Unreal). Seamless integration would require developing plugins, standardized formats, and workflow adaptations.
3. Intellectual Property Concerns: The generated animations raise questions about copyright—who owns the output when the system combines learned patterns from training data with user-provided character designs?

AINews Verdict & Predictions

This research represents a pivotal moment in AI-driven content creation, demonstrating that data efficiency and controllability are not mutually exclusive goals. The team's insight—that complex group dynamics can be learned from pairwise interactions—will influence numerous domains beyond character animation, from multi-agent simulation to social behavior analysis.

Our editorial assessment: The DGAT approach is fundamentally sound and addresses a critical bottleneck in practical AI video generation. While not without limitations, its core architectural innovations—particularly the disentangled representation of appearance, motion, and interaction—will become standard in next-generation generative systems. The research deserves particular recognition for its elegant balance between theoretical insight and practical implementation.

Specific predictions for the next 12-18 months:
1. Commercial Integration (Q3-Q4 2026): We expect Chinese tech giants—particularly ByteDance (via CapCut) and Tencent (via its gaming studios)—to implement similar technology within their creator tools within 6-9 months. The data efficiency aligns perfectly with their massive user bases and need for scalable content creation features.
2. Open-Source Implementation (Q1 2027): A production-ready implementation will likely emerge on GitHub, building on existing frameworks like Stable Diffusion and ControlNet. This will democratize access and spur innovation in the indie developer community.
3. Industry Standards Emergence (2027): As multi-character animation becomes more common, we'll see the development of standardized formats for specifying character interactions—a kind of "scripting language" for AI animation that goes beyond simple pose sequences.
4. Specialized Hardware Acceleration (2028): Chip manufacturers like NVIDIA, AMD, and specialized AI hardware companies will develop optimizations specifically for this class of relational generation models, potentially offering 10-100x speedups for real-time applications.

What to watch next:
1. Follow-up research from the team: Their next publication will likely address temporal extension (longer sequences) and physical plausibility—the two most significant current limitations.
2. Competitive responses from Western labs: Google DeepMind, OpenAI, and Meta AI have all invested heavily in video generation and will likely publish their own approaches to data-efficient multi-subject generation within 6-12 months.
3. First commercial products: Watch for announcements from social media platforms about "group animation" features, particularly ahead of major shopping seasons or cultural events where user-generated content peaks.
4. Regulatory developments: As this technology lowers the barrier for synthetic media creation, we anticipate increased regulatory attention, potentially including watermarking requirements, content authentication standards, or usage restrictions.

The ultimate impact will be measured not just in technical benchmarks but in creative empowerment. By making sophisticated multi-character animation accessible without massive datasets or specialized expertise, this research could unleash a new wave of personalized, interactive storytelling—transforming passive media consumption into active co-creation. The challenge for the industry will be harnessing this potential while addressing the legitimate ethical concerns it raises.

Related topics

AI video generation24 related articles

Archive

March 20262347 published articles

Further Reading

Wan2.7 de Alibaba domina la edición de vídeo con IA, redefiniendo los flujos de trabajo creativosWan2.7 de Alibaba ha sido coronado como el líder indiscutible en la edición de vídeo con IA por el juez más importante: La generación de vídeo con IA consciente de la física emerge como la próxima frontera más allá de la fidelidad visualLa carrera por la generación de vídeo con IA está virando de los gráficos perfectos a pixel a dinámicas físicamente plauEl ocaso de Xiaoice: Cómo el pionero de IA de Microsoft fue superado por la ola generativaMicrosoft Xiaoice, una vez una IA conversacional revolucionaria con más de 660 millones de usuarios, ha entrado en un esCómo la IA de voz full-duplex como Seeduplex está poniendo fin a la era de las conversaciones robóticasUna revolución silenciosa en la IA de voz se está desarrollando en las aplicaciones de consumo. El despliegue de modelos

常见问题

这次模型发布“How Chinese Researchers Are Solving Multi-Person Animation With Minimal Data”的核心内容是什么?

The field of visual generation is undergoing a fundamental transition from simply creating content to creating content with precise structural control. In character animation speci…

从“how does dyadic training enable multi-person animation”看,这个模型发布为什么重要?

The research introduces a novel architecture called Dyadic-to-Group Animation Transformer (DGAT), which fundamentally rethinks how multi-person interactions are learned and generated. At its core is the insight that comp…

围绕“data requirements for AI character animation tools comparison”,这次模型更新对开发者和企业有什么影响?

开发者通常会重点关注能力提升、API 兼容性、成本变化和新场景机会,企业则会更关心可替代性、接入门槛和商业化落地空间。