Model Architecture
15B Sandwich Transformer
40-layer unified sandwich Transformer architecture. Modality-specific and shared layers work together seamlessly.
The 15-billion-parameter model uses a single-stream self-attention architecture to handle video and sound as one unified sequence.
Unified Architecture
One model handles video and audio simultaneously. No separate pipelines, no cross-attention modules, no manual syncing required.
Performance Specifications
~38s to 1080p
Generation time on H100 GPU. DMD-2 distillation enables rapid generation.
With MagiCompiler, it's 1.2x faster.
1080p Native Resolution
True 1080p output, not upscaled. Crisp details without artifacts.
5-10s Duration
Single-pass generation. Ideal for short-form content creation.
Audio Capabilities
Joint Video + Audio
One unified model handles it all. Dialogue, ambient sound, foley—all synchronized.
7 Languages Lip Sync
Lip sync that matches speech naturally:
| Language | Supported |
|---|---|
| English | ✅ |
| Mandarin | ✅ |
| Cantonese | ✅ |
| Japanese | ✅ |
| Korean | ✅ |
| German | ✅ |
| French | ✅ |
Inference Optimization
8-Step DMD-2 Distillation
Fast and efficient inference. No CFG needed.
MagiCompiler Support
1.2x faster with MagiCompiler optimization.
Rankings
| Category | Rank | Elo Score |
|---|---|---|
| Text-to-Video (No Audio) | #1 | 1,375 |
| Image-to-Video (No Audio) | #1 | 1,392 |
| Lead over Seedance 2.0 | 60+ | — |
Open Source
The team has committed to releasing the full open-source package, including the base model, distilled versions, and inference code by mid-2026.
