SII-GAIR  •  Sand.ai

daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Try the Demo →
Scroll to explore

Highlights

🧠

Single-Stream Transformer

A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.

🎭

Human-Centric Quality

Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.

🌍

Multilingual

Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French out of the box.

Blazing Fast Inference

Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.

🏆

State-of-the-Art Results

Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.

📦

Fully Open Source

Complete model stack released: base model, distilled model, super-resolution model, and inference code under Apache 2.0.

Architecture

A single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, jointly denoising video and audio within a unified token sequence.

🥪

Sandwich Architecture

First and last 4 layers use modality-specific projections; the middle 32 layers share parameters across all modalities for efficient joint reasoning.

🕐

Timestep-Free Denoising

No explicit timestep embeddings — the model infers the denoising state directly from the input latents themselves.

🔀

Per-Head Gating

Learned scalar gates with sigmoid activation on each attention head provide training stability and modality balancing.

🔗

Unified Conditioning

Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches needed.

Performance

Quantitative Quality Benchmark

ModelVisual Quality ↑Text Alignment ↑Physical Consistency ↑WER ↓
OVI 1.14.734.104.4140.45%
LTX 2.34.764.124.5619.23%
daVinci-MagiHuman4.804.184.5214.60%

Human Evaluation — 2,000 Pairwise Comparisons

MatchupdaVinci WinTieOpponent Win
vs Ovi 1.1
80.0%
8.2%11.8%
vs LTX 2.3
60.9%
17.2%21.9%

Inference Speed

5-second video on a single H100 GPU

256p
2.0
seconds
540p
8.0
seconds
1080p
38.4
seconds
ResolutionBase (s)Super-Res (s)Decode (s)Total (s)
256p1.60.42.0
540p1.65.11.38.0
1080p1.631.05.838.4

Efficient Inference

⚡ Latent-Space Super-Resolution

Two-stage pipeline: generate at low resolution, then refine in latent space, avoiding an extra VAE decode-encode round trip.

🔄 Turbo VAE Decoder

A lightweight re-trained decoder that substantially reduces decoding overhead while preserving visual fidelity.

🔧 Full-Graph Compilation

MagiCompiler fuses operators across Transformer layers for approximately 1.2x inference speedup through graph-level optimization.

💨 DMD-2 Distillation

Distillation enables generation with only 8 denoising steps and no classifier-free guidance, without sacrificing output quality.

Getting Started

# Pull the prebuilt MagiHuman image docker pull sandai/magi-human:latest docker run -it --gpus all --network host --ipc host \ -v /path/to/repos:/workspace \ -v /path/to/checkpoints:/models \ --name my-magi-human \ sandai/magi-human:latest \ bash # Install MagiCompiler git clone https://github.com/SandAI-org/MagiCompiler.git cd MagiCompiler pip install -r requirements.txt pip install . cd .. # Clone daVinci-MagiHuman git clone https://github.com/GAIR-NLP/daVinci-MagiHuman cd daVinci-MagiHuman
# Create environment conda create -n davinci-magihuman python=3.12 conda activate davinci-magihuman conda install ffmpeg # Install PyTorch pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 # Install Flash Attention (Hopper) git clone https://github.com/Dao-AILab/flash-attention cd flash-attention/hopper && python setup.py install && cd ../.. # Install MagiCompiler git clone https://github.com/SandAI-org/MagiCompiler.git cd MagiCompiler pip install -r requirements.txt pip install . cd .. # Clone and install daVinci-MagiHuman git clone https://github.com/GAIR-NLP/daVinci-MagiHuman cd daVinci-MagiHuman pip install -r requirements.txt pip install --no-deps -r requirements-nodeps.txt # Optional (only for sr-1080p): Install MagiAttention git clone --recursive https://github.com/SandAI-org/MagiAttention.git cd MagiAttention git checkout v1.0.5 git submodule update --init --recursive pip install -r requirements.txt pip install --no-build-isolation .

Citation

@misc{davinci-magihuman-2026, title = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model}, author = {SII-GAIR and Sand.ai}, year = {2026}, url = {https://github.com/GAIR-NLP/daVinci-MagiHuman} }