Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
Try the Demo →A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French out of the box.
Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
Complete model stack released: base model, distilled model, super-resolution model, and inference code under Apache 2.0.
A single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, jointly denoising video and audio within a unified token sequence.
First and last 4 layers use modality-specific projections; the middle 32 layers share parameters across all modalities for efficient joint reasoning.
No explicit timestep embeddings — the model infers the denoising state directly from the input latents themselves.
Learned scalar gates with sigmoid activation on each attention head provide training stability and modality balancing.
Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches needed.
| Model | Visual Quality ↑ | Text Alignment ↑ | Physical Consistency ↑ | WER ↓ |
|---|---|---|---|---|
| OVI 1.1 | 4.73 | 4.10 | 4.41 | 40.45% |
| LTX 2.3 | 4.76 | 4.12 | 4.56 | 19.23% |
| daVinci-MagiHuman | 4.80 | 4.18 | 4.52 | 14.60% |
| Matchup | daVinci Win | Tie | Opponent Win |
|---|---|---|---|
| vs Ovi 1.1 | 8.2% | 11.8% | |
| vs LTX 2.3 | 17.2% | 21.9% |
5-second video on a single H100 GPU
| Resolution | Base (s) | Super-Res (s) | Decode (s) | Total (s) |
|---|---|---|---|---|
| 256p | 1.6 | — | 0.4 | 2.0 |
| 540p | 1.6 | 5.1 | 1.3 | 8.0 |
| 1080p | 1.6 | 31.0 | 5.8 | 38.4 |
Two-stage pipeline: generate at low resolution, then refine in latent space, avoiding an extra VAE decode-encode round trip.
A lightweight re-trained decoder that substantially reduces decoding overhead while preserving visual fidelity.
MagiCompiler fuses operators across Transformer layers for approximately 1.2x inference speedup through graph-level optimization.
Distillation enables generation with only 8 denoising steps and no classifier-free guidance, without sacrificing output quality.
# Pull the prebuilt MagiHuman image
docker pull sandai/magi-human:latest
docker run -it --gpus all --network host --ipc host \
-v /path/to/repos:/workspace \
-v /path/to/checkpoints:/models \
--name my-magi-human \
sandai/magi-human:latest \
bash
# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..
# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman# Create environment
conda create -n davinci-magihuman python=3.12
conda activate davinci-magihuman
conda install ffmpeg
# Install PyTorch
pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0
# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..
# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..
# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
pip install --no-deps -r requirements-nodeps.txt
# Optional (only for sr-1080p): Install MagiAttention
git clone --recursive https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .@misc{davinci-magihuman-2026,
title = {Speed by Simplicity: A Single-Stream Architecture
for Fast Audio-Video Generative Foundation Model},
author = {SII-GAIR and Sand.ai},
year = {2026},
url = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
}