daVinci-MagiHuman — Fast Audio-Video Generative Foundation Model

Why daVinci

Highlights

🧠

Single-Stream Transformer

A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.

🎭

Human-Centric Quality

Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.

🌍

Multilingual

Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French out of the box.

⚡

Blazing Fast Inference

Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.

🏆

State-of-the-Art Results

Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.

📦

Fully Open Source

Complete model stack released: base model, distilled model, super-resolution model, and inference code under Apache 2.0.

Model Design

Architecture

A single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, jointly denoising video and audio within a unified token sequence.

🥪

Sandwich Architecture

First and last 4 layers use modality-specific projections; the middle 32 layers share parameters across all modalities for efficient joint reasoning.

🕐

Timestep-Free Denoising

No explicit timestep embeddings — the model infers the denoising state directly from the input latents themselves.

🔀

Per-Head Gating

Learned scalar gates with sigmoid activation on each attention head provide training stability and modality balancing.

🔗

Unified Conditioning

Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches needed.

Benchmarks

Performance

Quantitative Quality Benchmark

Model	Visual Quality ↑	Text Alignment ↑	Physical Consistency ↑	WER ↓
OVI 1.1	4.73	4.10	4.41	40.45%
LTX 2.3	4.76	4.12	4.56	19.23%
daVinci-MagiHuman	4.80	4.18	4.52	14.60%

Human Evaluation — 2,000 Pairwise Comparisons

Matchup	daVinci Win	Tie	Opponent Win
vs Ovi 1.1	80.0%	8.2%	11.8%
vs LTX 2.3	60.9%	17.2%	21.9%

Inference Speed

5-second video on a single H100 GPU

256p

2.0

seconds

540p

8.0

seconds

1080p

38.4

seconds

Resolution	Base (s)	Super-Res (s)	Decode (s)	Total (s)
256p	1.6	—	0.4	2.0
540p	1.6	5.1	1.3	8.0
1080p	1.6	31.0	5.8	38.4

Speed

Efficient Inference

⚡ Latent-Space Super-Resolution

Two-stage pipeline: generate at low resolution, then refine in latent space, avoiding an extra VAE decode-encode round trip.

🔄 Turbo VAE Decoder

A lightweight re-trained decoder that substantially reduces decoding overhead while preserving visual fidelity.

🔧 Full-Graph Compilation

MagiCompiler fuses operators across Transformer layers for approximately 1.2x inference speedup through graph-level optimization.

💨 DMD-2 Distillation

Distillation enables generation with only 8 denoising steps and no classifier-free guidance, without sacrificing output quality.

Installation

Getting Started

# Pull the prebuilt MagiHuman image
docker pull sandai/magi-human:latest

docker run -it --gpus all --network host --ipc host \
  -v /path/to/repos:/workspace \
  -v /path/to/checkpoints:/models \
  --name my-magi-human \
  sandai/magi-human:latest \
  bash

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman

# Create environment
conda create -n davinci-magihuman python=3.12
conda activate davinci-magihuman
conda install ffmpeg

# Install PyTorch
pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0

# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
pip install --no-deps -r requirements-nodeps.txt

# Optional (only for sr-1080p): Install MagiAttention
git clone --recursive https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .

Reference

Citation

@misc{davinci-magihuman-2026,
  title   = {Speed by Simplicity: A Single-Stream Architecture
             for Fast Audio-Video Generative Foundation Model},
  author  = {SII-GAIR and Sand.ai},
  year    = {2026},
  url     = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
}