architecture · training framework · sampling method
Network \(\epsilon_\theta\) learns to predict the noise \(\epsilon\). Since \(x_t\) and \(\epsilon\) determine \(x_0\), this is equivalent to predicting the clean image.
Straight paths mean the ODE solver needs far fewer steps than curved diffusion trajectories. Same architecture as DDPM — only the training target differs.
Requires \(T \approx 1000\) steps. Each step is cheap but there are many. The injected noise \(\sigma_t z\) is necessary but forces a stochastic, slow process.
Replaces stochastic DDPM sampler with a deterministic ODE — same trained model, \(\sim\)10× fewer steps. The precursor to flow matching: proved deterministic ODE sampling works.
Unlike CNNs, every patch attends to every other patch from layer 1. No inductive spatial bias — the model learns it from data. Scales reliably with compute.
DiT is architecture only — it knows nothing about whether you use DDPM or Flow Matching. Modern models (Flux, SD3) pair DiT with flow matching for the best of both.