Generative Model Components

DDPM

Denoising Diffusion Probabilistic Models

Training Framework

Defines a stochastic forward process that gradually corrupts data into Gaussian noise, then trains a network to reverse it by predicting the noise added at each step.

Forward Process (closed form) \( x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \) where \( \bar\alpha_t = \prod_{s=1}^{t}(1 - \beta_s) \) Training Loss \( \mathcal{L} = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right] \)

↯

Network \(\epsilon_\theta\) learns to predict the noise \(\epsilon\). Since \(x_t\) and \(\epsilon\) determine \(x_0\), this is equivalent to predicting the clean image.

Flow Matching

Conditional Flow Matching (CFM)

Training Framework

Defines a deterministic ODE that transports noise to data along straight-line paths. Trains a vector field \(v_\theta\) via simple MSE — no score estimation or SDE needed.

Linear Interpolant Path \( x_t = (1-t)\,x_0 + t\,x_1, \quad t \in [0,1] \) Conditional Target Vector Field \( u_t(x \mid x_1) = x_1 - x_0 \) CFM Loss \( \mathcal{L}_\text{CFM} = \mathbb{E}_{t,\, x_0,\, x_1}\!\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right] \) Inference ODE \( \dfrac{dx}{dt} = v_\theta(x_t,\, t), \qquad x_0 \sim \mathcal{N}(0, I) \)

↯

Straight paths mean the ODE solver needs far fewer steps than curved diffusion trajectories. Same architecture as DDPM — only the training target differs.

DDPM Sampler

Stochastic Ancestral Sampling

Sampling Method

Reverses the forward SDE step by step. At each step, the predicted noise is removed and fresh Gaussian noise is reinjected — inherently stochastic.

Reverse Mean \( \mu_\theta(x_t, t) = \dfrac{1}{\sqrt{\alpha_t}}\!\left(x_t - \dfrac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\epsilon_\theta(x_t, t)\right) \) Sampling Step (stochastic) \( x_{t-1} = \mu_\theta(x_t, t) + \sigma_t\, z, \quad z \sim \mathcal{N}(0, I) \)

↯

Requires \(T \approx 1000\) steps. Each step is cheap but there are many. The injected noise \(\sigma_t z\) is necessary but forces a stochastic, slow process.

DDIM

Denoising Diffusion Implicit Models

Sampling Method

An inference-time trick that re-interprets a DDPM-trained model as an ODE. No noise is injected — trajectories are deterministic. Needs no retraining.

Predicted Clean Image \( \hat{x}_0 = \dfrac{x_t - \sqrt{1-\bar\alpha_t}\,\epsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}} \) Deterministic Update Step \( x_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\,\epsilon_\theta(x_t, t) \)

↯

Replaces stochastic DDPM sampler with a deterministic ODE — same trained model, \(\sim\)10× fewer steps. The precursor to flow matching: proved deterministic ODE sampling works.

ViT

Vision Transformer

Architecture

Applies a standard Transformer to images by splitting them into fixed-size patches, projecting each into a token, and processing the sequence with self-attention.

Patchify + Embed \( \mathbf{Z}_0 = [\,\mathbf{x}^1_p\mathbf{E};\; \mathbf{x}^2_p\mathbf{E};\;\ldots;\;\mathbf{x}^N_p\mathbf{E}\,] + \mathbf{E}_\text{pos} \) Scaled Dot-Product Attention \( \text{Attn}(Q,K,V) = \text{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right)V \) Transformer Block \( \mathbf{x} \leftarrow \mathbf{x} + \text{MHA}(\text{LN}(\mathbf{x})) \) \( \mathbf{x} \leftarrow \mathbf{x} + \text{MLP}(\text{LN}(\mathbf{x})) \)

↯

Unlike CNNs, every patch attends to every other patch from layer 1. No inductive spatial bias — the model learns it from data. Scales reliably with compute.

DiT

Diffusion Transformer

Architecture

A ViT adapted to generative modelling. Takes \((x_t, t, c)\) as input. The key addition over plain ViT: adaptive LayerNorm (adaLN) injects time and class conditioning into every block.

Condition Embedding \( \mathbf{c}_\text{cond} = \text{embed}(t) + \text{embed}(c) \) Adaptive LayerNorm (adaLN) \( (\gamma,\, \beta) = \text{MLP}(\mathbf{c}_\text{cond}) \) \( \text{adaLN}(\mathbf{x}) = \gamma \cdot \dfrac{\mathbf{x} - \mu}{\sigma} + \beta \) adaLN-Zero (best variant) \( \text{init final linear} \to \mathbf{0} \implies \text{each block} \approx \text{identity at step 0} \) Framework-Agnostic Output \( \text{output} = v_\theta(x_t, t) \;\text{ or }\; \epsilon_\theta(x_t, t) \)

↯

DiT is architecture only — it knows nothing about whether you use DDPM or Flow Matching. Modern models (Flux, SD3) pair DiT with flow matching for the best of both.