Generative Model Components

architecture · training framework · sampling method

Training Framework
Sampling Method
Architecture
Training Frameworks
DDPM
Denoising Diffusion Probabilistic Models
Training Framework
Defines a stochastic forward process that gradually corrupts data into Gaussian noise, then trains a network to reverse it by predicting the noise added at each step.
Forward Process (closed form)
\( x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \)
where
\( \bar\alpha_t = \prod_{s=1}^{t}(1 - \beta_s) \)
Training Loss
\( \mathcal{L} = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right] \)

Network \(\epsilon_\theta\) learns to predict the noise \(\epsilon\). Since \(x_t\) and \(\epsilon\) determine \(x_0\), this is equivalent to predicting the clean image.

Flow Matching
Conditional Flow Matching (CFM)
Training Framework
Defines a deterministic ODE that transports noise to data along straight-line paths. Trains a vector field \(v_\theta\) via simple MSE — no score estimation or SDE needed.
Linear Interpolant Path
\( x_t = (1-t)\,x_0 + t\,x_1, \quad t \in [0,1] \)
Conditional Target Vector Field
\( u_t(x \mid x_1) = x_1 - x_0 \)
CFM Loss
\( \mathcal{L}_\text{CFM} = \mathbb{E}_{t,\, x_0,\, x_1}\!\left[\|v_\theta(x_t, t) - (x_1 - x_0)\|^2\right] \)
Inference ODE
\( \dfrac{dx}{dt} = v_\theta(x_t,\, t), \qquad x_0 \sim \mathcal{N}(0, I) \)

Straight paths mean the ODE solver needs far fewer steps than curved diffusion trajectories. Same architecture as DDPM — only the training target differs.

Sampling Methods
DDPM Sampler
Stochastic Ancestral Sampling
Sampling Method
Reverses the forward SDE step by step. At each step, the predicted noise is removed and fresh Gaussian noise is reinjected — inherently stochastic.
Reverse Mean
\( \mu_\theta(x_t, t) = \dfrac{1}{\sqrt{\alpha_t}}\!\left(x_t - \dfrac{\beta_t}{\sqrt{1-\bar\alpha_t}}\,\epsilon_\theta(x_t, t)\right) \)
Sampling Step (stochastic)
\( x_{t-1} = \mu_\theta(x_t, t) + \sigma_t\, z, \quad z \sim \mathcal{N}(0, I) \)

Requires \(T \approx 1000\) steps. Each step is cheap but there are many. The injected noise \(\sigma_t z\) is necessary but forces a stochastic, slow process.

DDIM
Denoising Diffusion Implicit Models
Sampling Method
An inference-time trick that re-interprets a DDPM-trained model as an ODE. No noise is injected — trajectories are deterministic. Needs no retraining.
Predicted Clean Image
\( \hat{x}_0 = \dfrac{x_t - \sqrt{1-\bar\alpha_t}\,\epsilon_\theta(x_t, t)}{\sqrt{\bar\alpha_t}} \)
Deterministic Update Step
\( x_{t-1} = \sqrt{\bar\alpha_{t-1}}\,\hat{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\,\epsilon_\theta(x_t, t) \)

Replaces stochastic DDPM sampler with a deterministic ODE — same trained model, \(\sim\)10× fewer steps. The precursor to flow matching: proved deterministic ODE sampling works.

Architectures
ViT
Vision Transformer
Architecture
Applies a standard Transformer to images by splitting them into fixed-size patches, projecting each into a token, and processing the sequence with self-attention.
Patchify + Embed
\( \mathbf{Z}_0 = [\,\mathbf{x}^1_p\mathbf{E};\; \mathbf{x}^2_p\mathbf{E};\;\ldots;\;\mathbf{x}^N_p\mathbf{E}\,] + \mathbf{E}_\text{pos} \)
Scaled Dot-Product Attention
\( \text{Attn}(Q,K,V) = \text{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right)V \)
Transformer Block
\( \mathbf{x} \leftarrow \mathbf{x} + \text{MHA}(\text{LN}(\mathbf{x})) \)
\( \mathbf{x} \leftarrow \mathbf{x} + \text{MLP}(\text{LN}(\mathbf{x})) \)

Unlike CNNs, every patch attends to every other patch from layer 1. No inductive spatial bias — the model learns it from data. Scales reliably with compute.

DiT
Diffusion Transformer
Architecture
A ViT adapted to generative modelling. Takes \((x_t, t, c)\) as input. The key addition over plain ViT: adaptive LayerNorm (adaLN) injects time and class conditioning into every block.
Condition Embedding
\( \mathbf{c}_\text{cond} = \text{embed}(t) + \text{embed}(c) \)
Adaptive LayerNorm (adaLN)
\( (\gamma,\, \beta) = \text{MLP}(\mathbf{c}_\text{cond}) \)
\( \text{adaLN}(\mathbf{x}) = \gamma \cdot \dfrac{\mathbf{x} - \mu}{\sigma} + \beta \)
adaLN-Zero (best variant)
\( \text{init final linear} \to \mathbf{0} \implies \text{each block} \approx \text{identity at step 0} \)
Framework-Agnostic Output
\( \text{output} = v_\theta(x_t, t) \;\text{ or }\; \epsilon_\theta(x_t, t) \)

DiT is architecture only — it knows nothing about whether you use DDPM or Flow Matching. Modern models (Flux, SD3) pair DiT with flow matching for the best of both.

How they combine
SD 1.x   →  DDPM training  +  U-Net  +  DDPM/DDIM sampler
DiT paper →  DDPM training  +  DiT   +  DDIM sampler
Flux / SD3 →  Flow Matching  +  DiT   +  Euler / RK4 ODE solver