Neural Tech Daily
ai-research

Diffusion image generation from DDPM to Flux: a multi-paper lineage review

Multi-paper review of the diffusion-image-generation lineage — DDPM, Latent Diffusion, SDXL, and Flux. What each paper changed about the loss, the architecture, and…

Updated ~68 min read
Share

Reading-register key

From the paper: — directly supported by the paper. [Analysis] — the publication’s reasoned assessment. [Reconstructed] — faithful reconstruction from partial disclosure. [External comparison] — comparison to prior work or general knowledge. [Reviewer Perspective] — critical or speculative.

Figure 2 of Ho et al. — Denoising Diffusion Probabilistic Models (arXiv:2006.11239), the directed graphical model showing the forward noising chain and the learned reverse chain

Figure 2 of Ho, Jain, Abbeel — Denoising Diffusion Probabilistic Models (arXiv:2006.11239), reproduced for editorial coverage.

Section 1 — Paper identity and scope

This review covers four papers that together define the modern diffusion-image-generation lineage.

  1. Ho, Jain, Abbeel — Denoising Diffusion Probabilistic Models (DDPM). NeurIPS 2020. arXiv:2006.11239. The foundational paper that turned Sohl-Dickstein et al.’s 2015 non-equilibrium-thermodynamics generative framework into a working high-quality image model, and recast the training objective as noise-prediction.
  2. Rombach, Blattmann, Lorenz, Esser, Ommer — High-Resolution Image Synthesis with Latent Diffusion Models (LDM, the paper behind Stable Diffusion). CVPR 2022. arXiv:2112.10752. Moves diffusion from pixel space into the latent space of a pre-trained autoencoder and adds cross-attention conditioning, the architectural template Stable Diffusion 1.x and 2.x build on directly.
  3. Podell, English, Lacey, Blattmann, Dockhorn, Müller, Penna, Rombach — SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952, July 2023. Scales the LDM U-Net to 2.6B parameters, adds a second text encoder, introduces micro-conditioning on image size and crop coordinates, and ships a two-stage base-plus-refinement pipeline.
  4. Black Forest Labs — FLUX.1 family release. August 2024, founder team includes Robin Rombach (lead author of LDM and SDXL). bfl.ai announcement and the FLUX.1-dev model card. 12B-parameter rectified-flow transformer; the first widely-deployed open model that drops the DDPM-derived U-Net entirely in favour of a flow-matching transformer architecture. No peer-reviewed paper as of the access date; the model card plus the BFL announcement plus the rectified-flow paper by Liu, Gong, and Liu are the citable primary sources.

Retrieval confirmation. All four primary sources were retrieved on 2026-05-20. DDPM, LDM, and SDXL were fetched in full from arXiv abstracts and the ar5iv HTML rendering; the Flux entry combines the BFL announcement and the Hugging Face model card, since there is no Flux-specific peer-reviewed paper. Where the Flux description relies on the rectified-flow theory background, the Liu, Gong, Liu 2022 paper is the supporting citation.

Paper classification. All four entries fall under Generative model and Architecture proposal. DDPM is also Theoretical (variational bound derivation). LDM and SDXL are also Representation learning (autoencoder pre-training is load-bearing). Flux is also Inference method (rectified flow shortens the sampling chain).

One-paragraph technical abstract in the publication’s voice. This review traces how the diffusion-image-generation field moved from a pixel-space U-Net trained for 1,000-step DDPM sampling on CIFAR-10 in 2020 to a 12B-parameter rectified-flow transformer producing 1024x1024 photorealistic images at single-digit sampling steps in 2024. The four papers covered are linked structurally: LDM keeps DDPM’s loss and replaces the data space with a learned latent; SDXL keeps LDM’s autoencoder and scales the U-Net while adding micro-conditioning; Flux drops the U-Net entirely for a transformer backbone and replaces DDPM’s stochastic reverse process with deterministic rectified flow. The review reproduces each paper’s central loss equation, traces the corresponding algorithm on a small worked example, and locates the contribution within the broader score-matching and flow-matching theoretical context.

Primary research question. How do you train a deep network to sample from a complex high-dimensional image distribution, and how do you keep that pipeline tractable as resolution, conditioning, and sample quality demand all increase?

Core technical claim. Diffusion-style generative models are tractable because the forward noising process is a fixed Markov chain with a closed-form Gaussian transition, the reverse process can be learned by predicting added noise rather than predicting the data, and the entire pipeline can be made dramatically more efficient by operating in a learned latent space (LDM), scaling the denoiser (SDXL), and straightening the sampling trajectory (Flux’s rectified flow).

Core technical domains and depth labels. Probabilistic modelling and variational inference (deep). Score-based generative modelling and SDEs (moderate to deep). Convolutional and transformer architectures for image generation (deep). Classifier-free guidance and conditioning (moderate). Flow-matching and rectified flow (moderate). Autoencoder pre-training (surface to moderate).

Reader prerequisites. High-school algebra. Some familiarity with neural-network basics helps but is not required — every term used in the body is in the Glossary in Section 2.5. Probability at the level of “what a Gaussian is” is assumed; the rest of the probability machinery (KL divergence, variational bound) is explained inline.

Section 2 — TL;DR and executive overview

3-sentence TL;DR. Diffusion models generate images by learning to reverse a noising process: you take a clean image, gradually add Gaussian noise until it becomes pure static, then train a neural network to undo the noise one small step at a time. Four papers built this field: DDPM (2020) showed the training trick that made the loss simple and the samples high-quality, Latent Diffusion (2021) moved the whole pipeline into a compressed latent space so it could run on a single GPU, SDXL (2023) scaled the model and added conditioning on image size and cropping to fix artefacts, and Flux (2024) replaced the U-Net with a 12-billion-parameter transformer and the stochastic sampler with deterministic rectified flow. Each paper is a self-contained technical contribution; together they explain why a $1,000 consumer GPU in 2024 can produce 1024x1024 photorealistic images that a 2020 supercomputer could not.

Executive summary. The diffusion-image-generation lineage condenses cleanly into four canonical reference papers. DDPM in 2020 fixed the noise-prediction parameterisation ϵθ(xt,t)\epsilon_\theta(x_t, t) as the standard training objective and the U-Net as the standard architecture. Latent Diffusion in 2021 moved the diffusion process into the latent space of a pre-trained autoencoder, cutting training and inference cost by an order of magnitude and making text-conditioning via cross-attention practical. SDXL in 2023 scaled the U-Net to 2.6B parameters, added a second CLIP text encoder, and introduced micro-conditioning on image size and crop coordinates to fix the long-standing “headless cat” artefact from random-crop training. Flux in 2024 replaced the U-Net with a 12B-parameter transformer and the DDPM-style stochastic reverse process with rectified flow. The progression is the canonical reading list for anyone trying to understand what generative image models do under the hood.

Five practitioner-relevant takeaways.

  • The DDPM noise-prediction loss Lsimple=Et,x0,ϵ[midϵϵθ(xt,t)mid2]L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}[\\mid \epsilon - \epsilon_\theta(x_t, t) \\mid ^2] is still the conceptual loss for SDXL and conceptually the loss for Flux (up to the rectified-flow vector-field reparameterisation). Implementations of any diffusion model start here.
  • The autoencoder downsampling factor ff in latent diffusion is the single most important hyperparameter for the speed-quality trade-off. Stable Diffusion 1.x and 2.x use f=8f=8; the LDM paper found f{4,8}f \in \{4, 8\} is the sweet spot.
  • SDXL’s micro-conditioning trick (telling the model the original image size and crop coordinates at training time) recovers ~39% of training data that earlier filters discarded, and is the cleanest example of “free” data-efficiency improvements available in this lineage.
  • The Flux-style transformer-plus-rectified-flow stack is the practical successor to LDM-style U-Net architectures. Stable Diffusion 3 and Flux both moved to this stack independently and concurrently in 2024.
  • Sampling steps dropped from 1000 (DDPM) to 50 (LDM/SDXL with DDIM) to ~4 (Flux schnell with guidance distillation) over four years. Most of the wall-clock speedup in this lineage came from sampler improvements, not model architecture.

Pipeline overview. All four papers share the same training-time outline: take a clean data point x0x_0 (image for DDPM, latent code z0=E(x0)z_0 = E(x_0) for LDM/SDXL/Flux), sample a timestep tt, sample noise ϵ\epsilon, form the noised input, ask the network to predict the noise (DDPM, LDM, SDXL) or the velocity field (Flux), and minimise squared error. At inference time, all four start from pure noise and iteratively denoise; LDM/SDXL/Flux decode the final latent back to a pixel image via the autoencoder decoder.

Section 2.5 — Glossary

TermPlain-English explanationFirst appears in
Forward processA fixed, non-learned recipe for gradually adding noise to a clean image until it becomes pure Gaussian noise. The opposite of generation.Section 3
Reverse processThe learned chain of denoising steps the network performs at sampling time; the actual image generator.Section 3
Timestep ttAn integer (DDPM: t{1,...,1000}t \in \{1, ..., 1000\}) or a continuous value (rectified flow: t[0,1]t \in [0, 1]) indexing how noisy the current state is. t=0t=0 is clean data; t=Tt=T is pure noise.Section 3
Noise scheduleThe sequence of variances β1,...,βT\beta_1, ..., \beta_T controlling how much noise is added at each forward step. DDPM uses a linear schedule from 10410^{-4} to 0.020.02.Section 3
U-NetA specific convolutional neural-network architecture with a symmetric encoder-decoder shape and skip connections. The default DDPM/LDM/SDXL denoiser backbone.Section 5
Latent spaceThe compressed representation produced by an autoencoder’s encoder. A 512x512 RGB image (786,432 numbers) compresses into a 64x64x4 latent (16,384 numbers) at f=8f=8 — 48 times smaller.Section 5
Cross-attentionA neural-network operation that lets the U-Net look at conditioning information (text embeddings) at every spatial location of every feature map. The key mechanism enabling text-to-image conditioning.Section 5
Variational bound (ELBO)A surrogate loss derived from a probabilistic model; minimising the bound minimises a related divergence. DDPM derives one but trains a simpler loss in practice.Section 6
KL divergenceA measure of how different two probability distributions are; zero when they are identical.Section 6
Score functionThe gradient of the log-probability density, xlogp(x)\nabla_x \log p(x). Score-matching trains a network to predict this.Section 6
Classifier-free guidanceA sampling trick that runs the network twice — once with the conditioning, once without — and extrapolates between them to amplify the conditioning’s effect.Section 7
Rectified flowA flow-matching training objective that learns a velocity field vθ(xt,t)v_\theta(x_t, t) pointing from noise to data along nearly-straight paths, enabling few-step deterministic sampling.Section 6
ELBOEvidence lower bound. A lower bound on the log-likelihood used as a training objective.Section 6
FIDFréchet Inception Distance. A standard image-quality metric; lower is better.Section 9
Inception Score (IS)An older image-quality metric correlating with sample diversity and class confidence. Higher is better.Section 9
[Analysis] labelThe publication’s own reasoned assessment, distinct from what the paper itself claims.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what the paper proves.Sections 11, 12
[Reconstructed] labelContent the publication faithfully reconstructed from partial disclosure.Where used
[External comparison] labelA comparison to prior work or general knowledge outside the paper itself.Sections 4, 11
”From the paper:” prefixContent directly supported by the paper’s text, equations, tables, or figures.Throughout

Section 3 — Problem formalisation

Notation table (common to all four papers, harmonised here; original papers use slightly different conventions which are flagged where they diverge).

SymbolTypeMeaningFirst appears in
x0x_0tensor in RH×W×3\mathbb{R}^{H \times W \times 3}A clean data sample, typically an RGB image.Section 3
xtx_ttensor of same shapeThe data sample after tt forward noising steps.Section 3
ttinteger or realThe current timestep. DDPM: t{1,...,T}t \in \{1, ..., T\}, T=1000T=1000. Flow: t[0,1]t \in [0, 1].Section 3
TTintegerNumber of diffusion steps. DDPM: 1000. LDM/SDXL: 1000 train, 50 inference (DDIM). Flux schnell: 4 inference.Section 3
βt\beta_treal in (0,1)(0, 1)Noise variance schedule at step tt. Linear from 10410^{-4} to 0.020.02 in DDPM.Section 3
αt=1βt\alpha_t = 1 - \beta_trealPer-step signal-retention coefficient.Section 3
αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^`{t}` \alpha_srealCumulative product; used for the closed-form noised state.Section 3
ϵ\epsilontensor, same shape as x0x_0A standard-Gaussian noise sample.Section 3
ϵθ(xt,t)\epsilon_\theta(x_t, t)tensor, same shape as xtx_tThe neural network’s noise prediction.Section 3
q()q(\cdot)distributionThe fixed forward process.Section 3
pθ()p_\theta(\cdot)distributionThe learned reverse process.Section 3
ztz_ttensor in RH/f×W/f×c\mathbb{R}^{H/f \times W/f \times c}The latent counterpart of xtx_t in LDM/SDXL/Flux.Section 5
E(),D()E(\cdot), D(\cdot)encoder, decoderThe pre-trained autoencoder used by LDM/SDXL/Flux.Section 5
τθ(y)\tau_\theta(y)tensorThe conditioning encoder output (text embedding) used by LDM cross-attention.Section 5
vθ(xt,t)v_\theta(x_t, t)tensorFlux’s learned velocity field.Section 6
ffintegerAutoencoder downsampling factor; LDM evaluates f{1,2,4,8,16,32}f \in \{1, 2, 4, 8, 16, 32\}.Section 5

Formal problem statement. Given a dataset D\mathcal{D} of clean images x0pdatax_0 \sim p_{\text{data}} and (for text-conditional models) paired text captions yy, learn a sampler that draws fresh samples from pdatap_{\text{data}} (or from pdata(y)p_{\text{data}}(\cdot \mid y) conditional on a caption yy).

The diffusion formulation. All four papers approach this by constructing a forward Markov chain that destroys structure in x0x_0:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t \mathbf{I})

This chain has a remarkable property — the marginal distribution at any timestep tt is also Gaussian, in closed form:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) \mathbf{I})

which means xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon for ϵN(0,I)\epsilon \sim \mathcal{N}(0, \mathbf{I}). From the paper (DDPM Equation 4): this lets you sample any noised state xtx_t directly without simulating t1t-1 intermediate steps.

Explicit assumption list.

  • The data distribution pdatap_{\text{data}} has a density (it is not concentrated on a measure-zero set in Rd\mathbb{R}^d). [Analysis] Standard but worth surfacing — true for natural images at typical resolutions.
  • The noise schedule βt\beta_t is small enough that the reverse process is also approximately Gaussian. DDPM uses βT=0.02\beta_T = 0.02, well within the regime where this holds. [Analysis] Potentially limiting at very high noise levels; flow-based methods do not need this assumption.
  • The U-Net (or transformer) has sufficient capacity to approximate the reverse-process mean over all timesteps from a single set of weights. [Analysis] Empirically validated; theoretically not characterised in any of the four papers.
  • For LDM/SDXL/Flux: the autoencoder’s latent space is “diffusable” — its statistics are smooth enough that a diffusion model can be trained on it. From the paper (LDM Section 4.3.1): KL- or VQ-regularisation is required to keep latent variances bounded.

Why the problem is hard. A direct attempt to learn pθ(x0)p_\theta(x_0) via maximum likelihood on a high-dimensional image distribution suffers from two well-documented failure modes. Per Sohl-Dickstein et al. 2015 and Ho et al. 2020, autoregressive and normalising-flow approaches scale poorly with resolution because of strict architectural constraints (e.g. invertibility) or sequential generation. VAEs trained directly on pixels produce blurry samples. GANs sidestep likelihood entirely but suffer from mode collapse and training instability. The diffusion framework converts a hard one-shot generation problem into a sequence of small, well-conditioned regression problems — denoising at a fixed noise level — that a U-Net or transformer can solve effectively.

Conditioning structure. For text-to-image generation (LDM onward), the problem becomes learning pθ(x0y)p_\theta(x_0 \mid y) where yy is a text caption. LDM introduces cross-attention as the conditioning mechanism; SDXL extends this to two text encoders concatenated; Flux uses transformer-native cross-attention with T5-XXL and CLIP-L text encoders.

Section 4 — Motivation and gap

Real-world problem. Generative image modelling needs to produce high-quality, controllable samples at resolutions readers actually use (512x512 and up). Pre-2020 attempts hit hard walls. GANs (BigGAN, StyleGAN2) produced sharp samples but trained unstably and lost mode coverage. Autoregressive models (PixelCNN, PixelSNAIL, ImageGPT) had likelihood guarantees but scaled poorly past 64x64 because of their sequential generation pattern. VAEs trained directly in pixel space produced blurry samples. The field needed a method that combined likelihood-friendly training, mode coverage comparable to a likelihood model, and sample quality competitive with adversarial methods.

Existing approaches and their failure modes.

  • GANs (Goodfellow 2014 onward). Per Ho et al.’s related-work section: high sample quality but mode collapse and unstable training. [External comparison] The 2018 BigGAN paper showed that scaling GANs to high quality required extreme hyperparameter tuning.
  • Autoregressive models. Per the DDPM introduction: PixelCNN-family models compute likelihood exactly but generate one pixel at a time, making sampling at 256x256 prohibitively slow.
  • Normalising flows. Per the DDPM related work: invertibility constraints limit expressiveness; sample quality lagged GANs.
  • VAEs trained in pixel space. Per the LDM introduction and [External comparison] the original Kingma-Welling 2014 paper: produced blurry samples because of the Gaussian decoder likelihood interacting with pixel-level reconstruction.
  • Score-matching with NCSN / Song-Ermon 2019. The conceptual ancestor of DDPM. Required annealed Langevin dynamics for sampling; sample quality was decent but not state of the art until DDPM.

What gap each paper claims to fill.

  • DDPM (2020). Sohl-Dickstein et al.’s 2015 framework existed but had not produced competitive image samples. From the paper (DDPM abstract): “We obtain high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.” DDPM achieves FID 3.17 on CIFAR-10, beating contemporary GAN scores. The gap closed is “diffusion models can compete with GANs on sample quality” while keeping likelihood-based training stability.
  • LDM (2021). Pixel-space diffusion is computationally prohibitive at 512x512 and above — training a pixel-space DDPM at 256x256 takes hundreds of GPU-days. From the paper (LDM introduction): training and inference cost makes the field inaccessible outside large industrial labs. LDM closes this gap by moving diffusion into latent space.
  • SDXL (2023). Stable Diffusion 1.5 (the productionised LDM) produces visible artefacts: cropped subjects, “headless cat” outputs, distorted faces. From the paper (SDXL Section 2.2): random-crop training and rejection of small images both contribute to these artefacts. SDXL closes the gap by adding explicit conditioning on the image size and crop coordinates the training pipeline applied.
  • Flux (2024). SDXL’s U-Net is still the architectural template inherited from DDPM; sampling still takes 25-50 steps. The Flux release argues (from the BFL announcement) that combining a transformer backbone with rectified-flow training closes the gap to single-digit-step sampling and substantially improves prompt following.

Practical stakes. Image generation was the first large-scale ML application where the model architecture, training objective, and sampler were genuinely under active research simultaneously over a four-year window. [Analysis] The downstream applications (Stable Diffusion, Midjourney, Flux derivatives) created the consumer-AI category in 2022-2024; the lineage covered here is the canonical technical history.

[External comparison] Position in broader research landscape. Score-matching (Hyvärinen 2005, Vincent 2011, Song-Ermon 2019) and DDPM converged in 2020 — Song et al.’s 2021 “Score-Based Generative Modeling through Stochastic Differential Equations” paper unified them under a continuous-time SDE formulation. LDM bridged the score/diffusion framework with VAE-style representation learning. SDXL combined LDM’s architecture with classifier-free guidance (Ho-Salimans 2021) and aspect-ratio bucketing common in production text-to-image pipelines. Flux’s rectified-flow approach connects diffusion to optimal-transport (Tong et al. 2023, Lipman et al. 2023 flow matching), with the Liu-Gong-Liu 2022 rectified-flow paper as the immediate ancestor.

Section 5 — Method overview

Each paper builds on the previous. This section presents the four methods in chronological order, with explicit cross-references for what each paper inherits and what it changes.

5.1 DDPM — pixel-space diffusion with noise prediction

Plain-English intuition. Pick an image, add a tiny bit of Gaussian noise; do that 1000 times until the image is pure static. Now train a U-Net to look at a noisy image and predict the noise that was added. To generate a new image, start with random static and iteratively subtract the network’s noise predictions.

Exact mechanism. From the paper (DDPM Section 3.2):

  1. The reverse-process Gaussian has a parameterised mean and a fixed variance: pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 \mathbf{I}) with σt2=βt\sigma_t^2 = \beta_t (chosen for simplicity over the alternative β~t\tilde\beta_t).
  2. The mean is reparameterised in terms of a noise-prediction network ϵθ(xt,t)\epsilon_\theta(x_t, t):

μθ(xt,t)=1αt(xtβt1αˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \, \epsilon_\theta(x_t, t) \right)

  1. The training loss simplifies dramatically. The full variational bound (Section 6, MATH ENTRY 2) decomposes into a sum of KL terms; with the noise-prediction parameterisation and an unweighted version of the loss, training reduces to:

Lsimple=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2]L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, t) \|^2 \right]

  1. Architecture: a U-Net with group normalisation, sinusoidal timestep embeddings, and self-attention at the 16x16 resolution. From the paper (DDPM Appendix B): the U-Net follows the design of PixelCNN++ and Wide ResNet.
  2. Hyperparameters from the paper: T=1000T = 1000, βt\beta_t linear from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02.

Connection to score matching. From the paper (DDPM Section 3.2): “training our model resembles denoising score matching over multiple noise scales.” The noise-prediction parameterisation is equivalent (up to a scale factor) to predicting the score xtlogq(xtx0)\nabla_{x_t} \log q(x_t \mid x_0) — see Section 6 MATH ENTRY 3.

Design rationale. From the paper (DDPM Section 3.4): the noise-prediction parameterisation performs empirically better than predicting the mean directly. The simplified unweighted loss outperforms the ELBO-weighted loss on FID even though the ELBO is the principled bound.

What breaks if removed. Without the noise parameterisation, training is much harder. Without the simplified loss, FID degrades. Without classifier-free guidance (added later in Ho-Salimans 2021, not in the original DDPM), conditional generation requires a separate classifier.

Classification. [Adapted] from Sohl-Dickstein et al. 2015 (the non-equilibrium-thermodynamics framework). [New] the noise-prediction parameterisation and the simplified loss formulation. [Adopted] U-Net architecture from prior work.

5.2 LDM — diffusion in latent space with cross-attention conditioning

Plain-English intuition. Pixel-space diffusion is expensive because the model has to denoise high-dimensional images. LDM trains a separate autoencoder first that compresses a 512×512×3512 \times 512 \times 3 image into a much smaller latent code (e.g. 64×64×464 \times 64 \times 4 at downsampling factor f=8f=8). Then the diffusion model runs entirely in this compressed latent space. For text conditioning, the model uses cross-attention to let the U-Net look at text embeddings at every layer.

Exact mechanism. From the paper (LDM Section 3):

  1. Stage 1 — perceptual compression autoencoder. Train an encoder EE and decoder DD such that D(E(x))xD(E(x)) \approx x. Loss is a combination of pixel reconstruction, perceptual loss (LPIPS), patch-based adversarial loss, and either KL-regularisation (to keep latent variances bounded) or VQ-regularisation. From the paper: this autoencoder is trained once and frozen.
  2. Stage 2 — latent diffusion. Replace xtx_t in DDPM with ztz_t, the latent counterpart. The training loss becomes:

LLDM=EE(x),ϵ,t[ϵϵθ(zt,t,τθ(y))22]L_{\text{LDM}} = \mathbb{E}_{E(x), \epsilon, t} \left[ \| \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y)) \|_2^2 \right]

where zt=αˉtE(x)+1αˉtϵz_t = \sqrt{\bar{\alpha}_t} \, E(x) + \sqrt{1 - \bar{\alpha}_t} \, \epsilon and τθ(y)\tau_\theta(y) is the encoded conditioning input.

  1. Cross-attention conditioning. From the paper (LDM Section 3.3): at every U-Net layer, intermediate features attend to the conditioning encoder’s output via standard scaled dot-product attention. The Q matrix derives from latent features, K and V from the conditioning encoder.

Attention(Q,K,V)=softmax(QKd)V\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d}} \right) V

  1. Downsampling factor evaluation. The paper evaluates f{1,2,4,8,16,32}f \in \{1, 2, 4, 8, 16, 32\} and finds f{4,8}f \in \{4, 8\} gives the best quality-efficiency trade-off. Stable Diffusion 1.x and 2.x productionised f=8f=8.

Connection to full pipeline. LDM keeps DDPM’s training objective and U-Net structure but changes the data space. Sampling: generate latent z0z_0 via DDIM or DDPM sampler, then decode with D(z0)D(z_0) to get the final image.

Design rationale. From the paper (LDM Section 1): pixel-space diffusion “wastes” capacity learning to denoise semantically unimportant high-frequency detail. The autoencoder separates perceptual compression (handled by the autoencoder) from semantic compression (handled by the diffusion U-Net). Training cost drops by ~10x at the same sample quality.

What breaks if removed. Without the autoencoder, you are back to pixel-space DDPM with all its compute cost. Without cross-attention conditioning, text-to-image becomes hard — LDM specifically demonstrates that cross-attention generalises across conditioning modalities (text, semantic maps, bounding boxes).

Classification. [Adopted] DDPM loss, U-Net architecture. [New] latent-space diffusion, cross-attention conditioning at every U-Net layer. [Adapted] autoencoder design from VQ-GAN / VQ-VAE prior work.

5.3 SDXL — scaling LDM with micro-conditioning and refinement

Plain-English intuition. SDXL is “LDM but bigger and with three new conditioning tricks.” The U-Net has 2.6 billion parameters instead of 0.9 billion. There are two text encoders concatenated instead of one. And the model is told at training time what the original image size was, where it got cropped, and what aspect ratio bucket it came from — which fixes a long-standing class of artefacts.

Exact mechanism. From the paper (SDXL Sections 2.1-2.5):

  1. U-Net scaling. From the paper: SDXL’s U-Net has 2.6B parameters, roughly 3x larger than SD 1.5’s 860M. Most of the increase comes from more transformer blocks at lower-resolution feature-map levels.
  2. Dual text encoders. OpenCLIP ViT-bigG/14 (the larger encoder, 694M params) concatenated with CLIP ViT-L/14 (123M params). From the paper (SDXL Section 2.1): “we concatenate the penultimate text encoder outputs along the channel axis.” Total text-encoder size ~817M params.
  3. Size conditioning csizec_{\text{size}}. From the paper (SDXL Section 2.2): the model receives the original image’s height and width as Fourier-embedded conditioning. This means the model can train on small images (which would normally be discarded) and learn to generate at any target resolution.
  4. Crop conditioning ccropc_{\text{crop}}. From the paper: random-crop training tells the model the top-left crop coordinates (ctop,cleft)(c_{\text{top}}, c_{\text{left}}) as Fourier-embedded conditioning. At inference time, setting (0,0)(0, 0) tells the model to generate as if the subject were centred.
  5. Multi-aspect-ratio training. From the paper (SDXL Section 2.3): images are bucketed into ~10 aspect-ratio bins around 1024² pixel count (e.g. 1024x1024, 1152x896, 896x1152). Training mixes buckets.
  6. Refinement model. From the paper (SDXL Section 2.5): a second smaller diffusion model trained specifically on the high-signal-to-noise-ratio end of the schedule. At inference, the base model produces a latent at intermediate noise level, then the refinement model continues denoising. SDEdit-style.

Training schedule. From the paper: 600,000 steps at 256x256, 200,000 steps at 512x512, then multi-aspect training near 1024x1024. Batch size 2048 at the 256x256 stage.

Classification. [Adopted] LDM autoencoder + cross-attention + U-Net structure. [New] size and crop micro-conditioning. [Adapted] multi-aspect bucketing (production technique formalised) and refinement model (SDEdit-style technique applied at scale).

5.4 Flux — rectified-flow transformer

Plain-English intuition. Flux replaces two pieces of the previous lineage simultaneously. First, the U-Net is replaced with a 12-billion-parameter transformer architecture (similar in spirit to DiT). Second, the DDPM-style stochastic noising-and-denoising process is replaced with rectified flow, which learns a deterministic velocity field vθ(xt,t)v_\theta(x_t, t) pointing from noise to data along nearly-straight paths. Together, these allow much faster sampling (4 steps for FLUX.1 schnell) and much better prompt following.

Exact mechanism. From the BFL announcement and the FLUX.1-dev model card, supplemented by the rectified-flow theory in Liu, Gong, Liu 2022:

  1. Architecture. A hybrid transformer combining “multimodal and parallel diffusion transformer blocks” scaled to 12B parameters. Per the BFL announcement: a two-stage block structure — double-stream blocks process image tokens and text tokens in parallel streams, then single-stream blocks process the concatenated sequence. Rotary positional embeddings are used.
  2. Text encoders. T5-XXL (≈4.7B params) and CLIP-L. T5 carries most of the prompt-following burden; CLIP supplies a coarser global alignment signal. [Reconstructed] from the model card and community-replicated implementation details; not all hyperparameters are formally documented.
  3. Latent space. Flux operates in the latent space of an autoencoder, inherited in spirit from LDM. The model card does not publish the autoencoder’s exact downsampling factor; community measurements (and the diffusers FluxPipeline implementation) suggest f=16f=16 at 1024x1024.
  4. Training objective. Rectified-flow training. The network learns a velocity field vθ(xt,t)v_\theta(x_t, t) such that the path dxtdt=vθ(xt,t)\frac{d x_t}{d t} = v_\theta(x_t, t) transports noise samples (t=0t=0) to data samples (t=1t=1). Training minimises:

LRF=Et,x0,x1[(x1x0)vθ(xt,t)2]L_{\text{RF}} = \mathbb{E}_{t, x_0, x_1} \left[ \| (x_1 - x_0) - v_\theta(x_t, t) \|^2 \right]

with xt=(1t)x0+tx1x_t = (1-t) x_0 + t x_1, where x0x_0 is noise and x1x_1 is data (or in latent-flow variants, z0z_0 noise and z1z_1 latent data). See MATH ENTRY 4 in Section 6.

  1. Guidance distillation. From the FLUX.1-dev model card: “FLUX.1 [dev] was trained using guidance distillation, making it more efficient.” This bakes the classifier-free-guidance step into the model, halving inference-time forward passes.
  2. Variants. FLUX.1 [pro] (closed, API only). FLUX.1 [dev] (open weights, non-commercial license, 50 sampling steps recommended). FLUX.1 [schnell] (Apache 2.0, 4 sampling steps, distilled from dev).

Classification. [Adopted] latent-space operation from LDM. [Adapted] cross-attention conditioning into transformer cross-attention. [New for this lineage] rectified-flow training objective (the technique itself is from Liu-Gong-Liu 2022). [New] 12B-parameter transformer backbone replacing the U-Net.

Section 6 — Mathematical contributions

MATH ENTRY 1: DDPM’s forward and reverse processes (closed-form noising).

  • Source: DDPM Section 2, Equations 1-4.
  • What it is: Gaussian transitions that destroy structure in x0x_0 over TT small steps, with a closed-form expression for the noised state at any intermediate timestep.
  • Formal definition:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t \mathbf{I})

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) \mathbf{I})

  • Term-by-term + dimensional/type analysis.
    • xtRH×W×3x_t \in \mathbb{R}^{H \times W \times 3} — an image-sized tensor.
    • βt(0,1)\beta_t \in (0, 1) — scalar noise variance at step tt. DDPM uses βt[104,0.02]\beta_t \in [10^{-4}, 0.02].
    • αt=1βt(0.98,1)\alpha_t = 1 - \beta_t \in (0.98, 1) — per-step signal coefficient.
    • αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^`{t}` \alpha_s — cumulative product. At t=T=1000t=T=1000 with the linear schedule, αˉT4×105\bar{\alpha}_T \approx 4 \times 10^{-5}, so xTx_T is dominated by noise.
    • I\mathbf{I} is the identity matrix of dimension HW3H \cdot W \cdot 3.
  • Worked numerical example. Take a tiny case: a 4-pixel grayscale image x0=[0.6,0.2,0.4,0.1]x_0 = [0.6, -0.2, 0.4, 0.1] (after normalisation to [1,1][-1, 1]). Use T=4T = 4 and a coarse schedule β=[0.1,0.2,0.3,0.4]\beta = [0.1, 0.2, 0.3, 0.4]. Then α=[0.9,0.8,0.7,0.6]\alpha = [0.9, 0.8, 0.7, 0.6], and αˉ1=0.9\bar\alpha_1 = 0.9, αˉ2=0.72\bar\alpha_2 = 0.72, αˉ3=0.504\bar\alpha_3 = 0.504, αˉ4=0.3024\bar\alpha_4 = 0.3024. Drawing ϵ=[0.5,0.7,0.2,1.1]\epsilon = [0.5, -0.7, 0.2, 1.1] once:
    • At t=1t=1: x1=0.9x0+0.1ϵ=0.949x0+0.316ϵ=[0.728,0.411,0.443,0.443]x_1 = \sqrt{0.9} \cdot x_0 + \sqrt{0.1} \cdot \epsilon = 0.949 \cdot x_0 + 0.316 \cdot \epsilon = [0.728, -0.411, 0.443, 0.443].
    • At t=4t=4: x4=0.3024x0+0.6976ϵ=0.550x0+0.835ϵ=[0.748,0.694,0.387,0.974]x_4 = \sqrt{0.3024} \cdot x_0 + \sqrt{0.6976} \cdot \epsilon = 0.550 \cdot x_0 + 0.835 \cdot \epsilon = [0.748, -0.694, 0.387, 0.974].
    • The signal contribution (0.550x00.550 \cdot x_0) is now smaller than the noise contribution (0.835ϵ0.835 \cdot \epsilon). At t=Tt = T in the real schedule, αˉT0.006\sqrt{\bar\alpha_T} \approx 0.006 and the signal is negligible.
  • Role. The closed-form q(xtx0)q(x_t \mid x_0) is what makes training tractable. Without it, you would need to actually simulate t1t-1 noising steps to get a training sample at step tt.
  • Edge cases. αˉ0=1\bar\alpha_0 = 1 recovers x0x_0. As TT \to \infty with the schedule, αˉT0\bar\alpha_T \to 0 and the marginal becomes N(0,I)\mathcal{N}(0, \mathbf{I}).
  • Novelty. [Adapted] from Sohl-Dickstein et al. 2015 (the chain structure) with the closed-form rewriting and explicit cumulative-product notation.
  • Why it matters. Every paper in this lineage trains by sampling tt, sampling ϵ\epsilon, and computing xtx_t in closed form. This single equation underpins the whole field.

MATH ENTRY 2: DDPM’s variational bound and the simplified loss.

  • Source: DDPM Section 3, Equations 5, 8, 12, 14.
  • What it is: A lower bound on logpθ(x0)\log p_\theta(x_0) derived from the standard ELBO trick. After reparameterising in terms of noise prediction and dropping the timestep weights, the loss collapses to a simple mean-squared error.
  • Formal definition. The full variational bound:

LVB=Eq[DKL(q(xTx0)p(xT))+t>1DKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)]L_{\text{VB}} = \mathbb{E}_q \left[ D_{\text{KL}}(q(x_T \mid x_0) \| p(x_T)) + \sum_{t > 1} D_{\text{KL}}(q(x_{t-1} \mid x_t, x_0) \| p_\theta(x_{t-1} \mid x_t)) - \log p_\theta(x_0 \mid x_1) \right]

After the noise-prediction reparameterisation (DDPM Section 3.2), each per-timestep KL term reduces to a squared-error between true noise ϵ\epsilon and predicted noise ϵθ(xt,t)\epsilon_\theta(x_t, t), weighted by a tt-dependent coefficient. DDPM’s empirical finding (Equation 14): dropping the weight (using uniform weight 1) gives better samples:

Lsimple=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2]L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, t) \|^2 \right]

  • Each term.
    • The first term measures how far the fully-noised xTx_T is from a standard Gaussian. With the right schedule this is essentially zero and is not optimised.
    • The middle sum is the per-step denoising KL. This is where the bulk of training signal comes from.
    • The last term logpθ(x0x1)-\log p_\theta(x_0 \mid x_1) is the final pixel-level reconstruction term.
  • Proof sketch step-by-step (variational-bound derivation).
    1. Start with the variational lower bound on logpθ(x0)\log p_\theta(x_0): logpθ(x0)Eq[logpθ(x0:T)logq(x1:Tx0)]\log p_\theta(x_0) \geq \mathbb{E}_q[\log p_\theta(x_{0:T}) - \log q(x_{1:T} \mid x_0)]. This is the standard VAE / ELBO inequality from Jensen.
    2. Factorise both the forward chain and the reverse chain: pθ(x0:T)=p(xT)tpθ(xt1xt)p_\theta(x_{0:T}) = p(x_T) \prod_t p_\theta(x_{t-1} \mid x_t) and q(x1:Tx0)=tq(xtxt1)q(x_{1:T} \mid x_0) = \prod_t q(x_t \mid x_{t-1}).
    3. Use Bayes’ rule on the forward chain (DDPM Equation 6) to express q(xtxt1)=q(xt1xt,x0)q(xtx0)q(xt1x0)q(x_t \mid x_{t-1}) = \frac{q(x_{t-1} \mid x_t, x_0) \, q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)}. The conditional posterior q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0) has a closed-form Gaussian (DDPM Equations 6, 7).
    4. After algebraic rearrangement, the bound becomes the sum of KL divergences shown above.
    5. With the noise parameterisation μθ(xt,t)=1αt(xtβt1αˉtϵθ)\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}} \, \epsilon_\theta), the KL between two Gaussians with the same variance reduces to a squared-error between means. After substitution and simplification, each per-step term becomes λtE[midϵϵθ(xt,t)mid2]\lambda_t \, \mathbb{E}[\\mid \epsilon - \epsilon_\theta(x_t, t)\\mid ^2] for a tt-dependent weight λt\lambda_t.
    6. DDPM Section 3.4 sets λt=1\lambda_t = 1 and shows empirically this gives better FID than the principled weighting.
  • Worked numerical example. With T=4T=4 from MATH ENTRY 1: at t=2t=2, αˉ2=0.849\sqrt{\bar\alpha_2} = 0.849, 1αˉ2=0.529\sqrt{1 - \bar\alpha_2} = 0.529. If the network outputs ϵ^=[0.4,0.6,0.3,1.0]\hat\epsilon = [0.4, -0.6, 0.3, 1.0] and the true noise was ϵ=[0.5,0.7,0.2,1.1]\epsilon = [0.5, -0.7, 0.2, 1.1], the per-sample loss is 14midϵϵ^mid2=14(0.01+0.01+0.01+0.01)=0.01\frac{1}{4}\\mid \epsilon - \hat\epsilon\\mid ^2 = \frac{1}{4}(0.01 + 0.01 + 0.01 + 0.01) = 0.01. Averaged over a batch and over tt, this is the gradient signal.
  • Role. This is the training loss for DDPM, LDM, and SDXL with only minor changes (LDM/SDXL operate on latents; LDM/SDXL add a conditioning input to ϵθ\epsilon_\theta).
  • Novelty. [Adapted] the variational decomposition from Sohl-Dickstein et al. 2015. [New] the noise-prediction reparameterisation and the empirical observation that the unweighted loss is better.
  • Why it matters. This loss is what made diffusion models actually train. The principled ELBO weighting puts too much emphasis on near-pure-noise timesteps where there is little signal to learn. The unweighted loss spreads gradient signal evenly across timesteps.

MATH ENTRY 3: Connection between noise prediction and score matching.

  • Source: DDPM Section 3.2; Song-Ermon 2019; Song et al. 2021 SDE paper.
  • What it is: The DDPM noise-prediction network ϵθ\epsilon_\theta is, up to a scale factor, predicting the score xtlogq(xtx0)\nabla_{x_t} \log q(x_t \mid x_0).
  • Formal definition. From q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} \, x_0, (1 - \bar\alpha_t) \mathbf{I}), the score is:

xtlogq(xtx0)=xtαˉtx01αˉt=ϵ1αˉt\nabla_{x_t} \log q(x_t \mid x_0) = -\frac{x_t - \sqrt{\bar\alpha_t} \, x_0}{1 - \bar\alpha_t} = -\frac{\epsilon}{\sqrt{1 - \bar\alpha_t}}

So ϵθ(xt,t)1αˉtsθ(xt,t)\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar\alpha_t} \, s_\theta(x_t, t) where sθs_\theta is the score network from Song-Ermon NCSN.

  • Worked numerical example. Continuing the t=2t=2 case from MATH ENTRY 2: 1αˉ2=0.529\sqrt{1 - \bar\alpha_2} = 0.529, ϵ=[0.5,0.7,0.2,1.1]\epsilon = [0.5, -0.7, 0.2, 1.1]. The implied score is ϵ/0.529=[0.945,1.323,0.378,2.079]-\epsilon / 0.529 = [-0.945, 1.323, -0.378, -2.079]. The score points away from the noised state toward the data manifold, which is exactly what an annealed Langevin sampler uses.
  • Why it matters. This connection is what Song et al.’s 2021 SDE paper formalises. It unifies the DDPM lineage with the score-matching lineage and lets researchers use Langevin / Euler-Maruyama / probability-flow ODE samplers interchangeably with the DDPM ancestral sampler.
  • Novelty. [Adapted] the score-matching equivalence is implicit in Sohl-Dickstein et al. 2015 but the explicit DDPM-to-score-matching mapping is the DDPM paper’s contribution.

MATH ENTRY 4: Rectified flow (Flux’s training objective).

  • Source: Liu, Gong, Liu 2022 (arXiv:2209.03003), the rectified-flow paper. Flux uses this objective per the BFL announcement and the FLUX.1-dev model card.
  • What it is: A flow-matching training objective that learns a deterministic velocity field vθ(xt,t)v_\theta(x_t, t) pointing from noise to data along nearly-straight paths in Rd\mathbb{R}^d.
  • Formal definition. Define a straight-line interpolation between a noise sample x0N(0,I)x_0 \sim \mathcal{N}(0, \mathbf{I}) and a data sample x1pdatax_1 \sim p_{\text{data}}:

xt=(1t)x0+tx1,t[0,1]x_t = (1 - t) x_0 + t \, x_1, \qquad t \in [0, 1]

The exact velocity along this path is x˙t=x1x0\dot{x}_t = x_1 - x_0, constant in tt. The training loss matches a learned velocity to this target:

LRF=EtU[0,1],x0,x1[(x1x0)vθ(xt,t)2]L_{\text{RF}} = \mathbb{E}_{t \sim U[0,1], \, x_0, \, x_1} \left[ \| (x_1 - x_0) - v_\theta(x_t, t) \|^2 \right]

  • Each term.
    • x0Rdx_0 \in \mathbb{R}^d — noise sample (or latent noise for Flux).
    • x1Rdx_1 \in \mathbb{R}^d — data sample.
    • t[0,1]t \in [0, 1] — continuous interpolation parameter.
    • vθ(xt,t)Rdv_\theta(x_t, t) \in \mathbb{R}^d — the velocity-field network output.
  • Worked numerical example. Take x0=[0.0,0.0]x_0 = [0.0, 0.0] (noise) and x1=[1.0,2.0]x_1 = [1.0, 2.0] (a 2D data point). Then xt=(t,2t)x_t = (t, 2t) for t[0,1]t \in [0, 1]. The true velocity is (1,2)(1, 2) everywhere along the path. If the network outputs vθ(xt=(0.3,0.6),t=0.3)=(0.9,1.8)v_\theta(x_t = (0.3, 0.6), t = 0.3) = (0.9, 1.8), the loss is mid(1,2)(0.9,1.8)mid2=0.01+0.04=0.05\\mid (1, 2) - (0.9, 1.8)\\mid ^2 = 0.01 + 0.04 = 0.05.
  • Why “nearly-straight.” With many data points and many noise samples, the straight lines connecting random pairs cross — the true marginal velocity field has curvature. The rectified-flow paper proves that running an additional “reflow” step (training a new model on samples generated by the first one) progressively straightens the trajectories.
  • Sampling. Once vθv_\theta is trained, generate by solving the ODE dxdt=vθ(x,t)\frac{dx}`{dt}` = v_\theta(x, t) from t=0t=0 (noise) to t=1t=1 (data). Euler integration: xt+Δt=xt+Δtvθ(xt,t)x_{t + \Delta t} = x_t + \Delta t \cdot v_\theta(x_t, t). With straight enough trajectories, a single Euler step (4 steps for FLUX.1 schnell) gives high-quality samples.
  • Role. Flux’s entire training loss. Replaces DDPM’s LsimpleL_{\text{simple}}.
  • Novelty. [Adopted] from Liu, Gong, Liu 2022. [New for Flux] the combination with a 12B-parameter transformer backbone, T5-XXL conditioning, and large-scale text-to-image training.
  • Why it matters. This is the single conceptual change that lets Flux sample in 4 steps instead of 50. The trade-off versus DDPM-style stochastic sampling is determinism (good for reproducibility, less good for diversity at very low step counts).

MATH ENTRY 5: Cross-attention conditioning (LDM/SDXL/Flux).

  • Source: LDM Section 3.3.
  • What it is: Standard scaled-dot-product attention where queries come from image latent features and keys/values come from a conditioning encoder output (e.g. CLIP text embeddings).
  • Formal definition.

CrossAttn(Q,K,V)=softmax(QKdk)V\text{CrossAttn}(Q, K, V) = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V

with Q=WQϕ(zt)Q = W_Q \cdot \phi(z_t), K=WKτθ(y)K = W_K \cdot \tau_\theta(y), V=WVτθ(y)V = W_V \cdot \tau_\theta(y), where ϕ(zt)\phi(z_t) is a flattened image-feature tensor and τθ(y)\tau_\theta(y) is the text-embedding tensor.

  • Each term.
    • ϕ(zt)RN×d\phi(z_t) \in \mathbb{R}^{N \times d} where NN is the number of spatial positions in the current feature map (e.g. N=64×64=4096N = 64 \times 64 = 4096 for an early SDXL layer) and dd is the channel dim.
    • τθ(y)RL×dy\tau_\theta(y) \in \mathbb{R}^{L \times d_y} where LL is the text token count (LDM uses up to 77 for CLIP; Flux uses 512 for T5-XXL) and dyd_y is the conditioning embedding dim.
    • WQ,WK,WVW_Q, W_K, W_V are learned projection matrices.
  • Worked numerical example. Suppose at one U-Net layer the feature map is 8×8×168 \times 8 \times 16 (N=64N = 64, d=16d = 16), the text encoder gives 3 tokens at dy=16d_y = 16, and dk=16d_k = 16. Then QQ is 64×1664 \times 16, KK is 3×163 \times 16, VV is 3×163 \times 16. The attention score matrix QK/16QK^\top / \sqrt{16} is 64×364 \times 3. Softmax over the 3 columns gives a 64×364 \times 3 probability matrix; each spatial location gets a probability distribution over the 3 text tokens. Multiplying by VV gives 64×1664 \times 16 output features — same shape as the input feature flattened. This means every spatial location is conditioned on a custom mixture of the text tokens.
  • Role. This is how text “reaches” the image features in LDM, SDXL, and (with a transformer twist) Flux. Without cross-attention, the conditioning signal would have to be concatenated or added globally, which empirically gives much worse prompt following.
  • Novelty. [Adopted] from standard Transformer attention. [New for diffusion] the application as a conditioning mechanism inserted into every U-Net layer.

Section 7 — Algorithmic contributions

ALGORITHM ENTRY 1: DDPM training and sampling (Algorithms 1 and 2 of the paper).

  • Source: DDPM Algorithms 1 and 2.
  • Purpose: Train the noise-prediction network and sample new images.
  • Training pseudocode (DDPM Algorithm 1).
repeat:
    x_0 ~ q(x_0)                                # sample a clean image
    t ~ Uniform({1, ..., T})                    # random timestep
    epsilon ~ N(0, I)                           # random Gaussian noise
    x_t = sqrt(bar_alpha_t) * x_0
          + sqrt(1 - bar_alpha_t) * epsilon     # closed-form noising
    take a gradient step on:
        || epsilon - epsilon_theta(x_t, t) ||^2
until converged
  • Sampling pseudocode (DDPM Algorithm 2).
x_T ~ N(0, I)
for t = T, T-1, ..., 1:
    z ~ N(0, I) if t > 1, else z = 0
    x_{t-1} = (1 / sqrt(alpha_t)) * (
                  x_t - ((1 - alpha_t) / sqrt(1 - bar_alpha_t)) * epsilon_theta(x_t, t)
              ) + sigma_t * z
return x_0
  • Hand-traced example. Use the toy T=4T=4, β=[0.1,0.2,0.3,0.4]\beta = [0.1, 0.2, 0.3, 0.4] schedule from MATH ENTRY 1. Assume a fully-trained network that perfectly predicts noise, and start with x4=[0.748,0.694,0.387,0.974]x_4 = [0.748, -0.694, 0.387, 0.974] (the noised state at t=4t=4 from MATH ENTRY 1).
    • At t=4t=4: the network outputs ϵ^4=[0.5,0.7,0.2,1.1]\hat\epsilon_4 = [0.5, -0.7, 0.2, 1.1] (matches the true noise). α4=0.775\sqrt{\alpha_4} = 0.775, (1α4)/1αˉ4=0.4/0.835=0.479(1 - \alpha_4)/\sqrt{1 - \bar\alpha_4} = 0.4 / 0.835 = 0.479. The mean update: μ4=(1/0.775)(x40.479ϵ^4)=(1/0.775)[0.7480.239,0.694(0.335),0.3870.096,0.9740.527]=(1/0.775)[0.509,0.359,0.291,0.447]=[0.657,0.463,0.376,0.577]\mu_4 = (1/0.775)(x_4 - 0.479 \cdot \hat\epsilon_4) = (1/0.775)[0.748 - 0.239, -0.694 - (-0.335), 0.387 - 0.096, 0.974 - 0.527] = (1/0.775)[0.509, -0.359, 0.291, 0.447] = [0.657, -0.463, 0.376, 0.577]. Draw zN(0,I)z \sim \mathcal{N}(0, \mathbf{I}), scale by σ4=β4=0.632\sigma_4 = \sqrt{\beta_4} = 0.632, and add to get x3x_3.
    • The chain continues until t=1t=1, when z=0z = 0 deterministically. Final x0x_0 should be close to the original [0.6,0.2,0.4,0.1][0.6, -0.2, 0.4, 0.1] up to accumulated sampler noise.
  • Complexity. Training: O(Nθ)O(N \cdot \mid \theta\mid ) for NN gradient steps. Sampling: TT forward passes through the network, T=1000T=1000 for vanilla DDPM. Each forward pass is O(θ)O(\mid \theta\mid ). Bottleneck step: the inner forward pass over a 256x256 image through a ~500M-param U-Net.
  • Hyperparameters. From the paper: T=1000T = 1000, β1=104\beta_1 = 10^{-4}, βT=0.02\beta_T = 0.02 (linear), σt=βt\sigma_t = \sqrt{\beta_t}, learning rate 2×1042 \times 10^{-4} with EMA decay 0.99990.9999. Batch size 128 for CIFAR-10.
  • Failure modes. With too few sampling steps (T below 100) and the ancestral sampler, FID degrades sharply. DDIM (Song et al. 2020) was the first deterministic alternative that allowed step counts of 20-50.
  • Novelty. [New] the noise-prediction loss formulation. [Adapted] the ancestral sampling formula from Sohl-Dickstein et al. 2015.
  • Transferability. [Analysis] Anyone training a diffusion model on any modality (images, audio, video, 3D, molecules) starts from this Algorithm 1.

ALGORITHM ENTRY 2: LDM training and sampling.

  • Source: LDM Section 3, with autoencoder pre-training in Section 3.1.
  • Purpose: Train the latent diffusion U-Net given a frozen autoencoder.
  • Pseudocode.
# Stage 1 (one-time): train autoencoder E, D on the image dataset
# using L_recon + L_perceptual + L_adversarial + L_KL

# Stage 2: latent diffusion training
freeze E and D
repeat:
    x_0 ~ q(x_0)
    z_0 = E(x_0)                                # encode to latent
    t ~ Uniform({1, ..., T})
    epsilon ~ N(0, I_latent_shape)
    z_t = sqrt(bar_alpha_t) * z_0
          + sqrt(1 - bar_alpha_t) * epsilon
    if conditional:
        y = caption(x_0)
        c = tau_theta(y)                         # encode conditioning
    else:
        c = None
    take a gradient step on:
        || epsilon - epsilon_theta(z_t, t, c) ||^2
until converged
  • Sampling pseudocode. Same as DDPM Algorithm 2 but in latent space, with classifier-free guidance applied to the conditional case:
z_T ~ N(0, I_latent_shape)
for t = T, T-1, ..., 1:
    eps_uncond = epsilon_theta(z_t, t, None)
    eps_cond   = epsilon_theta(z_t, t, c)
    eps_guided = eps_uncond + w * (eps_cond - eps_uncond)
    # ... DDPM-style update with eps_guided ...
x_0 = D(z_T_decoded)                            # decode to pixels
  • Hand-traced example. Skipped here for brevity — the loop structure is identical to DDPM Algorithm 2, only the dimension changes (a 64x64x4 latent for SD 1.x at 512x512 instead of a 512x512x3 image). The hand-trace from ALGORITHM ENTRY 1 transfers.
  • Complexity. Training one step on a 512x512 image: encoder forward (O(θE)O(\mid \theta_E\mid )) + diffusion forward (O(θU)O(\mid \theta_U\mid ) on a 64x64x4 tensor) + gradient. The autoencoder is frozen so θE\mid \theta_E\mid is not in the gradient cost. The U-Net cost drops by roughly f2=64f^2 = 64x compared to a pixel-space U-Net at the same architectural depth.
  • Hyperparameters. From the paper: f=4f = 4 or f=8f = 8 for text-to-image; latent channel count c{3,4,8}c \in \{3, 4, 8\}; DDPM-style schedule with T=1000T = 1000; guidance scale w[1.5,7.5]w \in [1.5, 7.5] at sampling time.
  • Novelty. [Adapted] the DDPM training and sampling loops to latent space. [New] the autoencoder + cross-attention combination.

ALGORITHM ENTRY 3: SDXL training with micro-conditioning.

  • Source: SDXL Sections 2.1-2.5.
  • Purpose: Train the scaled-up latent diffusion model with size, crop, and aspect-ratio conditioning.
  • Pseudocode (training).
freeze E, D (the SDXL autoencoder)
repeat:
    (x_0, h_orig, w_orig) ~ dataset          # image + original dimensions
    bucket = pick_aspect_ratio_bucket(h_orig, w_orig)
    x_0_resized = resize_to_bucket(x_0, bucket)
    (c_top, c_left) ~ random_crop_coords(x_0_resized, bucket)
    x_0_cropped = crop(x_0_resized, c_top, c_left, bucket)
    z_0 = E(x_0_cropped)
    t ~ Uniform({1, ..., T})
    epsilon ~ N(0, I)
    z_t = sqrt(bar_alpha_t) * z_0
          + sqrt(1 - bar_alpha_t) * epsilon
    y = caption(x_0)
    c_text = concat(OpenCLIP-G(y), CLIP-L(y))
    c_size = fourier(h_orig, w_orig)
    c_crop = fourier(c_top, c_left)
    c_aspect = fourier(bucket.h, bucket.w)
    c = (c_text, c_size, c_crop, c_aspect)
    take a gradient step on:
        || epsilon - epsilon_theta(z_t, t, c) ||^2
until converged
  • Hand-traced example. Take a 480x720 portrait image (original horig=480h_{\text{orig}} = 480, worig=720w_{\text{orig}} = 720). The closest aspect-ratio bucket near 1024² pixels is the 832x1216 portrait bucket. Resize the image to 832x1216 preserving aspect, then crop. Suppose (ctop,cleft)=(32,0)(c_{\text{top}}, c_{\text{left}}) = (32, 0). The Fourier-embed csize=fourier(480,720)c_{\text{size}} = \text{fourier}(480, 720), ccrop=fourier(32,0)c_{\text{crop}} = \text{fourier}(32, 0), caspect=fourier(832,1216)c_{\text{aspect}} = \text{fourier}(832, 1216). These get concatenated with the dual text-encoder output and added (or cross-attended) into the U-Net.
  • Inference-time pattern. At sampling time, set ccrop=(0,0)c_{\text{crop}} = (0, 0) to tell the model “this is a centred subject”; set csizec_{\text{size}} to the target resolution (e.g. (1024,1024)(1024, 1024)) to tell the model “generate as if the original was already at this resolution.”
  • Complexity. SDXL training: ~3×3 \times FLOPs per step of SD 1.5 due to U-Net scaling; same step count. Inference: 50 DDIM steps for base + 50 DDIM steps for refinement at default settings; can be lowered to 30 + 30 with minimal quality drop.
  • Novelty. [New] the micro-conditioning scheme. [Adapted] aspect-ratio bucketing from production text-to-image practice. [Adopted] dual text encoder pattern from prior text-to-image work.

ALGORITHM ENTRY 4: Flux rectified-flow training.

  • Source: [Reconstructed] from the rectified-flow paper combined with the BFL announcement and the Hugging Face model card. Black Forest Labs has not published a Flux-specific training paper as of the access date.
  • Purpose: Train the velocity-field transformer.
  • Pseudocode.
freeze E, D (the Flux autoencoder)
repeat:
    x_data ~ q(x_data)
    z_1 = E(x_data)                          # latent data sample
    z_0 ~ N(0, I_latent_shape)               # latent noise sample
    t ~ U[0, 1]                              # continuous timestep
    z_t = (1 - t) * z_0 + t * z_1            # straight-line interpolation
    y = caption(x_data)
    c = (T5-XXL(y), CLIP-L(y))
    target_velocity = z_1 - z_0
    take a gradient step on:
        || target_velocity - v_theta(z_t, t, c) ||^2
until converged
  • Sampling (Euler integration).
z = z_0 ~ N(0, I)
dt = 1 / N_steps                              # e.g. 1/4 for schnell
for k = 0, 1, ..., N_steps - 1:
    t = k * dt
    z = z + dt * v_theta(z, t, c)
x_out = D(z)
  • Hand-traced example. Take a 2D toy with z0=(0.0,0.0)z_0 = (0.0, 0.0), target data z1=(1.0,2.0)z_1 = (1.0, 2.0), Nsteps=4N_{\text{steps}} = 4, dt=0.25dt = 0.25. Assume the network has learned vθ(z,t,c)=(1,2)v_\theta(z, t, c) = (1, 2) along the straight line. Then:
    • k=0k=0: t=0t=0, z=(0,0)+0.25(1,2)=(0.25,0.5)z = (0,0) + 0.25 \cdot (1, 2) = (0.25, 0.5).
    • k=1k=1: t=0.25t=0.25, z=(0.25,0.5)+0.25(1,2)=(0.5,1.0)z = (0.25, 0.5) + 0.25 \cdot (1, 2) = (0.5, 1.0).
    • k=2k=2: t=0.5t=0.5, z=(0.75,1.5)z = (0.75, 1.5).
    • k=3k=3: t=0.75t=0.75, z=(1.0,2.0)z = (1.0, 2.0). Recovered the target.
    • In realistic settings, the learned velocity field is not perfectly constant and the path curves slightly; reflow improves straightness.
  • Complexity. Inference cost: NstepsN_{\text{steps}} forward passes through the transformer plus one autoencoder decode. FLUX.1 schnell at 4 steps is roughly 6-10x faster wall-clock than SDXL at 50 DDIM steps on the same hardware. [Analysis] FLUX.1 dev at 50 steps is comparable wall-clock to SDXL at 50 steps because the transformer is larger but each step is one forward pass.
  • Hyperparameters. From the FLUX.1-dev model card: guidance scale 3.5, 50 inference steps, max sequence length 512 for T5-XXL, output resolution 1024x1024.
  • Novelty. [Adopted] rectified-flow training from Liu-Gong-Liu 2022. [New] the combination with a transformer of this scale on text-to-image at production quality.

Section 8 — Specialised design contributions

Subsection 8A — LLM / prompt design. Not applicable to this paper lineage. None of DDPM, LDM, SDXL, or Flux use an LLM in a prompting capacity within the generation pipeline; text encoders (CLIP, OpenCLIP, T5) are encoder-only and produce embeddings, not prompts.

Subsection 8B — Architecture-specific details.

  • DDPM U-Net. Group normalisation, sinusoidal timestep embeddings, self-attention at 16x16 resolution.
  • LDM U-Net. Adds cross-attention layers between self-attention and feed-forward at every resolution where the original U-Net had self-attention. Conditioning input dim: 768 (CLIP ViT-L) or 1280 (CLIP ViT-bigG).
  • SDXL U-Net. Removes self-attention at the highest resolution; adds more transformer blocks at the middle and bottom resolutions. Total 2.6B parameters. Two text-encoder embeddings are concatenated channel-wise before cross-attention.
  • Flux transformer. Hybrid double-stream + single-stream block structure. Rotary positional embeddings. Total 12B parameters. No U-Net.

Subsection 8C — Training specifics.

  • DDPM. Single 8-GPU node sufficient for CIFAR-10. CelebA-HQ and LSUN trained on larger setups.
  • LDM. From the paper (Section 4): autoencoder trained for 200K-400K steps; text-to-image LDM trained on LAION-400M (later LAION-2B for SD 1.x).
  • SDXL. From the paper (Section 2.6): 600K + 200K + multi-aspect training stages, batch size 2048 at 256x256. Compute not officially disclosed; community estimates are in the high-six-figure GPU-hour range.
  • Flux. From the BFL announcement: compute scale not disclosed. [Analysis] Inferring from the 12B parameter count and the production deployment, training cost is in the same order of magnitude as Stable Diffusion 3.

Subsection 8D — Inference / deployment specifics.

  • DDPM. 1000 sampling steps; impractical for production until DDIM (Song 2020) and DPM-Solver (Lu 2022) cut this to 20-50.
  • LDM/SDXL. 50 DDIM steps default; classifier-free guidance scale w[3,8]w \in [3, 8].
  • Flux. 50 steps for dev; 4 steps for schnell via guidance distillation. Guidance baked into the model.

Section 9 — Experiments and results

Datasets across the lineage.

  • DDPM: CIFAR-10 (32x32), CelebA-HQ (256x256), LSUN Bedrooms / Churches / Cats (256x256).
  • LDM: CelebA-HQ, FFHQ, LSUN-Bedrooms, LSUN-Churches, ImageNet (class-conditional), COCO (text-to-image), LAION-400M (large-scale text-to-image).
  • SDXL: Internal large-scale text-image dataset (LAION-derived; not publicly fully specified). Evaluation on PartiPrompts and a user-preference study.
  • Flux: Training data not disclosed by Black Forest Labs.

Reproducing DDPM’s headline results (Table 1 of the paper).

DatasetMethodFIDInception Score
CIFAR-10NCSN (Song-Ermon 2019)25.328.87
CIFAR-10StyleGAN2 + ADA2.929.83
CIFAR-10DDPM (this paper)3.179.46
LSUN ChurchStyleGAN4.21
LSUN ChurchDDPM (this paper)7.89

Table 1 of Ho, Jain, Abbeel — Denoising Diffusion Probabilistic Models (arXiv:2006.11239), reproduced for editorial coverage.

Reproducing LDM’s headline results (selected from Tables 8-13).

DatasetResolutionMethodFID
CelebA-HQ256x256LDM-45.11
FFHQ256x256LDM-44.98
LSUN-Churches256x256LDM-84.02
LSUN-Bedrooms256x256LDM-42.95
ImageNet (class-cond.)256x256LDM-4-G3.60
COCO (text-to-image)256x256LDM-KL-8-G12.63

Tables 8-13 of Rombach et al. (arXiv:2112.10752), reproduced for editorial coverage.

SDXL user-preference study results. From the paper (Section 3): in a head-to-head preference study, SDXL with refinement was preferred 48.44% of the time vs SD 1.5 (7.91%) and SD 2.1 (6.71%). Against Midjourney v5.1 on PartiPrompts (17,153 user comparisons), SDXL was favoured 54.9% of the time. [Analysis] User-preference studies are subjective and the sampling demographics aren’t publicly described; treat as indicative of the qualitative jump, not as a SOTA claim.

Flux benchmarks. Black Forest Labs has not released a peer-reviewed benchmark paper for Flux. The BFL announcement reports internal benchmarks favouring FLUX.1 [pro] over Midjourney v6.0, DALL-E 3, and SD3-Ultra on visual quality, prompt following, size/aspect variability, typography, and output diversity. [Analysis] These are vendor self-reports; independent peer-reviewed comparison would strengthen the SOTA claim. Community benchmarks (Image-Arena, lmsys leaderboards) have ranked FLUX.1 [pro] in the top tier among open and closed text-to-image models through 2024-2025.

Independent benchmark cross-check. [External comparison] Papers With Code’s text-to-image leaderboard (accessed 2026-05-20) places SDXL and FLUX.1 [pro] within the top tier on CLIP-score and aesthetic-score benchmarks; reproducibility studies (Conde 2024 and others) report FLUX.1 [dev] matching or exceeding SDXL on prompt-following metrics at 1024x1024.

Ablations.

  • DDPM Section 4.2 ablation: the simplified vs full ELBO loss; simplified wins on FID. The noise-prediction parameterisation vs predicting μ\mu or x0x_0 directly; noise-prediction wins.
  • LDM Section 4.1 ablation: downsampling factor sweep. f=1f=1 (pixel space) is worst on FID-vs-compute; f{4,8}f \in \{4, 8\} is the sweet spot; f=32f=32 loses too much detail.
  • SDXL ablations: the impact of each micro-conditioning. From the paper (Section 2.2): turning off size conditioning leaves the model unable to use ~39% of training data. Turning off crop conditioning brings back the “headless cat” artefact.
  • Flux: No public ablations.

Evidence audit.

  • Strongly supported. DDPM’s CIFAR-10 FID; LDM’s CelebA-HQ and LSUN FIDs (independently reproduced many times); SDXL’s qualitative jump over SD 1.5/2.1; Flux’s 4-step sampling capability.
  • Partially supported. SDXL’s “competitive with black-box” claim (subjective; vendor-friendly framing). Flux’s prompt-following claims vs DALL-E 3 (vendor self-report).
  • Reliant on narrow evidence. Flux benchmark numbers — no peer-reviewed paper. [Reviewer Perspective] The community has independently validated Flux’s quality through deployment; the absence of a paper is unusual for a model of this prominence.

Section 10 — Technical novelty summary

ComponentTypeNovelty levelJustificationSource
Noise-prediction parameterisationTraining methodFully novelEmpirically dominant; reframed the entire fieldDDPM Section 3.2
Simplified unweighted lossTraining methodIncrementally novelDrops the ELBO weighting; better FIDDDPM Section 3.4
U-Net for diffusionArchitectureCombination novelU-Net existed (Ronneberger 2015); diffusion application was newDDPM Appendix B
Latent-space diffusionArchitectureFully novelFirst demonstration that VAE-style latent + diffusion is the winning combinationLDM Section 3
Cross-attention conditioning at every U-Net layerArchitectureFully novelReplaced earlier global conditioning patternsLDM Section 3.3
Perceptual + adversarial autoencoder lossTraining methodAdaptedBorrowed from VQ-GAN / Esser-Rombach-Ommer 2020LDM Section 3.1
Dual text encoder concatenationArchitectureCombination novelBoth encoders existed; the channel-axis concat is the contributionSDXL Section 2.1
Size and crop micro-conditioningTraining methodFully novelNo prior text-to-image system had this conditioning surfaceSDXL Section 2.2
Multi-aspect-ratio bucketingTraining procedureAdaptedProduction technique formalisedSDXL Section 2.3
Refinement model (SDEdit-at-scale)Inference methodCombination novelSDEdit existed; using it as a deployed second-stage was new at this scaleSDXL Section 2.5
Rectified-flow transformerArchitecture + trainingCombination novelBoth components existed independently; this combination at production scale was newFlux release
Guidance distillationInference methodAdaptedSalimans-Ho 2022 technique applied to FluxFLUX.1-dev model card

Single most novel contribution across the lineage. [Analysis] The DDPM noise-prediction parameterisation is the single most consequential contribution. It is the conceptual move that converted a hard variational-bound optimisation into a trivial mean-squared-error regression, which is what made the entire post-2020 diffusion era possible. LDM’s latent-space move and Flux’s rectified-flow move are both major engineering advances, but each builds on the noise-prediction parameterisation directly.

What the papers do NOT claim as novel. DDPM does not claim the diffusion framework itself (Sohl-Dickstein et al. 2015) or the U-Net architecture (Ronneberger et al. 2015). LDM does not claim the autoencoder architecture (Esser-Rombach-Ommer 2020 VQ-GAN) or cross-attention (Vaswani et al. 2017). SDXL does not claim the classifier-free-guidance scheme (Ho-Salimans 2021) or aspect-ratio bucketing. Flux does not claim rectified-flow training (Liu-Gong-Liu 2022) or the transformer-for-diffusion pattern (Peebles-Xie DiT 2022).

Section 11 — Situating the work

Prior work. [External comparison] The conceptual ancestor is Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli 2015 — Deep Unsupervised Learning using Nonequilibrium Thermodynamics — which introduced the forward-backward Markov chain framework but produced only low-quality samples. The score-matching ancestor is Hyvärinen 2005, then Vincent 2011 (denoising score matching), then Song-Ermon 2019 (NCSN). Song et al. 2021 unified DDPM and NCSN under a continuous-time SDE.

What this lineage changes conceptually.

  • DDPM — diffusion is competitive with GANs on sample quality.
  • LDM — diffusion is computationally tractable at production resolution.
  • SDXL — micro-conditioning recovers training data and fixes artefacts.
  • Flux — flow matching + transformer + few-step sampling is the future.

Contemporaneous related work.

  • GLIDE (Nichol et al. 2021) and DALL-E 2 (Ramesh et al. 2022) — the first large-scale text-to-image diffusion deployments, contemporaneous with LDM but closed-source. LDM’s contribution relative to GLIDE is that LDM operates in latent space and was released openly; the architectural ideas are otherwise close cousins.
  • Imagen (Saharia et al. 2022) — Google’s pixel-space text-to-image cascade with T5 conditioning. Contemporaneous with LDM, demonstrated that a strong text encoder (T5-XXL) matters more than the visual model for prompt following. Flux’s T5-XXL choice is a direct descendant of this finding.
  • DiT (Peebles, Xie 2022) — the first paper proposing transformers as the diffusion backbone for class-conditional ImageNet generation. The architectural blueprint behind Stable Diffusion 3 and Flux.
  • Stable Diffusion 3 (Esser, Kulal et al. 2024, arXiv:2403.03206) — the immediate counterpart to Flux. SD3 also uses rectified flow + transformer (MMDiT). The two architectures converged independently and concurrently on essentially the same design.
  • DALL-E 3 (OpenAI, 2023) — closed-weights text-to-image system based on diffusion. Contemporaneous with SDXL; cited for prompt-following quality.

[Reviewer Perspective] Strongest skeptical objection. DDPM’s contribution is more incremental than the field’s standard narrative suggests — the noise-prediction parameterisation is one line of algebra away from the score-matching formulation that Song-Ermon already had. LDM’s contribution is engineering rather than science: the autoencoder + diffusion combination is conceptually obvious in hindsight and the heavy lifting is the empirical sweep over ff. SDXL’s micro-conditioning is clever but the paper does not deeply justify why Fourier-embedding size/crop coordinates as conditioning is better than alternative parameterisations. Flux ships without a paper at all.

[Reviewer Perspective] Strongest author-side rebuttal grounded in the paper. DDPM’s noise-prediction parameterisation, while related to score matching in retrospect, was not the standard approach in the diffusion literature at the time, and DDPM’s empirical demonstration that this parameterisation gives state-of-the-art FID was what shifted the field’s consensus. LDM’s engineering contribution is exactly what made diffusion accessible — the field’s reproducibility and downstream-application explosion (Stable Diffusion, ControlNet, LoRA, etc.) is a direct consequence. SDXL’s ablations explicitly show the data-recovery effect of size conditioning; the empirical justification is strong even if the theoretical justification is light. Flux’s lack of a paper is a known open question; the model’s open-weights release and community validation are partial substitutes.

What remains unsolved.

  • Mode coverage of text-conditional models. All four papers in the lineage suffer from concept-bleeding, attribute-confusion, and long-tail concept coverage failures. SDXL Section 4 explicitly acknowledges this.
  • Compositional generation. None of the four reliably handle prompts like “a cube on top of a sphere, both red, next to a blue cylinder.”
  • Text rendering inside images. Improved substantially from SD 1.5 to SDXL to Flux but is not yet solved.
  • Computational cost at inference time. Even Flux schnell’s 4-step sampling is dominated by 12B-parameter transformer forward passes.
  • Open evaluation. No standardised text-to-image benchmark with public ground truth and reproducible evaluation protocols at the level of GLUE or ImageNet.

Three future research directions, paper-grounded.

  1. Replacing the autoencoder. The LDM/SDXL/Flux autoencoder is trained once and frozen; quality bottlenecks at high resolution come from the autoencoder’s reconstruction, not the diffusion model. A jointly-trained or end-to-end-replaced autoencoder could be the next quality unlock. [Analysis] grounded in LDM Section 4’s reconstruction-quality discussion.
  2. Distillation across the lineage. Flux schnell shows guidance distillation works for rectified flow; analogous distillation could compress SDXL or pixel-space DDPM to single-digit steps. [Reviewer Perspective] partially done in research literature (Consistency Models, Progressive Distillation) but not unified in a single deployment.
  3. Better text encoders. The Imagen + Flux convergence on T5-XXL suggests text-encoder choice matters more than visual model choice. A vision-language model specifically pre-trained for diffusion conditioning could close the prompt-following gap with closed-weights systems. [Analysis] not explicitly proposed in any of the four papers; emerges from comparing them.

Section 12 — Critical analysis

Strengths with concrete evidence.

  • DDPM establishes a complete training-and-sampling pipeline. Algorithms 1 and 2 are short, implementable in a single afternoon, and the loss is a one-line MSE.
  • LDM delivers the order-of-magnitude compute reduction that made the field accessible. From the paper (Section 4): training on a single 8-GPU node is sufficient for 256x256 text-to-image at LDM-4 scale.
  • SDXL ships a clear ablation of each micro-conditioning component (Section 2.2-2.5), letting practitioners understand which conditioning carries which artefact-fix.
  • Flux demonstrates that few-step sampling at production quality is achievable open-weights, releasing FLUX.1 [schnell] under Apache 2.0.

Weaknesses explicitly stated by the authors.

  • DDPM Section 5: “diffusion models do not have competitive log likelihoods compared to other likelihood-based models.” Sampling is 1000 steps.
  • LDM Section 5: sequential sampling is still slower than GANs; reconstruction fidelity at very high resolution is bottlenecked by the autoencoder.
  • SDXL Section 4 (limitations): “The model may encounter challenges when synthesizing intricate structures, such as human hands.” Text rendering: “still encounters difficulties when rendering long, legible text.” Concept bleeding. Training-data biases. Imperfect photorealism.
  • Flux: the BFL announcement and the model card do not enumerate limitations; [Reviewer Perspective] the absence is itself a documentation gap.

Weaknesses not stated or understated by the authors.

  • [Reviewer Perspective] All four models inherit the training data’s biases and copyrighted-content risks. LDM’s LAION-400M training was the subject of the New York Times v OpenAI / Getty v Stability AI litigation in 2023-2024; the SDXL paper does not engage with the copyright surface despite training on similar data.
  • [Reviewer Perspective] DDPM’s sampling-step count was a known weakness from the start; the paper’s reported FID assumes 1000-step sampling, which is impractical for most downstream uses.
  • [Reviewer Perspective] SDXL’s user-preference study uses a non-publicly-described population; the framing of “48.44% preferred vs 7.91%” overweights the qualitative jump.
  • [Reviewer Perspective] independent commentary: the Hugging Face diffusers documentation and community implementations have surfaced subtle reproducibility issues across the lineage (latent-scale-factor choices, VAE numerical precision at fp16) that the papers themselves do not document.

Reproducibility check.

  • DDPM. Code released (Ho et al. official TensorFlow at github.com/hojonathanho/diffusion). Data: CIFAR-10 public, CelebA-HQ public, LSUN public. Hyperparameters fully specified. Compute fully reported. Trained weights: released for CIFAR-10. Overall: fully reproducible.
  • LDM. Code released (github.com/CompVis/latent-diffusion). Stable Diffusion 1.x weights released openly. Hyperparameters specified. Compute: partially specified (training cost stated approximately). Overall: fully reproducible.
  • SDXL. Code released (github.com/Stability-AI/generative-models). Weights released openly under OpenRAIL++ license. Training data: not fully disclosed. Hyperparameters: largely specified. Compute: not explicitly stated. Overall: partially reproducible — the model is downloadable but full retraining requires access to undisclosed training data.
  • Flux. Code: FLUX.1 [dev] weights released under non-commercial license at huggingface.co/black-forest-labs/FLUX.1-dev. FLUX.1 [schnell] under Apache 2.0. Training code: not released. Training data: not disclosed. Hyperparameters for inference: in the model card. Compute: not disclosed. Overall: weights-available but not training-reproducible.

Methodology callout.

Methodology

  • Sample sizes. DDPM: CIFAR-10 (50K train), CelebA-HQ (30K), LSUN subsets (1-3M). LDM: LAION-400M for text-to-image (400M image-caption pairs). SDXL: undisclosed LAION-derived dataset. Flux: undisclosed.
  • Evaluation sets. DDPM and LDM: standard FID computation against held-out test splits. SDXL: PartiPrompts (1,632 prompts) for user-preference study. Flux: vendor self-report, no public benchmark dataset.
  • Baselines. DDPM vs NCSN, BigGAN, StyleGAN2-ADA. LDM vs DDPM (pixel-space), VQ-GAN, GLIDE. SDXL vs SD 1.5, SD 2.1, Midjourney v5.1. Flux vs SD3, Midjourney v6.0, DALL-E 3 (vendor-reported, no peer review).
  • Hardware / compute. DDPM: not centrally tabulated; community estimates of 100-200 GPU-days for full results. LDM: stated as ~150-250 GPU-days for text-to-image training. SDXL: not stated. Flux: not stated.

Generalisability. [Analysis] The pipeline generalises to other modalities — audio diffusion (AudioLDM), video diffusion (Video Diffusion Models, Sora), 3D (DreamFusion), molecules (Boltz-1 lineage). The core noise-prediction loss is modality-agnostic.

Assumption audit. Revisiting Section 3 assumptions: the data-distribution-has-a-density assumption holds for natural images. The “small enough βt\beta_t” assumption is well within the operating regime. The U-Net (or transformer) capacity assumption is empirically validated by every scaling experiment in the lineage.

What would make each paper significantly stronger. [Analysis] DDPM: a formal characterisation of why the unweighted loss outperforms the weighted ELBO would close a long-standing theoretical gap. LDM: a more principled analysis of the autoencoder’s bottleneck rather than only an empirical sweep. SDXL: ablations of alternative micro-conditioning parameterisations (additive vs cross-attended vs token-injected). Flux: a public training paper documenting the training data, compute, and architectural choices.

Section 13 — What is reusable for a new study

REUSABLE COMPONENT 1: The noise-prediction training loss.

  • What it is: LsimpleL_{\text{simple}} from DDPM.
  • Why worth reusing: Modality-agnostic, simple to implement, well-understood empirically.
  • Preconditions: A forward process with a closed-form noised state (any variance-preserving SDE qualifies).
  • What would need to change in a different setting: Replace the U-Net with the modality-appropriate encoder (transformer for video / language; equivariant network for molecules; etc.).
  • Risks: Sample quality at low step counts requires either distillation, DDIM-style deterministic sampling, or a flow-matching reformulation.

REUSABLE COMPONENT 2: Latent-space diffusion via a pre-trained autoencoder.

  • What it is: Train an autoencoder once; train a diffusion model in the latent space.
  • Why worth reusing: Order-of-magnitude compute reduction with minimal quality loss for high-dimensional data.
  • Preconditions: A high-quality autoencoder with smooth latents (KL- or VQ-regularised).
  • What would need to change: For video, replace the 2D autoencoder with a 3D / temporal autoencoder. For audio, use mel-spectrogram + audio decoder pattern.
  • Risks: Autoencoder reconstruction becomes the quality ceiling. Latent statistics drift can break diffusion training mid-run.

REUSABLE COMPONENT 3: Cross-attention conditioning at every U-Net layer.

  • What it is: Inject conditioning via cross-attention rather than concatenation / addition.
  • Why worth reusing: Empirically the strongest conditioning mechanism for diffusion; generalises across modalities.
  • Preconditions: A conditioning encoder that produces a sequence of token embeddings (CLIP, T5, or domain-specific).
  • What would need to change: Choose the right cross-attention placement; not every layer needs it.
  • Risks: Adds parameter count; care needed with attention masking for variable-length conditioning.

REUSABLE COMPONENT 4: SDXL-style micro-conditioning (size, crop, aspect ratio).

  • What it is: Telling the model at training time what the data preprocessing did, so the model can compensate at inference.
  • Why worth reusing: Recovers training data that would otherwise be discarded; gives users explicit inference-time controls.
  • Preconditions: A training pipeline with deterministic preprocessing whose parameters can be recorded.
  • What would need to change: The conditioning encoder for whatever preprocessing-parameters matter in the new domain.
  • Risks: Conditioning on too many factors can create training instability; needs ablation.

REUSABLE COMPONENT 5: Rectified-flow training for few-step sampling.

  • What it is: Replace DDPM-style stochastic training with deterministic flow matching along straight paths.
  • Why worth reusing: Single-digit step sampling at production quality.
  • Preconditions: A latent space (or pixel space) where straight-line interpolation between noise and data is meaningful.
  • What would need to change: Sampling code becomes an Euler ODE integrator rather than an ancestral-Gaussian sampler.
  • Risks: Trajectory curvature may require reflow iterations to fully realise the few-step sampling benefit.

Dependency map. Component 2 (latent diffusion) depends on Component 1 (noise-prediction loss applied to latents). Component 3 (cross-attention conditioning) integrates with either Component 1 or Component 5. Component 4 (micro-conditioning) integrates with Components 2 and 3. Component 5 (rectified flow) replaces the loss in Component 1 and is most powerful when paired with Component 2.

Highest-value components. [Analysis] For a new study trying to build a text-to-image system from scratch in 2026, the recommended stack is Component 5 (rectified flow) + Component 2 (latent diffusion via a strong autoencoder) + Component 3 (cross-attention with T5-XXL or successor) + Component 4 (micro-conditioning adapted to the deployment scenario). This is exactly the Flux / SD3 architecture.

What type of new study benefits most. [Analysis] Domain-specific generative modelling (medical imaging, satellite imagery, scientific visualisation) where the model needs strong conditioning controls and few-step sampling — Components 3, 4, and 5 combined deliver the biggest wins.

Section 14 — Known limitations and open problems

Limitations stated by the authors.

  • DDPM: slow sampling, weak log-likelihood relative to other likelihood-based models.
  • LDM: autoencoder reconstruction is the quality ceiling; sampling is still slower than GANs.
  • SDXL: human hands, text rendering, concept bleeding, training-data biases, imperfect photorealism.
  • Flux: limitations not enumerated in the public release materials.

Limitations not stated, sourced from independent commentary.

  • [Reviewer Perspective] grounded in community follow-up: the consistency-model lineage (Song et al. 2023) explicitly identifies DDPM’s single-step sampling impossibility as a structural limitation; this drove the entire distillation literature.
  • [Reviewer Perspective] grounded in the Stable Diffusion 3 paper (Esser, Kulal et al. 2024, arXiv:2403.03206): identifies SDXL’s U-Net as a scalability ceiling, motivating the MMDiT transformer in SD3.
  • [Reviewer Perspective] grounded in policy discourse: none of the four papers address watermarking, attribution, or provenance, despite these being central to the responsible-deployment conversation.

Technical root causes.

  • Slow sampling: the forward chain destroys structure progressively; reversing it well requires many small steps unless the trajectory is straightened (Flux).
  • Concept bleeding: text-encoder embeddings entangle semantically related concepts; the U-Net cannot disentangle.
  • Hands and text: training data contains few high-resolution hand and text examples; the diffusion loss has no special weighting for them.

Open problems left behind.

  • A unified theoretical framework explaining why noise prediction empirically dominates other parameterisations.
  • A principled approach to compositional generation.
  • Reliable text rendering inside images.
  • Open standardised text-to-image benchmarks.
  • Open training-data documentation across the lineage.

What a follow-up paper would need to solve to address the most critical limitation. [Analysis] The most critical open problem for the diffusion lineage in 2026 is compositional generation. A follow-up paper would need to (a) propose a conditioning mechanism that separates object identity from spatial relations cleanly, (b) train at scale on data with rich compositional annotations, and (c) demonstrate on a standardised benchmark with held-out compositional prompts. The 2024-2025 follow-up literature (RPG-DiffusionMaster, BoxDiff, structured-prompt approaches) has made progress; a definitive solution is not yet in hand.

How this article reads at three depths

For the curious high-school reader. Diffusion models generate pictures by undoing a noising process — first they learn to turn pictures into static, then they learn to run that backwards. Over four years and four key papers, the field went from grainy 32x32 CIFAR samples to 1024x1024 photorealistic images that any consumer GPU can produce in seconds. The four papers each made one big change: the loss function (DDPM), where the model lives (LDM moved it to a compressed space), what it sees during training (SDXL added image-size and crop conditioning), and how it samples (Flux replaced the random noising with straight-line flows).

For the working developer or ML engineer. The DDPM noise-prediction MSE loss is your conceptual starting point and ports unchanged to LDM/SDXL with the latent substitution. For production deployment, run latent diffusion at f=8f=8 with a KL-regularised autoencoder; use cross-attention conditioning at every U-Net layer with CLIP text embeddings; bucket training data into aspect ratios and condition on size + crop + aspect to recover small images. For 2024-onward greenfield work, prefer rectified flow + transformer (SD3 / Flux architecture) — the few-step sampling at production quality is a real win, and the transformer backbone scales more cleanly than the U-Net. Pin sampling to DDIM (50 steps) for U-Net diffusion or Euler ODE (4 steps for distilled, 50 for non-distilled) for rectified flow.

For the ML researcher. DDPM’s noise parameterisation is the foundational reframing; the simplified-vs-weighted-ELBO discrepancy remains theoretically uncharacterised. LDM’s primary contribution is engineering — the autoencoder-plus-diffusion combination is conceptually simple but the empirical sweep over ff and the cross-attention placement are load-bearing. SDXL’s micro-conditioning is the cleanest example of data-efficiency-via-conditioning in the lineage; the ablation is tight, but alternative parameterisations are under-explored. Flux moves to the rectified-flow + transformer regime convergently with SD3 (Esser, Kulal et al. 2024); the strongest objection is the absence of a peer-reviewed training paper from Black Forest Labs. A follow-up paper closing the theoretical gap on compositional generation, end-to-end-trained autoencoders, and standardised open benchmarks would be the highest-impact next step.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.