Neural Tech Daily
ai-research

Generative video models from DiT to Sora, Veo, and Movie Gen: a multi-paper review

Multi-paper review of generative video models — the DiT architecture, OpenAI's Sora, DeepMind's Veo 2/3, and Meta Movie Gen. Architecture, training, data curation…

Updated ~64 min read
Share

Reading-register key

From the paper: — directly supported by the paper. [Analysis] — the publication’s reasoned assessment. [Reconstructed] — faithful reconstruction from partial disclosure. [External comparison] — comparison to prior work or general knowledge. [Reviewer Perspective] — critical or speculative.

Figure 3 of Peebles and Xie — Scalable Diffusion Models with Transformers (arXiv:2212.09748), the four DiT block variants (in-context, cross-attention, adaLN, adaLN-Zero) that became the architectural template behind Sora, Veo, and Movie Gen

Figure 3 of Peebles and Xie — Scalable Diffusion Models with Transformers (arXiv:2212.09748), reproduced for editorial coverage.

Section 1 — Paper identity and scope

This review covers four artefacts that together define the modern generative-video stack.

  1. Peebles and Xie, Scalable Diffusion Models with Transformers (DiT). ICCV 2023. arXiv:2212.09748. The architectural ancestor: replaces the U-Net backbone of latent diffusion with a Vision-Transformer-style block on patchified latents, and proposes adaLN-Zero modulation for timestep and class conditioning. Every video model in this review uses a DiT-family backbone.
  2. OpenAI, Video generation models as world simulators (Sora technical report). February 2024. OpenAI blog post plus the later Sora System Card. Not a peer-reviewed paper; a written report describing Sora’s spacetime-patch tokenisation, diffusion-transformer architecture, variable-resolution training, and re-captioning approach inherited from DALL-E 3. Most architectural detail is intentionally withheld.
  3. Polyak et al. (Meta), Movie Gen: A Cast of Media Foundation Models. October 2024. arXiv:2410.13720. A 92-page technical report on Meta’s joint text-to-video, text-to-audio, video-editing, and personalisation system. The 30-billion-parameter Movie Gen Video backbone is the most extensively documented frontier video DiT in the open literature.
  4. Google DeepMind, Veo 2 and Veo 3. Veo 2 announced December 2024; Veo 3 announced at Google I/O 2025. Sources: the Veo model family page and the Veo 3 launch post. No peer-reviewed paper; the launch posts and system cards are the citable sources. Veo 3’s distinguishing claim is natively synchronised audio generation in the same diffusion pass as video.

Retrieval confirmation. DiT was retrieved in full from the arXiv abstract page and the ar5iv HTML rendering. Movie Gen was retrieved from the arXiv abstract and the official Meta paper PDF on 2026-05-19. The Sora technical report was retrieved from OpenAI’s blog index (cached snapshots used when the live page returned 403 to automated fetch); the Sora System Card was retrieved separately. The Veo entries combine the DeepMind model family page and the Google blog announcement. [Analysis] None of Sora, Veo 2, or Veo 3 has a peer-reviewed paper as of the access date; Movie Gen is the only one with a full technical paper, and DiT is the only one in the lineage with conference publication (ICCV 2023). This asymmetry shapes how strongly each section can ground its claims.

Paper classification. All four entries are Generative model and Architecture proposal. Movie Gen and Sora are additionally Multimodal (text-video and, for Movie Gen and Veo 3, text-audio). DiT is also Representation learning (the patchified-latent representation is load-bearing). Movie Gen and Veo are also Data-driven (data-curation pipelines are an explicit contribution). The Sora and Veo write-ups are also Application: they describe a deployed system more than they describe a model.

One-paragraph technical abstract in the publication’s voice. This review traces how generative-video models converged on a common architectural template between 2022 and 2025. The Peebles–Xie DiT paper showed that a transformer block with adaptive layer-normalisation modulation could match and exceed the U-Net on class-conditional ImageNet at the same compute budget, and that the resulting architecture scaled cleanly with FLOPs. Sora generalised the DiT block to spacetime patches, decoupled training from a fixed resolution or duration, and added a DALL-E-3-style re-captioning pipeline to manufacture high-quality text-video training pairs. Movie Gen documented the recipe at scale: a 30B-parameter DiT, a temporal autoencoder compressing video 8x in each of T, H, W, flow-matching as the training objective, joint image-and-video pre-training, and a five-stage data-curation pipeline. Veo 2 and Veo 3 are deployed systems whose architectural detail is mostly withheld; Veo 3’s headline contribution is that audio and video are generated in the same diffusion pass, producing lip-sync without a separate audio model. [Analysis] The four artefacts read as a single trajectory: DiT supplies the block, Sora supplies the tokenisation and the captioning, Movie Gen supplies the recipe at frontier scale, Veo supplies the multimodal output surface.

Primary research question. How do you build a deep generative model that produces minute-long, photorealistic, prompt-faithful, audio-synced video at frontier scale, and what architectural and data-curation choices are load-bearing for that result?

Core technical claim. A patchified-latent diffusion transformer trained with flow-matching (or noise-prediction) on a heavily curated joint image-video corpus, with timestep-and-text conditioning injected via adaptive layer-normalisation or cross-attention, is the current state-of-the-art recipe for text-to-video generation. The DiT block, spacetime patches, the temporal autoencoder, and the data-curation pipeline are each individually necessary contributions; together they define the recipe.

Core technical domains and depth labels. Diffusion and flow-matching generative modelling (deep). Transformer architecture design and adaptive normalisation (deep). Video tokenisation and temporal autoencoders (moderate to deep). Data curation and re-captioning pipelines (moderate). Human evaluation and A/B win-rate protocols (moderate). Audio-visual joint modelling (surface to moderate; Veo 3 detail is limited).

Reader prerequisites. High-school algebra. Familiarity with neural-network basics helps but is not required because the Glossary in Section 2.5 covers every prerequisite term. The diffusion-image-generation lineage is the natural prequel; every formula in this review extends one from the DDPM-to-Flux multi-paper review. Probability at the level of “what a Gaussian is” is assumed; everything else is explained inline.

Section 2 — TL;DR and executive overview

3-sentence TL;DR. Generative video models are diffusion models that take random noise and turn it into a clip by gradually denoising it; the modern recipe replaces the older U-Net architecture with a transformer that operates on small video patches, and adds an autoencoder that compresses the video by a factor of about eight in each of width, height, and time before the transformer ever touches it. Four artefacts define the current recipe: DiT (2022) introduced the transformer block, Sora (2024) generalised it to spacetime patches and variable-resolution training, Movie Gen (2024) documented the full recipe at thirty-billion parameters with a five-stage data-curation pipeline, and Veo 3 (2025) added natively synchronised audio so the same diffusion pass produces both pixels and sound. The review reproduces the central equation behind each system, traces the patchification operation on a small worked example, and locates each contribution within the broader diffusion and flow-matching theoretical context.

Executive summary. The video-generation field condenses onto a common architectural spine. DiT showed in 2022 that a transformer with adaptive-layer-normalisation timestep conditioning beats the U-Net on class-conditional ImageNet at matched FLOPs, scaling cleanly from 33M to 675M parameters. Sora generalised the DiT to videos by tokenising compressed latents as spacetime patches and training across variable resolutions, durations, and aspect ratios; OpenAI’s report makes no architectural disclosures beyond the patch-and-transformer framing. Movie Gen documents what the recipe looks like at frontier scale: a 30B transformer with Llama-3-style RMSNorm and SwiGLU, full bidirectional attention, cross-attention for text conditioning, flow-matching as the training objective, a temporal autoencoder with 8x compression in each spatial-temporal axis, and a five-stage data-curation pipeline producing roughly 73.8M curated video clips. Veo 2 and Veo 3 are deployed systems; their main editorial contribution is the demonstration that joint audio-video diffusion is practical at production quality, with the same transformer attending to both visual spacetime patches and temporal audio tokens.

Five practitioner-relevant takeaways.

  • The DiT-with-adaLN-Zero block is the conceptual atom of every model in this review. AdaLN-Zero conditions each transformer block on the timestep and the text embedding by regressing per-channel scale, shift, and gate parameters from those embeddings, initialised so each block starts as the identity function and the network gradually learns to do something.
  • The temporal autoencoder is the single most expensive choice in the recipe. Movie Gen’s TAE compresses video by 8×8×88 \times 8 \times 8 across (T,H,W)(T, H, W) with C=16C=16 latent channels; this collapses a 768x768 256-frame clip from 1.5×108\approx 1.5 \times 10^8 raw pixel-channel values to a latent tensor of 32×96×96×164.4×10632 \times 96 \times 96 \times 16 \approx 4.4 \times 10^6 entries, a 33x reduction.
  • Variable-resolution, variable-aspect-ratio, variable-duration training is the engineering pattern Sora popularised. It avoids per-resolution data bucketing and lets a single model serve a 4-second vertical clip and a 60-second landscape clip from the same checkpoint.
  • Re-captioning, not better text-encoder choice, is what unlocks prompt fidelity. Sora’s report attributes its strong instruction-following to applying the DALL-E-3 re-captioning trick to the training videos: a captioning model writes long, descriptive captions, the diffusion model trains on those, and at inference a separate prompt-expansion model rewrites short user prompts into the same descriptive register.
  • Native multimodal generation is the 2025 frontier. Veo 3’s claim is that audio is generated in the same transformer pass as video, not as a separate post-hoc model; Movie Gen ships text-to-video and text-to-audio as separate models in the same family, so the joint-audio-and-video diffusion pass is Veo 3’s distinct contribution.

Pipeline overview. All four artefacts share a training-time outline: encode a clip xx into a compact latent tensor z=E(x)z = E(x) using a pre-trained spatiotemporal autoencoder; patchify zz into a sequence of spacetime tokens; sample a timestep tt; form a noisy input ztz_t by interpolating between zz and Gaussian noise (or by adding noise per a DDPM schedule); run the DiT to predict either the noise or the flow-matching velocity field; minimise squared error. At inference, all four start from pure Gaussian noise, iteratively denoise via a numerical ODE or SDE solver, and decode the final latent back to a video via the autoencoder decoder. Text conditioning enters via cross-attention layers; timestep conditioning enters via adaptive layer-normalisation modulation parameters regressed from the timestep embedding.

Section 2.5 — Glossary

TermPlain-English explanationFirst appears in
Diffusion modelA generative model that learns to invert a noise-adding process: at training it sees increasingly noisy versions of real data and learns to predict the noise; at sampling it starts from pure noise and gradually denoises.Section 3
Forward / reverse processForward: the fixed recipe that progressively adds Gaussian noise to a clean clip until it becomes static. Reverse: the learned chain of denoising steps the network performs at sampling time.Section 3
Timestep ttAn integer (DDPM-style) or continuous value in [0,1][0, 1] (flow-matching style) indexing how noisy the current state is. t=0t=0 is clean data; t=Tt=T or t=1t=1 is pure noise.Section 3
Latent spaceA compressed representation of the input produced by an autoencoder; the diffusion model operates here rather than in pixel space, cutting compute by orders of magnitude.Section 3
Autoencoder E,DE, DA pair of neural networks. EE compresses input video to a small latent tensor; DD decompresses a latent back to pixels. Trained once, frozen during diffusion training.Section 3
PatchifyCut a 2D image or 3D video into fixed-size non-overlapping patches and flatten each patch into a token. The transformer then treats the patches as a sequence.Section 5
Spacetime patchA 3D patch covering a small region in space AND a short window in time, used by Sora and Movie Gen to tokenise video.Section 5
DiT blockA transformer block that processes patchified-latent tokens with timestep and text conditioning. The basic building block of every model in this review.Section 5
adaLN / adaLN-ZeroAdaptive layer-normalisation: the scale and shift parameters of LayerNorm are predicted from a conditioning vector. “Zero” initialises a per-channel gate at zero so each block starts as the identity function.Section 5
RoPERotary positional embedding: a way to inject positional information into attention by rotating query and key vectors as a function of their position.Section 5
Flow matchingA training objective that replaces noise-prediction with prediction of a velocity field interpolating between data and noise; produces a deterministic ODE at sampling time.Section 6
Re-captioningTraining a captioning model first, then using it to write long descriptive captions for training videos, then training the diffusion model on those captions. Boosts prompt-following.Section 8
Classifier-free guidanceA sampling-time trick that strengthens the model’s conditioning by interpolating between conditional and unconditional predictions.Section 6
TAETemporal AutoEncoder: the spatiotemporal autoencoder Movie Gen and similar systems use to compress video before diffusion.Section 5
VMAFA perceptual video-quality metric Netflix open-sourced; used by Movie Gen as one filter in its data-curation pipeline.Section 9
[Analysis] labelThe publication’s own reasoned assessment, distinct from what the paper itself claims.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what the paper proves.Section 11, 12
[Reconstructed] labelContent the publication faithfully reconstructed because the paper only partially disclosed it.Section 5, 8
[External comparison] labelA comparison to prior work or general knowledge outside the paper itself.Section 4, 11
”From the paper:” prefixContent directly supported by the paper’s text, equations, tables, or figures.Throughout

Section 3 — Problem formalisation

Notation table.

SymbolTypeMeaningFirst appears in
xxclip (real)A raw video clip, a tensor of shape (T,H,W,3)(T, H, W, 3) for TT RGB frames at resolution H×WH \times W.This section
z=E(x)z = E(x)latent (real)The autoencoder-compressed latent of xx, shape (T,H,W,C)(T', H', W', C) with TT,HH,WWT' \le T, H' \le H, W' \le W and channel count CC.This section
T,H,WT, H, WintegersFrame count, height, width of the raw clip.This section
T,H,WT', H', W'integersFrame count, height, width of the latent tensor. For Movie Gen T=T/8,H=H/8,W=W/8T'=T/8, H'=H/8, W'=W/8.This section
CCintegerLatent channel dimension; Movie Gen uses C=16C=16.Section 5
ps,ptp_s, p_tintegersPatch sizes along spatial and temporal axes. Movie Gen uses ps=2,pt=1p_s=2, p_t=1 on top of the TAE.Section 5
NNintegerToken-sequence length after patchification, N=THW/(ps2pt)N = T' \cdot H' \cdot W' / (p_s^2 \cdot p_t).Section 5
ddintegerTransformer hidden dimension. DiT-XL: d=1152d=1152. Movie Gen 30B: d=6144d=6144.Section 5
LLintegerNumber of transformer layers (depth). DiT-XL: L=28L=28. Movie Gen 30B: L=48L=48.Section 5
ttscalarDiffusion timestep. DDPM: t{1,...,1000}t \in \{1, ..., 1000\} integer. Flow matching: t[0,1]t \in [0, 1] continuous.Section 6
ϵ\epsilontensorStandard Gaussian noise sampled at training. Same shape as zz.Section 6
ztz_ttensorNoisy latent at timestep tt; under flow matching, zt=tϵ+(1(1σmin)t)zz_t = t \cdot \epsilon + (1 - (1-\sigma_{\min}) t) \cdot z.Section 6
VtV_ttensorThe flow-matching velocity-field target, Vt=ϵ(1σmin)zV_t = \epsilon - (1 - \sigma_{\min}) z.Section 6
ccembeddingThe text conditioning embedding, produced by a frozen text encoder (CLIP, T5, or similar).Section 5
θ\thetaparametersThe DiT model parameters.Section 6
fθ(zt,t,c)f_\theta(z_t, t, c)functionThe DiT’s prediction; either noise ϵ\epsilon (DDPM-style) or velocity VtV_t (flow-matching style).Section 6

Formal problem statement. Given a corpus of (x,c)(x, c) pairs where xx is a video clip and cc is a text caption describing it, learn a parametric distribution pθ(zc)p_\theta(z \mid c) in latent space such that:

zpθ(c),xgenerated=D(z)z \sim p_\theta(\cdot \mid c), \quad x_{\text{generated}} = D(z)

produces high-quality, prompt-faithful, temporally-coherent video. The decoded xgeneratedx_{\text{generated}} should be a plausible sample from the conditional video distribution p(xc)p(x \mid c) implied by the training data. The latent representation z=E(x)z = E(x) comes from a separately-trained spatiotemporal autoencoder (E,D)(E, D) that is frozen during diffusion training.

Input space. Variable-shape video tensors. Sora is the first to make non-fixed resolution a first-class input; Movie Gen documents the same idea concretely. A training batch can contain clips of mixed durations, resolutions, and aspect ratios; the transformer processes each clip’s spacetime-patch sequence and the positional encoding handles the shape.

Output space. Same shape as input. A 256-frame 768x768 input produces a 256-frame 768x768 output. Sora and Veo both let the user specify resolution and duration at inference.

Objective. For DiT (image-only): match the DDPM noise-prediction loss L=Et,x0,ϵ[midϵϵθ(xt,t,c)mid2]L = \mathbb{E}_{t, x_0, \epsilon}[\\mid \epsilon - \epsilon_\theta(x_t, t, c)\\mid ^2]. For Movie Gen and (per its public framing) Veo: match the flow-matching velocity field L=Et,x0,ϵ[midVtfθ(zt,t,c)mid2]L = \mathbb{E}_{t, x_0, \epsilon}[\\mid V_t - f_\theta(z_t, t, c)\\mid ^2].

Constraints and assumptions.

  • The pre-trained autoencoder (E,D)(E, D) is high-fidelity enough that pixel quality is bottlenecked by the diffusion model, not the decoder. [Analysis] Movie Gen reports the TAE separately and emphasises reconstruction-quality benchmarks for this reason.
  • The text encoder cϕc_\phi supplies an embedding faithful enough that text conditioning is informative. Movie Gen uses MetaCLIP, ByT5, and Long-prompt MetaCLIP in combination; Sora is opaque on this point but uses an internal text encoder.
  • Training data is heavily curated: visual-quality filters, motion filters, scene-detection filters, OCR-based text removal, deduplication, and concept-balancing resampling are all applied before training.
  • [Analysis] Potentially strong assumption: the joint distribution p(x,c)p(x, c) implied by the curated corpus is representative enough of the use-case distribution at inference. This is the core training-distribution-to-deployment-distribution generalisation assumption that every supervised generative model leans on, and is the failure mode all of Sora’s stated limitations (object permanence, physical-law violations) ultimately reduce to.

Complexity arguments. The dominant compute cost at training time is the attention operation, O(N2d)O(N^2 \cdot d) per DiT layer per training step. For Movie Gen the worst-case NN is roughly 73,72873{,}728 (a 768x768 256-frame clip at 8×8×88\times 8\times 8 TAE compression and 2×2×12 \times 2 \times 1 patchification), which makes full quadratic attention expensive but tractable; Movie Gen reports parallelisation across tensor, context, and sequence dimensions to handle the cost. [Analysis] The cubic scaling of NN with resolution (since NTHWN \propto T' H' W') is the binding constraint on producing longer clips; doubling clip duration doubles NN and quadruples the attention budget.

If multimodal (Movie Gen, Veo 3). Audio is processed as a separate token stream. Movie Gen ships a separate text-to-audio model; Veo 3 generates audio in the same transformer pass. For Veo 3 the public framing is that the transformer attends jointly to “visual spacetime patches and temporal audio information”; no architectural detail is disclosed.

Section 4 — Motivation and gap

The pre-DiT generative-video literature was dominated by U-Net-based latent diffusion (the architecture Stable Diffusion popularised for images), extended to video either by adding 3D convolutions (Imagen Video, Make-a-Video) or by inserting temporal attention blocks between spatial blocks. Two structural problems with this template:

  1. The U-Net does not scale cleanly. Beyond ~1B parameters the U-Net’s parameter-to-compute ratio worsens, the skip-connection bandwidth becomes a bottleneck, and the architecture’s hard inductive bias toward local-spatial processing fights with the transformer-style attention every team eventually wanted to add. [External comparison] The same pattern played out in image classification, where ViT eventually supplanted ConvNets at scale.
  2. U-Nets and per-resolution training do not mix. Random crop / fixed-resolution training is the default; running the same U-Net at multiple resolutions or aspect ratios requires per-resolution heads or per-resolution fine-tunes. For video this matters more than for images because the use-case distribution spans vertical phone clips, landscape 16:9 clips, and square social-media clips.

DiT addressed problem (1): a Vision-Transformer-style block on patchified latents, with adaLN-Zero modulation for timestep conditioning, beats the U-Net on FID at matched FLOPs on class-conditional ImageNet. Sora addressed problem (2): spacetime patches with positional encoding make resolution and duration first-class inputs the transformer can handle in a single batch.

Real-world stakes. Pre-2024 generative-video output was characterised by 2-4-second clips at fixed resolutions, weak prompt-following, and frequent temporal incoherence (objects flickering, faces morphing). The Sora announcement landed because the demo videos showed minute-long clips with strong subject consistency, varied aspect ratios, and (apparently) plausible physical behaviour for the duration of the clip. [Analysis] Sora’s release was the cultural inflection point at which “video generation” graduated from research demo to product category.

Gap each paper claims to fill.

  • DiT: a scaling-friendly transformer alternative to the U-Net for diffusion image generation. The paper’s framing is architectural, not video-specific.
  • Sora: “world simulator” framing covers video models as a path to general physical understanding. The technical report explicitly positions Sora as a step toward “AGI for the physical world”. [Reviewer Perspective] The framing is editorial; the technical contributions (spacetime patches, variable-resolution training, re-captioning) are narrower and clearly described, while the “world simulator” claim is aspirational and harder to evaluate.
  • Movie Gen: the full recipe at frontier scale, with documentation that prior frontier models (Sora, Veo) had withheld. The paper’s value proposition is the technical disclosure itself.
  • Veo 2/3: deployed product-quality video generation, with Veo 3 specifically addressing the audio-synchronisation gap.

Position in the broader landscape. [External comparison] Concurrent and adjacent work includes Stable Video Diffusion (Stability AI, 2023), Imagen Video (Google, 2022), Make-a-Video (Meta, 2022), Phenaki (Google, 2022), Kling (Kuaishou, 2024), Runway Gen-3 (Runway, 2024), and Pika (2023-2024). The DiT-Sora-Movie Gen-Veo trajectory is the dominant-architecture branch; alternative branches (autoregressive video tokens, masked-token modelling, cascaded super-resolution U-Nets) remain active but lower-publicity.

Section 5 — Method overview

This section walks the architectural spine shared across all four artefacts. The components below are introduced once and reused per-paper in later sections.

5.1 The DiT block (Peebles–Xie)

Plain-English intuition. Take a small image, cut it into non-overlapping square patches (say 2×22 \times 2 pixels each), flatten each patch into a vector, and feed the resulting sequence into a transformer. The transformer’s attention layer lets every patch “see” every other patch so the model can capture long-range structure (a corner of an image conditioning the opposite corner) that a U-Net struggles with. The diffusion-specific addition is that each transformer block receives a timestep embedding telling it how noisy the current input is, and a class-or-text embedding telling it what to generate; both inject via modulation parameters that scale and shift the layer-normalisation output.

Exact mechanism. Given an input latent zRH×W×Cz \in \mathbb{R}^{H' \times W' \times C} (the autoencoder’s output), DiT does the following:

  1. Patchify. Cut zz into H/pW/pH'/p \cdot W'/p patches of shape (p,p,C)(p, p, C); flatten each into a length-p2Cp^2 C vector; project to dimension dd via a learned linear map. Add a learned positional embedding. Result: a sequence of N=HW/p2N = H' W'/p^2 tokens of dimension dd.
  2. Compute conditioning embedding. Sum the timestep embedding (a sinusoidal tt-embedding passed through an MLP) and the class embedding (or text embedding). Call this hRdh \in \mathbb{R}^d.
  3. DiT blocks. Repeat LL times: for each block, regress per-channel scale γ\gamma, shift β\beta, and (in adaLN-Zero) gate α\alpha parameters from hh via a linear map. Apply layer-norm to the token sequence, then modulate: LN(z)(1+γ)+β\text{LN}(z) \cdot (1 + \gamma) + \beta. Run multi-head self-attention. Gate the residual by α\alpha. Same pattern for the FFN sub-block.
  4. Final layer. Layer-norm, modulate one last time, linear-project back to patch space, un-patchify. The output has the same shape as zz and is interpreted as predicted noise.

adaLN-Zero specifically. The “Zero” qualifier means the linear map producing the gate α\alpha is initialised to zero, so at initialisation each DiT block is the identity function and the residual stream passes through unchanged. [Analysis] This is the same trick that ControlNet uses for stable conditioning and that ResNet uses with batch-norm γ\gamma initialised to zero; the residual gate is reliably the “safe” parameter to zero-init when adding new modules to a deep stack.

DiT model sizes.

ModelLayers LLHidden ddHeadsParamsGflops (256x256, p=4p=4)
DiT-S12384633M1.4
DiT-B1276812130M5.6
DiT-L24102416458M19.7
DiT-XL28115216675M29.1

Connection to full pipeline. The DiT block is the only block. Everything else (patchification, conditioning, sampler) wraps around it.

Design rationale and tradeoffs. AdaLN-Zero strictly outperformed in-context (just appending the conditioning as extra tokens), cross-attention (a separate conditioning sequence), and plain adaLN (without the zero-init gate) in DiT’s ablation. The choice has been carried into nearly every successor video model.

Novelty: [New] DiT was the first paper to demonstrate transformer-on-latent diffusion at scale with adaLN-Zero modulation.

5.2 Spacetime patches (Sora)

Sora’s main architectural disclosure is the move from 2D patches to 3D spacetime patches.

Plain-English intuition. Cut a video clip into 3D boxes, each spanning a small spatial region AND a short window of frames. Each box becomes a single token the transformer processes. The transformer now attends across both space (one corner of a frame to the other) and time (frame 1 to frame 64) in the same self-attention operation. The implication: temporal coherence is enforced by the attention mechanism itself, not by a separate temporal-attention block bolted onto a spatial U-Net.

Mechanism (reconstructed from Sora’s disclosure). [Reconstructed]

  1. Spatiotemporal autoencoder. Compress raw video by both spatial and temporal factors. Sora’s report does not disclose the compression ratio; Movie Gen later disclosed 8×8×88 \times 8 \times 8 as the working choice.
  2. Spacetime patchify. Cut the latent tensor zRT×H×W×Cz \in \mathbb{R}^{T' \times H' \times W' \times C} into 3D patches of shape (pt,ps,ps,C)(p_t, p_s, p_s, C); flatten each to a vector of length ptps2Cp_t p_s^2 C; linear-project to dimension dd. Movie Gen uses pt=1,ps=2p_t=1, p_s=2 explicitly.
  3. Add positional encoding. Sora uses positional encoding that handles variable (T,H,W)(T', H', W'), exact form not disclosed; Movie Gen uses rotary positional embedding (RoPE) generalised to 3D.
  4. Feed the resulting token sequence into a DiT. Same adaLN-Zero modulation, same self-attention, same FFN.

Sequence-length math. For a Movie Gen 768x768 256-frame clip:

  • TAE output: T=256/8=32T' = 256/8 = 32, H=W=768/8=96H' = W' = 768/8 = 96, C=16C = 16.
  • Patchify with ps=2,pt=1p_s = 2, p_t = 1: token count N=3296/296/2=324848=73,728N = 32 \cdot 96/2 \cdot 96/2 = 32 \cdot 48 \cdot 48 = 73{,}728.

The “73K tokens” headline Movie Gen reports is this calculation.

Design rationale. Variable-shape input is supported by construction: a 256x256 4-second clip and a 768x768 16-second clip produce different NN, but the same DiT processes both. The transformer’s attention naturally handles variable sequence lengths; positional encoding handles variable shapes.

Novelty: [Adapted] Spacetime patches generalise Vision-Transformer 2D patches to 3D. Pre-Sora video transformers used either flat 2D-then-temporal-attention architectures or 3D convolutions. Sora’s contribution is using a single 3D patchification + a single transformer attending to the full spacetime sequence.

5.3 The temporal autoencoder TAE (Movie Gen)

[New] documentation of an architecture Sora hinted at but did not specify.

Plain-English intuition. Compress video before the diffusion model sees it. The compression network sees raw RGB frames; the output is a small 4D tensor with most of the temporal and spatial redundancy squeezed out. The diffusion model operates on this compact representation; at inference the decoder expands the generated latent back to pixels.

Architecture. Movie Gen inflates a standard 2D image VAE by inserting a 1D temporal convolution after each 2D spatial convolution and a 1D temporal attention after each 2D spatial attention. The result is a 3D-aware autoencoder that produces a latent tensor with compression ratio 8×8×88 \times 8 \times 8 in (T,H,W)(T, H, W) and C=16C=16 channels.

Why C=16C=16? Movie Gen reports an ablation showing C=16C=16 outperforms C=8C=8 for reconstruction quality at the same compute budget. [Analysis] The choice aligns with the Stable Diffusion 3 ablation, which also moved to C=16C=16 for its latent.

Connection to full pipeline. The TAE is trained first, frozen, and then used as the input-output bridge for diffusion training. Pixel-quality artefacts in the final video can come from either the diffusion model or the TAE decoder; Movie Gen reports TAE benchmarks separately.

5.4 The Movie Gen 30B backbone

[New] disclosure of the full backbone configuration.

ComponentValue
Total parameters30B
Layers LL48
Hidden dim dd6144
FFN hidden dim16384
Attention heads48
Max context length73,728\approx 73{,}728 tokens
NormalisationRMSNorm
ActivationSwiGLU
Positional encodingRoPE (3D-generalised)
Attention patternFull bidirectional
Text conditioningCross-attention layers
Timestep conditioningAdaptive layer-norm
Bias termsNone

The backbone closely follows the Llama 3 architecture, with three modifications: cross-attention layers for text conditioning, adaLN for timestep conditioning, and full bidirectional attention replacing Llama’s causal attention. [Analysis] This is the most transparent disclosure in the lineage. Sora’s report makes no architectural disclosures of comparable specificity, and the Veo writeups make none at all.

5.5 Veo 2/3 architecture (mostly withheld)

Public disclosure: Veo is a “diffusion-based” video model; Veo 3 generates audio in the same diffusion pass. No architectural detail is published. [Reviewer Perspective] The conservative inference is that Veo 3 uses a DiT-family backbone (since every other 2024-2025 frontier video model does), with some form of joint visual-audio token sequence; the precise tokenisation of audio, the relative loss weighting, and the conditioning structure are unknown to readers outside DeepMind.

Section 6 — Mathematical contributions

The math in this lineage condenses into a small number of equations, each appearing in multiple papers with minor reparameterisations.

MATH ENTRY [1]: DDPM noise-prediction loss

  • Source: Ho, Jain, Abbeel (DDPM, arXiv:2006.11239) Eq. 14; used by DiT.
  • What it is: a one-line training objective. Sample a clean image, sample a timestep, add scheduled noise, ask the network to predict the noise; minimise squared error.
  • Formal definition:

Lsimple(θ)=Et,x0,ϵ[ϵϵθ(xt,t)2]L_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\left[\| \epsilon - \epsilon_\theta(x_t, t) \|^2\right]

where xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t} \cdot x_0 + \sqrt{1 - \bar\alpha_t} \cdot \epsilon and αˉt\bar\alpha_t is the cumulative noise schedule.

  • Term-by-term:
    • x0x_0: a clean training sample (image or latent), tensor of shape matching the model’s input.
    • ϵ\epsilon: standard Gaussian noise, same shape as x0x_0.
    • tt: timestep sampled uniformly from {1,...,T}\{1, ..., T\}.
    • αˉt\bar\alpha_t: a scalar in (0,1)(0, 1); close to 1 at small tt (mostly clean), close to 0 at large tt (mostly noise).
    • xtx_t: the noisy input the network sees, same shape as x0x_0.
    • ϵθ(xt,t)\epsilon_\theta(x_t, t): the network’s noise prediction, same shape as ϵ\epsilon.
  • Worked numerical example. Take x0R4x_0 \in \mathbb{R}^{4} as a tiny “image” [0.5,0.2,0.8,0.1][0.5, -0.2, 0.8, 0.1], ϵ=[1.0,1.0,0.5,0.5]\epsilon = [1.0, -1.0, 0.5, -0.5], t=500t=500, αˉ500=0.36\bar\alpha_{500} = 0.36. Then αˉt0.60,1αˉt0.80\sqrt{\bar\alpha_t} \approx 0.60, \sqrt{1 - \bar\alpha_t} \approx 0.80, so xt=0.60[0.5,0.2,0.8,0.1]+0.80[1.0,1.0,0.5,0.5]=[1.10,0.92,0.88,0.34]x_t = 0.60 \cdot [0.5, -0.2, 0.8, 0.1] + 0.80 \cdot [1.0, -1.0, 0.5, -0.5] = [1.10, -0.92, 0.88, -0.34]. The network sees xtx_t and t=500t=500, must predict ϵ\epsilon. If it predicts ϵ^=[0.9,1.1,0.4,0.4]\hat\epsilon = [0.9, -1.1, 0.4, -0.4], the per-element squared error is [0.01,0.01,0.01,0.01][0.01, 0.01, 0.01, 0.01], mean 0.010.01.
  • Role: standard objective for DDPM-style diffusion models. DiT uses this loss verbatim.
  • Edge cases: at t=Tt=T the input is pure noise, xtϵx_t \approx \epsilon, and the optimal prediction is trivially ϵ^=ϵ\hat\epsilon = \epsilon; the loss is near-zero. At t=1t=1 the input is nearly clean and the loss is the most informative.
  • Novelty: [Adopted] by DiT from DDPM.
  • Transferability: [Analysis] reusable wherever a Gaussian-noise forward process is appropriate. Replaced by flow-matching variants in newer models (Movie Gen, Stable Diffusion 3, Flux).
  • Why it matters: the conceptual atom of every diffusion model in this lineage.

MATH ENTRY [2]: Flow-matching velocity objective

  • Source: Movie Gen Section 3.1 (arXiv:2410.13720); Lipman et al. (arXiv:2210.02747) for the underlying framework.
  • What it is: the modern replacement for the DDPM noise-prediction loss. The network predicts a velocity field that transports samples from the data distribution to a Gaussian; at sampling time the velocity field is integrated as an ODE.
  • Formal definition. The interpolation path:

zt=tϵ+(1(1σmin)t)z0z_t = t \cdot \epsilon + (1 - (1 - \sigma_{\min}) \cdot t) \cdot z_0

The velocity target:

Vt=ϵ(1σmin)z0V_t = \epsilon - (1 - \sigma_{\min}) \cdot z_0

The loss:

LFM(θ)=Et,z0,ϵ[Vtfθ(zt,t,c)2]L_{\text{FM}}(\theta) = \mathbb{E}_{t, z_0, \epsilon}\left[\| V_t - f_\theta(z_t, t, c) \|^2\right]

where t[0,1]t \in [0, 1] continuous, σmin\sigma_{\min} a small constant (Movie Gen uses σmin=105\sigma_{\min} = 10^{-5}).

  • Term-by-term:
    • z0z_0: a clean latent (the TAE output of a training clip), tensor of shape (T,H,W,C)(T', H', W', C).
    • ϵ\epsilon: standard Gaussian noise, same shape as z0z_0.
    • tt: a continuous timestep sampled from a distribution Movie Gen specifies as logit-normal.
    • ztz_t: the interpolated noisy latent at time tt. At t=0t=0, zt=z0z_t = z_0 (clean); at t=1t=1, ztϵz_t \approx \epsilon (noise) up to the σmin\sigma_{\min} correction.
    • VtV_t: the velocity-field target. With σmin\sigma_{\min} near zero this simplifies to Vtϵz0V_t \approx \epsilon - z_0, the constant-velocity straight-line interpolation between data and noise.
    • fθ(zt,t,c)f_\theta(z_t, t, c): the DiT’s predicted velocity, same shape as VtV_t.
  • Worked numerical example. Take z0=[0.5,0.2,0.8,0.1]z_0 = [0.5, -0.2, 0.8, 0.1], ϵ=[1.0,1.0,0.5,0.5]\epsilon = [1.0, -1.0, 0.5, -0.5], t=0.4t = 0.4, σmin=105\sigma_{\min} = 10^{-5} (so 1σmin11 - \sigma_{\min} \approx 1). Then zt=0.4[1.0,1.0,0.5,0.5]+0.6[0.5,0.2,0.8,0.1]=[0.70,0.52,0.68,0.14]z_t = 0.4 \cdot [1.0, -1.0, 0.5, -0.5] + 0.6 \cdot [0.5, -0.2, 0.8, 0.1] = [0.70, -0.52, 0.68, -0.14]. The velocity target is Vt=ϵz0=[0.5,0.8,0.3,0.6]V_t = \epsilon - z_0 = [0.5, -0.8, -0.3, -0.6]. If the network predicts V^t=[0.4,0.9,0.25,0.5]\hat V_t = [0.4, -0.9, -0.25, -0.5], per-element squared error is [0.01,0.01,0.0025,0.01][0.01, 0.01, 0.0025, 0.01], mean 0.008\approx 0.008.
  • Role: training objective for Movie Gen. [Reconstructed] near-certainly the training objective for Veo 2/3 given the broader 2024-2025 industry move, but DeepMind does not disclose.
  • Edge cases: at t=0t=0 the path is at the data and the velocity target is V0=ϵz0V_0 = \epsilon - z_0; the model still has to predict this from zt=z0z_t = z_0 alone, with no information about ϵ\epsilon. The model effectively learns the conditional expectation of the velocity given the current state.
  • Novelty: [Adopted] Movie Gen adopts flow matching from Lipman et al. [Adapted] the specific logit-normal timestep distribution and σmin\sigma_{\min} values are Movie-Gen-specific choices.
  • Transferability: [Analysis] universally applicable; this is the 2024-2025 default objective for high-end diffusion-style generative models.
  • Why it matters: produces a deterministic ODE sampling path, supports fewer sampling steps than DDPM, and admits cleaner classifier-free-guidance scaling.

MATH ENTRY [3]: adaLN-Zero modulation

  • Source: DiT (arXiv:2212.09748) Section 3.
  • What it is: the recipe by which each DiT block consumes the timestep and class (or text) embeddings. Three sets of scalars per channel (scale, shift, gate) are regressed from the conditioning vector and used to modulate the post-LayerNorm activations and the residual stream.
  • Formal definition. Let hRdh \in \mathbb{R}^d be the sum of the timestep and class embeddings. Predict six per-channel modulation vectors via linear maps:

γ1,β1,α1,γ2,β2,α2Rd=Linear(h)\gamma_1, \beta_1, \alpha_1, \gamma_2, \beta_2, \alpha_2 \in \mathbb{R}^d = \text{Linear}(h)

Then the block computes:

y=x+α1MHSA(LN(x)(1+γ1)+β1)y = x + \alpha_1 \odot \text{MHSA}\left(\text{LN}(x) \odot (1 + \gamma_1) + \beta_1\right)

z=y+α2FFN(LN(y)(1+γ2)+β2)z = y + \alpha_2 \odot \text{FFN}\left(\text{LN}(y) \odot (1 + \gamma_2) + \beta_2\right)

where MHSA\text{MHSA} is multi-head self-attention, FFN\text{FFN} is the feed-forward sub-block, LN\text{LN} is layer-norm, \odot is element-wise multiplication. The linear maps producing α1,α2\alpha_1, \alpha_2 are initialised to zero.

  • Term-by-term + dimensional analysis:
    • xx: token sequence, tensor of shape (N,d)(N, d).
    • hh: conditioning vector, length dd.
    • γ,β,α\gamma, \beta, \alpha: each length dd, broadcast across the NN tokens.
    • LN(x)\text{LN}(x): layer-norm across the dd axis, shape (N,d)(N, d).
    • LN(x)(1+γ1)+β1\text{LN}(x) \odot (1 + \gamma_1) + \beta_1: per-channel affine modulation, shape (N,d)(N, d).
    • MHSA()\text{MHSA}(\cdot): self-attention, shape (N,d)(N, d).
    • yy: residual output, shape (N,d)(N, d).
  • Worked numerical example. Take d=4d=4, N=2N=2. Suppose x=(0.50.51.00.00.30.70.20.8)x = \begin{pmatrix} 0.5 & -0.5 & 1.0 & 0.0 \\ 0.3 & 0.7 & -0.2 & 0.8 \end{pmatrix}. After LN per row: row 1 mean 0.250.25, std 0.56\approx 0.56; row 1 LN [0.45,1.34,1.34,0.45]\approx [0.45, -1.34, 1.34, -0.45]. Suppose γ1=[0.1,0.2,0.3,0.4],β1=[0.0,0.0,0.0,0.0]\gamma_1 = [0.1, 0.2, 0.3, 0.4], \beta_1 = [0.0, 0.0, 0.0, 0.0]. Then modulated row 1 is [0.451.1,1.341.2,1.341.3,0.451.4][0.50,1.61,1.74,0.63][0.45 \cdot 1.1, -1.34 \cdot 1.2, 1.34 \cdot 1.3, -0.45 \cdot 1.4] \approx [0.50, -1.61, 1.74, -0.63]. At initialisation α1=0\alpha_1 = 0 exactly, so the entire MHSA contribution is gated out; the block produces y=xy = x.
  • Role: the conditioning mechanism for every model in this review. Movie Gen uses adaLN for timestep; cross-attention for text.
  • Edge cases: α=0\alpha = 0 at initialisation means the block is the identity function; gradients flow unobstructed through the residual; the block only starts to do something once α\alpha moves off zero. This is what makes deep stacks (DiT-XL has 28 blocks; Movie Gen 30B has 48) train stably from scratch.
  • Novelty: [New] in DiT in this exact form, though the zero-init residual gate is a generalisation of the BatchNorm-γ\gamma-zero-init trick from much earlier work.
  • Transferability: [Analysis] universally used. Every video model in this review uses adaLN or adaLN-Zero for timestep conditioning.
  • Why it matters: makes deep transformer stacks trainable for diffusion. Without it the DiT does not scale; with it the FLOPs-vs-FID curve becomes monotonic and predictable.

MATH ENTRY [4]: Classifier-free guidance scaling

  • Source: Ho, Salimans (CFG, 2022) for the underlying mechanism; applied by all four artefacts at inference.
  • What it is: the sampling-time trick that strengthens text conditioning by interpolating between conditional and unconditional predictions of the model.
  • Formal definition. At inference, given current noisy latent ztz_t, text embedding cc, and a “null” embedding \emptyset representing unconditional generation:

f~(zt,t,c)=(1+w)fθ(zt,t,c)wfθ(zt,t,)\tilde f(z_t, t, c) = (1 + w) \cdot f_\theta(z_t, t, c) - w \cdot f_\theta(z_t, t, \emptyset)

where w0w \ge 0 is the guidance weight. w=0w = 0 recovers conditional generation; larger ww pushes the prediction further away from the unconditional and toward the conditional.

  • Term-by-term:
    • fθ(zt,t,c)f_\theta(z_t, t, c): the conditional prediction (noise or velocity).
    • fθ(zt,t,)f_\theta(z_t, t, \emptyset): the unconditional prediction. Training drops the text conditioning with some probability so the same model can produce both.
    • ww: typically in [1,10][1, 10] for diffusion image models; Movie Gen reports ww tuned per-stage.
  • Worked numerical example. Take fθ(zt,t,c)=[0.5,0.3,0.8,0.1]f_\theta(z_t, t, c) = [0.5, -0.3, 0.8, 0.1], fθ(zt,t,)=[0.4,0.0,0.6,0.2]f_\theta(z_t, t, \emptyset) = [0.4, 0.0, 0.6, 0.2], w=2w = 2. Then f~=3[0.5,0.3,0.8,0.1]2[0.4,0.0,0.6,0.2]=[0.7,0.9,1.2,0.1]\tilde f = 3 \cdot [0.5, -0.3, 0.8, 0.1] - 2 \cdot [0.4, 0.0, 0.6, 0.2] = [0.7, -0.9, 1.2, -0.1]. The guided prediction is pushed further from the unconditional toward the direction the conditional emphasised.
  • Role: standard inference-time mechanism for all diffusion-based video models. Sora, Movie Gen, and (by inference) Veo all use it. Movie Gen specifically reports per-stage CFG weight tuning.
  • Edge cases: at ww large the samples become high-contrast and over-saturated; quality-fidelity-vs-diversity tradeoff is the classic CFG knob.
  • Novelty: [Adopted] from earlier work.
  • Why it matters: a single hyperparameter controls how strictly the model follows the text prompt; the entire prompt-fidelity-vs-diversity tradeoff sits here.

MATH ENTRY [5]: Attention compute scaling for video

  • Source: [Analysis] derived from Movie Gen’s reported model and sequence dimensions (arXiv:2410.13720).
  • What it is: the back-of-envelope FLOPs estimate for the dominant attention operation in a video DiT.
  • Formal definition. Per transformer layer per forward pass:

FLOPsattn4N2d+2Nd2\text{FLOPs}_{\text{attn}} \approx 4 \cdot N^2 \cdot d + 2 \cdot N \cdot d^2

The first term is the QKTQK^T and attention-weighted VV computation; the second is the linear projections. For Movie Gen 30B at full context: N=73,728N = 73{,}728, d=6144d = 6144, so N2d3.3×1013N^2 \cdot d \approx 3.3 \times 10^{13} and the attention compute per layer is 1.3×1014\approx 1.3 \times 10^{14} FLOPs; over 48 layers, 6.4×1015\approx 6.4 \times 10^{15} FLOPs per forward pass. [Analysis]

  • Term-by-term:
    • NN: token-sequence length, 73\approx 73K for a 768x768 256-frame clip after TAE compression and patchification.
    • dd: hidden dimension, 6144 for Movie Gen 30B.
    • The N2dN^2 \cdot d term dominates at long sequence lengths; the Nd2N \cdot d^2 term dominates at short sequences.
  • Worked numerical example. Movie Gen training on a 256x256 4-second clip (32 frames at 8fps): T=4,H=W=32T' = 4, H' = W' = 32, N=41616=1024N = 4 \cdot 16 \cdot 16 = 1024 at ps=2,pt=1p_s = 2, p_t = 1. Then N2d1024261446.4×109N^2 \cdot d \approx 1024^2 \cdot 6144 \approx 6.4 \times 10^9 FLOPs per layer for the attention quadratic; the linear term is 21024614427.7×10102 \cdot 1024 \cdot 6144^2 \approx 7.7 \times 10^{10}, which dominates at this scale. For the 768x768 16-second clip the quadratic term dominates by an order of magnitude. The crossover point is N2dN \approx 2d, here N12,000N \approx 12{,}000.
  • Role: explains why Movie Gen’s parallelisation strategy is critical and why per-clip training cost scales nonlinearly with clip length.
  • Why it matters: the cubic scaling of NN with (T,H,W)(T, H, W) is the fundamental constraint on producing longer or higher-resolution clips from a DiT-family model.

[Analysis] Multi-line proofs / theorem statements are not the contribution category in this lineage; Sora, Movie Gen, and Veo are recipe-and-engineering papers, not theorem-and-proof papers. The deepest “proof” content sits in the DDPM and flow-matching parent papers (cited above for context), not in any of the four artefacts under review.

Section 7 — Algorithmic contributions

ALGORITHM ENTRY [1]: DiT training step (headline algorithm)

  • Source: Peebles–Xie, arXiv:2212.09748. Inferred from Section 3 + Algorithm 1 / training-loop description.
  • Purpose: train the DiT denoiser on a batch of clean latents and the DDPM noise-prediction loss.
  • Inputs:
    • z0z_0: batch of clean latents, shape (B,H/p,W/p,d)(B, H'/p, W'/p, d) after patchify.
    • cc: batch of class (or text) embeddings, shape (B,dc)(B, d_c).
  • Outputs: gradient of LsimpleL_{\text{simple}} with respect to model parameters θ\theta; updated θ\theta.
  • Pseudocode (also rendered as PNG below):
# DiT training step (Peebles-Xie 2022)
def dit_training_step(z0_batch, c_batch, model, optimizer):
    B = batch size
    # 1. Sample timesteps uniformly
    t = uniform({1, ..., T}, size=B)
    # 2. Sample standard Gaussian noise
    eps = N(0, I), same shape as z0_batch
    # 3. Form noisy latents per DDPM forward
    alpha_bar = schedule[t]               # shape (B,)
    z_t = sqrt(alpha_bar) * z0 + sqrt(1 - alpha_bar) * eps
    # 4. Compute conditioning embedding
    h = sinusoidal(t) + class_embed(c_batch)
    # 5. Forward pass
    eps_pred = model(z_t, h)             # patchify -> DiT blocks -> unpatchify
    # 6. Loss
    L = mean((eps - eps_pred) ** 2)
    # 7. Backward + update
    L.backward()
    optimizer.step()
    return L
  • Hand-traced example on minimal input. Take B=1B=1, z0z_0 a 2×22 \times 2 “latent” [[0.5,0.2],[0.8,0.1]][[0.5, -0.2], [0.8, 0.1]] (flattened: [0.5,0.2,0.8,0.1][0.5, -0.2, 0.8, 0.1]), c=c = class 7, T=1000T=1000, αˉt\bar\alpha_t from a linear schedule.
    • Step 1: sample t=500t = 500.
    • Step 2: sample ϵ=[1.0,1.0,0.5,0.5]\epsilon = [1.0, -1.0, 0.5, -0.5].
    • Step 3: αˉ500=0.36\bar\alpha_{500} = 0.36, so zt=0.6z0+0.8ϵ=[0.30+0.80,0.120.80,0.48+0.40,0.060.40]=[1.10,0.92,0.88,0.34]z_t = 0.6 \cdot z_0 + 0.8 \cdot \epsilon = [0.30 + 0.80, -0.12 - 0.80, 0.48 + 0.40, 0.06 - 0.40] = [1.10, -0.92, 0.88, -0.34].
    • Step 4: h=sin-cos(500)+embed(7)h = \text{sin-cos}(500) + \text{embed}(7), a length-dd vector.
    • Step 5: DiT forward produces ϵ^=[0.9,1.1,0.4,0.4]\hat\epsilon = [0.9, -1.1, 0.4, -0.4] (illustrative).
    • Step 6: L=mean([(1.00.9)2,(1.0+1.1)2,(0.50.4)2,(0.5+0.4)2])=mean([0.01,0.01,0.01,0.01])=0.01L = \text{mean}([(1.0-0.9)^2, (-1.0+1.1)^2, (0.5-0.4)^2, (-0.5+0.4)^2]) = \text{mean}([0.01, 0.01, 0.01, 0.01]) = 0.01.
    • Step 7: AdamW update on θ\theta from θL\nabla_\theta L.
  • Complexity. Per step: O(N2d)O(N^2 d) attention + O(Nd2)O(N d^2) linear projections per layer per block, ×L\times L blocks. DiT-XL on 256×256256 \times 256 ImageNet: N=256N = 256, d=1152d = 1152, L=28L = 28; total 29.1\approx 29.1 Gflops per forward pass.
  • Hyperparameters. Optimizer: AdamW. Learning rate: 10410^{-4}. EMA: 0.9999 on weights. CFG drop probability: 10%.
  • Failure modes. CFG drop probability too low (<5%<5\%) hurts unconditional sample quality; too high (>20%>20\%) hurts conditional quality. Linear noise schedule produces uneven loss across timesteps; cosine schedules (later work) improved on this.
  • Novelty: [Adapted] training loop from DDPM; the architectural choice (DiT block, adaLN-Zero modulation) is new.
  • Transferability: [Analysis] reused, in flow-matching reparameterised form, by Movie Gen and (per public framing) Veo.
DiT training step pseudocode rendered as a syntax-highlighted code block

Algorithm 1 reconstruction of the DiT training step, from Peebles and Xie — Scalable Diffusion Models with Transformers (arXiv:2212.09748). Reproduced for editorial coverage.

ALGORITHM ENTRY [2]: Movie Gen flow-matching training step

  • Source: Movie Gen Section 3, arXiv:2410.13720. [Reconstructed] in pseudocode form; paper provides equations but not a single-step algorithm listing.
  • Purpose: train the Movie Gen DiT on a batch of clip latents and the flow-matching loss.
  • Inputs:
    • z0z_0: batch of TAE-encoded clip latents, shape (B,T,H,W,C)(B, T', H', W', C).
    • cc: batch of text conditioning, MetaCLIP + ByT5 + Long-prompt MetaCLIP embeddings.
  • Outputs: gradient of LFML_{\text{FM}} with respect to θ\theta; updated θ\theta.
  • Pseudocode:
# Movie Gen training step (Polyak et al. 2024)
def movie_gen_training_step(z0_batch, c_batch, model, optimizer):
    # 1. Sample timesteps from logit-normal
    t = sigmoid(N(0, 1))                # in (0, 1)
    # 2. Sample noise
    eps = N(0, I), same shape as z0_batch
    # 3. Interpolate
    sigma_min = 1e-5
    z_t = t * eps + (1 - (1 - sigma_min) * t) * z0
    # 4. Velocity target
    V_t = eps - (1 - sigma_min) * z0
    # 5. Text conditioning embedding
    c_emb = concat([metaclip(c), byt5(c), long_metaclip(c)])
    # 6. Forward pass: patchify spacetime, DiT blocks with adaLN(t) + cross-attn(c_emb)
    V_pred = model(z_t, t, c_emb)
    # 7. Flow-matching loss
    L = mean((V_t - V_pred) ** 2)
    # 8. Backward + update
    L.backward()
    optimizer.step()
    return L
  • Hand-traced example. Take B=1B=1, z0=[0.5,0.2,0.8,0.1]z_0 = [0.5, -0.2, 0.8, 0.1] (a tiny “clip latent”), text c=c = “a cat”.
    • Step 1: uN(0,1)=0.405u \sim N(0, 1) = 0.405, t=sigmoid(0.405)0.6t = \text{sigmoid}(0.405) \approx 0.6.
    • Step 2: ϵ=[1.0,1.0,0.5,0.5]\epsilon = [1.0, -1.0, 0.5, -0.5].
    • Step 3: zt=0.6ϵ+0.4z0=[0.80,0.68,0.62,0.26]z_t = 0.6 \cdot \epsilon + 0.4 \cdot z_0 = [0.80, -0.68, 0.62, -0.26].
    • Step 4: Vt=ϵz0=[0.5,0.8,0.3,0.6]V_t = \epsilon - z_0 = [0.5, -0.8, -0.3, -0.6].
    • Step 5: cembc_{\text{emb}} is a concatenation of three text-encoder outputs.
    • Step 6: V^t=[0.4,0.9,0.25,0.5]\hat V_t = [0.4, -0.9, -0.25, -0.5] (illustrative).
    • Step 7: L=mean([0.01,0.01,0.0025,0.01])0.008L = \text{mean}([0.01, 0.01, 0.0025, 0.01]) \approx 0.008.
  • Complexity. Per step: attention scales O(N2d)O(N^2 d) with N73N \approx 73K at full clip size, d=6144d = 6144, 48 layers; total 6×1015\approx 6 \times 10^{15} FLOPs per forward pass at 768x768 256-frame scale. [Analysis]
  • Hyperparameters. Optimizer: AdamW with weight decay 0.1. Learning rate: stage-dependent, with cosine decay. Logit-normal timestep distribution. σmin=105\sigma_{\min} = 10^{-5}.
  • Failure modes. Insufficient timestep variance hurts late-stage quality; over-aggressive CFG at inference saturates.
  • Novelty: [Adapted] flow-matching framework from Lipman et al.; logit-normal timestep schedule and text-encoder stack are Movie-Gen-specific.

ALGORITHM ENTRY [3]: Sora’s variable-resolution training (reconstructed)

  • Source: OpenAI Sora technical report; details intentionally withheld.
  • Purpose: train a single DiT that handles arbitrary input shapes.
  • Inputs: a batch of clips of mixed (T,H,W)(T, H, W).
  • Outputs: a single set of model parameters that can sample at any (T,H,W)(T, H, W) within the training distribution.
  • Pseudocode (reconstructed):
# Sora variable-resolution training (reconstructed)
def sora_training_step(clip_batch, captions, model, optimizer):
    # 1. Encode each clip to latent (variable shape per clip)
    z_list = [TAE.encode(clip) for clip in clip_batch]
    # 2. Spacetime patchify each latent
    token_seqs = [spacetime_patchify(z, p_s, p_t) for z in z_list]
    # 3. Pad or pack sequences to common length (implementation choice)
    z_t_list, V_t_list = compute_diffusion_targets(z_list)
    # 4. Forward pass with shape-aware positional encoding
    preds = model(z_t_list, positional_info_per_clip, captions)
    # 5. Loss accumulated over the batch
    L = mean over batch of squared-error losses
    L.backward()
    optimizer.step()
  • Hand-traced example. Take a batch of two clips: clip A is 256×256256 \times 256 4-second (TAE output 4×32×32×164 \times 32 \times 32 \times 16, patchify 2×2×12 \times 2 \times 141616=10244 \cdot 16 \cdot 16 = 1024 tokens); clip B is 768×768768 \times 768 16-second (TAE output 16×96×96×1616 \times 96 \times 96 \times 16, patchify 2×2×12 \times 2 \times 1164848=36,86416 \cdot 48 \cdot 48 = 36{,}864 tokens). Either pad clip A to clip B’s length and mask, or run the two through the same DiT with separate forward passes accumulated into one optimizer step.
  • Complexity. The padding strategy hurts utilisation; the per-clip-forward strategy hurts throughput.
  • Novelty: [New] in Sora as a first-class architectural disclosure for video. The fundamental ideas (NaViT-style packing for ViTs) were concurrent in image-classification research.
  • Transferability: [Analysis] every frontier video model since has adopted some variant; the engineering detail differs across teams.

Section 8 — Specialised design contributions

8A — LLM / prompt design

PROMPT ENTRY [1]: Sora’s re-captioning pipeline

  • Source: Sora technical report.
  • Role: at training time, generate detailed captions for training videos so the diffusion model trains on long descriptive captions rather than short noisy ones. At inference, expand short user prompts into the same descriptive register before they enter the model.
  • Prompt type: meta-prompt (captioning model is a separate trained model whose outputs are training data for the diffusion model).
  • Components: a video-captioning model trained on a smaller hand-labelled set; an inference-time prompt-expansion model (likely GPT-4-class). Specific prompt wording is not disclosed.
  • Reconstructed inference-time template: [Reconstructed]
You are a video-prompt-expansion model. Given a short user prompt,
rewrite it as a long, descriptive video caption in the style of
the captioning model's training data: include camera framing,
lighting, motion, subject details, and scene composition. Do not
introduce facts not implied by the user prompt.

User prompt: <USER_PROMPT>
Expanded caption: <EXPANSION>
  • Failure handling: not disclosed.
  • Design rationale: training-time and inference-time caption distributions are matched, so the model is fluent at the long-descriptive register users naturally underspecify.
  • Complexity: prompt-expansion calls add an LLM call per user prompt; negligible compared to the diffusion-sampling cost.
  • Novelty: [Adapted] from DALL-E 3’s re-captioning approach.
  • Transferability: [Analysis] the recipe is standard in 2024-2025 frontier image and video models.

8B — Architecture-specific details

DiT block variants (in-context, cross-attention, adaLN, adaLN-Zero) covered in Section 5.1. Movie Gen’s Llama-3-style modifications (RMSNorm, SwiGLU, no biases, full bidirectional attention, RoPE) covered in Section 5.4. Veo: not disclosed.

8C — Training specifics

SpecifierDiTSoraMovie GenVeo
Largest modelDiT-XL/2 (675M)Not disclosed30BNot disclosed
Batch size256 (ImageNet)Not disclosedStage-dependentNot disclosed
HardwareA100 clusterNot disclosedH100 cluster (size not disclosed)Not disclosed
Compute (FLOPs)DiT-XL/2: 950\approx 950 Gflops · 7M stepsNot disclosedNot disclosed in headline FLOPsNot disclosed
Training stagesSingle stageMulti-resolution variable-shapeT2I warmup → T2I/V joint at 256px → joint at 768px → supervised fine-tuneNot disclosed
Mixed precisionYesLikelyYesLikely

[Analysis] The disclosure gap between Movie Gen and Sora/Veo is the cleanest case in this lineage of a research-vs-product trade-off. Meta’s choice to publish a 92-page report has educated the field; OpenAI and DeepMind have not.

8D — Inference / deployment specifics

  • DiT: DDIM sampler with 250 steps in the paper; 50 steps at deployment-grade quality. CFG weight 1.5 for the headline result.
  • Sora: not disclosed.
  • Movie Gen: flow-matching ODE solver, exact number of steps not disclosed in the main paper but reported in the order of 50\sim 50 for production samples. CFG weight tuned per stage.
  • Veo: not disclosed.

Section 9 — Experiments and results

9.1 DiT — class-conditional ImageNet

Datasets. ImageNet 256×256256 \times 256 and 512×512512 \times 512, class-conditional.

Baselines. ADM, LDM, CDM, DiffusionGAN, BigGAN-deep, StyleGAN-XL, U-ViT, and others.

Metrics. FID, sFID, IS, Precision, Recall.

Main results (reproduced from Table 4 / Figure 4 of the paper).

ModelImageNet 256x256 FIDImageNet 512x512 FID
ADM-U (prior SoTA at 256256)3.94
LDM-4-G (cfg=1.5)3.60
U-ViT-H2.29
DiT-XL/2 (cfg=1.50)2.273.04

DiT-XL/2 sets state-of-the-art on both resolutions at the time of publication.

Ablations. Four DiT block variants compared: adaLN-Zero strictly beats adaLN, cross-attention, and in-context at matched Gflops. Patch size p=2p=2 beats p=4p=4 and p=8p=8 at the same model size.

9.2 Movie Gen — text-to-video human evaluation

Datasets. Proprietary curated corpus, 100\approx 100M videos at the input to the curation pipeline; 73.8\approx 73.8M at the 768x768 stage.

Baselines. OpenAI Sora, Runway Gen-3, LumaLabs Dream Machine, Kling 1.5.

Metrics. Human A/B win-rate on overall quality, realness, text alignment, motion quality, motion naturalness.

Main results (reproduced from Table 11 / paper Section 3.7). Movie Gen Video net win rate (positive = Movie Gen wins more):

ComparisonOverallRealnessMotion
Movie Gen vs OpenAI Sora+8.23%+35.7%+27.5%
Movie Gen vs Runway Gen-3+35.02%+48.49%+44.84%
Movie Gen vs Kling 1.5+3.87%+20.69%+24.07%
Movie Gen vs LumaLabs+60.58%+61.83%+60.42%

[Reviewer Perspective] These are Meta-side numbers from Meta-curated annotation protocols; treat the magnitudes as ordinal not absolute. The most defensible reading is “Movie Gen is in the same tier as Sora and Kling on overall quality, ahead of Runway Gen-3 and LumaLabs, and visibly stronger on realness than all four.” [External comparison] Independent reproducibility for these comparisons has not been published as of the access date; reproducing them would require API access to all four competitor models on a comparable prompt set, which is not currently feasible at academic scale.

9.3 Sora and Veo — qualitative-only

Sora’s technical report contains no quantitative benchmarks; the evidence is exclusively the demo gallery and stated emergent capabilities (3D consistency, object permanence, simulated digital worlds). [Reviewer Perspective] The “world simulator” framing is unfalsifiable at the report’s level of disclosure. The Sora System Card later added safety-evaluation specifics (red-team results, C2PA watermarking) but no generation-quality benchmarks.

Veo 2 and Veo 3 are similarly disclosure-light. The Veo 3 launch post claims “state-of-the-art” video generation with native audio; no leaderboard, no FID, no human-eval win-rate against named competitors. [Reviewer Perspective] Independent comparisons (community video-arena leaderboards like Artificial Analysis’s Video Generation Arena) place Veo 3 and Sora in the top tier alongside Kling and Runway as of mid-2025, but the publication has not independently reproduced these.

9.4 Independent benchmark cross-check

The publication’s reading of the cross-benchmark landscape, for the SOTA-positioning claims each artefact makes:

  • DiT: peer-reviewed at ICCV 2023; FID numbers reproducible by anyone with the released code on ImageNet. SOTA claim well-grounded. [External comparison] U-ViT-H reaches FID 2.29 concurrently; DiT-XL/2 is within noise of contemporary alternatives.
  • Sora: no quantitative SOTA claim made.
  • Movie Gen: Meta-side win-rates only; no independent reproduction. SOTA claim is “competitive with or ahead of” rather than “categorically ahead of”. [Analysis]
  • Veo 3: no benchmark numbers. Audio-synchronisation claim is qualitatively visible in DeepMind demos; not independently quantified. [Analysis]

9.5 Evidence audit

  • Strongly supported. DiT-XL/2 outperforms U-Net on class-conditional ImageNet at matched FLOPs. Movie Gen’s TAE achieves the stated reconstruction quality. AdaLN-Zero is the best DiT conditioning variant in DiT’s own ablation.
  • Partially supported. Movie Gen win-rates against Sora and Kling on Meta-internal protocols. Sora’s spacetime-patch and variable-resolution claims rely on demo evidence; the technical assertion is supported only at the framing level.
  • Narrow evidence. Sora’s “world simulator” framing. Veo 3’s audio-synchronisation quality at the demo level only.

Section 10 — Technical novelty summary

ComponentTypeNovelty levelJustificationSource
DiT block on patchified latentsArchitectureFully novel (at scale)First scaling-friendly transformer for latent diffusionDiT
adaLN-Zero modulationConditioningCombination novelComposes adaLN with zero-init residual gateDiT
Spacetime patchesTokenisationIncrementally novel3D generalisation of ViT patchesSora
Variable-resolution training for diffusionTrainingIncrementally novelMixes per-resolution NaViT-style ideas with diffusionSora
Re-captioning for videoDataAdoptedDALL-E 3 recipe transplanted to videoSora
TAE with 8×8×88 \times 8 \times 8 compression and C=16C=16ArchitectureCombination novelInflates standard 2D VAE with temporal axes; documents specific choiceMovie Gen
30B DiT with Llama-3-style RMSNorm/SwiGLUArchitectureCombination novelLLM-architecture transplant into video-diffusion DiTMovie Gen
Flow-matching with logit-normal timestepTrainingAdaptedLogit-normal schedule on top of Lipman et al. frameworkMovie Gen
Five-stage data-curation pipelineDataCombination novelSpecific filter stack and ordering documentedMovie Gen
Joint audio-video diffusionArchitectureIncrementally novelFirst production deployment of single-pass A/V diffusionVeo 3

Single most novel contribution. [Analysis] DiT’s adaLN-Zero conditioning. It is the architectural innovation that made deep transformer stacks trainable for diffusion at scale; every other entry in this review either adopts it or extends it. Movie Gen’s TAE documentation is the most useful single disclosure (because Sora’s was withheld), but the underlying spatiotemporal-autoencoder idea is older.

What the papers do not claim to be novel. DDPM forward process (adopted from Ho et al.). Cross-attention text conditioning (adopted from Latent Diffusion). CLIP text encoders (adopted from OpenAI’s CLIP). RoPE positional encoding (adopted from Su et al.). Classifier-free guidance (adopted from Ho-Salimans). The flow-matching framework itself (adopted from Lipman et al.).

Section 11 — Situating the work

What prior work did. Pre-2022 video generation was dominated by autoregressive models (VQ-VAE-style discrete tokens), GAN-based models (MoCoGAN, TGAN, DVD-GAN), and U-Net-based latent diffusion extensions (Imagen Video, Make-a-Video, Phenaki). All three families produced 2-4-second clips at fixed resolutions with weak prompt-following.

What this lineage changes conceptually. Three things. First, the transformer becomes the dominant architecture (DiT). Second, the training data and training-time captioning become explicit first-class contributions, not afterthoughts (Sora’s re-captioning, Movie Gen’s curation pipeline). Third, the architectural disclosure norm in the field has weakened: Sora and Veo are deployed without architectural papers, and Movie Gen’s 92-page paper is the exception not the norm. [Analysis] The third change is the most editorially significant for readers: the gap between “what was built” and “what is documented” is wider in video than in any other ML modality as of 2025.

Contemporaneous related papers.

  • Stable Video Diffusion (Blattmann et al., Stability AI 2023, arXiv:2311.15127). Open-weights U-Net video-diffusion model with explicit data-curation discussion. This lineage builds on its training-data-curation framing; Movie Gen cites it directly. [External comparison]
  • Stable Diffusion 3 (Esser, Kulal et al. 2024, arXiv:2403.03206). Image-only, but the architectural recipe (MMDiT with cross-attention text conditioning, rectified-flow training, C=16C=16 latent) is the closest sibling to Movie Gen Video minus the temporal axis. [External comparison]

[Reviewer Perspective] The strongest skeptical objection is that none of Sora, Veo 2, or Veo 3 has produced a citable technical paper, and that Movie Gen’s 30B Video model is the only frontier-scale video DiT a reader can actually study. The cited “DiT is the foundation of Sora” is a near-universal-but-uncited inference; OpenAI has confirmed “diffusion transformer” but never disclosed the specific block, conditioning, or tokenisation choices. The field is reading tea leaves on what frontier video models actually do.

[Reviewer Perspective] Author-side rebuttal grounded in the paper. From a Movie-Gen-centric reading: the paper’s documentation is genuinely deep, the architectural choices are well-motivated, and the recipe is reproducible at smaller scale by anyone with the data and compute. From a Sora-centric reading: the technical report is intentionally framing-level because the technical novelty (in OpenAI’s framing) is the demonstration of the capability frontier itself, not any single block-level innovation; the system is to be evaluated as a product.

Three future research directions grounded in paper-specific limitations:

  1. Longer-clip generation beyond the ~16-second ceiling Movie Gen reports. The quadratic attention cost is the binding constraint; sparse-attention variants (the kind that worked for long-context LLMs) are the obvious entry point. [Analysis]
  2. Better physical-consistency evaluation protocols. Sora’s stated limitations (water cup that doesn’t shatter, glass that liquefies) are not currently measurable on any standard benchmark. Building a physical-plausibility benchmark would let progress on this dimension be tracked. [Reviewer Perspective]
  3. Open-weights frontier video models. The publication’s understanding of Sora and Veo is bottlenecked by the closed-source posture. A Llama-3-scale open-weights video model would let the field study what Sora is actually doing. [Analysis]

Section 12 — Critical analysis

Strengths with concrete evidence.

  • DiT’s FLOPs-vs-FID scaling curve is monotone and predictable across DiT-S to DiT-XL (Figure 8 of the paper). The scaling claim is well-supported.
  • Movie Gen’s TAE ablations isolate the contribution of each compression-ratio choice (Section 3.2 of the paper).
  • Sora’s spacetime-patch + variable-shape framing is the cleanest explanation in the public literature of how variable-resolution video generation is supposed to work.

Weaknesses explicitly stated by the authors.

  • Sora: “weaknesses as a simulator”, inability to model glass shattering, occasional object-permanence failures, hands and complex multi-agent motion.
  • Movie Gen: 16-second maximum clip length; bias toward Western-style content in training data; no real-time inference.
  • Veo: not disclosed.

Weaknesses understated by the authors. [Reviewer Perspective]

  • Sora’s evaluation gap. No quantitative benchmark; no human-eval win-rate against named baselines. The publication has nothing to compare Sora against systematically.
  • Movie Gen’s win-rate protocol. Meta-side annotation infrastructure, Meta-chosen prompt distribution; the comparison numbers are likely systematically generous. [Reviewer Perspective] Discussion in the Anthropic-affiliated AI safety community’s reading group threads on Movie Gen (general venue, no specific Movie Gen post cited because no rigorous independent peer-review exists for it as of the access date) flags the same concern.
  • Veo’s complete disclosure gap. DeepMind’s choice not to publish a Veo paper, despite Veo 3’s launch position as “first joint audio-video diffusion at scale”, is the editorially most significant unstated weakness.

Reproducibility check.

ArtefactCodeDataHyperparametersComputeWeightsEval setOverall
DiTReleased on GitHubPublic (ImageNet)Fully disclosedDisclosedReleasedPublicFully reproducible
SoraNot releasedNot releasedNot disclosedNot disclosedClosedNot releasedNot reproducible
Movie GenNot released as of access dateNot releasedPartially disclosedPartially disclosedClosedNot releasedPartially reproducible at smaller scale
VeoNot releasedNot releasedNot disclosedNot disclosedClosedNot releasedNot reproducible

Methodology disclosure.

Methodology

  • Sample size: DiT: 250K ImageNet samples generated for FID; Sora: not stated; Movie Gen: 1,003 prompts for the headline T2V eval per the paper’s Table 11; Veo: not stated.
  • Evaluation set: DiT: held-out ImageNet validation; Movie Gen: proprietary Movie Gen Bench prompt set; Sora/Veo: not stated.
  • Baselines: DiT: ADM, LDM, CDM, U-ViT, others; Movie Gen: Sora, Runway Gen-3, Kling 1.5, LumaLabs; Sora/Veo: none.
  • Hardware/compute: DiT: A100 cluster, exact size not in paper; Movie Gen: H100 cluster, exact size not in paper; Sora/Veo: not disclosed.

[Analysis] The disclosure asymmetry above is the single most important fact for any reader trying to navigate this literature. DiT is reproducible; Movie Gen is partially reproducible at smaller scale; Sora and Veo are not reproducible at any scale.

Generalisability. [Analysis] The DiT block generalises across image modalities (class-conditional, text-conditional, image-to-image) and into video. Spacetime patches and the TAE generalise to any video-shaped data including animated 2D scenes, simulation outputs, and 3D rendered content. The flow-matching objective generalises to any continuous data; Movie Gen ships a text-to-audio model in the same family using the same objective.

Assumption audit. Revisiting Section 3’s assumption list:

  • Frozen autoencoder is high-fidelity enough. Verified for DiT (LDM autoencoder is well-studied) and Movie Gen (TAE benchmarks in the paper). Likely true for Sora/Veo but unverified. [Analysis]
  • Text-conditioning embedding is informative enough. True at the modern scale; CLIP and T5 family embeddings are well-established.
  • Curated training data is representative. The most fragile assumption. Movie Gen’s curation pipeline filters out static clips, low-motion clips, and low-aesthetic clips; the model is therefore biased toward dynamic, aesthetically-curated content. [Reviewer Perspective] Sora’s stated bias toward camera-motion-heavy content shows this assumption breaking at deployment.

What would make these papers significantly stronger. [Analysis]

  • For Sora: a follow-up technical paper documenting the actual architectural choices.
  • For Movie Gen: an open release of the TAE and a smaller (1B-class) public checkpoint so the recipe can be studied.
  • For Veo 2/3: any technical disclosure beyond the launch post.
  • Across all four: an open, independent video-generation benchmark with shared prompts and shared human-eval protocols, so cross-model comparison stops being vendor-claim-driven.

Section 13 — What is reusable for a new study

REUSABLE COMPONENT [1]: DiT-with-adaLN-Zero block

  • What it is: a transformer block on patchified-latent tokens with timestep modulation via zero-initialised adaLN scale/shift/gate.
  • Why worth reusing: scales cleanly; trains stably from scratch; is the dominant block in 2024-2025 generative video.
  • Preconditions: a latent representation produced by a pre-trained autoencoder; a timestep embedding and (optionally) a class or text embedding.
  • What would need to change in a different setting: for audio, replace 2D patchify with 1D windowed patchify on log-mel spectrograms; for 3D point clouds, replace with set-transformer or PointNet-style tokenisation.
  • Risks: needs careful learning-rate scheduling at scale; AdamW β2=0.999\beta_2 = 0.999 is standard but 0.950.95 has shown improvements in some Stable Diffusion 3 ablations.
  • Interaction effects: pairs with cross-attention layers if text conditioning is rich; pairs with flow-matching loss for newer training stacks.

REUSABLE COMPONENT [2]: Spacetime patchification + RoPE-on-3D

  • What it is: tokenisation of video latents as (pt,ps,ps,C)(p_t, p_s, p_s, C) patches with rotary positional encoding generalised to three axes.
  • Why worth reusing: variable-shape input is supported by construction.
  • Preconditions: a spatiotemporal autoencoder; a transformer with attention.
  • Risks: long sequences (N73N \approx 73K for Movie Gen at full clip size) make attention expensive; sparse-attention or windowed-attention is the next engineering move.

REUSABLE COMPONENT [3]: Re-captioning pipeline

  • What it is: train a captioning model first, use it to generate long descriptive captions for training videos, train the diffusion model on those, and rewrite user prompts at inference into the same register.
  • Why worth reusing: the cleanest known “free” prompt-fidelity improvement.
  • Preconditions: a captioning model strong enough that its outputs are useful; an inference-time LLM call budget per generation.
  • Risks: amplifies the captioning model’s biases; the diffusion model becomes optimal at the captioning register, which may not match end-user prompt distributions if not aligned at inference.

REUSABLE COMPONENT [4]: Movie Gen’s five-stage data-curation pipeline

  • What it is: visual-quality filtering (720px minimum, aspect ratio, OCR text removal, scene detection), motion filtering (static removal, VMAF scoring, jitter detection), embedding-similarity deduplication, concept-balancing resampling.
  • Why worth reusing: documented filter stack and ordering; reduces from \sim100M raw to \sim73.8M training clips.
  • Preconditions: ability to run each filter at scale on raw video.
  • Risks: each filter encodes implicit aesthetic choices that bias the final model.

REUSABLE COMPONENT [5]: Flow-matching with logit-normal timestep

  • What it is: predict velocity field Vt=ϵz0V_t = \epsilon - z_0; train with squared error; sample with continuous-time ODE.
  • Why worth reusing: fewer sampling steps than DDPM at matched quality; deterministic at inference; cleaner CFG scaling.
  • Preconditions: standard.
  • Risks: the timestep distribution choice is a non-trivial knob; logit-normal worked for Movie Gen and Stable Diffusion 3 but is not the only choice.

Dependency map. Component 1 (DiT block) is the base; components 2 and 5 plug into it; component 3 attaches at the data preparation stage; component 4 is upstream of all of the above.

Recommendation. [Analysis] The three highest-value reusable components are (a) the DiT-with-adaLN-Zero block as the architectural foundation, (b) flow-matching with the logit-normal timestep schedule as the training objective, and (c) the re-captioning pipeline as the prompt-fidelity multiplier. A new video-generation project that adopts these three already captures the bulk of the 2024-2025 frontier recipe. The TAE and the data-curation pipeline are valuable but more domain-specific.

Type of new study that benefits most. [Analysis] Any team building a domain-specific video model (medical-imaging video, scientific visualisation, industrial sensor video) benefits from inheriting the architectural spine and replacing only the autoencoder and the data-curation filters; the rest of the recipe is roughly domain-agnostic.

Section 14 — Known limitations and open problems

Limitations explicitly stated by the authors.

  • Sora: limited physical-law fidelity (the famous glass-cup demo where the glass liquefies rather than shatters); occasional object-permanence failures (an object that should remain stationary drifts); difficulty with hand morphology and multi-agent action coordination.
  • Movie Gen: maximum clip length 16 seconds at 16 fps; bias toward English-language captions; bias toward Western-style content; no real-time inference; computational cost prevents widespread deployment.
  • DiT: compute-FID curve flattens at the largest tested model; whether scaling continues to FID <2<2 is open.
  • Veo: not disclosed.

Limitations not explicitly stated. [Reviewer Perspective]

  • Architectural disclosure. Sora and Veo’s choice not to publish architectural detail is itself a limitation for the research community.
  • Evaluation infrastructure. The field has no shared benchmark for video-generation human evaluation; all comparisons are vendor-curated.
  • Reproducibility at frontier scale. No public team can reproduce Sora or Veo; only Meta is in a position to reproduce Movie Gen.
  • Cost. The 30B Movie Gen Video model is the largest open documentation but is still beyond academic compute budgets to train from scratch.

Technical root cause of each. Cubic compute scaling with (T,H,W)(T, H, W) for quadratic attention. Data-curation bias is downstream of the curation filters themselves. Architectural-disclosure gaps are governance and competition choices, not technical limitations.

Open problems left behind.

  • Long-context attention for video (efficient N>100N > 100K attention).
  • Better-than-VMAF perceptual quality metrics for video.
  • Open, jointly-evaluable text-to-video benchmarks.
  • 3D-consistent video generation that respects physical laws by construction.

What a follow-up paper would need to solve. [Analysis] The most-impactful next contribution is an efficient attention variant that breaks the quadratic scaling without sacrificing temporal-coherence quality; this would unlock minute-long generation at production resolution and would let the entire lineage scale beyond Movie Gen’s 16-second ceiling.

Section 9 supplementary — body figures

Figure 4 of Peebles and Xie — DiT FLOPs vs FID scaling on class-conditional ImageNet

Figure 4 of Peebles and Xie — Scalable Diffusion Models with Transformers (arXiv:2212.09748), reproduced for editorial coverage.

Figure 2 of Polyak et al. — Movie Gen Video, the architectural overview of the 30B-parameter video DiT

Figure 2 of Polyak et al. — Movie Gen: A Cast of Media Foundation Models (arXiv:2410.13720), reproduced for editorial coverage.

Figure showing Movie Gen's temporal autoencoder TAE compression pipeline

Figure from Polyak et al. — Movie Gen (arXiv:2410.13720) on the temporal autoencoder, reproduced for editorial coverage.

Figure showing Movie Gen's five-stage data curation pipeline from raw video corpus to curated training set

Figure from Polyak et al. — Movie Gen (arXiv:2410.13720) on the data-curation pipeline, reproduced for editorial coverage.

How this article reads at three depths

For the curious high-school reader. Modern AI video generators (Sora, Veo, Movie Gen) work by gradually turning random static into a clip, guided by a text prompt. The neural network doing the denoising is called a “diffusion transformer”, or DiT. Four research artefacts together built this recipe: a 2022 paper introduced the transformer block; a 2024 OpenAI report showed how to generalise it from images to video; a 2024 Meta paper documented the full thirty-billion-parameter recipe; and a 2025 DeepMind launch added synchronised audio in the same pass. The recipe is now the dominant approach across the industry, and most of the architectural choices trace back to those four artefacts.

For the working developer or ML engineer. The DiT-with-adaLN-Zero block is the architectural atom; spacetime patches with RoPE are the standard tokenisation; flow matching with a logit-normal timestep schedule is the standard training objective; classifier-free guidance is the standard inference-time prompt-strength knob. Movie Gen’s documentation is the most rigorous reference in the open literature, including the temporal autoencoder configuration (8x8x8 compression, C=16), the Llama-3-style transformer modifications (RMSNorm, SwiGLU, bidirectional attention, RoPE), the five-stage data-curation pipeline, and the joint image-and-video pretraining schedule. Sora’s report supplies the spacetime-patches and re-captioning framings; Veo is essentially undocumented at architectural detail. If you are building a domain-specific video model in 2025-2026, the highest-payoff components to inherit are the DiT block, the flow-matching objective, and the re-captioning pipeline.

For the ML researcher. The single most-novel architectural contribution in the lineage is DiT’s adaLN-Zero modulation; everything else is either an adaptation (spacetime patches generalising ViT, flow matching from Lipman et al.) or an engineering documentation (Movie Gen’s TAE, curation pipeline, 30B configuration). The most load-bearing assumption is that the curated training distribution is representative of deployment use; Sora’s stated limitations all reduce to this assumption breaking. The strongest skeptical objection is the disclosure-asymmetry across the four artefacts: DiT is fully reproducible, Movie Gen is partially reproducible at smaller scale, Sora and Veo are not reproducible at any scale, and the field is forced to read frontier-model behaviour through demo galleries and vendor-curated win-rates. A follow-up paper that produced an efficient attention variant breaking the quadratic-with-NN scaling, paired with an open jointly-evaluable benchmark for video-generation quality, would be the most consequential next contribution.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.