Neural Tech Daily
ai-research

The Qwen 2.5 and Qwen 3 technical reports — a multi-paper review

Multi-paper review of Alibaba's Qwen 2.5 (arXiv:2412.15115) and Qwen 3 (arXiv:2505.09388). Pretraining recipe, post-training, multilingual sweep, MoE.

Updated ~56 min read
Share
Figure 1 of the Qwen3 Technical Report (arXiv:2505.09388) — the four-stage post-training pipeline integrating long-CoT cold start, reasoning RL with GRPO, thinking-mode fusion, and general RL into the flagship Qwen3-235B-A22B mixture-of-experts model reviewed here alongside the Qwen 2.5 dense lineage

Figure 1 of the Qwen3 Technical Report (arXiv:2505.09388), reproduced for editorial coverage.

1. Umbrella scope and paper identity

Citations.

  • Qwen Team, Alibaba Cloud, “Qwen2.5 Technical Report,” arXiv:2412.15115, December 2024 1 . Covers the open-weight dense models from 0.5B to 72B plus the proprietary mixture-of-experts variants Qwen2.5-Turbo and Qwen2.5-Plus.
  • Qwen Team, Alibaba Cloud, “Qwen3 Technical Report,” arXiv:2505.09388, May 2025 2 . Covers six dense models (0.6B, 1.7B, 4B, 8B, 14B, 32B) and two mixture-of-experts models (Qwen3-30B-A3B and Qwen3-235B-A22B). Release date 29 April 2025 5 .

Retrieval. The Qwen3 paper was retrieved via the ar5iv HTML render 3 , with cross-checks against the Hugging Face model card for Qwen3-235B-A22B 7 . The Qwen 2.5 paper’s full PDF body did not render through any of the fetchers attempted on 2026-05-19; this article reconstructs Qwen 2.5 details from the arXiv abstract 1 , the Qwen team’s official launch blog 4 , the Hugging Face model card for Qwen2.5-72B-Instruct 6 , and the Qwen 2.5 vs Qwen 3 deltas the Qwen3 paper itself documents 2 3 . Sections where the Qwen 2.5 details are reconstructed rather than directly quoted from the technical report carry an explicit [Reconstructed] marker.

Classification. Architecture proposal (Qwen 3’s dense-plus-MoE lineup with thinking-mode fusion). Training method (the four-stage post-training pipeline plus strong-to-weak distillation across the model-size sweep). Benchmark. Multilingual. LLM-based. Application. The Qwen 2.5 paper is a dense-architecture full-stack release; the Qwen 3 paper layers two genuinely new contributions on top of the inherited architecture: native dense + MoE, and unified thinking / non-thinking inference behind a single weights file.

Technical abstract in the publication’s voice. Across two technical reports separated by five months, Alibaba’s Qwen team scaled an open-weight LLM family from 18 trillion pretraining tokens and 29 languages (Qwen 2.5, December 2024) to 36 trillion pretraining tokens and 119 languages (Qwen 3, April-May 2025), while keeping the same architectural backbone (decoder-only Transformer with grouped-query attention, SwiGLU, RoPE, RMSNorm) and pushing post-training from a roughly Llama-style supervised fine-tuning plus direct preference optimization pipeline into a four-stage recipe that ends with a unified thinking and non-thinking mode controlled by a chat-template flag and a runtime thinking budget. Qwen 3 also introduces two mixture-of-experts models (30B-A3B and 235B-A22B) that use 128 routed experts with 8 activated per token, no shared experts, and global-batch load balancing. The Qwen 3 paper’s strongest empirical claim is that Qwen3-4B matches Qwen2.5-72B-Instruct on broad benchmarks, attributed to strong-to-weak distillation from the 235B teacher rather than scratch four-stage training 2 5 .

Primary research questions.

  • Qwen 2.5: How far can a fully open-weight dense Transformer family, trained on 18 trillion tokens with a Llama 3-style post-training cocktail, close the gap to closed-source frontier models on multilingual, math, and code benchmarks?
  • Qwen 3: Can a single set of weights serve both fast non-thinking responses and slow chain-of-thought reasoning, controllable by a per-prompt budget, without sacrificing benchmark headroom against specialist reasoning models like QwQ-32B and DeepSeek-R1?

Core technical claims.

  • Qwen 2.5: The 72B-Instruct variant outperforms several open-weight peers and reaches competitive parity with Llama 3.1-405B-Instruct despite roughly 5.6×5.6\times fewer parameters 1 . [Reconstructed from abstract.]
  • Qwen 3: Qwen3-235B-A22B in thinking mode reaches AIME-2024 85.7, AIME-2025 81.5, LiveCodeBench v5 70.7, and CodeForces Elo 2,056, while activating only 22B parameters per token. The Qwen3-32B dense base model beats Qwen2.5-72B-Base on coding and math despite 55% fewer parameters 2 3 .

Core technical domains.

DomainDepth
Decoder-only Transformer architecturedeep
Grouped-query attention, RoPE, SwiGLU, RMSNormdeep
Mixture-of-experts routingmoderate
Long-context extension (YARN + Dual Chunk Attention)moderate
Reinforcement learning from verifiable rewards (GRPO)deep
Knowledge distillationmoderate
Multilingual pretrainingmoderate
Tool-use and agent benchmarkssurface

Reader prerequisites. High-school algebra. Some prior exposure to neural-network basics helps but is not required: the Glossary in Section 2.5 defines every load-bearing technical term, and the worked numerical examples in Sections 6 and 7 build the math from first principles.

2. TL;DR and executive overview

3-sentence TL;DR. Qwen 2.5 (December 2024) and Qwen 3 (April-May 2025) are two consecutive technical reports from Alibaba Cloud’s Qwen team that document how a single open-weight LLM family doubled its pretraining data, expanded from 29 to 119 supported languages, added two mixture-of-experts models on top of the dense lineup, and fused step-by-step “thinking” mode with fast “non-thinking” mode into one set of weights. The headline empirical result is that the small 4-billion-parameter Qwen 3 dense model performs comparably to the 72-billion-parameter Qwen 2.5 instruction-tuned model on broad benchmarks, attributed to a strong-to-weak distillation pipeline that copies behaviour from the flagship 235-billion-parameter mixture-of-experts teacher. Together the two reports give the public-archive cleanest current picture of how a research-frontier LLM family is built end-to-end outside the United States, with full open weights under the Apache 2.0 licence.

Executive summary. The Qwen team’s two reports describe a coherent design philosophy: keep the architecture conservative (decoder-only Transformer, grouped-query attention, SwiGLU, RoPE, RMSNorm) and push gains primarily through data scale, post-training pipeline depth, and multilingual coverage. Qwen 2.5 establishes the 18-trillion-token dense baseline across model sizes 0.5B through 72B. Qwen 3 doubles pretraining tokens to 36 trillion, expands language coverage roughly fourfold, introduces two mixture-of-experts models, and most importantly fuses two distinct inference behaviours into one weights file, controlled by a /think and /no_think chat-template flag plus a runtime token budget. The lineup ships fully open weights under Apache 2.0 5 , making the family one of the most reproducible frontier-class open-weight releases available to practitioners today.

Five practitioner-relevant takeaways.

  1. The Qwen 2.5 to Qwen 3 jump is mostly post-training and data, not architecture. Grouped-query attention dimensions barely changed; the action is in the four-stage post-training pipeline and the doubled pretraining-token budget.
  2. Thinking-mode fusion is operationally novel. One weights file serves two latency-quality regimes via a chat-template flag; the runtime budget triggers an early-termination prompt when the token cap is reached.
  3. Strong-to-weak distillation is the production sweet spot. Qwen 3 reports the four small dense models (0.6B through 14B) match the four-stage training quality at roughly one-tenth the GPU hours when trained as student distillations of the 235B teacher.
  4. Mixture-of-experts at fixed active budget delivers most of the dense-flagship quality at a fraction of the inference compute. Qwen3-30B-A3B activates 3 billion parameters per token; the paper claims it beats QwQ-32B, a dense reasoning specialist, while using roughly 10x fewer activated parameters 5 .
  5. Multilingual coverage is now mandatory table stakes. Qwen 3’s 119-language expansion makes it the broadest-coverage open-weight family at release; English-centric models are decreasingly competitive on broad evaluation suites that include non-English content.

Pipeline overview in text. Training proceeds in two phases. Pretraining runs in three stages: a general stage on roughly 30 trillion tokens at 4,096 sequence length, a reasoning stage on roughly 5 trillion higher-quality tokens emphasizing science, technology, engineering, mathematics, and code, then a long-context stage on hundreds of billions of tokens at 32,768 sequence length using YARN and Dual Chunk Attention. Post-training runs in four stages for the flagship models: long chain-of-thought cold start via supervised fine-tuning on verified math, code, and reasoning problems; reasoning reinforcement learning using Group Relative Policy Optimization on a small set of query-verifier pairs; thinking-mode fusion that integrates thinking and non-thinking behaviours via chat-template prompts; then a final general-purpose reinforcement-learning stage covering broader tasks. Inference dispatches the prompt through either thinking mode (chain-of-thought visible to the runtime) or non-thinking mode (direct response), with an optional thinking budget that caps the chain length 3 .

2.5. Glossary

TermPlain-English explanationFirst appears in
Decoder-only TransformerA neural network that predicts the next token in a sequence given all previous tokens; the standard backbone of modern LLMs.Section 1
Grouped-query attention (GQA)A memory-saving variant of attention where multiple query heads share a smaller set of key and value heads; reduces inference memory roughly proportional to the sharing factor.Section 1
Mixture-of-experts (MoE)An architecture where each input token is routed to a small subset of “expert” sub-networks; only the active experts run, so total parameters can be large while compute stays modest.Section 1
SwiGLUAn activation function used in the feed-forward block of modern Transformers; combines a Swish activation with a gating mechanism.Section 1
RoPE (Rotary Position Embedding)A way to encode the position of each token by rotating its embedding vector by an angle that depends on the position; lets the model generalize to longer sequences than seen in training.Section 1
RMSNormA normalization layer simpler than LayerNorm; divides each vector by its root-mean-square magnitude. Used in most modern LLMs because it’s slightly cheaper.Section 1
PretrainingThe first training phase: the model learns to predict next tokens on a very large unlabeled text corpus.Section 2
Post-trainingThe phase after pretraining where the model is taught to follow instructions, refuse harmful requests, and reason step-by-step; uses techniques like supervised fine-tuning and reinforcement learning.Section 2
GRPO (Group Relative Policy Optimization)A reinforcement-learning algorithm that estimates the value of an action by comparing it against other sampled actions in the same group, removing the need for a separate critic network.Section 6
DistillationA training technique where a smaller “student” model learns to imitate a larger “teacher” model’s outputs, transferring capability at lower inference cost.Section 2
Thinking modeAn inference behaviour where the model emits a long chain-of-thought before answering; trades latency for accuracy on hard problems.Section 2
YARNA technique for extending a Transformer’s context window beyond its training-time length by rescaling the rotary position frequencies.Section 5
Dual Chunk AttentionA long-context attention scheme that breaks the sequence into chunks and applies a two-level attention pattern (within-chunk and cross-chunk) to reduce quadratic cost.Section 5
From the paper: prefixContent directly supported by the paper’s text, equations, tables, or figures.Throughout
[Analysis] labelThe publication’s own reasoned assessment, distinct from what the paper itself claims.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what the paper proves.Sections 11, 12
[Reconstructed] labelContent the publication faithfully reconstructed because the paper only partially disclosed it or the full PDF was not retrievable.Throughout (heavy in Qwen 2.5 sections)
[External comparison] labelA comparison to prior work or general knowledge outside the paper itself.Sections 4, 11

3. Problem formalisation

Notation table.

SymbolTypeMeaningFirst appears in
LLintNumber of Transformer layersSection 5
ddintHidden dimensionSection 5
hqh_qintNumber of query attention headsSection 5
hkvh_{kv}intNumber of key-value heads (GQA shared)Section 5
EEintTotal number of experts in an MoE layerSection 5
kkintNumber of activated experts per tokenSection 5
TTintPretraining token budgetSection 5
θ\thetaparamsModel parametersSection 6
πθ\pi_\thetapolicyThe current model viewed as a generation policySection 6
πref\pi_{\text{ref}}policyThe reference (SFT-stage) model used as a KL anchorSection 6
r(x,y)r(x, y)scalarA verifiable reward for response yy given prompt xxSection 6
A(x,y)A(x, y)scalarThe advantage of response yy relative to the groupSection 6
β\betascalarKL regularization coefficientSection 6
ϵ\epsilonscalarPPO-style clip thresholdSection 6

Formal problem statement (Qwen 2.5 and Qwen 3). Given a multilingual text corpus Dpre\mathcal{D}_{\text{pre}} with TT tokens and a smaller post-training corpus Dpost\mathcal{D}_{\text{post}} of paired (prompt, response, reward) triples, learn a single set of parameters θ\theta for a decoder-only Transformer πθ(yx)\pi_\theta(y \mid x) such that (i) πθ\pi_\theta minimizes next-token cross-entropy on Dpre\mathcal{D}_{\text{pre}}, (ii) πθ\pi_\theta maximizes expected verifiable reward on Dpost\mathcal{D}_{\text{post}} subject to a KL constraint against a reference policy πref\pi_{\text{ref}}, and (iii) for Qwen 3 specifically, πθ\pi_\theta exhibits two distinguishable behaviours (long chain-of-thought reasoning and fast direct response) controlled by a deterministic prompt-template flag.

Explicit assumptions.

  • The pretraining corpus is large enough that next-token loss tracks an underlying capability frontier per the now-standard compute-scaling literature. [Analysis] Potentially strong assumption outside Chinchilla-style compute-optimal regimes; the Qwen team scales tokens-per-parameter aggressively for the smaller dense models in line with the recent open-weight trend toward over-training small models.
  • Verifiable rewards (math correctness, code unit tests) provide a dense enough signal to support reinforcement learning at the scales reported (3,995 query-verifier pairs for the reasoning RL stage 3 ). [Analysis] Potentially strong assumption for tasks without verifiable ground truth; the Qwen 3 paper handles this by leaving the broader general-RL stage outside the verifiable-reward formulation.
  • Strong-to-weak distillation preserves the teacher’s task distribution well enough that the student matches the four-stage training quality on the evaluation suite. [Analysis] This is an empirical claim the paper backs up on its own benchmark slate; independent reproducibility on off-distribution tasks remains open.

Why the problem is hard.

  • Multilingual coverage at 119 languages changes the tokenization, data-filtering, and curriculum design problem at every stage 5 .
  • Unified thinking and non-thinking behaviour requires the same parameter set to produce two qualitatively different output distributions on demand, which historically required either two separate models or per-sample mode selection by an external router.
  • Mixture-of-experts at 128 experts with 8 activated introduces routing-stability, load-balance, and capacity-allocation issues that dense models do not face; the Qwen 3 paper handles load balance with a global-batch auxiliary loss and explicitly opts out of shared experts 3 .

Data-driven role. Both papers position data scale and quality as primary levers. The Qwen 3 paper labels Stage 2 of pretraining the “reasoning stage” and shifts roughly 5 trillion tokens of higher-quality science, technology, engineering, math, and code content into the curriculum at the back end of pretraining 3 . The cited rationale is that the model’s gradient capacity at the end of pretraining is best spent on high-density domains.

LLM-based role. Both Qwen reports are pure-LLM contributions in the sense that the LLM is the entire deliverable; there are no auxiliary models in the inference path. The thinking-budget mechanism is a runtime control inside the LLM itself, not a separate router.

Theoretical statements. Neither report contains formal theorems. The optimization objectives (next-token cross-entropy, GRPO, DPO-style preference loss in Qwen 2.5) are inherited from prior work and are treated empirically rather than analytically.

4. Motivation and gap

Real-world problem. The frontier-class LLM market in late 2024 and early 2025 is dominated by closed-weight US providers (OpenAI’s GPT-4o and o1, Anthropic’s Claude 3.5 Sonnet and Claude 3.7, Google’s Gemini 2.0 and 2.5). For a developer who wants frontier-class quality on their own hardware, or wants to fine-tune on private data, the open-weight options are Meta’s Llama 3 herd (dense, up to 405B), DeepSeek-V3 and DeepSeek-R1 (MoE), Mistral’s families, and Alibaba’s Qwen. The Qwen 2.5 and Qwen 3 releases position the family as the broadest-coverage and most-comprehensive open-weight lineup, with both small dense models (down to 0.6B for on-device) and a 235B-parameter MoE flagship for serious self-hosting.

Existing approaches and failure modes (as the papers frame them). The Qwen 2.5 paper compares Qwen 2.5-72B-Instruct primarily against Llama 3.1-405B-Instruct on standard benchmarks; the claim is parity at roughly 5.6×5.6\times fewer parameters 1 . The Qwen 3 paper compares its flagship against DeepSeek-R1, OpenAI’s o1 and o3-mini, xAI’s Grok-3, and Google’s Gemini-2.5-Pro on a reasoning-heavy benchmark slate 5 . The papers do not claim universal superiority; the framing is competitive parity at a substantially smaller activated-parameter budget.

Gap the papers fill.

  • Qwen 2.5: a fully-open dense lineup spanning seven sizes from 0.5B to 72B on a single 18-trillion-token training corpus, with 128K context and 29-language coverage, available under Apache 2.0 (most sizes).
  • Qwen 3: unified thinking and non-thinking modes in one weights file, paired with a runtime thinking budget for latency-aware deployment; plus the first Qwen MoE models released as open weights.

Practical stakes. [Analysis] For a self-hosting practitioner, the operational question is whether a single deployed Qwen 3 endpoint can serve both fast chat traffic and slow analytical traffic without model swapping. The thinking-mode fusion answer is yes, conditional on the chat template wiring; this is a meaningful infrastructure simplification relative to running QwQ-32B and Qwen2.5-7B-Instruct as separate endpoints.

[External comparison] Position in the open-weight landscape. Qwen 3 lands in the same competitive frame as DeepSeek-V3 (December 2024) 14 and Llama 4 Scout / Maverick (April 2025) 15 . All three pivot to MoE at the flagship tier; all three keep a dense lineup for smaller sizes. The differentiation between them sits in the post-training pipeline and the inference-control surface (thinking budgets, multimodality, context length).

5. Method overview

5.1 Architecture: the shared backbone

The Qwen 2.5 and Qwen 3 dense lineups use the same architectural choices: decoder-only Transformer, grouped-query attention (GQA), Rotary Position Embedding (RoPE), SwiGLU feed-forward blocks, and RMSNorm. The Qwen3 paper additionally adds QK-Norm to the attention block for training stability and removes the QKV bias term that Qwen 2.5 carried 3 .

Qwen 3 dense lineup architecture (from the Qwen3 paper Table 1 3 ):

ModelLayersQ heads / KV headsNative contextTied embeddings
Qwen3-0.6B2816 / 832Kyes
Qwen3-1.7B2816 / 832Kyes
Qwen3-4B3632 / 8128Kno
Qwen3-8B3632 / 8128Kno
Qwen3-14B4040 / 8128Kno
Qwen3-32B6464 / 8128Kno

Qwen 3 MoE lineup (from the Qwen3 paper Table 2 3 ):

ModelLayersQ heads / KV headsTotal expertsActivated expertsNative context
Qwen3-30B-A3B4832 / 41288128K
Qwen3-235B-A22B9464 / 41288128K

The Qwen 2.5 dense lineup follows the same backbone with the open sizes 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B, training-context 32,768 with YARN extension to 131,072 for the 72B variant 6 . The 72B variant carries 80 layers, 64 query heads, 8 KV heads, and roughly 70 billion non-embedding parameters 6 .

Design rationale. [Analysis] Conservative architecture choices reduce training-stability risk at the flagship scale and let the team push more aggressively on data, curriculum, and post-training. The architectural conservatism is the same posture Meta took in the Llama 3 paper and DeepSeek-AI in DeepSeek-V3.

What breaks if removed. GQA removal would roughly 8×8\times the KV-cache memory at inference (the ratio of hqh_q to hkvh_{kv} for the dense lineup), making 128K-context inference infeasible at the 32B and 72B tiers on typical hardware. RoPE removal would force fixed-length or sinusoidal positional encoding, blocking the YARN-based context extension.

Classification. [Adopted] from Llama 2 / Llama 3 backbone (GQA, SwiGLU, RoPE, RMSNorm) 8 9 . [Adapted] for the Qwen3 MoE structure (no shared experts, global-batch load balancing, a deliberate divergence from the DeepSeek-V3 shared-expert convention 14 ).

5.2 Pretraining curriculum

The Qwen 3 paper describes three pretraining stages 3 :

  • Stage S1, general stage. Approximately 30 trillion tokens at sequence length 4,096. Broad multilingual web text, books, code, and conversation.
  • Stage S2, reasoning stage. Approximately 5 trillion higher-quality tokens emphasizing science, technology, engineering, mathematics, and code. Token quality is filtered more aggressively; the learning rate schedule includes a “burn” through this curriculum.
  • Long-context stage. Hundreds of billions of tokens at sequence length 32,768. YARN and Dual Chunk Attention applied to extend context up to the 128K native context window of the larger models 12 13 .

The Qwen 2.5 paper trains on 18 trillion tokens; the curriculum stages are less explicitly enumerated in the public material 1 4 . [Reconstructed] the abstract describes the corpus scale but the staged-curriculum framing is fleshed out only in the Qwen 3 paper looking back at the Qwen 2.5 baseline.

5.3 Post-training: four stages

Qwen 3’s flagship post-training pipeline runs four stages 3 :

  1. Long-CoT cold start. Supervised fine-tuning on verified math, code, and reasoning problems with long chain-of-thought solutions. The cold-start data is filtered so every example carries a verifier (unit test for code, equality check for math, rubric for general reasoning).
  2. Reasoning RL. GRPO training on 3,995 query-verifier pairs. The paper reports AIME-2024 accuracy lifting from 70.1 to 85.1 over this stage alone 3 .
  3. Thinking-mode fusion. Integrates thinking and non-thinking behaviours via chat-template prompts. The same weights serve both modes; routing is by prompt template.
  4. General RL. Domain-wide reinforcement learning covering instruction-following, tool use, safety, and broader chat quality.

For the smaller dense models (0.6B through 14B), the team uses strong-to-weak distillation from the flagship 235B teacher. The paper reports comparable benchmark quality at roughly one-tenth the GPU hours versus running the full four-stage pipeline on the student directly 3 .

Design rationale. [Analysis] The four-stage pipeline is the operational expression of the reasoning-versus-instruction-following tradeoff: reasoning RL produces strong chain-of-thought but tends to degrade short-form instruction-following; the general-RL final stage repairs the latter without resetting the former. Thinking-mode fusion in between is the bridge that lets a single weights file serve both.

Classification. GRPO is [Adopted] from Shao et al. (DeepSeekMath, 2024) 10 . The four-stage composition is [New] to the Qwen 3 paper. The chat-template thinking flag is [New] to the Qwen 3 release. Strong-to-weak distillation is [Adapted] from a long lineage of teacher-student distillation work going back to Hinton et al.

6. Mathematical contributions

This section walks through the math that drives the Qwen 3 post-training pipeline and the GQA / MoE architecture inherited from Qwen 2.5.

MATH ENTRY 1: Grouped-query attention output.

  • Source: Qwen3 paper Section 2 architecture description 3 ; GQA paper 9 .
  • What it is: a memory-saving variant of multi-head attention where multiple query heads share key and value projections.
  • Formal definition. For a single token’s hidden state xRdx \in \mathbb{R}^d:

headi=softmax ⁣(QiKg(i)dk)Vg(i)\text{head}_i = \text{softmax}\!\left(\frac{Q_i K_{g(i)}^\top}{\sqrt{d_k}}\right) V_{g(i)}

with Qi=xWiQQ_i = x W_i^Q, Kj=xWjKK_j = x W_j^K, Vj=xWjVV_j = x W_j^V, where g(i){1,,hkv}g(i) \in \{1, \dots, h_{kv}\} is the group index assigning query head ii to one of the hkvh_{kv} key-value groups, and the multi-head output concatenates all hqh_q heads.

  • Each term explained with dimensional analysis.
    • xx is a row vector of length dd. For Qwen3-32B, dd is the hidden dimension (paper does not state it explicitly in the retrievable summary; for the Qwen2.5-72B sibling, dmodeld_{\text{model}} is widely reported as 8192).
    • WiQW_i^Q is a d×dkd \times d_k matrix; with hq=64h_q = 64 query heads on the 32B model and head dimension dk=d/hqd_k = d / h_q, each QiQ_i is a vector of length dkd_k.
    • Kg(i)K_{g(i)} and Vg(i)V_{g(i)} are d×dkd \times d_k matrices shared across all query heads in group g(i)g(i). With hkv=8h_{kv} = 8, there are only 8 distinct KK and VV tensors instead of 64.
    • The output of each head is a vector of length dkd_k; concatenation produces a vector of length hqdk=dh_q \cdot d_k = d, matching the residual stream.
  • Worked numerical example. Take a toy setup with d=16d = 16, hq=4h_q = 4, hkv=2h_{kv} = 2, sequence length T=5T = 5. Then dk=4d_k = 4. Each layer stores hq=4h_q = 4 query projections of shape 16×416 \times 4, and only hkv=2h_{kv} = 2 key projections and 2 value projections of shape 16×416 \times 4. KV cache for one layer at sequence length 5 holds 2254=802 \cdot 2 \cdot 5 \cdot 4 = 80 values (groups ×\times KV pair ×\times tokens ×\times head dim). Multi-head attention without GQA would store 2454=1602 \cdot 4 \cdot 5 \cdot 4 = 160 values. The 2x reduction here corresponds to the hq/hkv=2h_q / h_{kv} = 2 sharing factor; for the Qwen3-32B numbers (64 / 8), the reduction is 8x.
  • Role: reduces KV-cache memory at inference roughly proportional to hq/hkvh_q / h_{kv}; for the 32B model this is the 8x compression that lets 128K-token contexts fit on a single H100.
  • Edge cases: hkv=hqh_{kv} = h_q degenerates to standard multi-head attention; hkv=1h_{kv} = 1 degenerates to multi-query attention.
  • Novelty: [Adopted] from Ainslie et al. 9 .
  • Transferability: any decoder-only Transformer; standard.
  • Why it matters: this is the single biggest reason a 32B Qwen 3 dense model can serve 128K context on a single accelerator.

MATH ENTRY 2: MoE routing and load-balance loss.

  • Source: Qwen3 paper Section 2 3 .
  • What it is: how each token chooses 8 of 128 experts and the auxiliary loss that keeps routing roughly uniform.
  • Formal definition. For each token with hidden state xx, compute a router score se(x)=(xWR)es_e(x) = (x W_R)_e for each expert e{1,,E}e \in \{1, \dots, E\}. Select the top-kk experts T(x)\mathcal{T}(x). The MoE block output is

y=eT(x)exp(se(x))eT(x)exp(se(x))FFNe(x)y = \sum_{e \in \mathcal{T}(x)} \frac{\exp(s_e(x))}{\sum_{e' \in \mathcal{T}(x)} \exp(s_{e'}(x))} \cdot \text{FFN}_e(x)

Load balance is encouraged with a global-batch auxiliary loss

Llb=αEe=1Efepe\mathcal{L}_{\text{lb}} = \alpha \cdot E \cdot \sum_{e=1}^{E} f_e \cdot p_e

where fef_e is the fraction of tokens routed to expert ee across the global batch and pep_e is the average router probability assigned to expert ee.

  • Each term explained with dimensional analysis.
    • WRW_R is d×Ed \times E, mapping hidden state to a length-EE router-score vector. For Qwen3-235B-A22B with E=128E = 128, WRW_R has d×128d \times 128 entries.
    • T(x)\mathcal{T}(x) is the set of k=8k = 8 selected experts; the softmax in the output line normalizes only over the selected kk, not the full EE.
    • FFNe\text{FFN}_e is the ee-th expert’s feed-forward block. Each FFN has the standard SwiGLU two-projection structure, total expert parameters dominate the model parameter count.
    • fef_e and pep_e are scalars per expert; the auxiliary loss is a scalar.
    • α\alpha is the load-balance coefficient (paper does not disclose; common values are 10210^{-2}).
  • Worked numerical example. Take a global batch of 1,024 tokens with E=8E = 8 experts and k=2k = 2 selected per token. Suppose 200 tokens route to expert 1 as their top choice, 150 to expert 2, and so on. Then f1=200/10240.195f_1 = 200 / 1024 \approx 0.195, and if the router probability on expert 1 averaged across the global batch is p10.21p_1 \approx 0.21, the contribution to Llb\mathcal{L}_{\text{lb}} from expert 1 is α80.1950.210.327α\alpha \cdot 8 \cdot 0.195 \cdot 0.21 \approx 0.327 \alpha. Perfect load balance would give fe=pe=1/E=0.125f_e = p_e = 1/E = 0.125 for all experts, leaving the per-expert contribution at α80.1250.125=0.125α\alpha \cdot 8 \cdot 0.125 \cdot 0.125 = 0.125 \alpha. The auxiliary loss is minimized when load is uniform.
  • Role: keeps expert utilization from collapsing onto a small subset, which would otherwise waste MoE capacity.
  • Edge cases: severely imbalanced load causes expert starvation (some experts never train); the paper opts for global-batch rather than per-batch balance to avoid penalizing legitimate within-batch specialization on small batches.
  • Novelty: global-batch load balance is [Adapted] from prior MoE work; no-shared-experts is a [New] choice in the Qwen 3 release relative to the DeepSeek-V3 shared-expert default 14 .
  • Transferability: any MoE architecture.
  • Why it matters: this single architectural choice is the difference between an MoE model that actually uses its capacity and one that collapses to a smaller-effective dense model.

MATH ENTRY 3: GRPO objective for reasoning RL.

  • Source: Qwen 3 paper Section 4.2 3 ; GRPO paper 10 .
  • What it is: a policy-gradient method that estimates each response’s advantage by comparison against a group of sampled responses from the same prompt, removing the need for a separate value-function critic.
  • Formal definition. For prompt xx, sample GG responses {y1,,yG}\{y_1, \dots, y_G\} from the current policy πθ\pi_\theta. Each receives a verifiable scalar reward r(x,yi)r(x, y_i). The group-normalized advantage is

Ai=r(x,yi)mean(r(x,y1),,r(x,yG))std(r(x,y1),,r(x,yG))A_i = \frac{r(x, y_i) - \text{mean}(r(x, y_1), \dots, r(x, y_G))}{\text{std}(r(x, y_1), \dots, r(x, y_G))}

The GRPO objective is

LGRPO(θ)=Ex,{yi}[1Gi=1Gmin ⁣(ρiAi,  clip(ρi,1ϵ,1+ϵ)Ai)]+βKL(πθπref)\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x, \{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^{G} \min\!\left( \rho_i A_i,\; \text{clip}(\rho_i, 1 - \epsilon, 1 + \epsilon) A_i \right) \right] + \beta \cdot \text{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})

with ρi=πθ(yix)/πθold(yix)\rho_i = \pi_\theta(y_i \mid x) / \pi_{\theta_{\text{old}}}(y_i \mid x) the importance-sampling ratio.

  • Each term explained.
    • GG is the group size (typically 4 to 16 in published GRPO recipes). Larger GG gives lower-variance advantage estimates at proportional sampling cost.
    • r(x,yi)r(x, y_i) is a verifiable reward: 1 if the math answer is correct or all code unit tests pass, 0 otherwise. No reward model.
    • AiA_i is a scalar; the within-group standardization removes the need for a separate value baseline.
    • ρi\rho_i is the importance-sampling ratio for off-policy correction; identical in shape to PPO.
    • ϵ\epsilon is the PPO clip threshold (commonly 0.2).
    • β\beta is the KL coefficient against the reference SFT-stage policy, the standard anti-reward-hacking anchor.
  • Worked numerical example. Take a single prompt xx = “What is 7×87 \times 8?” and group size G=4G = 4. The policy samples four responses with token-level chain-of-thought; the verifier scores them r=[1,1,0,1]r = [1, 1, 0, 1] (three correct, one wrong). The group mean is 0.750.75, standard deviation (0.252+0.252+0.752+0.252)/40.433\sqrt{(0.25^2 + 0.25^2 + 0.75^2 + 0.25^2) / 4} \approx 0.433. The standardized advantages are A=[(10.75)/0.433,  (10.75)/0.433,  (00.75)/0.433,  (10.75)/0.433][0.577,  0.577,  1.732,  0.577]A = [(1 - 0.75)/0.433,\; (1 - 0.75)/0.433,\; (0 - 0.75)/0.433,\; (1 - 0.75)/0.433] \approx [0.577,\; 0.577,\; -1.732,\; 0.577]. The wrong response gets a large negative advantage, the correct ones get small positive advantages, and the gradient pushes πθ\pi_\theta toward the correct three and away from the wrong one. No reward model was queried at any point.
  • Role: drives the AIME-2024 accuracy jump from 70.1 to 85.1 in stage 2 alone 3 .
  • Edge cases. If all GG responses get the same reward, std=0\text{std} = 0 and the advantage is undefined; standard practice is to clip or skip the prompt. The Qwen 3 paper does not enumerate this handling in the retrievable summary.
  • Novelty: [Adopted] from DeepSeekMath 10 .
  • Transferability: any task with a cheap verifiable reward (math problems with closed-form answers, code with unit tests, formal logic, multiple-choice).
  • Why it matters: GRPO is the algorithmic engine of the reasoning-LLM regime; removing the value-function critic halves the memory cost relative to PPO and makes large-scale reasoning RL tractable.

MATH ENTRY 4: DPO objective (Qwen 2.5 post-training).

  • Source: Qwen 2.5 paper abstract describes “multistage reinforcement learning” 1 ; the underlying offline-RL preference optimization is [Reconstructed] from the Qwen 2.5 lineage as standard DPO per Rafailov et al. 11 .
  • What it is: a direct way to fit a policy to preference data without explicitly training a reward model.
  • Formal definition. For preference pairs (x,yw,yl)(x, y_w, y_l) where ywy_w is preferred over yly_l:

LDPO=E(x,yw,yl) ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \!\left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]

  • Each term explained.
    • σ\sigma is the logistic sigmoid.
    • β\beta is the KL coefficient (commonly 0.1 to 1.0); larger β\beta ties the policy more tightly to the reference.
    • πref\pi_{\text{ref}} is typically the SFT-stage model. Both ratios use log-probabilities the policy assigns to entire response sequences.
  • Worked numerical example. Suppose πθ(ywx)=0.05\pi_\theta(y_w \mid x) = 0.05 and πref(ywx)=0.02\pi_{\text{ref}}(y_w \mid x) = 0.02 for the winning response (the policy is more confident in the winner than the reference was); and πθ(ylx)=0.001\pi_\theta(y_l \mid x) = 0.001 and πref(ylx)=0.01\pi_{\text{ref}}(y_l \mid x) = 0.01 for the loser (the policy is much less confident in the loser than the reference was). The log ratios are log(0.05/0.02)=log(2.5)0.916\log(0.05/0.02) = \log(2.5) \approx 0.916 and log(0.001/0.01)=log(0.1)2.303\log(0.001/0.01) = \log(0.1) \approx -2.303. With β=0.5\beta = 0.5, the argument to σ\sigma is 0.5(0.916(2.303))=0.53.2191.6100.5 \cdot (0.916 - (-2.303)) = 0.5 \cdot 3.219 \approx 1.610. Then σ(1.610)0.833\sigma(1.610) \approx 0.833 and LDPOlog(0.833)0.183\mathcal{L}_{\text{DPO}} \approx -\log(0.833) \approx 0.183. The loss decreases as the policy further separates its winner-vs-loser margin relative to the reference.
  • Role: the offline preference-optimization workhorse for Qwen 2.5; gives instruction-following polish without requiring a deployed reward model.
  • Novelty: [Adopted] from Rafailov et al. 11 .

MATH ENTRY 5: Thinking-budget early-termination.

  • Source: Qwen3 paper Section 4.3 3 .
  • What it is: a runtime mechanism that caps the chain-of-thought length and forces the model to produce a final answer when the cap is reached.
  • Formal description. The user specifies a budget BB in tokens. The runtime monitors the number of tokens emitted inside the thinking-block markers. When the count reaches BB, the runtime injects the prompt fragment “Considering the limited time by the user, I have to give the solution” before continuing generation. The model then exits the thinking block and emits the final answer 3 .
  • Worked example. A user asks Qwen3-235B-A22B for the solution to an AIME-2024 problem with thinking budget B=2000B = 2000 tokens. The model emits roughly 1,950 tokens of chain-of-thought working through the problem. At token 2000 the runtime injects the early-termination prompt. The model exits the thinking block and produces its best-effort final answer using whatever progress was made. [Analysis] This is not a formal optimization mechanism; it is a deterministic prompt-injection scheme. Its effectiveness depends on the policy having been trained during the thinking-mode-fusion stage to handle the early-termination prompt gracefully, which the paper’s stage 3 explicitly does.
  • Novelty: [New] to the Qwen 3 release; no equivalent in the public Qwen 2.5 material.

7. Algorithmic contributions

ALGORITHM ENTRY 1: GRPO reasoning RL step (Qwen 3 stage 2).

  • Source: Qwen3 paper Section 4.2 3 ; GRPO paper 10 .
  • Purpose: one optimization step of the reasoning RL stage.
  • Inputs: current policy πθ\pi_\theta, reference policy πref\pi_{\text{ref}}, batch of NN verifiable prompts {xn}\{x_n\}, group size GG, verifier function r(x,y)r(x, y), learning rate η\eta, KL coefficient β\beta, clip ϵ\epsilon.
  • Outputs: updated policy parameters θ\theta.
  • Pseudocode.
for each prompt x_n in batch:
    sample G responses y_{n,1}, ..., y_{n,G} from pi_theta(.|x_n)
    score each: r_{n,i} = verifier(x_n, y_{n,i})           # 0 or 1
    compute group mean mu_n and std sigma_n of r_{n,1..G}
    for i in 1..G:
        A_{n,i} = (r_{n,i} - mu_n) / max(sigma_n, eps_std)
compute per-response log-prob ratios:
    rho_{n,i} = pi_theta(y_{n,i}|x_n) / pi_theta_old(y_{n,i}|x_n)
compute clipped policy-gradient loss:
    L_pg = - mean over n,i of min(rho_{n,i} * A_{n,i},
                                  clip(rho_{n,i}, 1-eps, 1+eps) * A_{n,i})
compute KL penalty:
    L_kl = beta * KL(pi_theta || pi_ref) averaged over batch
total loss:
    L = L_pg + L_kl
    theta = theta - eta * grad(L)
  • Hand-traced example on a minimal input. Take a batch of N=1N = 1 prompt: x=x = “What is 12+512 + 5?”, group size G=3G = 3. The policy samples three responses with chain-of-thought, scored by a verifier that checks the final numeric answer:
    • y1y_1 = “let me add … 17” → r1=1r_1 = 1
    • y2y_2 = “I think it is 18” → r2=0r_2 = 0
    • y3y_3 = “carrying through … 17” → r3=1r_3 = 1
    • Group mean μ=2/30.667\mu = 2/3 \approx 0.667, std σ0.471\sigma \approx 0.471.
    • Advantages: A1=(10.667)/0.4710.707A_1 = (1 - 0.667)/0.471 \approx 0.707, A2=(00.667)/0.4711.414A_2 = (0 - 0.667)/0.471 \approx -1.414, A30.707A_3 \approx 0.707.
    • Ratios ρi\rho_i are 1 at step 0 because the policy hasn’t changed since sampling.
    • Lpg=(1/3)(0.707+(1.414)+0.707)=0L_{\text{pg}} = -(1/3) \cdot (0.707 + (-1.414) + 0.707) = 0. The policy gradient is zero on this micro-batch because positive and negative advantages cancel; the next batch will move the policy as soon as ratios drift from 1.
    • Lkl=βKL(πθπref)L_{\text{kl}} = \beta \cdot \text{KL}(\pi_\theta \mid \mid \pi_{\text{ref}}), a small positive number anchoring the policy to the SFT model.
    • Variables after the step: θ\theta has moved by ηL\eta \cdot \nabla L. The wrong response y2y_2 becomes less likely; the right ones become more likely.
  • Complexity. Time per step: O(NGTrespcost(πθ))O(N \cdot G \cdot T_{\text{resp}} \cdot \text{cost}(\pi_\theta)) where TrespT_{\text{resp}} is the response length. Memory: dominated by the policy parameters plus GG stored responses per prompt. No critic; the value function is replaced by the within-group standardization.
  • Hyperparameters in the Qwen 3 stage-2 setup 3 .
    • 3,995 query-verifier pairs (the entire reasoning-RL dataset).
    • Group size GG: not specified in the retrievable summary.
    • β\beta: not specified.
    • ϵ\epsilon: not specified (paper does not enumerate clip threshold).
  • Failure modes. (1) All responses in a group correct or all wrong → degenerate standardization. (2) Reward hacking on verifier-only training: the policy learns superficially-correct outputs that game the verifier; standard mitigation is the KL anchor plus broader-domain general-RL in stage 4.
  • Novelty: [Adopted] from DeepSeekMath 10 .

ALGORITHM ENTRY 2: Thinking-mode dispatch at inference.

  • Source: Qwen3 paper Section 4.3 3 ; Qwen3 blog 5 .
  • Purpose: route a prompt through thinking or non-thinking mode based on the chat-template flag and optionally cap the thinking length.
  • Inputs: user prompt xx, chat-template flag {think,no_think}\in \{\text{think}, \text{no\_think}\}, optional thinking budget BB.
  • Outputs: response yy.
  • Pseudocode.
parse chat template; extract flag F in {think, no_think}
if F == no_think:
    y = pi_theta.generate(x, mode=non_thinking)
    return y
else:
    emit special token <think>
    thinking_tokens = 0
    while not end_of_thinking:
        next_token = pi_theta.sample(x + so_far)
        emit next_token
        thinking_tokens += 1
        if B is set and thinking_tokens >= B:
            inject prompt "Considering the limited time by
            the user, I have to give the solution"
            break
    emit special token </think>
    while not end_of_response:
        next_token = pi_theta.sample(x + so_far)
        emit next_token
    return y
  • Hand-traced example. User prompt: “Compute the integral of x2x^2 from 0 to 3,” chat template /think, budget B=200B = 200. Step 1: parse → F=F = think, B=200B = 200. Step 2: emit <think>. Step 3: the policy samples chain-of-thought tokens, and the variable thinking_tokens increments by 1 each iteration. Suppose at iteration 180 the policy naturally emits </think>; the loop exits at iteration 180, well under the budget. Step 4: the policy then emits the final answer “9” plus surrounding prose. Total response: a 180-token thinking block followed by a short final answer. Now consider the same prompt with B=50B = 50: at iteration 50 the runtime injects the early-termination prompt; the policy exits the thinking block and emits its best-effort answer with much less reasoning. [Analysis] In practice the thinking-budget behaviour is asymmetric: budgets well above the policy’s natural thinking length are no-ops, but budgets below it materially degrade response quality on hard problems.
  • Complexity. Time: O(Tthink+Tresp)O(T_{\text{think}} + T_{\text{resp}}), dominated by thinking length on hard problems. Memory: standard KV-cache.
  • Failure modes. (1) Mis-parsed chat template flag → wrong mode; runtime-side bug surface. (2) Thinking budget set too low → answers are confidently wrong because the chain was truncated. (3) Models trained without the stage-3 fusion pass don’t handle the early-termination prompt gracefully; the Qwen 3 paper specifically trains for this.
  • Novelty: [New] to the Qwen 3 release.

8. Specialised design contributions

Subsection 8A, LLM / prompt design. The thinking-mode chat-template flag is the load-bearing prompt design. The /think and /no_think markers in user prompts switch the model’s behaviour deterministically at runtime 5 . The paper’s Section 4.3 describes the integration but does not publish a verbatim chat template; the model card on Hugging Face 7 documents the canonical chat template, including the special tokens that wrap the thinking block. [Reconstructed] The early-termination prompt fragment “Considering the limited time by the user, I have to give the solution” is verbatim from the Qwen 3 paper Section 4.3 summary 3 .

Subsection 8B, architecture-specific details. The dense lineup’s smaller tiers (0.6B and 1.7B) use tied input-output embeddings to save parameters; the 4B and above untie them 3 . QK-Norm was added to all Qwen 3 dense and MoE models for training stability; the Qwen 2.5 lineup did not use it. Bias terms were removed from QKV projections in Qwen 3, a small deviation from Qwen 2.5 3 . The MoE blocks do not use shared experts (the DeepSeek-V3 convention 14 ); routing is purely top-kk over the 128 expert pool.

Subsection 8C, training specifics. The Qwen 3 paper does not enumerate hardware, GPU count, or wall-clock time in the public material retrievable on 2026-05-19. [Reconstructed] the broader Qwen team’s earlier disclosures place the cluster size in the multi-thousand-H800-or-equivalent range, but the technical report itself does not surface specific numbers in the abstract or the retrievable Section 3 summary. The 36-trillion-token pretraining total is split across the three stages described in Section 5.2; the 30T-plus-5T-plus-long-context decomposition is the only explicit budget breakdown 3 .

Subsection 8D, inference / deployment specifics. Both lineups ship full Apache 2.0 open weights 5 , with Hugging Face Transformers integration documented in the model cards 6 7 . The Qwen 2.5-72B-Instruct model card lists the native 32,768-token context with YARN scaling factor 4.0 to reach 131,072 tokens 6 . Qwen 3 native context spans 32K (for the 0.6B / 1.7B sizes) and 128K (for the 4B+ sizes), with YaRN extension supported on the larger tiers 7 .

9. Experiments and results

Datasets. Qwen 3’s benchmark slate includes (cited from the paper’s Section 1 and Section 3.3 summaries 3 ):

  • General knowledge: MMLU, MMLU-Pro, C-Eval, CMMLU.
  • Reasoning: GPQA Diamond.
  • Math: MATH, AIME-2024, AIME-2025.
  • Code: EvalPlus (combining HumanEval and MBPP with strengthened tests), LiveCodeBench v5, CodeForces Elo.
  • Multilingual: the paper claims 119-language coverage; specific multilingual benchmark numbers are sparser in the retrievable Section 3 summary.

Baselines. Qwen 3’s headline comparison set includes DeepSeek-R1, OpenAI o1 and o3-mini, xAI Grok-3, Google Gemini-2.5-Pro, QwQ-32B (the Qwen team’s earlier dedicated reasoning model), and the Qwen 2.5 dense lineup itself 5 .

Evaluation metrics. Pass-at-1 for math and code; standard accuracy for multiple-choice and short-answer; Elo rating for CodeForces.

Key quantitative results, Qwen3-235B-A22B-Base (cited from Qwen3 paper Section 3.3 3 ):

BenchmarkQwen3-235B-A22B-Base
MMLU87.81
MMLU-Pro68.18
GPQA47.47
MATH71.84
EvalPlus77.60

Key quantitative results, Qwen3-235B-A22B post-trained (thinking mode):

BenchmarkQwen3-235B-A22B (thinking)
AIME-202485.7
AIME-202581.5
LiveCodeBench v570.7
CodeForces Elo2,056

Cross-generation comparison. The Qwen 3 paper claims Qwen3-32B-Base outperforms Qwen2.5-72B-Base on coding and math benchmarks despite 55% fewer parameters 3 . The paper also claims Qwen3-4B-Instruct matches Qwen2.5-72B-Instruct on broad benchmarks, attributed to strong-to-weak distillation from the 235B teacher 5 .

Ablations. The paper reports the AIME-2024 lift over reasoning-RL stage 2 alone: 70.1 → 85.1 3 . This is the cleanest single-stage ablation in the retrievable summary. The four-stage versus distillation comparison is reported at the system level (small models match four-stage quality at ~10% the GPU hours).

Hyperparameter sensitivity. The retrievable paper summary does not surface sensitivity sweeps for GRPO hyperparameters (group size, β\beta, ϵ\epsilon) or the thinking-budget cap.

Qualitative results. The thinking-budget mechanism is illustrated in the paper with an example of the early-termination prompt; the runtime-control surface is the key qualitative claim.

Experimental scope limits. [Analysis] The retrievable Qwen 3 summary is comparatively light on multilingual numerical benchmarks given the 119-language framing; the headline reasoning numbers carry the launch narrative. Independent multilingual reproducibility studies on the 119-language claim are not yet in the literature at the time of writing.

Independent benchmark cross-checks for SOTA claims. [Analysis] The Qwen 3 reasoning numbers (AIME-2024 85.7, AIME-2025 81.5, LiveCodeBench v5 70.7) are competitive with DeepSeek-R1 and OpenAI o1 on the launch-time comparison; independent reproducibility on the AIME-2025 number specifically is worth watching because the benchmark was released close to the model training cutoff and contamination risk exists. The publication has not located an independent reproducibility study as of 2026-05-19; the SOTA claim is the authors’ framing on their chosen benchmark slate.

Evidence audit.

  • Strongly supported claims ([Analysis]): the dense 32B beating Qwen2.5-72B is reported on multiple benchmarks (Section 3.3); strong-to-weak distillation matching four-stage quality on small dense models is reported with cost numbers.
  • Partially supported: thinking-budget effectiveness is shown qualitatively but not via a sweep of budgets versus accuracy in the retrievable summary.
  • Narrow evidence: the 119-language multilingual claim relies on coverage rather than per-language quality numbers.

10. Technical novelty summary

ComponentTypeNovelty levelJustificationSource
Decoder-only Transformer backboneArchitectureAdoptedInherited from Llama 2 / 3 lineageSection 5.1 3
Grouped-query attentionArchitectureAdoptedAinslie et al. (2023) 9 Section 5.1
MoE with 128 experts, top-8, no shared expertsArchitectureCombination novelCombines top-kk routing with deliberate no-shared-experts designSection 5.1
QK-Norm in attentionArchitectureAdoptedTraining-stability technique from prior workSection 5.1
Thinking-mode fusion via chat-template flagInferenceFully novelFirst open-weight family to unify reasoning and non-reasoning behaviour in one weights fileSections 5.3 + 7
Thinking-budget early-terminationInferenceFully novelRuntime control over chain-of-thought lengthSection 6 MATH ENTRY 5
Four-stage post-training pipelineTrainingCombination novelComposes long-CoT SFT, GRPO, thinking-fusion, general-RLSection 5.3
GRPO objectiveTrainingAdoptedDeepSeekMath (2024) 10 Section 6 MATH ENTRY 3
Strong-to-weak distillation across model-size sweepTrainingIncrementally novelLong-established technique applied at the four-stage-pipeline boundarySection 5.3
119-language pretraining coverageDataIncrementally novelQuantitative expansion over Qwen 2.5’s 29 languagesSection 5.2
36-trillion-token pretraining budgetDataAdoptedContinuation of the open-weight token-scaling trendSection 5.2

Single most novel contribution. [Analysis] The unified thinking-and-non-thinking inference mode with a runtime thinking budget is the single most operationally important contribution of the Qwen 3 paper. It collapses what was previously a two-model deployment problem (a reasoning specialist plus a fast-chat model) into a single weights file with a runtime control surface. Every other contribution (architectural conservatism, GRPO adoption, distillation across the size sweep) has clearer prior art.

What the papers do NOT claim to be novel. The decoder-only Transformer, GQA, RoPE, SwiGLU, RMSNorm, GRPO, DPO, YARN, and Dual Chunk Attention are all explicitly adopted. The Qwen 3 paper attributes them to the prior literature where applicable.

11. Situating the work

Prior work. The Qwen 2.5 paper situates itself relative to Llama 3.1-405B (the dense open-weight benchmark of late 2024) and the prior Qwen 2 release 1 . The Qwen 3 paper situates itself against DeepSeek-R1, OpenAI o1 and o3-mini, Grok-3, Gemini-2.5-Pro, and QwQ-32B 5 .

Conceptual change. Qwen 3’s headline conceptual change is not architectural; it is the inference-control surface. By making thinking mode a chat-template flag and adding a runtime budget, the team converts what was previously a model-selection problem into a parameter on every inference call.

Two contemporaneous related papers cited.

  • DeepSeek-V3 (December 2024) 14 . The direct MoE competitor in the open-weight tier. DeepSeek-V3 ships shared experts plus routed experts (a design choice Qwen 3 deliberately rejects); both papers report MoE-at-fixed-active-budget achieving dense-flagship quality. The Qwen 3 paper does not directly cite DeepSeek-V3 in the retrievable summary but lands on essentially the same MoE design template with a deliberate divergence on shared-experts.
  • Llama 3 herd (July 2024) 15 . The dense-architecture benchmark. The Qwen 2.5 paper directly compares to Llama 3.1-405B-Instruct; Qwen 3 inherits the dense-backbone conservatism Llama 3 popularized.

[Reviewer Perspective] Strongest skeptical objection. The thinking-mode fusion’s empirical evidence in the retrievable paper summary is qualitative (early-termination example) rather than quantitative (a sweep of budget versus accuracy at multiple budgets across multiple benchmarks). A reviewer-side concern is that the budget mechanism could be a deployment convenience that does not meaningfully change the accuracy-latency Pareto frontier; that is, that a fixed budget effectively truncates reasoning at the cost of accuracy, with no gain over simply running a smaller non-thinking model at the matching latency. The paper’s strongest counter would be a budget-versus-accuracy sweep that shows graceful degradation; this is absent from the retrievable summary.

[Reviewer Perspective] Strongest author-side rebuttal grounded in the paper. The stage-3 thinking-mode-fusion training pass is explicitly designed to make the model handle the early-termination prompt gracefully. Without that pass, the budget mechanism would degrade catastrophically; with it, the policy’s behaviour when budget is hit is in-distribution.

What remains unsolved.

  • Independent reproducibility of the multilingual coverage claim across the 119-language span.
  • Independent reproducibility of the AIME-2025 numbers (released close to training cutoff; contamination risk).
  • A formal theoretical account of why the four-stage post-training pipeline outperforms a single end-to-end RL pass.
  • Multi-modal extension of the thinking-mode dispatch (Qwen 3 is text-only at the flagship tier).

Three future research directions.

  • [Analysis] A thinking-budget calibration regime that learns the budget per-prompt rather than taking a user-supplied integer.
  • [Analysis] Replacing global-batch load balance with a learned routing prior that targets specific experts for specific languages or domains.
  • [Analysis] Per-layer thinking-mode dispatch, where the chain-of-thought is interleaved across layers rather than emitted as a contiguous pre-answer block.

12. Critical analysis

Strengths.

  • Reproducible open-weight release at flagship scale. All Qwen 3 models ship under Apache 2.0 5 , with Hugging Face Transformers integration. Practitioners can self-host the 235B MoE flagship without negotiating a commercial licence.
  • Multilingual ambition. 119-language coverage is the broadest in the open-weight landscape at release.
  • Operationally novel inference control. Thinking-mode fusion plus runtime budget is a concrete deployment win.
  • Strong-to-weak distillation across the size sweep. Small dense models matching the four-stage training quality at 10% the GPU hours is a non-trivial practitioner-facing result 3 .

Weaknesses explicitly stated by the authors. The retrievable Qwen 3 summary does not contain a dedicated limitations section in the abstract or Section 1 framing. [Reconstructed] the paper likely contains one further in the body that the publication could not retrieve through the available fetchers on 2026-05-19; the body-figure backfill task surfaces this gap.

Weaknesses not stated or understated by the authors.

  • [Reviewer Perspective] Hyperparameter disclosure for GRPO (group size, β\beta, ϵ\epsilon) is sparse in the retrievable summary; reproducibility of the AIME-2024 70.1 → 85.1 jump from the 3,995 query-verifier pairs alone is harder than it should be without these numbers.
  • [Reviewer Perspective] The thinking-mode-fusion stage 3’s training data and recipe are described conceptually but the retrievable summary does not enumerate sample counts.
  • [Reviewer Perspective] No formal account of how the four-stage pipeline avoids catastrophic forgetting between stages 2 (reasoning RL) and 4 (general RL).
  • [Reviewer Perspective] Hardware and wall-clock are not disclosed in the retrievable summary; total training cost is opaque.

Reproducibility check.

  • Code: Qwen team has historically released training and inference code; the Qwen 3 release ships with Hugging Face Transformers integration and the chat-template definition in the model card 7 .
  • Data: pretraining corpus not released (standard for the open-weight tier).
  • Hyperparameters: partial; key reasoning-RL hyperparameters not disclosed in the retrievable summary.
  • Compute: not reported in the retrievable summary.
  • Trained model weights: released under Apache 2.0 at full open weights 5 7 .
  • Evaluation set: standard public benchmarks (MMLU, GPQA, AIME, MATH, EvalPlus, LiveCodeBench v5, CodeForces).
  • Overall: partially reproducible. Model inference is fully reproducible from the released weights; training pipeline reproducibility requires hyperparameter disclosure not in the retrievable summary.

Methodology callout.

  • Sample size: 3,995 query-verifier pairs for reasoning-RL stage 2; sample counts for other stages not enumerated in the retrievable summary.
  • Evaluation set: MMLU, MMLU-Pro, GPQA, MATH, AIME-2024, AIME-2025, EvalPlus, LiveCodeBench v5, CodeForces, C-Eval, CMMLU. Held-out vs training-distribution and contamination check not explicitly noted in the retrievable summary; AIME-2025 contamination risk flagged above.
  • Baselines: DeepSeek-R1, OpenAI o1 / o3-mini, Grok-3, Gemini-2.5-Pro, QwQ-32B, prior Qwen 2.5 lineup.
  • Hardware / compute: not reported in the retrievable summary.

Generalisability. [Analysis] The thinking-mode-fusion mechanism is architecturally agnostic; any decoder-only LLM trained on the appropriate stage-3 fusion data could in principle support a /think flag and a runtime budget. The chat-template wiring is not Qwen-specific. The four-stage post-training pipeline is more recipe than algorithm; it should transfer to other base models with similar pretraining budgets. The MoE no-shared-experts choice is a clearer architectural opinion that other teams may or may not adopt depending on their routing-stability profile.

Assumption audit. The strong-to-weak distillation claim relies on the teacher’s behaviour being learnable by the smaller student on the same distribution as the four-stage training would have produced. [Analysis] This is plausible on-distribution but the off-distribution behaviour of the distilled small models against agentic and tool-use tasks is an open empirical question.

What would make the papers significantly stronger. [Analysis]

  • A full hyperparameter table for the reasoning-RL stage (group size, β\beta, ϵ\epsilon, learning rate schedule).
  • A thinking-budget-versus-accuracy sweep across at least three benchmarks.
  • Per-language multilingual benchmark numbers across a representative cut of the 119 languages.
  • A formal limitations section enumerating what the four-stage pipeline does not yet solve.

13. What is reusable for a new study

REUSABLE COMPONENT 1: GRPO reasoning RL on verifiable rewards.

  • What it is: stage 2 of the four-stage pipeline; trains chain-of-thought reasoning using only a verifier function, no reward model.
  • Why worth reusing: the AIME-2024 jump from 70.1 to 85.1 over this stage alone 3 is a substantial empirical signal that verifiable-reward RL works at flagship scale.
  • Preconditions: a high-quality verifier for the target domain; an SFT-stage model as the reference policy; group size large enough to give meaningful within-group variance.
  • What would need to change in a different setting: the verifier function is domain-specific; for non-math non-code domains, the within-group standardization can fail when most responses get the same reward.
  • Risks: reward hacking on verifier-only training; standard mitigation is the KL anchor plus broader-domain general-RL stage.
  • Interaction effects: depends on the SFT cold-start producing a model that already attempts chain-of-thought; pure-pretraining base models often need an explicit cold start.

REUSABLE COMPONENT 2: Thinking-mode-fusion via chat-template flag.

  • What it is: stage 3 of the post-training pipeline; integrates two distinct inference behaviours into one weights file via prompt-template conditioning.
  • Why worth reusing: it operationally collapses the reasoning-versus-chat deployment from two endpoints to one.
  • Preconditions: a base model that already supports both behaviours after stages 1 and 2; chat-template parser at the runtime layer.
  • What would need to change: the early-termination prompt fragment is in English; multilingual deployments need translated fragments or a language-neutral special token.
  • Risks: mis-parsed chat-template flag at the runtime layer; the model handles the early-termination prompt gracefully only after the stage-3 fusion pass. Without that pass, the budget mechanism degrades catastrophically.

REUSABLE COMPONENT 3: Strong-to-weak distillation across the size sweep.

  • What it is: train the flagship via the full four-stage pipeline; train smaller models as student distillations of the teacher.
  • Why worth reusing: ~10% the GPU hours for comparable benchmark quality on the smaller tier 3 .
  • Preconditions: a strong teacher trained to convergence; a student architecture compatible with the teacher’s tokenizer.
  • What would need to change: the distillation data mixture is recipe-specific; off-distribution generalization of the student is the empirical risk.
  • Risks: the student inherits the teacher’s biases, including any reward-hacked artifacts from stage 2.

REUSABLE COMPONENT 4: MoE no-shared-experts with global-batch load balance.

  • What it is: top-8-of-128 expert routing with no shared experts, balanced via global-batch auxiliary loss.
  • Why worth reusing: a clearer design opinion than the DeepSeek-V3 shared-experts default 14 ; the Qwen 3 flagship’s performance shows it works at the 235B-A22B scale.
  • Preconditions: training infrastructure that supports global-batch loss accumulation; a routing stability regime tolerant of larger expert counts.
  • Risks: expert specialization may collapse if the auxiliary loss coefficient α\alpha is set too high; needs sweep.

Dependency map. Components 1 and 2 are independent of each other but both depend on a strong SFT cold start (an implicit reusable component itself, not separately enumerated). Component 3 depends on Components 1 and 2 having produced a strong teacher first. Component 4 is architectural and independent of the post-training stack; it could be combined with any of the three post-training components above.

Recommendation. [Analysis] The three highest-value reusable components for a new open-weight team are (1) GRPO reasoning RL on verifiable rewards, (2) thinking-mode fusion via chat-template flag, and (3) strong-to-weak distillation across the size sweep. Component 4 (MoE no-shared-experts) is valuable but requires more infrastructure than the others.

What type of new study benefits most. [Analysis] Any open-weight reasoning-LLM project below the flagship-compute tier benefits most from Components 1 and 3. Inference-stack engineering work on closed-source LLMs benefits most from the thinking-budget runtime-control surface (Component 2 conceptually, even if the weights file is different).

14. Known limitations and open problems

Limitations explicitly stated by the authors. The retrievable Qwen 3 summary does not enumerate a formal limitations section in the abstract or Section 1. [Reconstructed] the paper likely contains one further in the body that the publication could not access through the fetchers attempted on 2026-05-19.

Limitations not stated.

  • [Analysis] The four-stage pipeline depends on a high-quality verifier for stage 2; domains without one cannot use the GRPO step directly.
  • [Analysis] The thinking-budget mechanism handles a runtime token cap but not a runtime semantic stop condition (e.g., stop when the model is confident enough); the budget is a fixed integer, not adaptive.
  • [Analysis] The 119-language framing is coverage, not per-language quality; benchmark numbers on the long-tail languages are sparser in the retrievable summary.
  • [Reviewer Perspective] Catastrophic-forgetting analysis between stage 2 (reasoning RL) and stage 4 (general RL) is not in the retrievable summary; whether the general-RL stage degrades the reasoning gains is an open empirical question.
  • [Reviewer Perspective] The Qwen 2.5 paper’s full body was not retrievable on 2026-05-19; readers should consult the canonical arXiv PDF 1 for the original Qwen 2.5 limitations framing.

Technical root cause. The verifier-dependence root cause is structural to GRPO: without a verifier, the within-group standardization needs an alternative source of variance (e.g., a learned reward model). The budget-non-adaptive root cause is that the early-termination mechanism is a deterministic prompt injection rather than a learned policy decision.

Open problems left behind.

  • An adaptive thinking-budget policy that decides per-prompt how much chain-of-thought to emit.
  • A formal account of why the four-stage composition outperforms an end-to-end RL pass.
  • A multimodal extension of the thinking-mode dispatch.
  • Per-language quality numbers across the 119-language span.

What a follow-up paper would need to solve. [Analysis] The single most critical limitation to solve in a follow-up is the adaptive thinking budget. A learned per-prompt budget would close the gap between the current deterministic mechanism and an actual reasoning-economy policy.

How this article reads at three depths

For the curious high-school reader. Qwen 2.5 and Qwen 3 are two technical reports from Alibaba’s Qwen team that document how a single open-weight LLM family scaled up to 36 trillion training tokens and 119 languages. The most interesting trick in Qwen 3 is that one set of model weights can either think out loud for a long time before answering hard questions, or answer fast and short for easy questions, controlled by a simple flag in the prompt. The whole family is released for free under a permissive licence, so anyone can download the weights and run them.

For the working developer or ML engineer. Qwen 3’s headline operational claim is one weights file that supports both /think and /no_think modes via the chat template, with a runtime thinking-token budget that triggers an early-termination prompt when hit. Deployment-wise this collapses what was previously a two-endpoint problem (reasoning specialist plus fast-chat model) into a single endpoint. The MoE 30B-A3B model activates only 3B parameters per token, making it serviceable on hardware that would not run a 30B dense model at the same latency. The four-stage post-training pipeline is the recipe to study if reasoning-LLM behaviour is the goal; the strong-to-weak distillation result is the recipe to study if a smaller deployment target with flagship-like quality is the goal. Apache 2.0 weights remove the commercial-licence friction that gates Llama 4’s broader adoption.

For the ML researcher. Architecturally the Qwen lineup is conservative: GQA, SwiGLU, RoPE, RMSNorm, plus QK-Norm added in Qwen 3. The genuinely novel contributions are (1) the thinking-mode fusion via chat-template flag in Section 4.3 of the Qwen 3 paper, and (2) the four-stage post-training pipeline that composes long-CoT SFT, GRPO on verifiable rewards, thinking-mode fusion, and general RL into a single recipe. The 235B-A22B MoE flagship’s load-bearing architectural choice is no-shared-experts with global-batch load balance, a deliberate divergence from the DeepSeek-V3 convention. The strongest empirical signal worth probing in follow-up work is the reasoning-RL stage’s AIME-2024 lift from 70.1 to 85.1 over 3,995 query-verifier pairs, but the retrievable summary does not disclose group size, β\beta, or ϵ\epsilon, making independent reproduction harder than it should be. AIME-2025 contamination risk is the strongest near-term skeptical objection.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Qwen Team, "Qwen2.5 Technical Report," arXiv:2412.15115, December 2024. Abstract retrieved 2026-05-19; full PDF body did not render through the fetchers attempted. (accessed )
  2. 2. Qwen Team, "Qwen3 Technical Report," arXiv:2505.09388, May 2025. Abstract retrieved 2026-05-19. (accessed )
  3. 3. Qwen3 paper HTML render via ar5iv. Retrieved 2026-05-19 with Sections 1–4 summaries including the architecture tables, three pretraining stages, four post-training stages, and the AIME-2024 70.1→85.1 lift over reasoning-RL stage 2. (accessed )
  4. 4. Qwen Team, "Qwen2.5: A Party of Foundation Models!" official blog announcement. Documents 18T pretraining tokens, dense 0.5B through 72B lineup, 128K context, 29-language multilingual coverage. (accessed )
  5. 5. Qwen Team, "Qwen3: Think Deeper, Act Faster," official blog announcement, 29 April 2025. Confirms ~36T pretraining tokens, 119 languages, Apache 2.0 licence, `/think` and `/no_think` chat-template flags, Qwen3-4B matching Qwen2.5-72B-Instruct claim. (accessed )
  6. 6. Qwen2.5-72B-Instruct model card on Hugging Face. Documents 80 layers, 64 query heads, 8 KV heads, 72.7B total parameters, 70.0B non-embedding, native 32,768 context with YaRN scaling factor 4.0 to 131,072 tokens. (accessed )
  7. 7. Qwen3-235B-A22B model card on Hugging Face. Documents 94 layers, 64 query heads, 4 KV heads, 128 total experts with 8 activated, 235B total / 22B activated parameters, native 32,768 context extending to 131,072 with YaRN. (accessed )
  8. 8. Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding," arXiv:2104.09864. Original RoPE paper. (accessed )
  9. 9. Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," arXiv:2305.13245. Original GQA paper. (accessed )
  10. 10. Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," arXiv:2402.03300. Introduces Group Relative Policy Optimization (GRPO). (accessed )
  11. 11. Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," arXiv:2305.18290. Original DPO paper. (accessed )
  12. 12. Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models," arXiv:2309.00071. The YARN context-extension method cited by the Qwen 3 long-context pretraining stage. (accessed )
  13. 13. An et al., "Training-Free Long-Context Scaling of Large Language Models," arXiv:2402.17463. Dual Chunk Attention referenced by the Qwen 3 long-context stage. (accessed )
  14. 14. DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv:2412.19437. Contemporaneous MoE open-weight release with shared-experts design Qwen 3 deliberately diverges from. (accessed )
  15. 15. Grattafiori et al., "The Llama 3 Herd of Models," arXiv:2407.21783. Dense-architecture benchmark of late 2024 referenced by the Qwen 2.5 paper's competitive framing. (accessed )

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.