Phi-4, Phi-4-Mini, and Phi-4-reasoning: small LLMs chasing frontier reasoning

Multi-paper review of the Microsoft Phi-4 family — the 14B base model, the 3.8B Phi-4-Mini multimodal variant, and the Phi-4-reasoning SFT plus GRPO pipeline.

20 May 2026 Updated 20 May 2026 ~52 min read

1. Paper identity and scope

This multi-paper review covers three technical reports from Microsoft Research’s Phi team, published across late 2024 and 2025, that together define the “small LLM reaches for frontier reasoning” research line:

Phi-4 Technical Report, Marah Abdin et al., December 2024, arXiv:2412.08905¹. A 14-billion parameter dense decoder-only Transformer trained on roughly 10 trillion tokens with synthetic data as a deliberate first-class ingredient. Introduces Pivotal Token Search as a post-training preference-data method.
Phi-4-Mini Technical Report, Microsoft et al., March 2025, arXiv:2503.01743². A 3.8B language model plus Phi-4-Multimodal, a 5.6B unified model integrating text, vision, and speech via LoRA adapters and modality-specific routers.
Phi-4-reasoning Technical Report, Marah Abdin et al., April 2025, arXiv:2504.21318³. Supervised fine-tuning of Phi-4 on 16K curated “teachable” prompts using o3-mini reasoning traces, followed by a short GRPO reinforcement-learning phase to produce Phi-4-reasoning-plus.

Retrieval confirmation: all three arXiv abstracts and ar5iv HTML renders fetched 2026-05-20; the Phi-4-reasoning PDF was additionally fetched from Microsoft Research’s hosted copy⁴ because the arXiv ar5iv render returned a conversion error. Hugging Face model cards⁵⁶ supplement training-time details that the technical reports compressed.

Paper classification across all three: Architecture proposal (Phi-4-Multimodal modality routing) · Training method (synthetic data recipe, SFT curation) · Inference method (long reasoning chains) · Optimisation (Pivotal Token Search, GRPO) · LLM-based · Data-driven · Application.

Technical abstract in the publication’s voice: the Phi-4 line argues that a 14-billion parameter dense Transformer, trained on a deliberately synthetic-data-heavy corpus and post-trained with a reasoning-targeted SFT plus a narrow GRPO RL phase, can match or approach frontier reasoning models that are 5-50 times larger on math, science, and coding benchmarks. The Phi-4-Mini and Phi-4-Multimodal extensions push the same recipe to the 3.8B parameter scale and add vision and speech via mixture-of-LoRAs. The papers collectively reframe the “scaling laws” debate: data quality and post-training curation, not just parameter count, drive reasoning capability.

Primary research question: can a small dense Transformer reach reasoning-task parity with much larger models through careful training-data composition and reasoning-targeted post-training?

Core technical claim: yes, on a defined set of benchmarks. Phi-4 14B reaches 80.4 on the MATH benchmark⁷, outperforming its 70-billion parameter Llama-3.3 contemporary (66.3) and roughly matching GPT-4o (74.6); Phi-4-reasoning-plus 14B reaches 81.3 on AIME 2024 and 68.9 on GPQA-Diamond⁵, surpassing DeepSeek-R1-Distill-Llama-70B and approaching full DeepSeek-R1.

Core technical domains and depth: Language-model architecture (surface — dense decoder, GQA, RoPE, all standard). Synthetic data generation (deep — the headline contribution). Preference-learning post-training (moderate — DPO with Pivotal Token Search). Reinforcement-learning fine-tuning (moderate — GRPO with rule-based reward). Multimodal fusion (moderate — mixture-of-LoRAs is novel).

Reader prerequisites: high-school algebra plus willingness to learn the vocabulary the Glossary covers. Familiarity with neural-network basics is helpful but not required because the Glossary covers them. The article assumes nothing beyond that.

2. TL;DR and executive overview

TL;DR. Microsoft’s Phi-4 line shows that a 14-billion parameter language model — far smaller than the headline frontier models — can match or come close to frontier reasoning performance on math and science problems if trained carefully on synthetic data and post-trained with reasoning examples. The three papers tell a layered story: Phi-4 establishes the base, Phi-4-Mini scales the recipe down to 3.8 billion parameters and adds vision and speech, and Phi-4-reasoning shows how a short supervised fine-tuning pass plus a small dose of reinforcement learning turns the base model into a competitive reasoning model. The bet is that data quality and post-training curation matter at least as much as raw parameter count.

Executive summary. The Phi-4 family is Microsoft Research’s continuing argument that small dense Transformers can chase frontier reasoning through synthetic-data-heavy pretraining and targeted post-training. Phi-4 (14B) is trained on roughly 10 trillion tokens, 40% of them synthetic, and introduces Pivotal Token Search to focus preference-learning signal on the tokens that flip a math solution’s correctness. Phi-4-Mini (3.8B) shrinks the recipe and Phi-4-Multimodal (5.6B) extends it across vision and speech via parameter-efficient LoRA adapters. Phi-4-reasoning then shows that 16K supervised reasoning examples plus 90 steps of GRPO reinforcement learning on math problems is enough to lift the 14B base to AIME and GPQA scores within striking distance of DeepSeek-R1. The practitioner takeaway: a 14B model is now genuinely competitive in math, science, and code reasoning, with all weights released under MIT license.

Five practitioner takeaways.

Synthetic data is no longer a curiosity. Phi-4’s training mix is 40% synthetic, with the synthetic portion seen for 13.8 epochs — repeated exposure to high-quality synthetic material beats one-pass over fresh low-quality web tokens, per the paper’s ablation.
Pivotal Token Search is a reusable post-training technique. Most DPO pairs spend their preference signal on tokens that do not change the answer. PTS isolates the tokens that do.
Reasoning post-training works at 14B scale with modest compute. Phi-4-reasoning trains on 32 H100 GPUs for 2.5 days; the SFT corpus is 16B tokens with about 8.3B unique. Substantially cheaper than full R1-scale RL.
GRPO on math generalises to non-math reasoning. The Phi-4-reasoning-plus RL phase trains exclusively on mathematical problems but the model’s GPQA-Diamond and LiveCodeBench scores also improve.
Mixture-of-LoRAs is a clean multimodal recipe. Phi-4-Multimodal keeps the language backbone frozen and attaches modality-specific LoRA adapters of rank 320, processed by modality-specific routers, so vision, speech, and language do not interfere at the parameter level.

Pipeline overview. Training-time the three models share a common skeleton: (a) pretrain a dense decoder-only Transformer on a data mix engineered for reasoning, (b) post-train with supervised fine-tuning on instruction-following plus reasoning demonstrations, (c) preference-learning via DPO (Phi-4 base, with PTS) or GRPO (Phi-4-reasoning-plus). Inference-time Phi-4 and Phi-4-Mini are standard chat models; Phi-4-reasoning emits a long chain-of-thought wrapped in dedicated thought tags followed by a final answer, similar to DeepSeek-R1’s protocol.

2.5. Glossary

Read this section before reading the technical sections that follow. It is the dictionary the rest of the article assumes you have.

Term	Plain-English explanation	First appears in
Dense Transformer	A neural network architecture where every input token interacts with every layer’s full parameter set; contrasts with Mixture-of-Experts where only a subset activates per token.	Section 1
Parameter (weight)	A single tunable number inside the neural network. “14B parameters” means 14 billion tunable numbers.	Section 1
Pretraining	The first training phase where the model learns from a large unlabelled text corpus by predicting the next token.	Section 1
Post-training	Any training that happens after pretraining: supervised fine-tuning, preference learning, reinforcement learning.	Section 1
Synthetic data	Text generated by an existing language model, then used to train another. Phi-4 uses synthetic data for 40% of its pretraining mix.	Section 2
SFT (supervised fine-tuning)	Training the model to imitate good example outputs by minimising the cross-entropy loss on a curated dataset of input-output pairs.	Section 2
DPO (Direct Preference Optimization)	A post-training method that teaches the model to prefer “winning” responses over “losing” ones without training a separate reward model.	Section 5
GRPO (Group Relative Policy Optimization)	A reinforcement-learning algorithm from DeepSeekMath that estimates the advantage from a group of sampled responses instead of training a separate value network.	Section 5
Reasoning chain	A long step-by-step explanation the model generates before its final answer; longer chains usually help on hard math problems.	Section 5
Pivotal Token Search (PTS)	The Phi-4 paper’s method for finding the specific tokens in a reasoning trace whose probability has an outsized effect on whether the final answer is correct.	Section 6
LoRA adapter	Low-Rank Adaptation — a small set of additional parameters bolted on to a frozen base model so you can specialise it without retraining the whole thing.	Section 6
GQA (Group Query Attention)	An attention variant that shares key/value computations across groups of query heads, reducing KV-cache memory for inference.	Section 6
AIME / GPQA-Diamond	AIME is a US high-school math olympiad; GPQA-Diamond is a graduate-level science multiple-choice benchmark; both are standard reasoning evaluations.	Section 9
”From the paper:” prefix	Content directly supported by the paper’s text, tables, or figures.	Throughout
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what the paper itself claims.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the paper proves.	Sections 11 + 12
`[Reconstructed]` label	Content the publication faithfully reconstructed because the paper only partially disclosed it.	Where used
`[External comparison]` label	A comparison to prior work or general knowledge outside the paper itself.	Sections 4 + 11

3. Problem formalisation

The three papers attack overlapping but distinct problems. The umbrella problem is: train a language model with $N$ parameters and training budget $B$ such that performance on a target reasoning-task distribution $\mathcal{D}_{\text{reason}}$ is maximised, subject to $N$ being substantially smaller than frontier-model parameter counts.

Notation table.

Symbol	Type	Meaning	First appears in
$\theta$	Vector	Model parameters	Section 3
$\pi_\theta$	Function	The model’s conditional distribution over next tokens given context	Section 3
$\pi_{\text{ref}}$	Function	The reference (frozen) model used in DPO / GRPO	Section 6
$y$	Sequence	A model-generated token sequence (response or reasoning trace)	Section 3
$x$	Sequence	An input prompt	Section 3
$r(x, y)$	Scalar	Reward (rule-based or learned) assigned to response $y$ given prompt $x$	Section 6
$p_{\text{gap}}$	Scalar	Probability gap threshold for Pivotal Token Search	Section 6
$A_i$	Scalar	Advantage of the $i$ -th sampled response in a GRPO group	Section 6
$\beta$	Scalar	KL regularisation strength in DPO / GRPO	Section 6
$T$	Integer	Reasoning chain length in tokens	Section 6
$N$	Integer	Total parameters	Section 3

Formal problem statement. Given a target distribution of reasoning tasks $\mathcal{D}_{\text{reason}}$ and a parameter budget $N$ , find $\theta^\ast$ that maximises expected task success:

$\theta^\ast = \arg\max_\theta \, \mathbb{E}_{(x, y^\star) \sim \mathcal{D}_{\text{reason}}} \, \big[ \, \mathbb{1}[\text{verify}(y^\star, \pi_\theta(x))] \, \big]$

where $\text{verify}(\cdot)$ is a task-specific correctness check (e.g., exact-match for numerical AIME answers, unit tests for code, multiple-choice match for GPQA).

The Phi-4 papers operate under a fixed- $N$ constraint (14B for Phi-4 and Phi-4-reasoning, 3.8B for Phi-4-Mini, 5.6B for Phi-4-Multimodal). The free variables are the pretraining data mix, the post-training data curation, and the post-training algorithm.

Assumptions made by the papers.

Synthetic data generated by larger models (notably GPT-4 and o3-mini) carries enough signal to train a smaller model toward the teacher’s capability. From the paper: Phi-4 cites the synthetic data as drawn from GPT-4-class generators; Phi-4-reasoning cites o3-mini explicitly as the source of reasoning demonstrations. [Analysis] Potentially strong assumption — this folds the whole “knowledge distillation” question into a single design choice.
Decontamination via n-gram overlap is sufficient to prevent benchmark leakage. The Phi-4 paper describes a 13-gram-and-7-gram hybrid filter across 19 benchmarks. [Analysis] Potentially strong assumption — n-gram filters can miss semantic paraphrases of contamination sources.
Test-set distributions (MMLU, GPQA, AIME, LiveCodeBench) faithfully represent generalisable reasoning ability rather than narrow benchmark-specific skills. This is the standard assumption across the field but increasingly contested in 2025-2026 commentary.
For Phi-4-reasoning, supervised fine-tuning on o3-mini outputs is treated as an upper-bound on what the SFT phase can teach; the GRPO phase then pushes past that bound. From the paper.

Complexity arguments. Training cost scales with parameter count $N$ times tokens trained $D$ . Phi-4 trains on $D \approx 10^{13}$ tokens at $N = 1.4 \times 10^{10}$ parameters, putting the FLOPs cost in the same band as Llama-3 8B training. Phi-4-reasoning’s SFT phase trains on $D \approx 1.6 \times 10^{10}$ tokens (about 8.3B unique) over 2.5 days on 32 H100s, a much lighter post-training budget than DeepSeek-R1’s full RL run.

If reasoning-specific: the reasoning chain length $T$ is itself a learned quantity. Phi-4-reasoning-plus generates longer chains than Phi-4-reasoning, and the chain-length distribution is shaped by the GRPO reward, which penalises excessive length per the model card and paper.

4. Motivation and gap

The motivating problem: frontier reasoning capability has been gated behind models of 70B-700B parameters and proprietary RL pipelines. DeepSeek-R1, OpenAI’s o1 / o3-mini, Claude 3.7 Sonnet’s extended thinking mode — all push the state of the art on math olympiad benchmarks but at substantial inference cost and (for the proprietary ones) zero weight release. Practitioners wanting open-weight reasoning models had limited options until the DeepSeek-R1-Distill series and the Phi-4-reasoning release in April 2025.

The gap each paper claims to fill:

Phi-4 argues that a 14B dense Transformer trained correctly can outperform open-weight 70B models on STEM benchmarks. The headline number is MATH 80.4 vs Llama-3.3-70B 66.3⁷. [External comparison] This is a continuation of the “data quality scales differently from parameter count” thesis the earlier Phi-1, Phi-2, and Phi-3 papers built⁸.
Phi-4-Mini scales the recipe down to 3.8B and adds multimodal coverage. The gap: open-weight 3-4B class models with strong math + code performance plus unified vision and speech.
Phi-4-reasoning addresses the post-training half. The gap: how do you turn a strong base model into a reasoning specialist without the full DeepSeek-R1-scale RL run? The paper answers with a small curated SFT corpus plus a short GRPO phase.

Practical stakes: frontier inference cost is the main blocker for production reasoning deployment. A 14B model fits on a single H100 or even a consumer GPU at 4-bit quantisation; a 70B model needs a multi-GPU node. If the Phi-4 family genuinely matches 70B reasoning quality, the deployment cost falls by an order of magnitude. [Analysis] This is the genuine practitioner stake — the paper’s value proposition lives or dies on independent reproduction.

[External comparison] Positioning in the landscape: the Phi-4 line sits adjacent to the DeepSeek-R1-Distill family⁹, the Qwen 2.5 reasoning variants, and the smaller Llama-3 distilled models. The Phi family’s distinguishing posture is the synthetic-data-first pretraining philosophy that goes back to the original “Textbooks Are All You Need” paper.

5. Method overview

The three papers share a method-stack skeleton with paper-specific innovations layered on top. This section walks through the stack from pretraining to inference.

Stage 1: pretraining (all three papers).

Plain-English intuition: feed the model a curated diet of text and code, scheduled across multiple training phases, where the curation is the secret sauce. The paper claim is that synthetic data — text generated by GPT-4-class models, specifically engineered to be pedagogically rich on STEM topics — is more efficient per training token than raw web text.

Phi-4’s training mix (from the paper, Section 2):

40% synthetic data, approximately 290B unique tokens seen 13.8 times each.
30% filtered/rewritten web data, split 15% web and 15% rewrites.
20% code data, approximately 820B tokens.
10% acquired sources (books, papers, structured data), approximately 580B tokens.

The total training token count is approximately 10 trillion. The peak learning rate is $3 \times 10^{-4}$ with linear warmup-then-decay, weight decay 0.1, batch size 5760. Context window during pretraining is 4096 tokens, extended to 16384 during a midtraining phase.

Connection to full pipeline: this stage is the “general capability” base. Phi-4-reasoning starts from this checkpoint; the synthetic-data emphasis is the reason the base model has strong math intuition before any reasoning-specific post-training.

Design rationale: the paper argues, with an ablation in Figure 2 of the report, that 13.8 epochs of high-quality synthetic data outperforms one-pass on web tokens. [Analysis] This is consistent with the broader “less data more passes” finding in the 2024-2025 literature when data quality is high.

Classification: [Adapted] from the Phi-1, Phi-2, Phi-3 line; the synthetic-data philosophy is the franchise’s hallmark.

Stage 2: post-training supervised fine-tuning.

Plain-English intuition: show the model lots of examples of “here’s a question, here’s the ideal reasoning chain and answer” and train it to imitate them.

For Phi-4 base, SFT uses an instruction-following mix. For Phi-4-reasoning, SFT uses 16K “teachable” prompts where the reasoning demonstrations are generated by o3-mini, then filtered. The “teachable” filter selects prompts that are neither trivially solvable by the base model nor utterly impossible — the calibrated-difficulty middle band where supervised signal teaches the most. From the paper: learning rate $10^{-5}$ , linear warmup over 450 steps, weight decay $10^{-4}$ .

Connection: the SFT stage is where the model learns the long-reasoning protocol — emit a <think> block, work the problem step by step, then emit a final answer. The protocol is taught by example.

What breaks if removed: the base model can be coerced into long reasoning via prompting alone but with substantially weaker per-token efficiency. The SFT phase establishes the reasoning chain as a default mode.

Classification: [Adapted]. SFT on teacher-generated reasoning chains is the standard distillation pattern.

Stage 3: preference learning (Phi-4) or RL (Phi-4-reasoning-plus).

For Phi-4 base, post-training uses Direct Preference Optimization with preference pairs constructed via Pivotal Token Search (see Section 6 below). For Phi-4-reasoning-plus, the preference-learning phase is replaced with GRPO reinforcement learning on a 72,401-problem math seed dataset, with 64 problems subsampled per RL iteration over roughly 90 steps. The reward function is rule-based: $+1$ for a correct final answer, $-0.5$ for a wrong answer, plus formatting and length penalties.

Plain-English intuition for GRPO: sample several reasoning chains per problem, score each one, and update the model to prefer the high-scoring chains over the low-scoring ones. The “Group Relative” part is that the advantage of each sample is computed relative to its peers within the same group, so you do not need a separate value network.

Classification: [Adopted] from DeepSeekMath¹⁰, which introduced GRPO. The Phi-4-reasoning paper adapts the reward function but not the core algorithm.

Stage 4: multimodal extension (Phi-4-Multimodal).

Phi-4-Multimodal keeps the Phi-4-Mini language backbone frozen and attaches modality-specific LoRA adapters: a vision LoRA adapter of approximately 370M parameters paired with a SigLIP-400M vision encoder, and a speech LoRA adapter of approximately 460M parameters paired with a conformer-based speech encoder. Modality-specific routers select the relevant adapter path at inference time. Total parameter count is 5.6B, with the LoRA adapters at rank 320.

Plain-English intuition: the model can “see” or “hear” by routing visual or audio embeddings through a small adapter that translates them into the language model’s representation space, without disturbing the language model’s text-only capabilities.

Connection to full pipeline: this is a parameter-efficient design choice. Vision and speech each add roughly 800-900M parameters of adapter plus encoder, compared to retraining a full 14B multimodal model from scratch.

Classification: [Adapted]. Mixture-of-LoRAs is novel in the specific configuration but the underlying LoRA technique is borrowed.

6. Mathematical contributions

This section gives the precise definitions, worked numerical examples, and proof sketches for the key mathematical objects across the three papers.

MATH ENTRY 1: Pivotal Token Search (PTS) probability gap criterion.

Source: Phi-4 Technical Report, Section 3 (Post-training).
What it is: a method for finding, within a reasoning chain that the model generated, the specific tokens whose probability has an outsized effect on whether the final answer comes out right.
Formal definition. For a prompt $x$ and a completion $y = (y_1, y_2, \ldots, y_T)$ , with a verifier $v(y) \in \{0, 1\}$ indicating final-answer correctness, the pivotal token criterion identifies index $t^\ast$ where the conditional success probability shifts most:

$t^\ast = \arg\max_t \, \big| \, \Pr[v(y) = 1 \mid y_{\le t}] - \Pr[v(y) = 1 \mid y_{\le t-1}] \, \big| \ge p_{\text{gap}}$

Each term explained and dimensionally analysed:
- $y_{\le t}$ is the prefix of the completion through token $t$ , a sequence of length $t$ .
- $\Pr[v(y) = 1 \mid y_{\le t}]$ is the success probability conditional on the prefix; estimated by Monte-Carlo rollout. A scalar in $[0, 1]$ .
- $p_{\text{gap}}$ is the threshold parameter, a scalar (the paper uses values in the 0.2-0.5 band per Figure 3 of the report).
- $t^\ast$ is an integer in $[1, T]$ — the pivotal token index.
Worked numerical example. Take a math problem and a sampled completion of length $T = 20$ tokens. The Monte-Carlo rollout estimates conditional success probabilities token by token, giving the sequence $[0.45, 0.46, 0.47, 0.46, 0.48, 0.72, 0.71, 0.73, 0.72, 0.71, 0.70, 0.71, 0.69, 0.85, 0.84, 0.85, 0.86, 0.85, 0.86, 0.86]$ . The first big jump is between tokens 5 and 6, from 0.48 to 0.72 (gap 0.24); the second is between tokens 13 and 14, from 0.69 to 0.85 (gap 0.16). With $p_{\text{gap}} = 0.2$ , only token 6 qualifies as pivotal. PTS then constructs a preference pair where the “winning” branch follows the high-probability continuation at position 6 and the “losing” branch follows the low-probability continuation, isolating the DPO gradient onto that single token.
Role: PTS-generated preference pairs feed DPO. The hypothesis is that gradient updates concentrated on pivotal tokens are more sample-efficient than DPO over whole responses, where most tokens contribute negligibly to the answer’s correctness.
Edge cases. If no token’s gap exceeds $p_{\text{gap}}$ , the completion is skipped; if the verifier is undefined (e.g., free-form generation with no clear correctness signal) the method does not apply.
Novelty: [New]. The Phi-4 paper is the first to articulate this token-localised DPO variant under the Pivotal Token Search name.
Transferability: [Analysis] PTS is reusable wherever a verifier is available and reasoning chains are sufficiently long. Code generation with unit tests, math with exact-match, multiple-choice QA — all candidates. Not applicable to open-ended generation where verifier signal is absent.
Why it matters: standard DPO over entire long reasoning chains dilutes the preference signal across thousands of tokens whose correctness contribution is near zero. PTS focuses the signal where it actually moves the answer.

MATH ENTRY 2: DPO loss (used with PTS pairs in Phi-4).

Source: Phi-4 Technical Report Section 3; the underlying DPO formulation is from Rafailov et al. 2023.
What it is: a loss function that pushes the model to make winning responses more likely and losing responses less likely, relative to a frozen reference model, without training a separate reward model.
Formal definition:

$\mathcal{L}\_{\text{DPO}}(\theta) = -\mathbb{E}\_{(x, y\_w, y\_l) \sim \mathcal{D}} \left[ \log \sigma \big( \beta \log \frac{\pi\_\theta(y\_w \mid x)}{\pi\_{\text{ref}}(y\_w \mid x)} - \beta \log \frac{\pi\_\theta(y\_l \mid x)}{\pi\_{\text{ref}}(y\_l \mid x)} \big) \right]$

Each term explained and dimensionally analysed:
- $y_w$ is the preferred (winning) response, a sequence.
- $y_l$ is the dispreferred (losing) response, a sequence.
- $\pi_\theta(y \mid x)$ is the model’s probability of generating $y$ given $x$ , a scalar in $[0, 1]$ .
- $\pi_{\text{ref}}(y \mid x)$ is the same probability under the frozen reference model.
- $\beta$ is the KL regularisation strength, a scalar (Phi-4 uses values in the 0.1-1.0 band depending on phase).
- $\sigma$ is the logistic sigmoid, mapping a real number into $[0, 1]$ .
Worked numerical example. Take a single preference pair $(x, y_w, y_l)$ . Suppose $\log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} = 0.5$ — the model already prefers $y_w$ slightly more than the reference does. And $\log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} = -0.3$ — the model already disprefers $y_l$ slightly more. With $\beta = 0.5$ , the argument of $\sigma$ is $0.5 \cdot 0.5 - 0.5 \cdot (-0.3) = 0.25 + 0.15 = 0.40$ . Then $\sigma(0.40) \approx 0.599$ , so the per-example DPO loss is $-\log(0.599) \approx 0.512$ . Gradient steps on this loss push the model to further increase the winning log-probability and decrease the losing log-probability, with the KL anchor preventing the model from drifting too far from the reference.
Role: the loss function that the Phi-4 post-training optimiser minimises across PTS-generated preference pairs.
Edge cases. When $\pi_\theta(y_w \mid x) / \pi_{\text{ref}}(y_w \mid x)$ is already extreme (very large or very small), the sigmoid saturates and the gradient is tiny; this is the “reward hacking” pathology DPO can exhibit and is partly what the rule-based reward in GRPO is designed to side-step.
Novelty: [Adopted] from Rafailov et al. 2023. Phi-4’s contribution is the PTS pair-construction pipeline, not the DPO loss itself.
Transferability: [Analysis] standard wherever preference data is available.
Why it matters: DPO is the workhorse preference-learning loss in 2024-2025 open-weight model training; understanding it is necessary to understand what PTS is doing.

MATH ENTRY 3: GRPO objective (used in Phi-4-reasoning-plus).

Source: Phi-4-reasoning Technical Report Section 4; the underlying algorithm is DeepSeekMath’s GRPO.
What it is: a reinforcement-learning algorithm that estimates the advantage of each sampled response from its peers within a group, eliminating the need for a separate value (critic) network.
Formal definition. For a prompt $x$ , the policy samples a group of $G$ responses $\{y_1, y_2, \ldots, y_G\}$ from the old policy $\pi_{\theta_{\text{old}}}$ . Each receives a reward $r_i = r(x, y_i)$ from the rule-based verifier. The group-relative advantage is:

$A_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}$

The GRPO objective is then:

$\mathcal{J}\_{\text{GRPO}}(\theta) = \mathbb{E}\_{x, \{y\_i\}} \left[ \frac{1}{G} \sum\_{i=1}^G \min \big( \rho\_i A\_i, \, \text{clip}(\rho\_i, 1-\epsilon, 1+\epsilon) A\_i \big) - \beta \, D\_{\text{KL}}(\pi\_\theta \,\|\, \pi\_{\text{ref}}) \right]$

where $\rho_i = \pi_\theta(y_i \mid x) / \pi_{\theta_{\text{old}}}(y_i \mid x)$ is the importance-sampling ratio.

Each term explained and dimensionally analysed:
- $G$ is the group size (number of responses sampled per prompt); the Phi-4-reasoning paper uses values in the 8-16 band per the GRPO convention.
- $r_i$ is the scalar reward for response $i$ : $+1$ if the answer is correct, $-0.5$ if wrong, with additional formatting and length penalties.
- $A_i$ is the standardised group-relative advantage.
- $\rho_i$ is the importance-sampling ratio between the current and old policies.
- $\epsilon$ is the PPO-style clip range, a scalar around 0.2.
- $\beta$ is the KL regularisation strength, a scalar.
Worked numerical example. Take $G = 4$ samples for one math problem. The reward vector is $\{r_1, r_2, r_3, r_4\} = \{+1, -0.5, +1, -0.5\}$ (two correct, two wrong). Mean is $0.25$ ; standard deviation is approximately $0.866$ . Advantages: $A_1 = (1 - 0.25) / 0.866 \approx 0.866$ , $A_2 = (-0.5 - 0.25) / 0.866 \approx -0.866$ , $A_3 \approx 0.866$ , $A_4 \approx -0.866$ . Suppose all $\rho_i = 1$ (first step of the inner-loop SGD). The policy-gradient step then pushes the model to make response 1 and response 3 more likely and response 2 and response 4 less likely, with magnitude proportional to the advantage. The KL term anchors the updated policy near $\pi_{\text{ref}}$ .
Role: this is the main optimisation objective for Phi-4-reasoning-plus during its short RL phase.
Edge cases. If all $G$ samples receive the same reward, $\text{std}(\cdot) = 0$ and the advantage is undefined; common practice (and the DeepSeekMath convention) is to skip the gradient step for that group or add a small floor to the denominator.
Novelty: [Adopted] from Shao et al. DeepSeekMath. Phi-4-reasoning’s contribution is the small data + few-steps regime: 72,401 math problems, 64 problems per RL iteration, roughly 90 steps total.
Transferability: [Analysis] GRPO is generally applicable to any task with a verifier and group-sampling capacity. Hardware cost is dominated by the $G$ -times rollout per prompt; for $G = 16$ and a 14B model with 32K context, each step is non-trivial.
Why it matters: GRPO is the algorithm that powered DeepSeek-R1’s reasoning capability and is now arguably the dominant open-weight RL recipe for reasoning models. Phi-4-reasoning shows that a tiny budget (90 steps) is enough at 14B scale.

MATH ENTRY 4: KL-divergence anchor (used in both DPO and GRPO).

Source: standard from Rafailov et al. 2023 (DPO) and the PPO/GRPO lineage.
What it is: a penalty term that prevents the fine-tuned model from drifting too far from a reference model. Stops the policy from collapsing to degenerate outputs that exploit reward-model weaknesses.
Formal definition:

$D\_{\text{KL}}(\pi\_\theta \,\|\, \pi\_{\text{ref}}) = \mathbb{E}\_{y \sim \pi\_\theta} \left[ \log \frac{\pi\_\theta(y \mid x)}{\pi\_{\text{ref}}(y \mid x)} \right]$

Each term: standard. $\pi_\theta$ is the trained policy, $\pi_{\text{ref}}$ is the reference (typically the SFT-stage checkpoint), $y$ is a sampled response.
Worked numerical example. Suppose for a given $x$ , both $\pi_\theta$ and $\pi_{\text{ref}}$ put most mass on the same response $y$ with $\pi_\theta(y \mid x) = 0.6$ , $\pi_{\text{ref}}(y \mid x) = 0.5$ . The contribution to the KL is $0.6 \cdot \log(0.6 / 0.5) = 0.6 \cdot 0.1823 \approx 0.1094$ . Summed across the response distribution, the KL is small as long as $\pi_\theta$ stays close to $\pi_{\text{ref}}$ .
Role: the regularisation term that defines “how much can the model change before the post-training breaks the base model’s general capabilities”.
Novelty: [Adopted].
Why it matters: in practice, the KL coefficient $\beta$ is the most sensitive post-training hyperparameter. Phi-4-reasoning explicitly tunes it during the GRPO phase per the paper.

MATH ENTRY 5: Group Query Attention (Phi-4-Mini architecture).

Source: Phi-4-Mini Technical Report Section 2; underlying GQA is from Ainslie et al. 2023.
What it is: an attention variant where multiple query heads share the same key and value head, reducing the KV-cache memory footprint at inference time without large quality loss.
Formal definition. In standard multi-head attention with $H$ heads, each head computes:

$\text{Attn}_h(Q\_h, K\_h, V\_h) = \text{softmax}\big(Q\_h K\_h^\top / \sqrt{d\_k}\big) V\_h$

In GQA, the $H$ query heads are partitioned into $G$ groups, with each group sharing one $(K, V)$ pair. Phi-4-Mini uses 24 query heads and 8 KV heads, so each of the 8 KV pairs is shared by 3 query heads.

Each term explained and dimensionally analysed.
- $H = 24$ query heads in Phi-4-Mini.
- $G = 8$ KV groups.
- $d_k = d_{\text{model}} / H = 3072 / 24 = 128$ per-head dimension.
- Standard MHA KV-cache size scales as $H$ ; GQA scales as $G = H/3$ , a 3x reduction in KV memory.
Worked numerical example. For a Phi-4-Mini inference run with batch size 1, sequence length 32768 (the full context), and float16 KV cache: standard MHA KV memory is $2 \cdot 32 \cdot 32768 \cdot 24 \cdot 128 \cdot 2 = 12.9$ GB (the leading 2 is for K and V; 32 is the layer count, 32768 sequence length, 24 heads, 128 per-head, 2 bytes per float16). With GQA at 8 KV heads, this falls to $2 \cdot 32 \cdot 32768 \cdot 8 \cdot 128 \cdot 2 = 4.3$ GB, a 3x reduction that makes 32K-context inference feasible on a single consumer GPU.
Role: lets Phi-4-Mini support 128K context with manageable KV memory; the paper extends context via LongRoPE on top of GQA.
Novelty: [Adopted] from Ainslie et al. The Phi-4-Mini contribution is the specific 24/8 ratio.
Why it matters: KV-cache memory is the dominant cost in long-context inference; GQA is the standard cheap fix.

[Analysis] Five MATH ENTRIES cover the key mathematical machinery. The Phi-4 line is not heavy on novel theory — its contributions are recipes and curated objects (data mixes, SFT prompt selection, reward functions) rather than new losses or proofs. The math entries above cover the load-bearing pieces.

7. Algorithmic contributions

ALGORITHM ENTRY 1: Pivotal Token Search.

Source: Phi-4 Technical Report Section 3, Algorithm 1 (Figure 4 in the paper).
Purpose: identify the pivotal tokens within a completed reasoning chain and construct DPO preference pairs targeted at those positions.
Inputs:
- Prompt $x$ (sequence of tokens).
- Model $\pi_\theta$ (the policy to be improved).
- Verifier $v(\cdot) \in \{0, 1\}$ (correctness oracle).
- Probability-gap threshold $p_{\text{gap}}$ (scalar, typically 0.2-0.5).
- Monte-Carlo rollout count $K$ per position (typically 16-32).
Outputs:
- List of preference pairs $\{(x, y_w, y_l)\}$ where $y_w$ continues from the high-probability branch at the pivotal token and $y_l$ continues from the low-probability branch.

Pseudocode (faithful reconstruction; the paper Figure 4 specifies the recursion structure):

function PivotalTokenSearch(x, pi, v, p_gap, K):
    pairs = []
    Sample a completion y = (y_1, ..., y_T) from pi conditioned on x
    Estimate p_t = Pr[v(y) = 1 | y_{<=t}] for each t in [1..T]
        via K Monte-Carlo rollouts from each prefix
    for t in [1..T]:
        if |p_t - p_{t-1}| >= p_gap:
            # Token t is pivotal
            Construct y_w: prefix y_{<t} + high-prob next token + completion
            Construct y_l: prefix y_{<t} + low-prob next token + completion
            pairs.append((x, y_w, y_l))
    return pairs

Hand-traced example. Take a one-step solution to a small algebra problem. Prompt $x$ = “Solve for x: 2x + 3 = 11”. Sample $y$ = “Subtract 3 from both sides: 2x = 8. Divide by 2: x = 4”. Suppose $T = 14$ tokens and Monte-Carlo rollout produces success probabilities by token-prefix as $[0.50, 0.50, 0.78, 0.79, 0.78, 0.80, 0.80, 0.81, 0.82, 0.82, 0.93, 0.94, 0.94, 0.94]$ . The big jumps are between positions 2 and 3 (0.50 to 0.78, gap 0.28) and between positions 10 and 11 (0.82 to 0.93, gap 0.11). With $p_{\text{gap}} = 0.20$ , only position 3 qualifies. PTS then samples an alternative completion at position 3, ensures the alternative yields lower success probability via Monte-Carlo, and emits the preference pair (winning = original continuation, losing = alternative continuation). The DPO gradient is now concentrated on the single token at position 3, where the math operation actually mattered.
Complexity:
- Time: $O(T \cdot K \cdot C)$ per completion, where $C$ is the cost of one model rollout from a prefix.
- Space: $O(T)$ for the probability table.
- Bottleneck: the $K$ Monte-Carlo rollouts per position are the dominant cost. For $T = 1000$ and $K = 16$ this is 16K rollouts per training example.
Hyperparameters:
- $p_{\text{gap}}$ — threshold for pivotal classification. Larger threshold = fewer but cleaner pivots.
- $K$ — Monte-Carlo rollout count. Larger = more accurate probability estimates, higher cost.
- Maximum recursion depth (the paper formulation recursively subdivides the completion if no single-step pivot is found).
Failure modes. If the verifier is too noisy, the probability estimates are unreliable and PTS produces garbage pairs. If the completion has many small jumps and no single large one, the threshold may filter out all candidates.
Novelty: [New]. PTS is the Phi-4 paper’s headline algorithmic contribution.
Transferability: [Analysis] reusable wherever a verifier and a model that produces long reasoning chains exist. Code generation with unit tests is a natural target.

ALGORITHM ENTRY 2: Phi-4-reasoning SFT + GRPO pipeline.

Source: Phi-4-reasoning Technical Report, Sections 3-4.
Purpose: turn Phi-4 base into a reasoning specialist via supervised fine-tuning on curated o3-mini traces, then narrow GRPO RL on math.
Inputs:
- Phi-4 base checkpoint.
- Pool of candidate prompts (math, science, code) drawn from public datasets.
- Teacher model: o3-mini.
- Verifier: rule-based (exact-match for math, unit tests for code, multiple-choice match for GPQA).
Outputs:
- Phi-4-reasoning (after SFT).
- Phi-4-reasoning-plus (after additional GRPO).

Pseudocode (reconstructed from paper Sections 3-4):

# Stage A: SFT data curation
For each candidate prompt p:
    Generate reasoning trace t from o3-mini conditioned on p
    Compute difficulty score for the base Phi-4 (success rate on p)
    Keep p if difficulty is in the "teachable" band (not too easy, not too hard)
Filter ~16K curated (prompt, o3-mini-trace) pairs

# Stage B: SFT training
Train Phi-4 on the 16K pairs with the long-thinking template
    learning rate = 1e-5
    linear warmup = 450 steps
    weight decay = 1e-4
    duration: 2.5 days on 32 H100-80G GPUs
Output: Phi-4-reasoning

# Stage C: GRPO RL
Seed dataset: 72,401 math problems with verifiable answers
Sample 64 problems per RL iteration
For each problem:
    Sample group of G responses from the current policy
    Apply rule-based reward:
        +1 if final answer matches ground truth
        -0.5 if wrong
        + small format / length penalties
    Compute group-relative advantage A_i for each response
    Update policy via GRPO objective (Math Entry 3)
Run for ~90 RL iterations total
Output: Phi-4-reasoning-plus

Hand-traced example. Suppose the SFT corpus has been built. Pick prompt $x$ = “Find the smallest positive integer $n$ such that $n^2 + n + 41$ is composite.” The o3-mini trace might be a 4000-token chain that tests $n = 0, 1, 2, \ldots$ and discovers the answer at $n = 40$ . The SFT update minimises the cross-entropy of Phi-4’s generation against this trace, so Phi-4 learns to imitate o3-mini’s reasoning style. After SFT, in Stage C, sample $G = 8$ responses from the Phi-4-reasoning checkpoint on the same prompt. Suppose 5 are correct and 3 are wrong. The rule-based reward vector is $\{+1, +1, +1, +1, +1, -0.5, -0.5, -0.5\}$ . Mean is approximately $0.4375$ ; standard deviation approximately $0.78$ . Advantages standardise around zero; the GRPO update pushes the model to generate the high-reward responses more often. After 90 iterations of this loop across the 72,401-problem dataset, the model is Phi-4-reasoning-plus.
Complexity:
- Stage A: SFT data curation requires one o3-mini API call per candidate prompt plus a difficulty rollout per prompt. The 16K final corpus likely required 100K+ candidate prompts.
- Stage B: 2.5 days on 32 H100s, approximately 1920 GPU-hours.
- Stage C: each GRPO step samples $G$ responses per problem, so cost scales as $G \cdot N_{\text{problems-per-step}} \cdot T_{\text{rollout}}$ . For $G = 16$ , 64 problems per step, and 90 steps, the total rollout count is approximately 92,000 — significant but small compared to DeepSeek-R1’s full RL run.
Hyperparameters and their roles.
- SFT learning rate $10^{-5}$ — the standard “barely move the pretrained weights” magnitude.
- SFT warmup 450 steps — short.
- SFT weight decay $10^{-4}$ — light regularisation.
- GRPO group size $G$ — controls the variance reduction in advantage estimation.
- GRPO clip $\epsilon$ — PPO-style trust region.
- GRPO KL coefficient $\beta$ — anchor strength.
Failure modes.
- If the SFT corpus is not properly filtered for difficulty, the model may overfit to easy prompts and degrade on hard ones.
- If the GRPO reward signal is too sparse (too few correct samples per group), training can stall.
- The “teachable” filter is heuristic; the paper does not formally characterise it.
Novelty: [Adapted]. The two-stage SFT-then-GRPO recipe is established (DeepSeek-R1 uses a variant); Phi-4-reasoning’s contribution is the small-data, small-budget realisation and the demonstration that math-only RL generalises to other reasoning domains.

ALGORITHM ENTRY 3: Mixture-of-LoRAs modality routing in Phi-4-Multimodal.

Source: Phi-4-Mini Technical Report, Section 4 (Multimodal extension).
Purpose: integrate vision and speech inputs into the Phi-4-Mini language backbone without retraining the backbone.
Inputs:
- Phi-4-Mini frozen backbone (3.8B params).
- Vision encoder: SigLIP-400M.
- Speech encoder: conformer-based, 460M params.
- Vision LoRA adapter: rank 320, approximately 370M params.
- Speech LoRA adapter: rank 320, approximately 460M params.
- Modality router (a small classifier that detects the input modality).
Outputs: Phi-4-Multimodal unified output sequence.

Pseudocode (faithful reconstruction from paper Section 4 description):

function PhiMultimodalForward(input):
    modality = ModalityRouter(input)
    if modality == TEXT:
        return PhiBackbone(input)  # no LoRA active
    if modality == VISION:
        vision_features = SigLIPEncoder(input.image)
        projected = MLPProjector(vision_features)
        # Concatenate vision tokens with text tokens
        return PhiBackbone(concat(projected, input.text), lora=VisionLoRA)
    if modality == SPEECH:
        speech_features = ConformerEncoder(input.audio)
        return PhiBackbone(concat(speech_features, input.text), lora=SpeechLoRA)
    if modality == VISION_AND_SPEECH:
        # Both LoRAs active in parallel
        ...

Hand-traced example. User submits a JPEG of a math worksheet plus the spoken question “What is the answer to problem 3?”. The modality router fires the VISION_AND_SPEECH branch. The SigLIP encoder produces vision tokens, the conformer speech encoder produces audio tokens. Both LoRA adapters activate. The Phi-4-Mini backbone processes the concatenated multimodal token sequence with the LoRA-modified attention weights. Output: a text answer to problem 3.
Complexity. Vision encode is the dominant inference cost; SigLIP-400M is roughly 10x the per-token compute of the backbone. Speech encode scales with audio duration; the paper does not publish per-second numbers.
Hyperparameters. LoRA rank 320 — high for LoRA (typical text-only LoRA uses rank 8-64); this is closer to a full adapter than a low-rank approximation. The paper justifies the high rank by the difficulty of bridging modality representations.
Failure modes. Modality router misclassification, especially for ambiguous inputs. The paper does not publish router accuracy.
Novelty: [Adapted]. The mixture-of-LoRAs is a novel configuration; the underlying LoRA technique is standard.
Transferability: [Analysis] highly transferable as a recipe for adding modalities to a frozen language backbone.

8. Specialised design contributions

Subsection 8A — LLM and prompt design. The Phi-4-reasoning paper uses o3-mini as the teacher model for SFT trace generation. The teacher-prompt format is the standard “think step by step then give the final answer” template, but the paper does not publish the exact prompt wording. The inference-time system prompt for Phi-4-reasoning (per the Hugging Face model card) is the ChatML format with an instruction to wrap reasoning in <think> tags and the final answer in <solution> tags.

[Reconstructed] Inference system prompt (paraphrased from Hugging Face model card):

You are Phi, a language model trained by Microsoft. Your role is to provide
accurate, well-reasoned answers. For complex problems, work through your
reasoning step by step inside <think> tags, then provide your final answer
inside <solution> tags. Be precise, show your work, and verify your answer
before finalising it.

[Not specified in paper] Exact prompt wording for SFT trace generation from o3-mini.

Subsection 8B — Architecture-specific details. Phi-4 14B: 40 layers, hidden dimension 5120, 40 attention heads, full attention over 4K context extended to 16K, tiktoken with 100,352 vocabulary tokens. Phi-4-Mini 3.8B: 32 layers, hidden dimension 3072, 24 query heads, 8 KV heads (GQA 3:1), 128K context via LongRoPE, 200,064 vocabulary tokens via o200k_base tokenizer. Phi-4-reasoning shares the Phi-4 base architecture with a 32K context window for the long reasoning chains.

Subsection 8C — Training specifics.

Phi-4 pretraining: peak LR $3 \times 10^{-4}$ , linear warmup-then-decay, weight decay 0.1, batch size 5760, approximately 10T tokens.
Phi-4-Mini pretraining: 5T tokens, hardware not fully disclosed in the abstract.
Phi-4-Multimodal training: 28 days on 512 A100-80G GPUs, with 5T text tokens, 2.3M speech hours, 1.1T image-text tokens per the model card⁶.
Phi-4-reasoning SFT: 2.5 days on 32 H100-80G GPUs, learning rate $10^{-5}$ .
Phi-4-reasoning-plus GRPO: approximately 90 RL iterations, 64 math problems per iteration sampled from a 72,401-problem seed dataset.

Subsection 8D — Inference and deployment specifics. Phi-4-reasoning recommends temperature=0.8, top_k=50, top_p=0.95, do_sample=True, and max_new_tokens=32768 for complex queries per the model card. The 32K-token max-new-tokens is large because reasoning chains routinely exceed 10K tokens. Phi-4-Multimodal supports up to 64 image crops per inference (approximately 8448×8448 effective resolution) and up to 3 audio clips and 3 images per prompt in the vLLM inference path.

9. Experiments and results

The three papers report results across an overlapping set of benchmarks. The discussion below pulls the headline numbers; the full tables are reproduced inline.

Datasets and benchmarks used across the family.

MMLU and MMLU-Pro — broad multitask language understanding.
GPQA-Diamond — graduate-level science multiple choice; one of the hardest current open benchmarks for non-reasoning models.
MATH — competition mathematics, 5-12K problems.
AIME 2024 and AIME 2025 — US high-school math olympiad; 30 problems each year.
HMMT February 2025 — Harvard-MIT Math Tournament.
OmniMath — competition math, broader than AIME.
HumanEval and MBPP — code generation, function-level tests.
LiveCodeBench — code generation with contamination-resistant problem rotation.
IFEval — instruction-following.
SimpleQA — short-form factual QA; tests calibration.
OpenASR and CommonVoice v15 — automatic speech recognition.

Phi-4 base model — reproduction of Table 1 (Phi-4 Technical Report).

Benchmark	Phi-4 14B	Phi-3-14B	Qwen-2.5-14B	GPT-4o-mini	Llama-3.3-70B	GPT-4o
MMLU	84.8	77.9	79.9	81.8	86.3	88.1
GPQA	56.1	31.2	42.9	40.9	49.1	50.6
MATH	80.4	44.6	75.6	73.0	66.3	74.6
HumanEval	82.6	67.8	72.1	86.2	78.9	90.6

Table 1 of Abdin et al. Phi-4 Technical Report (arXiv:2412.08905), reproduced for editorial coverage. Phi-4 outperforms its 70B and GPT-4o-mini contemporaries on MATH and GPQA while trailing on MMLU and HumanEval.⁷

Phi-4-Mini and Phi-4-Multimodal — reproduction of headline tables.

Benchmark	Phi-4-Mini	Phi-3.5-Mini	Qwen 2.5 3B	Llama-3.2 3B
MMLU (5-shot)	67.3	65.5	65.0	61.8
GSM-8K (CoT)	88.6	86.2	80.6	75.6
MATH (CoT)	64.0	48.5	61.7	46.7
HumanEval	74.4	70.1	72.0	62.8

Reproduced from Microsoft et al. Phi-4-Mini Technical Report (arXiv:2503.01743), Section 5.²

Phi-4-Multimodal speech ASR results (from the paper and model card):

OpenASR leaderboard: 6.14% WER (rank #1 at submission)⁶.
CommonVoice v15: 6.80% WER vs Whisper V3’s 8.13%.
FLEURS: 4.00% average across 8 languages.

Phi-4-reasoning — reproduction of Hugging Face model card table.

Benchmark	Phi-4-reasoning 14B	Phi-4-reasoning-plus 14B
AIME 2024	75.3	81.3
AIME 2025	62.9	78.0
OmniMath	76.6	81.9
GPQA-Diamond	65.8	68.9
LiveCodeBench	53.8	53.1

Reproduced from the microsoft/Phi-4-reasoning model card on Hugging Face⁵. The paper additionally compares Phi-4-reasoning-plus to DeepSeek-R1-Distill-Llama-70B and full DeepSeek-R1.

From the paper: “Both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model”³.

Ablations reported in the Phi-4 paper.

Synthetic-data epoch ablation (Figure 2): 13.8 epochs of synthetic data outperforms one-pass over fresh web tokens on MMLU. The y-axis difference is several MMLU points.
PTS ablation: post-training with PTS-generated DPO pairs outperforms post-training with random-position DPO pairs across math benchmarks (paper Section 3).
Decontamination ablation: the 13-gram and 7-gram filter is shown to remove minor contamination across the 19-benchmark suite.

Hyperparameter sensitivity. The papers report little explicit sensitivity sweep data. The Phi-4-reasoning paper notes that AIME 2024 and GPQA-Diamond were used as “progress indicators” to track different training strategies, implying that other configurations were tried but not published.

Robustness and stress tests. Phi-4-reasoning is evaluated on out-of-distribution reasoning tasks (BA-Calendar, Maze, SpatialMap per the paper) showing that math-only RL generalises. Phi-4-Multimodal is evaluated across 24 speech languages.

Experimental scope limits. The Phi-4 family is evaluated primarily on English-language reasoning benchmarks. Multilingual reasoning (non-English math, non-English code documentation) is largely untested in the published tables. Phi-4-Mini’s expanded 200K-token tokenizer is justified by multilingual support but specific multilingual benchmark scores are limited.

Independent benchmark cross-checks. [Analysis] The Phi-4 benchmark claims have been independently scrutinised — Hugging Face Open LLM Leaderboard runs and community evaluations broadly confirm the MATH and GPQA leads, though MMLU has been retested with closer-to-reported numbers. The AIME and GPQA claims for Phi-4-reasoning are more recent (April 2025) and independent reproduction was still ongoing through mid-2025. [Reviewer Perspective] Contamination remains the largest open question for any small-model benchmark claim — the synthetic-data heavy training mix raises the prior probability of subtle test-set leakage even with the n-gram decontamination filter.

Evidence audit. [Analysis]

Strongly supported: Phi-4 14B outperforms Llama-3.3-70B on MATH and GPQA. The paper’s table is consistent with multiple independent re-evaluations.
Partially supported: the synthetic-data-quality argument. The paper’s Figure 2 ablation makes the case but the comparison baseline is not the strongest possible “high-quality web data” baseline.
Narrow evidence: the claim that math-only GRPO generalises to coding (LiveCodeBench) and science (GPQA). The Phi-4-reasoning paper reports the result but the generalisation pathway is not fully explained.

10. Technical novelty summary

Component	Type	Novelty	Justification	Source
Pivotal Token Search	Algorithm	Fully novel	First articulation of token-localised DPO under a principled gap criterion	Phi-4 paper Sec 3
Synthetic-data-heavy training mix (40%, 13.8 epochs)	Training data	Incrementally novel	Continues the Phi-1/2/3 lineage; the specific mix and epoch count are new	Phi-4 paper Sec 2
Mixture-of-LoRAs multimodal routing	Architecture	Combination novel	LoRA is standard; the modality-router + frozen-backbone configuration is new	Phi-4-Mini paper Sec 4
16K-prompt SFT with “teachable” filter	Training data	Combination novel	The filter heuristic is the novel piece	Phi-4-reasoning Sec 3
GRPO with rule-based reward at 90 steps	Optimisation	Adopted	GRPO from DeepSeekMath; the 90-step regime is the contribution	Phi-4-reasoning Sec 4
Math-only RL generalising to code + science	Empirical finding	Incrementally novel	Useful negative result on prior assumptions; not theoretically grounded	Phi-4-reasoning Sec 5

Single most novel contribution. Pivotal Token Search is the most distinctive technical contribution. Most token-level DPO variants emerged in 2024-2025 (token-level DPO, IPO, KTO), but PTS’s specific criterion — recursively subdivide the completion until a single token has a $p_{\text{gap}}$ -threshold success-probability swing — is the Phi-4 paper’s invention. [Analysis] The empirical contribution that math-only GRPO generalises to code and science is arguably more practically important but less theoretically novel.

What the papers do NOT claim to be novel. The transformer architecture, GQA, LoRA, DPO, GRPO, SigLIP vision encoder, conformer speech encoder, RoPE / LongRoPE positional encoding — all borrowed from prior work and credited as such in the related-work sections.

11. Situating the work

What prior work did. The Phi family traces back to “Textbooks Are All You Need” (Gunasekar et al. 2023, Phi-1)⁸, which argued that 1B-parameter models trained on synthetic “textbook-quality” code data could reach competitive HumanEval scores. Phi-2 and Phi-3 scaled the recipe to 2.7B and 7B respectively. The DeepSeekMath paper (Shao et al. 2024)¹⁰ introduced GRPO. DeepSeek-R1 (DeepSeek-AI 2025)⁹ demonstrated that GRPO at scale plus a long RL pipeline produces frontier reasoning capability in open-weight form. The DPO loss is from Rafailov et al. 2023.

What this paper changes conceptually. Three shifts:

Synthetic data graduates from experiment to default. Phi-4 treats synthetic data as a first-class ingredient, not an augmentation. 40% of the training mix is synthetic; the model is trained for 13.8 epochs on this portion.
Token-localised preference learning. PTS reframes DPO from “rank whole responses” to “rank token-level branches at moments where they actually matter”.
Compact reasoning post-training. Phi-4-reasoning shows that the SFT-then-RL pipeline that produced DeepSeek-R1 can be replicated at fractional cost (32 H100-days for SFT, 90 GRPO steps for RL) at the 14B scale.

Contemporaneous related papers.

DeepSeek-R1 (arXiv:2501.12948, January 2025)⁹ is the dominant point of comparison. DeepSeek-R1 is a 671B Mixture-of-Experts model trained with extensive GRPO. The Phi-4-reasoning paper benchmarks directly against DeepSeek-R1 and the DeepSeek-R1-Distill series. Technical relationship: Phi-4-reasoning uses a similar SFT-then-GRPO pipeline but at 14B dense scale and with 90 RL steps instead of DeepSeek-R1’s extensive RL phase.
Qwen 2.5 family — competing open-weight small models. Qwen 2.5-14B is the closest 14B contemporary. Technical relationship: Phi-4 outperforms Qwen 2.5-14B on MATH (80.4 vs 75.6) and GPQA (56.1 vs 42.9) per the Phi-4 Table 1, but trails on HumanEval (82.6 vs Qwen’s 72.1 — Phi-4 wins HumanEval) and MMLU (84.8 vs 79.9 — Phi-4 wins MMLU). The benchmarks favour Phi-4 selectively.
DeepSeek-R1-Distill series — DeepSeek’s own attempt to compress R1 into smaller models (1.5B, 7B, 8B, 14B, 32B, 70B). Technical relationship: Phi-4-reasoning is a parallel attempt at small-scale reasoning capability, using SFT + GRPO on Phi-4 base instead of distillation from R1.

[Reviewer Perspective] Strongest skeptical objection. The synthetic-data-first approach makes contamination control essentially the central methodological question. The Phi-4 paper’s n-gram decontamination filter is rigorous but cannot catch paraphrased contamination. When the teacher model (GPT-4 or o3-mini) has been trained on data that overlaps with the test sets, even decontaminating the student’s training corpus may not be enough. The fact that Phi-4’s MATH score (80.4) exceeds its teachers’ reported scores on the same benchmark is striking and could reflect either genuine capability or subtle data leakage that did not pass the n-gram threshold.

[Reviewer Perspective] Strongest author-side rebuttal. The papers include hand-built contamination-free benchmark variants and a 19-benchmark decontamination sweep. The improvements on AIME 2025 (released after Phi-4’s training cutoff per the model card) suggest at least partial out-of-distribution generalisation. The fact that GRPO with a math-only reward improves GPQA-Diamond scores is also hard to explain via contamination alone — there is no obvious leakage path from math problems to graduate-level chemistry questions.

What remains unsolved.

Multilingual reasoning. Phi-4 and Phi-4-reasoning are primarily English. The 200K-token tokenizer in Phi-4-Mini hints at multilingual capability but reasoning benchmarks across languages remain underreported.
Long-context reasoning. 32K context in Phi-4-reasoning is generous but below the 128K-1M frontier of Claude 4 and Gemini 2.5 Pro.
Tool-augmented reasoning. The papers do not benchmark on tool-using agent tasks (SWE-bench, MetaGPT scenarios). The reasoning capability is evaluated in closed-form reasoning, not agentic settings.

Three future research directions.

PTS for code with unit-test verifiers. [Analysis] PTS’s pivotal-token criterion is verifier-driven; code with unit tests is a natural extension. The publication is unaware of a 2025-2026 paper applying PTS to code DPO.
GRPO budget scaling laws. [Reviewer Perspective] Phi-4-reasoning-plus reaches strong scores at 90 RL steps; does the curve plateau, or would 900 steps push further? The Phi-4-reasoning paper does not publish the training-step ablation.
Multimodal reasoning chains. [Analysis] Phi-4-Multimodal handles vision and speech inputs but does not extend the reasoning-chain protocol to multimodal outputs. A natural follow-up: train the multimodal model to emit <think> blocks that combine visual and textual reasoning.

12. Critical analysis

Strengths with concrete evidence.

Open weights under MIT license. All three models are released to Hugging Face under permissive licensing per the model cards⁵⁶, enabling community evaluation, fine-tuning, and deployment without commercial restriction.
Reproducible at moderate compute. The 32-GPU 2.5-day SFT budget for Phi-4-reasoning is within reach of well-funded academic labs and most well-resourced industrial teams. Compare against the multi-thousand-GPU budgets needed for full DeepSeek-R1-scale RL.
Benchmark coverage spans reasoning subdomains. Math (MATH, AIME, OmniMath), science (GPQA), code (HumanEval, MBPP, LiveCodeBench), instruction-following (IFEval) — the evaluation surface is broad.
Pivotal Token Search is a genuinely transferable idea. Token-localised preference learning is increasingly relevant as reasoning chains lengthen.

Weaknesses explicitly stated by the authors.

English-language bias — the Phi-4-reasoning model card explicitly flags degraded non-English performance.
Hallucination remains possible — the card explicitly warns against high-stakes use without safeguards.
Code-language bias — Python-heavy training; other languages underperform per the card.
Election queries elevated defect rate — explicitly flagged in the model card.

Weaknesses not stated or understated.

[Reviewer Perspective] Contamination remains the elephant in the room. While the paper’s n-gram filter is solid for verbatim contamination, the synthetic-data-heavy training mix means that the teacher model’s pretraining data (which the authors do not control or fully disclose) is implicitly transferred. If GPT-4 saw AIME 2024 solutions during its own pretraining, the synthetic traces it generated for Phi-4 may carry subtle echoes of those solutions even after Phi-4’s decontamination filter.

[Reviewer Perspective] Benchmark selection is favourable. The benchmarks where Phi-4 outperforms 70B contemporaries (MATH, GPQA) are exactly the STEM benchmarks that synthetic data targeted. The benchmarks where it trails (MMLU, HumanEval against larger models) are not emphasised. A more balanced presentation would highlight that synthetic-data wins are domain-specific.

[Reviewer Perspective] The “teachable prompts” filter is under-specified. Phi-4-reasoning’s central data-curation step — selecting 16K prompts out of a presumably much larger candidate pool — is described as a “right level of complexity and diversity” heuristic without precise criteria. This is the single highest-payoff design choice in the SFT phase and the paper does not enable reproduction.

Reproducibility check.

Code: partial release. The Phi-4 family has Hugging Face Transformers integration via the microsoft/Phi-4-reasoning and related repos; the training code is not fully open-sourced.
Data: the synthetic-data corpus is NOT released. This is the single largest reproducibility gap — 290B unique synthetic tokens are described in aggregate but not provided.
Hyperparameters: SFT hyperparameters are reported (LR $10^{-5}$ , warmup 450 steps, weight decay $10^{-4}$ ). GRPO hyperparameters are partially reported (group size, reward shape, step count).
Compute: 32 H100s for 2.5 days SFT, 512 A100s for 28 days for Phi-4-Multimodal training reported on model cards.
Trained model weights: released. microsoft/Phi-4, microsoft/Phi-4-mini-instruct, microsoft/Phi-4-multimodal-instruct, microsoft/Phi-4-reasoning, microsoft/Phi-4-reasoning-plus all on Hugging Face under MIT license.
Evaluation set: standard public benchmarks (MMLU, AIME, GPQA, etc.) — easily reproducible by third parties.
Overall: partially reproducible. Models are open; training data is not; the recipe is documented at a level that enables informed adaptation but not exact reproduction.

Methodology disclosure callout.

Sample size: Phi-4-reasoning SFT corpus is 16K prompts; GRPO seed dataset is 72,401 math problems with 64 problems per RL iteration over approximately 90 iterations.
Evaluation set: standard public benchmarks (AIME 2024 and 2025, GPQA-Diamond, OmniMath, HMMT February 2025, MATH-500, LiveCodeBench, MMLU, MMLU-Pro). Contamination check via 13-gram and 7-gram filter on the Phi-4 base corpus; the Phi-4-reasoning paper does not detail an additional contamination check on the SFT corpus.
Baselines: DeepSeek-R1, DeepSeek-R1-Distill-Llama-70B, DeepSeek-R1-Distill-Qwen-32B, o1-mini, o3-mini, Claude 3.7 Sonnet, Qwen 2.5-14B, Llama-3.3-70B, GPT-4o, GPT-4o-mini.
Hardware/compute: Phi-4 pretraining budget not fully disclosed in the paper. Phi-4-Multimodal: 512 A100-80G GPUs for 28 days. Phi-4-reasoning SFT: 32 H100-80G GPUs for 2.5 days. Phi-4-reasoning-plus GRPO compute budget not explicitly disclosed in step-count terms.

Generalisability. [Analysis] The Phi-4 line generalises well to similar Transformer-decoder architectures in the 3B-30B band. The synthetic-data recipe is replicable in principle if the operator has access to a sufficiently capable teacher model (GPT-4-class). PTS generalises to any verifier-equipped task. Mixture-of-LoRAs generalises to other modality extensions.

Assumption audit. Revisiting Section 3’s assumptions:

Synthetic data carries enough signal. The benchmark results support this at the Phi-4 scale for STEM tasks. Less clear for non-STEM or non-English domains.
N-gram decontamination is sufficient. Defensible for verbatim contamination; not for paraphrase contamination through the teacher model. [Reviewer Perspective] This is the assumption most likely to fail.
Benchmark distributions are representative. Standard ML field assumption; increasingly contested but unchanged here.
SFT on o3-mini outputs is an upper bound on what SFT can teach. Reasonable framing; GRPO’s incremental gain on top supports the claim that SFT alone leaves headroom.

What would make the paper stronger. [Analysis]

A detailed contamination analysis on the synthetic-data corpus, not just the web-scraped portion.
Publication of the “teachable prompts” filter mathematics or pseudocode.
A GRPO step-count ablation: scores at 30, 90, 270 steps to chart the diminishing-returns curve.
Multilingual reasoning benchmarks beyond English.

13. What is reusable for a new study

REUSABLE COMPONENT 1: Pivotal Token Search algorithm.

What it is: token-localised DPO pair construction via Monte-Carlo probability estimation and a gap-threshold filter.
Why worth reusing: focuses preference-learning gradient on the high-impact tokens, improving sample efficiency.
Preconditions: a verifier $v(\cdot)$ and a base model capable of generating long reasoning chains.
What would need to change in a different setting: the $p_{\text{gap}}$ threshold tuning; for code with unit tests, the verifier is binary pass/fail and PTS applies directly; for open-ended generation, no verifier exists and PTS does not apply.
Risks: Monte-Carlo rollout cost can be substantial; if the rollout count $K$ is too small the probability estimates are noisy.
Interaction effects: PTS interacts with the base model’s policy diversity — if the model has collapsed to a single mode, the “alternative continuation” needed for the losing branch is hard to sample.

REUSABLE COMPONENT 2: GRPO with rule-based reward at small step budgets.

What it is: the Phi-4-reasoning-plus 90-step GRPO recipe with $+1$ / $-0.5$ reward and format-length penalties.
Why worth reusing: demonstrates that small-budget RL works on top of a strong SFT base.
Preconditions: rule-based verifier; SFT-warmed base; group-sampling capacity.
What would need to change: reward shaping for non-math tasks (code needs unit-test reward; science needs multiple-choice match).
Risks: low-diversity groups can give $\text{std} = 0$ advantage and stall training.
Interaction effects: requires a strong SFT base — applying GRPO directly to a vanilla pretrained model is much harder than starting from an SFT’d checkpoint.

REUSABLE COMPONENT 3: Mixture-of-LoRAs modality routing.

What it is: parameter-efficient extension of a frozen language backbone with modality-specific LoRA adapters plus a router.
Why worth reusing: lets a single model handle multiple modalities without language-backbone retraining.
Preconditions: a frozen language backbone; modality-specific encoders; reasonable router classifier.
What would need to change: the LoRA rank may need tuning per modality; modalities far from the language manifold (proteins, time-series, point clouds) may need higher rank.
Risks: router misclassification on ambiguous inputs.

REUSABLE COMPONENT 4: Synthetic-data-heavy pretraining recipe with multi-epoch synthetic exposure.

What it is: the 40%-synthetic, 13.8-epoch training mix philosophy.
Why worth reusing: empirically demonstrates that high-quality synthetic data with multiple passes outperforms one-pass web text on STEM benchmarks.
Preconditions: access to a capable teacher model for synthetic generation.
What would need to change: domain-specific synthetic data generation prompts; rigorous contamination filtering.
Risks: teacher-model bias and contamination paths.

Dependency map. PTS depends on a verifier and DPO trainer. GRPO depends on a verifier, SFT-warmed base, and group-sampling infrastructure. Mixture-of-LoRAs depends on encoder + adapter training infra. Synthetic-data recipe depends on teacher-model access. All four components are independently reusable; PTS + GRPO is a natural combined pipeline for math-heavy post-training.

Recommendation. [Analysis] Highest-value reusable components, in order:

GRPO with small-budget rule-based reward. The 90-step demonstration is the most practitioner-relevant: it lowers the bar for adding reasoning capability to a base model from “many thousands of GPU-hours” to “tens of GPU-days”.
Pivotal Token Search. Transferable to any verifier-equipped DPO pipeline; the gain on top of standard whole-response DPO is the paper’s headline post-training contribution.
Mixture-of-LoRAs for multimodal extension. Cleanest recipe in the open-weight ecosystem for adding modalities without backbone retraining.

[Analysis] Type of new study that benefits most: a reasoning-focused post-training study on a different base model (e.g., Qwen 2.5 14B, Mistral Small 3) that wants to reproduce a Phi-4-reasoning-class capability without committing to a full R1-scale RL pipeline.

14. Known limitations and open problems

Limitations explicitly stated by the authors.

English-language bias (Phi-4-reasoning model card).
Hallucination risk in high-stakes domains (legal, health, elections).
Code-language Python bias.
Limited multilingual reasoning evaluation.
Phi-4-Multimodal vision evaluation is English-only.

Limitations not stated or understated.

[Analysis]

Contamination via teacher-model pretraining. The synthetic-data corpus inherits whatever the teacher model saw; n-gram decontamination on Phi-4’s training corpus does not address this transitive contamination path.
“Teachable prompt” filter opacity. The single highest-payoff decision in the Phi-4-reasoning SFT phase is not reproducibly specified.
GRPO step-count scaling. The 90-step result is presented as a sufficient demonstration; the diminishing-returns curve and its asymptote are not characterised.
Inference latency for long reasoning chains. A 32K-token reasoning chain at 30 tokens-per-second generation rate is roughly 15 minutes of wall-clock time per query. The papers report quality numbers without latency analysis.

Technical root cause of each.

Contamination: the synthetic-data-heavy training paradigm structurally exposes the model to upstream model’s training data through the teacher.
Teachable filter: the filter is a heuristic, not a formally characterised criterion.
GRPO scaling: the paper’s claim is positive (90 steps is enough); the negative claim (more steps don’t help much) requires ablation the paper doesn’t publish.
Latency: long reasoning chains are an inherent cost of the chain-of-thought + RL approach.

Open problems.

How to detect transitive contamination through teacher-model pretraining without access to teacher’s training data?
Can the “teachable prompt” selection be replaced by a learned scoring function?
What is the GRPO scaling law for reasoning models — is 90 steps near a knee or far below it?
Can multimodal inputs participate in reasoning chains, not just final answers?
How does the recipe transfer to non-Transformer architectures (state-space models, hybrid SSM-attention)?

What a follow-up paper would need to solve. [Analysis] The single most critical follow-up is the contamination-via-teacher question. The Phi-4 family’s headline results are striking precisely because they exceed reasonable priors for a 14B dense model. Independent confirmation that the gains are not the result of subtle teacher-model leakage would solidify the entire research line. A natural follow-up paper: train Phi-4-class models with two different teacher families (e.g., GPT-4 synthetic + Claude synthetic, paired ablations) and quantify the resulting benchmark drift.

How this article reads at three depths

For the curious high-school reader. Microsoft’s Phi-4 line shows that small language models — about 14 billion numbers tunable inside, versus the hundreds of billions in the biggest models — can do hard math and science problems nearly as well as their giant cousins. The trick is training them on text that other AI models wrote (called synthetic data) and then giving them a second round of training that specifically teaches them how to think step by step before answering. The result is models you can run on a single graphics card that solve high-school math olympiad problems at the same level as multi-million-dollar systems.

For the working developer or ML engineer. Phi-4 family delivers strong math, science, and code reasoning at 3.8B-14B parameter scale under MIT license, deployable on a single H100 or even a high-end consumer GPU at quantised precision. The three papers offer a recipe: synthetic-data-heavy pretraining (40% synthetic, 13.8 epochs on the synthetic portion), DPO with Pivotal Token Search for the base, and SFT-on-o3-mini-traces plus a 90-step GRPO RL phase for the reasoning variant. Headline numbers: Phi-4 14B reaches MATH 80.4 and GPQA-Diamond 56.1; Phi-4-reasoning-plus reaches AIME 2024 81.3 and GPQA-Diamond 68.9. Trade-offs to be aware of: synthetic-data contamination is the main reproducibility concern, multilingual capability is weak, and a 32K-token reasoning chain costs 10-15 minutes of wall-clock time per query.

For the ML researcher. The Phi-4 family is best read as an extended empirical case study for the “data quality scales differently from parameter count” thesis. The two genuinely novel pieces are Pivotal Token Search (token-localised DPO via a recursive $p_{\text{gap}}$ criterion on Monte-Carlo-estimated success probabilities) and the demonstration that 90 GRPO steps on math problems generalise to GPQA and LiveCodeBench gains. The load-bearing assumptions are (a) that synthetic-data contamination through teacher-model pretraining is bounded, (b) that n-gram decontamination at 13 and 7 grams catches enough leakage, and (c) that the “teachable prompt” heuristic generalises across reasoning domains. The strongest objection is the implicit transitive contamination path through GPT-4 / o3-mini that the n-gram filter cannot address. A follow-up paper that swaps teacher-model families in ablation would settle the most contested question in this research line.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Abdin et al. — Phi-4 Technical Report (arXiv:2412.08905, December 2024) (accessed 2026-05-20) ↩
2. Microsoft et al. — Phi-4-Mini Technical Report (arXiv:2503.01743, March 2025) (accessed 2026-05-20) ↩
3. Abdin et al. — Phi-4-reasoning Technical Report (arXiv:2504.21318, April 2025) (accessed 2026-05-20) ↩
4. Microsoft Research — Phi-4-reasoning Technical Report PDF (mirror of arXiv:2504.21318) (accessed 2026-05-20) ↩
5. microsoft/Phi-4-reasoning model card on Hugging Face (released April 30, 2025; MIT license) (accessed 2026-05-20) ↩
6. microsoft/Phi-4-multimodal-instruct model card on Hugging Face (5.6B params, 128K context, MIT license) (accessed 2026-05-20) ↩
7. Phi-4 Technical Report Table 1, benchmark scores reproduced for editorial coverage (accessed 2026-05-20) ↩
8. Gunasekar et al. — Textbooks Are All You Need (Phi-1, arXiv:2306.11644) (accessed 2026-05-20) ↩
9. DeepSeek-AI — DeepSeek-R1 Technical Report (arXiv:2501.12948) (accessed 2026-05-20) ↩
10. Shao et al. — DeepSeekMath / GRPO (arXiv:2402.03300) (accessed 2026-05-20) ↩

Anonymous · no cookies set

Found this useful? Share it.