GRPO and Reinforcement Fine-Tuning: A Multi-Paper Technical Reference

GRPO replaces PPO's value network with group-averaged rewards. Multi-paper review of DeepSeekMath, DeepSeek-R1, and OpenAI RFT — math, algorithms, benchmarks.

20 May 2026 Updated 20 May 2026 ~52 min read

Section 1: Cluster identity and scope

This review covers three connected works that defined post-training for reasoning-tuned large language models in 2024 and 2025:

Shao, Wang et al., 2024, “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models” (arXiv:2402.03300). The paper that introduced Group Relative Policy Optimization (GRPO).¹
DeepSeek-AI, 2025, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” (arXiv:2501.12948; published in Nature vol. 645, pp. 633-638, 2025). The paper that applied GRPO at scale to produce an o1-class reasoning model, fully open-weight.²
OpenAI, December 2024 → 2025, Reinforcement Fine-Tuning (RFT). A closed-source product/API surface for adapting OpenAI’s o-series reasoning models with a programmable grader. There is no peer-reviewed paper; the canonical sources are OpenAI’s API documentation and Cookbook examples.⁸⁹

Retrieval confirmation: DeepSeekMath and DeepSeek-R1 retrieved via arXiv abstracts + ar5iv HTML renders on 2026-05-20. OpenAI RFT documentation retrieved from platform.openai.com/docs and cookbook.openai.com on 2026-05-20. No appendix-only material was inaccessible.

Cluster classification: RL · Training method · LLM-based · Theoretical (GRPO derivation) · Application (RFT product).

Why these three together. GRPO is the algorithm. DeepSeek-R1 is its proof at frontier scale. OpenAI’s RFT is the productised parallel, closed-source, but the published descriptions of “sample, grade, update” describe a near-identical training regime. Reading them side by side surfaces what is fundamentally one paradigm: post-training a reasoning model with verifier-driven rewards rather than the human-preference rewards that anchored the original RLHF era.⁷

Primary research question (cluster-level). Can a reasoning-capable LLM be trained without a learned value network and without dense human preference labels, relying instead on rule-based or grader-based rewards over groups of sampled completions?

Core technical claim (cluster-level). Yes. GRPO eliminates PPO’s value model by using group-relative advantages; DeepSeek-R1 demonstrates the approach matches OpenAI’s o1-1217 across reasoning benchmarks; OpenAI’s RFT operationalises the same loop as a paid API surface for adapting o-series models with as few as a dozen training examples.

Reader prerequisites. High-school algebra. Familiarity with probability (random variables, expectation), gradient descent, and the basic idea of a neural network. Section 2.5 brings the high-school reader up to speed on RL-specific terms. Working ML researchers will find Sections 5-7 the most useful; practitioners will find Sections 8 and 13 most useful.

Section 2: TL;DR and executive overview

3-sentence TL;DR. Group Relative Policy Optimization (GRPO) is a training recipe that teaches a language model to reason better by generating several answers to the same question, comparing how good each one was relative to its peers, and pushing the model toward the answers that scored highest. DeepSeek’s 2024 math paper introduced it; their 2025 DeepSeek-R1 model used it to match OpenAI’s then-best reasoning model while staying fully open-weight; OpenAI’s Reinforcement Fine-Tuning service offers the same idea as a paid API where the customer writes a “grader” function. The headline technical move is dropping the separate “value network” that older methods like PPO require, which makes the training loop roughly half as expensive in GPU memory.

Executive summary. Reasoning post-training in 2024-2025 converged on a shared recipe. Sample many completions per prompt, score each one with an automatic verifier (a math-answer checker, a code-execution test, or a model-graded rubric), normalise the scores within the group, and run a policy gradient update. GRPO is the loss function that makes this clean; DeepSeek-R1 is the existence proof that it scales; RFT is the commercial wrapper. The trio matters because it replaces the human-preference labelling pipeline of RLHF with a much cheaper verifier-based loop, when the domain admits a verifier, post-training can be done with thousands of prompts instead of millions of preference pairs.

Five practitioner takeaways.

GRPO removes the value-function network used by PPO. Memory cost during training drops from roughly 4 model-sized state tensors (policy, reference, reward, value) to 3 (policy, reference, reward).
The “advantage” of each token is the group-normalised reward: $(r_i - \mathrm{mean}(r))/\mathrm{std}(r)$ over the $G$ sampled completions to the same prompt. No critic network needed.
DeepSeek-R1-Zero (pure RL, no SFT) reached 71.0% pass@1 on AIME 2024 from a base score of 15.6%, the headline result that GRPO + rule-based rewards on a strong base model is sufficient to surface reasoning behaviour.²
DeepSeek-R1 (the four-stage SFT + RL + SFT + RL pipeline) reached 79.8% AIME 2024 pass@1 versus OpenAI o1-1217’s 79.2%.²
OpenAI’s RFT API exposes the same loop to customers: provide a JSONL dataset and a grader (string match, model grader, or multi-grader), the platform runs the sample-grade-update cycle. Supported on o4-mini at general availability; pre-GA alphas ran on o1-mini.⁸¹⁰

Pipeline overview in text. Training time: for each prompt $q$ , sample $G$ completions $o_1, \ldots, o_G$ from the current policy $\pi_{\theta_{\text{old}}}$ . Score each completion with a reward function $r_i$ . Compute group-normalised advantages $\hat{A}_i = (r_i - \mathrm{mean}(r))/\mathrm{std}(r)$ . Update the policy by maximising a clipped surrogate objective with a KL penalty against a frozen reference policy $\pi_{\text{ref}}$ . Inference time: nothing changes, the trained policy generates as usual.

Section 2.5: Glossary

Term	Plain-English explanation	First appears in
Policy ( $\pi_\theta$ )	The language model itself, viewed as a function that takes a prompt and outputs a probability distribution over next tokens. The parameters $\theta$ are what training updates.	Section 2
Reward	A number that says “how good was this completion.” Higher is better. In GRPO it comes from a rule (e.g., “is the math answer correct?”) or a model grader.	Section 2
Advantage	How much better a particular completion was compared to a baseline. In GRPO the baseline is the group’s average reward.	Section 2
Value network / critic	An auxiliary neural network used by PPO to estimate “expected future reward from this state.” GRPO’s headline move is to delete it.	Section 4
KL divergence	A number that measures how different two probability distributions are. Zero when identical. Used as a “don’t drift too far from the reference model” penalty.	Section 6
Reference model ( $\pi_{\text{ref}}$ )	A frozen copy of the model at the start of RL training. The KL penalty keeps the live policy close to this anchor.	Section 6
Group ( $G$ )	The number of completions sampled per prompt during GRPO training. Typical values are 8, 16, or 64.	Section 6
PPO	Proximal Policy Optimization, Schulman et al. 2017. The mainstream RL algorithm before GRPO; uses a learned value function.	Section 4
RLHF	Reinforcement Learning from Human Feedback. The Christiano-Stiennon-Ouyang lineage that uses pairwise human preference data to train a reward model, then runs PPO against it.	Section 4
Verifier reward	A reward that comes from a deterministic check (math-answer match, code unit-test pass, regex on output format) rather than a learned reward model.	Section 5
Pass@1	The benchmark metric where the model gets exactly one attempt per problem and you measure accuracy. As opposed to pass@k with k attempts.	Section 9
Cold-start SFT	A short supervised fine-tuning step on a small curated dataset before RL begins, used in DeepSeek-R1 to stabilise the model’s output format.	Section 5
”From the paper:” prefix	Content directly supported by the paper’s text, equations, tables, or figures.	Throughout
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what the paper itself claims.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the paper proves.	Sections 11-12
`[Reconstructed]` label	Content the publication faithfully reconstructed because the paper only partially disclosed it.	Where used
`[External comparison]` label	A comparison to prior work or general knowledge outside the papers themselves.	Sections 4, 11

Section 3: Problem formalisation

Notation table.

Symbol	Type	Meaning	First appears in
$q$	string	A prompt / question.	Section 3
$o = (o_1, o_2, \ldots, o_{\mid o\mid })$	token sequence	A completion sampled from the policy.	Section 3
$o_t$	token	The $t$ -th token of completion $o$ .	Section 6
$\pi_\theta(o \mid q)$	distribution	The policy’s probability of generating completion $o$ given prompt $q$ .	Section 3
$\pi_{\theta_{\text{old}}}$	distribution	The policy snapshot used to sample the current training batch.	Section 6
$\pi_{\text{ref}}$	distribution	The frozen reference policy used in the KL penalty.	Section 6
$r_i$	scalar	The reward assigned to completion $o_i$ .	Section 6
$G$	positive integer	The number of completions sampled per prompt.	Section 6
$\hat{A}_{i,t}$	scalar	The advantage estimate for the $t$ -th token of completion $o_i$ .	Section 6
$\beta$	non-negative scalar	The coefficient on the KL penalty.	Section 6
$\varepsilon$	small positive scalar	The clip range in the surrogate objective (typically 0.2).	Section 6
$\mathbb{D}_{\mathrm{KL}}$	functional	Kullback-Leibler divergence between two distributions.	Section 6

Formal problem statement. Given a base language model with parameters $\theta_0$ and a prompt distribution $\mathcal{D}$ , find updated parameters $\theta$ such that completions sampled from $\pi_\theta$ achieve higher expected reward $\mathbb{E}_{q \sim \mathcal{D}, o \sim \pi_\theta(\cdot \mid q)}[r(q, o)]$ while remaining “close” to $\pi_{\text{ref}}$ under KL divergence. The reward $r$ may be a learned reward model (the classical RLHF setting) or a rule-based verifier (the GRPO-for-reasoning setting that dominates DeepSeek-R1 and OpenAI RFT).

Assumptions (from DeepSeekMath Section 4 and Section 5):

The reward $r_i$ is well-defined for any completion $o_i$ , and dependable enough that group-normalisation is informative. [Analysis] Potentially strong assumption when the reward is a learned model, reward hacking remains a documented failure mode.
Group statistics (mean, std) over $G$ completions are reasonable proxies for an advantage baseline. This implicitly requires $G$ to be large enough that the sample std is non-zero. [Analysis] Potentially fragile when most completions in the group score identically (e.g., all-correct or all-wrong groups on easy / impossible problems).
The policy and reference model share the same tokeniser and vocabulary (required to compute the per-token KL).

Why the problem is hard. Three structural difficulties remain even after GRPO’s simplifications. (a) The reward signal is sparse and outcome-only in the DeepSeek-R1 setting, the model must allocate credit across hundreds of reasoning tokens from a single end-of-completion reward. (b) The policy distribution shifts during training, so the off-policy correction (the $\pi_\theta / \pi_{\theta_{\text{old}}}$ ratio) must be clipped to prevent destructively large updates. (c) Without a value baseline, advantage estimates are noisier; the group-normalisation has to do the work that a learned critic used to do.

Causal vs correlational. Not applicable, this is a policy-optimisation paradigm, not a causal-discovery paper.

LLM-based formal role. The policy is an LLM. The reward function may itself be an LLM (model grader, in OpenAI RFT terminology) or a deterministic verifier (math-answer matcher, code unit-test runner). The reference model is a frozen LLM.

Section 4: Motivation and gap

The real-world problem. Pre-training and supervised fine-tuning (SFT) produce capable LLMs, but they do not naturally surface multi-step reasoning behaviour at competition-math difficulty. Through 2023, the standard recipe for “make a model better at reasoning” was: collect a chain-of-thought SFT dataset, fine-tune, and hope. OpenAI’s o1 demonstrated in late 2024 that a different recipe, post-training with reinforcement learning where the model is rewarded for getting the answer right, could produce a qualitatively different behaviour: long internal monologues, self-correction, backtracking. The o1 paper did not disclose its training recipe.³

Existing approaches and their failure modes (from DeepSeekMath Section 4.1, Related Work).

PPO with a learned reward model (RLHF lineage). Christiano-Stiennon-Ouyang.⁷ Requires (a) a separate reward model trained on pairwise preference data, and (b) a separate value/critic network of roughly policy size for advantage estimation. The combination doubles or triples the memory footprint during training. DeepSeekMath flags: “in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token.”¹
DPO (Direct Preference Optimization). Skips the value network and the explicit reward model, but still needs pairwise preference data. DPO and its descendants (IPO, KTO, SimPO) are the alternative line of attack on RLHF cost.¹¹ [External comparison] See the publication’s prior coverage of DPO and its successors for the parallel branch of the post-training literature.
Process reward models (PRMs). Score every reasoning step, not just the final answer. Densely supervised but expensive, labelling step correctness at scale is harder than labelling final-answer correctness.

Gap GRPO claims to fill. A policy-optimisation algorithm that (a) keeps PPO’s clipped-surrogate stability, (b) deletes the value network, and (c) works equally well with rule-based or model-based rewards. DeepSeekMath’s framing: “we propose Group Relative Policy Optimization (GRPO), a variant of PPO that forgoes the critic model. Instead of learning a value function, GRPO estimates the baseline from group scores, significantly reducing training resources.”¹

Why prior methods were insufficient (per the paper). The DeepSeekMath authors had a concrete operational pain: they were training a 7B math model with PPO, and the critic network was eating GPU memory they would rather spend on larger group sizes. The paper’s Section 4 motivation reads as “we already have multiple samples per prompt because we are doing best-of-N inference anyway, why not use that group as the baseline?”

Practical stakes. Reasoning-tuned post-training was, before GRPO, gated by RLHF infrastructure that only the largest labs operated. GRPO + rule-based rewards moved the cost into the range of well-resourced academic groups and the larger open-source teams. DeepSeek-R1’s full open release, weights, distilled variants, technical report, operationalised this for the community.²¹³

[External comparison] Position in the landscape. GRPO occupies a specific niche in the 2024-2025 post-training taxonomy: outcome-reward RL with multiple-sample baselines. DPO/IPO/KTO occupy the preference-pair niche. Process reward models with step-level supervision occupy the dense-reward niche. RLOO (REINFORCE with Leave-One-Out baseline) is a closely related design that pre-dates GRPO and uses a similar group baseline idea. The Sea AI Lab “Dr. GRPO” paper, March 2025, argues GRPO’s normalisation introduces length bias and proposes a corrected variant.¹¹

Section 5: Method overview

5.1 GRPO at a glance (DeepSeekMath Section 4)

Plain-English intuition. Imagine you ask the model the same question eight times, getting eight different attempts. Score each attempt (e.g., 1 if the math answer is right, 0 if wrong). The group’s average score is the baseline; any individual attempt is “good” if it scored above that average, “bad” if below. Push the model’s parameters in the direction that makes the above-average attempts more likely and the below-average attempts less likely. Do this with a clipped ratio (so any one update can’t be too aggressive) and with a KL leash to the starting model (so the model can’t drift into gibberish).

Step-by-step mechanism (from DeepSeekMath Algorithm 1):

Initialise the policy $\pi_\theta$ from a starting checkpoint. Snapshot it as $\pi_{\text{ref}}$ .
For each training step: a. Set $\pi_{\theta_{\text{old}}} \leftarrow \pi_\theta$ . b. Sample a batch of prompts $\{q^{(b)}\}$ from the training distribution. c. For each prompt $q^{(b)}$ , sample $G$ completions $o_1, \ldots, o_G$ from $\pi_{\theta_{\text{old}}}(\cdot \mid q^{(b)})$ . d. Score each completion to get $r_1, \ldots, r_G$ . e. Compute group-normalised advantages $\hat{A}_i = (r_i - \mathrm{mean}(r))/\mathrm{std}(r)$ . f. Assign $\hat{A}_{i,t} = \hat{A}_i$ for every token $t$ in completion $o_i$ (outcome supervision) or accumulate step-level rewards (process supervision). g. Update $\theta$ by maximising the GRPO surrogate objective.
Iteratively update $\pi_{\text{ref}}$ to the current policy at occasional intervals (DeepSeekMath uses a replay buffer of 10% historical data to retrain the reward model in the iterative variant).¹

Connection to the full pipeline. GRPO is the inner training loop. The outer pipeline differs across the three papers:

DeepSeekMath: SFT → GRPO on math problems.
DeepSeek-R1-Zero: GRPO directly on the base model (no SFT).
DeepSeek-R1: Cold-start SFT → GRPO (reasoning) → rejection-sampling SFT → GRPO (all scenarios).
OpenAI RFT: customer-provided dataset + grader → OpenAI-managed sample-grade-update loop on an o-series model.

Design rationale and tradeoffs. Group sampling is naturally parallelisable, the $G$ completions per prompt fit cleanly into the same forward-pass batch. The value model removal saves memory roughly equal to one policy-sized state. The cost is that advantage estimates are higher-variance for groups where most completions agree (all-correct or all-wrong on easy / impossible problems contribute zero gradient because std is zero).

What breaks if removed. Drop the KL penalty and the policy drifts off the reference; drop the clipping and updates destabilise; drop the group baseline (use raw rewards) and the gradient is dominated by the absolute reward scale rather than relative quality. The structural pieces are interdependent.

Novelty: [Adapted] from PPO. The clipped surrogate, the per-token policy ratio, and the KL penalty are PPO-native. The group-relative advantage and value-function removal are GRPO’s contribution.

5.2 DeepSeek-R1-Zero (DeepSeek-R1 Section 2.2)

Plain-English intuition. Take a strong base model (DeepSeek-V3-Base). Don’t do SFT. Just run GRPO directly on math and code problems, with rule-based rewards: “did you get the right answer?” plus “did you format your reasoning inside the <think> and </think> tags?” Watch what emerges.

Reward design (Section 2.2.2):

Accuracy reward. For math: extract the final answer and compare to ground truth. For code: run the unit tests. Binary or near-binary.
Format reward. A small bonus if the response contains the reasoning inside <think>...</think> followed by the answer in the expected format.

What emerged (Section 2.2.3, 2.2.4). The model’s chain-of-thought length grew steadily across training steps. AIME 2024 pass@1 climbed from 15.6% at the start to 71.0% after RL training (86.7% with majority voting at 64 samples).² The paper’s “aha moment” anecdote: at an intermediate training step the model spontaneously inserts phrases like “wait, let me reconsider” and re-derives a step. This was not in the cold-start data because there was no cold-start data, it emerged from reward optimisation alone.

Failure modes the paper acknowledges. R1-Zero outputs are hard to read: language mixing, missing markdown structure, occasional incoherence. The model is reasoning-effective but presentation-poor. This motivated R1’s four-stage pipeline.

5.3 DeepSeek-R1 four-stage pipeline (DeepSeek-R1 Section 2.3)

Plain-English intuition. R1-Zero proved RL alone could surface reasoning, but the output was hard to read. R1 fixes that with a multi-stage pipeline that alternates supervised fine-tuning (for output quality / format / general capability) and RL (for reasoning power).

Stage 1, Cold-start SFT (Section 2.3.1). Curate “thousands” of long chain-of-thought examples from R1-Zero (filtered for readability) and from prompted DeepSeek-V3 outputs. Fine-tune V3-Base on this small dataset to lock in the reasoning + summary output format.

Stage 2, Reasoning-oriented RL (Section 2.3.2). Apply GRPO with the same rule-based rewards as R1-Zero, plus a language-consistency reward to discourage Chinese/English code-switching mid-thought.

Stage 3, Rejection sampling SFT (Section 2.3.3). From the stage-2 model, generate completions on a wider prompt distribution, score them with rule-based + DeepSeek-V3 model-grader signals, keep the high-scoring ones. Mix with non-reasoning data (writing, QA, translation). The paper reports “approximately 600k reasoning related training samples” plus “approximately 200k training samples” of non-reasoning data, ~800k total, two epochs.²

Stage 4, RL for all scenarios (Section 2.3.4). A second GRPO pass with rule-based rewards for reasoning and a preference-based reward model for helpfulness and harmlessness. The aim: bring back the alignment behaviour that pure-reasoning RL had partially eroded.

Why four stages? [Analysis] The pipeline is best read as alternation between capability injection (RL surfaces reasoning) and behaviour shaping (SFT enforces format and alignment). R1-Zero shows you can skip the alternation if you only care about benchmark reasoning; R1’s pipeline shows what it takes to ship a model.

5.4 OpenAI RFT (December 2024 → 2025)

Plain-English intuition. OpenAI’s customer-facing version of the same loop. The customer provides (a) a training dataset and (b) a “grader”, a function or rubric that scores model outputs. OpenAI’s platform runs the sample-grade-update cycle on an o-series reasoning model and returns a fine-tuned checkpoint.⁸

Mechanism (per OpenAI’s documentation and Cookbook):

Training data: JSONL with prompts and reference answers / contexts.
Grader types: string-match graders (regex / exact match), model graders (an LLM scores the output against a rubric), multi-graders (composite of the above).
Training algorithm: OpenAI does not publish the loss function. The description “sample multiple candidate answers, run your grader to score them, apply a policy-gradient update” is consistent with GRPO or with a closely related group-baseline REINFORCE variant. [Reconstructed] the precise loss is not in public documentation.
Models supported: at GA, o4-mini. The December 2024 alpha ran on o1-mini.¹⁰
Data efficiency claim: “tens to hundreds of samples” suffice. The Berkeley Lab + Charité Hospital rare-disease demo cited at the December 2024 launch used a few dozen examples and reportedly outperformed o1 on rare-disease gene identification.¹⁰

Novelty: [Adopted] of the GRPO-style loop into a productised API. OpenAI did not publish a paper; the timing (DeepSeekMath February 2024 → OpenAI RFT December 2024) and the algorithmic description make the lineage transparent. [Reviewer Perspective] Independent commentators have noted the close resemblance.¹⁰

Section 6: Mathematical contributions

This is the depth section. Every formula gets a worked numerical example.

MATH ENTRY 1: PPO clipped surrogate objective (the baseline GRPO modifies)

Source: DeepSeekMath Section 4.1, Eq. (1); Schulman et al. 2017.¹⁶
What it is: the per-token clipped policy-gradient loss that PPO maximises. It says “push the policy towards high-advantage tokens, but don’t let any single update push the ratio of new-policy to old-policy probability outside a small window.”
Formal definition:

$\mathcal{J}_{\mathrm{PPO}}(\theta) = \mathbb{E}\left[\frac{1}{|o|}\sum_{t=1}^{|o|} \min\left(\rho_t A_t,\; \mathrm{clip}(\rho_t, 1-\varepsilon, 1+\varepsilon)\, A_t\right)\right]$

where $\rho_t = \dfrac{\pi_\theta(o_t \mid q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t \mid q, o_{<t})}$ is the importance ratio and $A_t$ is an advantage estimate.

Term-by-term:
- $\pi_\theta(o_t \mid q, o_{<t})$ is a scalar in $[0, 1]$ , the probability the live policy assigns to token $o_t$ at position $t$ .
- $\pi_{\theta_{\text{old}}}$ is the same kind of scalar from the snapshot policy.
- $\rho_t \in (0, \infty)$ is dimensionless. $\rho_t = 1$ means no change; $\rho_t > 1$ means the new policy is more likely to emit this token.
- $A_t$ is a scalar (typically computed via Generalised Advantage Estimation from a learned value network $V_\phi$ ).
- $\varepsilon$ is a small scalar, e.g., 0.2.
- The expectation is over prompts, completions, and token positions.
Worked numerical example. Take $\varepsilon = 0.2$ $ε = 0.2$ , $A_t = +1.0$ $A_{t} = + 1.0$ (good token), $\rho_t = 1.5$ $ρ_{t} = 1.5$ (live policy overshoots).
- Unclipped term: $\rho_t A_t = 1.5 \times 1.0 = 1.5$ .
- Clipped term: $\mathrm{clip}(1.5, 0.8, 1.2) \times A_t = 1.2 \times 1.0 = 1.2$ .
- PPO takes $\min(1.5, 1.2) = 1.2$ . The clip prevents the gradient from being driven by the unreasonably large ratio.
- Now $A_t = -1.0$ (bad token), $\rho_t = 1.5$ . Unclipped: $1.5 \times (-1) = -1.5$ . Clipped: $1.2 \times (-1) = -1.2$ . The $\min$ picks $-1.5$ , the larger negative, which means PPO does allow large penalties when the action was bad. (This asymmetry is intentional in PPO.)
Role: baseline against which GRPO is presented.
Edge cases: $\rho_t = 0$ when $\pi_\theta$ assigns zero probability to a token (numerically rare with softmax outputs but possible after aggressive updates). The $\min$ over the clipped + unclipped branches makes the gradient flow only when the unclipped branch dominates.
Novelty: [Adopted] from PPO (Schulman 2017). DeepSeekMath uses it as the explicit baseline.
Transferability: [Analysis] PPO is the dominant RL algorithm for sequence modelling outside RLHF too (game-playing, robotics).
Why it matters: this is the loss GRPO’s authors set out to simplify. The structural changes in MATH ENTRY 2-4 are best understood as edits to this expression.

MATH ENTRY 2: GRPO objective (the main contribution)

Source: DeepSeekMath Section 4.1.2, Eq. (3).¹
What it is: the same clipped-ratio policy gradient as PPO, but averaged over a group of $G$ completions per prompt, with the per-token advantage replaced by a group-normalised reward and a KL penalty added directly into the loss.
Formal definition:

$\mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\left\{\min\!\left(\rho_{i,t}\hat{A}_{i,t},\; \mathrm{clip}(\rho_{i,t}, 1-\varepsilon, 1+\varepsilon)\,\hat{A}_{i,t}\right) - \beta\, \mathbb{D}_{\mathrm{KL}}[\pi_\theta \,\|\, \pi_{\mathrm{ref}}]\right\}\right]$

where $\rho_{i,t} = \dfrac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})}$ and $\hat{A}_{i,t}$ is the group-normalised advantage (defined in MATH ENTRY 3).

Term-by-term:
- The outer $\frac{1}{G}\sum_{i=1}^{G}$ averages over the $G$ completions sampled for the same prompt.
- The inner $\frac{1}{\mid o_i\mid }\sum_{t=1}^{\mid o_i\mid }$ averages over the tokens in completion $i$ . Completion lengths $\mid o_i\mid$ vary across $i$ .
- $\rho_{i,t}$ is the per-token importance ratio, scalar in $(0, \infty)$ .
- $\hat{A}_{i,t}$ is a scalar, the same for every $t$ in a given $i$ under outcome supervision, but varies with $t$ under process supervision (MATH ENTRY 4).
- $\beta$ is a hyperparameter, typical values 0.01-0.05.
- The KL term is computed per-token (MATH ENTRY 5).
Worked numerical example. Suppose $G = 4$ $G = 4$ , all four completions have length $\mid o_i\mid = 5$ $∣ o_{i} ∣= 5$ tokens, and the four group rewards are $r = (1.0, 0.0, 1.0, 0.0)$ $r = (1.0, 0.0, 1.0, 0.0)$ (two correct, two wrong). Then:
- $\mathrm{mean}(r) = 0.5$ , $\mathrm{std}(r) = 0.5$ .
- Per-completion advantages: $\hat{A}_1 = \hat{A}_3 = (1.0 - 0.5)/0.5 = +1.0$ ; $\hat{A}_2 = \hat{A}_4 = (0.0 - 0.5)/0.5 = -1.0$ .
- Under outcome supervision, $\hat{A}_{i,t} = \hat{A}_i$ for every $t$ .
- Now suppose for completion 1, token 3, $\rho_{1,3} = 1.1$ and $\hat{A}_{1,3} = +1.0$ . The clipped term contributes $\min(1.1 \times 1.0, \mathrm{clip}(1.1, 0.8, 1.2) \times 1.0) = \min(1.1, 1.1) = 1.1$ .
- If $\rho_{1,3} = 1.5$ (unreasonably large), the clip kicks in: $\min(1.5, 1.2) = 1.2$ .
Role: the loss function the training loop maximises. Gradients of this loss with respect to $\theta$ drive policy updates.
Edge cases:
- If all $G$ completions get the same reward, $\mathrm{std}(r) = 0$ and the normalised advantage is undefined. Practical implementations add a small $\epsilon$ to the denominator or skip such groups.
- If $\hat{A}_{i,t} = 0$ , the token contributes only the KL penalty term. This is why “easy” problems (all-correct groups) and “impossible” problems (all-wrong groups) provide no learning signal.
Novelty: [New], this exact combination of group-normalised advantages + KL-in-the-loss + clipped ratio is the GRPO contribution.
Transferability: [Analysis] widely transferable to any setting where one can sample multiple completions per prompt and score them. The 2024-2025 open-source RL stack (TRL, OpenRLHF, verl) all ship GRPO trainers.¹²
Why it matters: this is the loss. Everything in the rest of Section 6 is either an input to this expression (the advantage in MATH ENTRY 3-4, the KL in MATH ENTRY 5) or a baseline it replaces (MATH ENTRY 1).

MATH ENTRY 3: Group-normalised advantage under outcome supervision

Source: DeepSeekMath Section 4.1.2, Eq. (4); the standard GRPO advantage.¹
What it is: the per-completion advantage when the reward arrives only at the end of the completion.
Formal definition:

$\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \mathrm{mean}(\{r_1, \ldots, r_G\})}{\mathrm{std}(\{r_1, \ldots, r_G\})}$

for every $t \in \{1, \ldots, \mid o_i\mid \}$ .

Term-by-term:
- $r_i$ is the scalar reward for completion $i$ .
- The mean and std are computed across the $G$ completions for the same prompt.
- $\hat{A}_{i,t}$ is a scalar; the same value applies to every token in completion $i$ .
- Dimensionally, $\hat{A}_{i,t}$ is dimensionless (rewards normalised to z-scores).
Worked numerical example. $G = 8$ $G = 8$ completions with rewards $r = (1, 1, 0, 1, 0, 0, 1, 0)$ $r = (1, 1, 0, 1, 0, 0, 1, 0)$ :
- $\mathrm{mean}(r) = 4/8 = 0.5$ .
- $\mathrm{var}(r) = \frac{1}{8}\sum (r_i - 0.5)^2 = \frac{1}{8}(8 \times 0.25) = 0.25$ , so $\mathrm{std}(r) = 0.5$ .
- Per-completion advantages: $\hat{A}_i = +1.0$ for the four correct ( $r_i = 1$ ); $\hat{A}_i = -1.0$ for the four wrong ( $r_i = 0$ ).
Role: drives the policy-gradient direction in the GRPO loss.
Edge cases: when $\mathrm{std}(r) = 0$ (all completions tied), the advantage is undefined. Implementations clamp or skip. When reward is continuous (a model grader returning 0-1 floats), the advantage is also continuous and the gradient signal is finer-grained.
Novelty: [New] as the explicit advantage definition in GRPO; the group-baseline idea has antecedents in RLOO and other REINFORCE variants.
Transferability: [Analysis] works wherever multiple completions can be sampled and scored.
Why it matters: this is the formula that replaces PPO’s learned value network. Saving the value-network memory is the headline operational advantage.

MATH ENTRY 4: Group-normalised advantage under process supervision

Source: DeepSeekMath Section 4.1.2, Eq. (5).¹
What it is: when a process reward model assigns rewards at intermediate reasoning steps, the per-token advantage is the cumulative future normalised reward from that token onward.
Formal definition:

$\hat{A}_{i,t} = \sum_{j:\, \mathrm{index}(j) \geq t} \tilde{r}_i^{(j)}$

where $\tilde{r}_i^{(j)}$ is the normalised reward for the $j$ -th step in completion $i$ , and $\mathrm{index}(j)$ is the token position where step $j$ ends.

Term-by-term:
- $\tilde{r}_i^{(j)}$ is a normalised step reward, computed across the same group of $G$ completions but at the step granularity.
- $\mathrm{index}(j)$ is an integer token position.
- $\hat{A}_{i,t}$ accumulates all step rewards whose end-token is at or after $t$ , analogous to a Monte-Carlo return.
Worked numerical example. A completion has three reasoning steps ending at tokens 10, 20, 30, with normalised step rewards $\tilde{r}^{(1)} = +0.5$ $\tilde{r}^{(1)} = + 0.5$ , $\tilde{r}^{(2)} = -0.3$ $\tilde{r}^{(2)} = - 0.3$ , $\tilde{r}^{(3)} = +0.8$ $\tilde{r}^{(3)} = + 0.8$ . Then:
- Tokens 1-10: $\hat{A}_t = 0.5 + (-0.3) + 0.8 = 1.0$ .
- Tokens 11-20: $\hat{A}_t = -0.3 + 0.8 = 0.5$ .
- Tokens 21-30: $\hat{A}_t = 0.8$ .
- Tokens after step 3: 0.
Role: used when process reward models are available; DeepSeekMath reports both supervision modes.
Edge cases: requires reliable step boundaries, which themselves can be ambiguous in free-form text.
Novelty: [Adapted] from generalised advantage estimation, applied to grouped completions.
Why it matters: process supervision is the “denser reward signal” alternative. DeepSeek-R1 chose outcome supervision (MATH ENTRY 3) for its simplicity and reliability, the rule-based math/code verifier is more trustworthy than a learned step grader.

MATH ENTRY 5: KL divergence (token-level, in the loss directly)

Source: DeepSeekMath Section 4.1.2, Eq. (4); the “unbiased estimator” of KL.¹
What it is: a token-level estimator of the KL divergence between the live policy and the reference. Put directly in the loss rather than added as a post-hoc reward.
Formal definition:

$\mathbb{D}_{\mathrm{KL}}[\pi_\theta \,\|\, \pi_{\mathrm{ref}}] = \frac{\pi_{\mathrm{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - \log\frac{\pi_{\mathrm{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} - 1$

Term-by-term:
- The two probabilities are scalars in $(0, 1]$ .
- The expression is the per-token unbiased estimator of $\mathrm{KL}(\pi_\theta \,\\mid \, \pi_{\mathrm{ref}})$ , see Schulman’s blog post on KL estimators. It is non-negative and zero when $\pi_\theta = \pi_{\mathrm{ref}}$ .
Worked numerical example. Suppose at a particular token, $\pi_\theta(o_t \mid \cdot) = 0.6$ $π_{θ} (o_{t} ∣ \cdot) = 0.6$ and $\pi_{\mathrm{ref}}(o_t \mid \cdot) = 0.4$ $π_{ref} (o_{t} ∣ \cdot) = 0.4$ . Then:
- $r = \pi_{\mathrm{ref}}/\pi_\theta = 0.4/0.6 \approx 0.667$ .
- $\log r \approx -0.405$ .
- $\mathbb{D}_{\mathrm{KL}} \approx 0.667 - (-0.405) - 1 = 0.072$ .
- If the two probabilities are equal ( $r = 1$ ): $1 - 0 - 1 = 0$ . Confirms the zero point.
- If $\pi_\theta = 0.9, \pi_{\mathrm{ref}} = 0.1$ : $r \approx 0.111$ , $\log r \approx -2.197$ , $\mathbb{D}_{\mathrm{KL}} \approx 0.111 + 2.197 - 1 = 1.308$ . Larger drift, larger penalty.
Role: keeps the live policy from drifting too far from the reference. $\beta$ (the coefficient in MATH ENTRY 2) controls the strength.
Edge cases: if $\pi_\theta$ assigns very low probability to a token that $\pi_{\mathrm{ref}}$ likes, the ratio $r$ blows up, and the KL spikes.
Novelty: [Adapted], the unbiased KL estimator is from Schulman’s notes; its use directly inside the loss (rather than as a reward correction) is a GRPO design choice that DeepSeekMath frames as cleaner than the PPO-RLHF “KL-in-reward” pattern.
Why it matters: the KL leash is what stops the model from collapsing into a degenerate policy that always emits the same answer.

MATH ENTRY 6: Memory savings vs PPO

Source: [Analysis] derived from DeepSeekMath Section 4.1.2 motivation (“significantly reducing training resources”) and standard PPO-RLHF implementation knowledge.
What it is: a back-of-envelope account of why GRPO’s removal of the value network matters operationally.
During PPO-RLHF training, four model-sized state tensors live in GPU memory: policy ( $\pi_\theta$ ), reference ( $\pi_{\mathrm{ref}}$ ), reward model ( $r_\phi$ ), and value/critic model ( $V_\psi$ ). The value model is typically the same architecture as the policy.
During GRPO training, only three live in memory: policy, reference, reward model. The value model is gone.
Worked numerical example. For a 7B policy at bf16 precision, one model-sized state is ~14 GB of weights plus optimizer state. The value network alone consumes roughly that footprint. Removing it frees ~25-33% of the training-time memory budget (depending on whether the reward model is also that large or smaller), which can be redirected to larger group sizes $G$ , longer sequences, or larger batch.
Novelty: [Analysis], the calculation is the publication’s, derived from the algorithm.
Why it matters: this is the operational reason GRPO took over the open-source post-training stack within 12 months of publication.

Section 7: Algorithmic contributions

ALGORITHM ENTRY 1: GRPO training loop (headline algorithm)

Source: DeepSeekMath Section 4.1.2, Algorithm 1.¹
Purpose: train a policy by repeated cycles of “sample $G$ completions, score them, compute group-normalised advantages, update the policy.”
Inputs:
- Initial policy $\pi_{\theta_0}$ (LLM parameters; one model-sized tensor).
- Reward function $r$ (rule-based or model-based; takes a (prompt, completion) pair and returns a scalar).
- Prompt dataset $\mathcal{D}$ .
- Hyperparameters: group size $G$ (positive integer, typical 8-64), clip range $\varepsilon$ (typical 0.2), KL coefficient $\beta$ (typical 0.01-0.05), learning rate $\eta$ , total training steps $T$ .
Outputs: trained policy $\pi_\theta$ .
Pseudocode:

# DeepSeekMath Algorithm 1, GRPO training loop (paraphrased)
def grpo_train(pi_theta, reward_fn, prompt_dataset, G, eps, beta, lr, T):
    pi_ref = clone_frozen(pi_theta)        # Reference snapshot for KL leash.
    optimizer = AdamW(pi_theta.parameters(), lr=lr)
    for step in range(T):
        pi_old = clone_frozen(pi_theta)    # Snapshot at start of step.
        q_batch = sample_batch(prompt_dataset)
        for q in q_batch:
            # 1. Sample G completions per prompt from pi_old.
            outputs = [sample(pi_old, q) for _ in range(G)]
            # 2. Score each completion.
            rewards = [reward_fn(q, o) for o in outputs]
            # 3. Group-normalise to get advantages.
            mu, sigma = mean(rewards), std(rewards) + 1e-8
            advantages = [(r - mu) / sigma for r in rewards]
            # 4. Compute the GRPO surrogate loss.
            loss = 0.0
            for i, (o_i, A_i) in enumerate(zip(outputs, advantages)):
                for t, token in enumerate(o_i):
                    rho = pi_theta.prob(token | q, o_i[:t]) / pi_old.prob(token | q, o_i[:t])
                    surrogate = min(rho * A_i,
                                    clip(rho, 1 - eps, 1 + eps) * A_i)
                    kl_token = (pi_ref.prob(token | q, o_i[:t]) / pi_theta.prob(token | q, o_i[:t])
                                - log(pi_ref.prob(...) / pi_theta.prob(...)) - 1)
                    loss += (surrogate - beta * kl_token) / len(o_i)
            loss /= G * len(q_batch)
            # 5. Gradient step (the gradient flows only into pi_theta via the rho ratio and KL).
            optimizer.zero_grad()
            (-loss).backward()
            optimizer.step()
        # Optional: periodically update pi_ref to current pi_theta (iterative GRPO variant).
    return pi_theta

Hand-traced example on minimal input. Take $G = 2$ $G = 2$ , one prompt $q$ $q$ , completions $o_1, o_2$ $o_{1}, o_{2}$ each of length 3 tokens. Rewards $r = (1, 0)$ $r = (1, 0)$ .
- Step 3 (group normalise): $\mu = 0.5$ , $\sigma = 0.5$ , advantages $= (+1, -1)$ .
- Step 4 loop for $i = 1$ $i = 1$ , $A_1 = +1$ $A_{1} = + 1$ :
  - Token 1: $\rho_{1,1} = \pi_\theta / \pi_{\mathrm{old}}$ . Assume both = 0.5 initially (fresh snapshot), so $\rho = 1.0$ . Surrogate = $\min(1.0 \times 1, 1.0 \times 1) = 1.0$ . KL term = 0 (initial $\pi_\theta = \pi_{\mathrm{ref}}$ ). Contribution: $1.0/3$ .
  - Tokens 2, 3 similar: contribution $1.0/3$ each.
  - Sub-total for $i = 1$ : 1.0.
- Step 4 loop for $i = 2$ $i = 2$ , $A_2 = -1$ $A_{2} = - 1$ :
  - All three tokens contribute $-1.0/3$ . Sub-total: $-1.0$ .
- Total loss = $(1.0 + (-1.0))/(G \times 1) = 0$ . With the policy still equal to the snapshot, no update, but on the next forward-backward pass, the $\rho$ ratios shift and the gradient becomes informative.
- The trace shows why $\rho_{i,t}$ being a function of current $\theta$ matters: the gradient flows through $\rho$ , not through the rewards (which are detached scalars).
Complexity:
- Forward + backward: $\mathcal{O}(\mid \text{batch}\mid \times G \times \mid o_{\max}\mid \times d_{\text{model}}^2)$ per step.
- Memory: 3 model-sized state tensors (policy, reference, reward model) vs PPO’s 4.
- Bottleneck step explicitly: the $G$ sequential decode passes per prompt. This is partially amortised when implementations batch the $G$ completions together.
Hyperparameters used in DeepSeekMath: $G = 64$ , $\varepsilon = 0.2$ , $\beta = 0.04$ , learning rate $1 \times 10^{-6}$ . Source: paper Section 5.2.¹
Failure modes: groups with $\mathrm{std}(r) = 0$ (skip); reward hacking when the verifier admits exploits; instability if $\beta$ is too small (drift) or too large (no learning).
Novelty: [New] algorithm structure; [Adapted] from PPO.
Transferability: [Analysis] works for any sequence-modelling task with a sample-able policy and a scorable output. Used in code generation, math, scientific reasoning, and conversational tuning across the 2024-2025 open-source RL stack.

ALGORITHM ENTRY 2: DeepSeek-R1 four-stage training pipeline

Source: DeepSeek-R1 Section 2.3.²
Purpose: turn DeepSeek-V3-Base into a polished reasoning model (DeepSeek-R1) with readable output and broad capability, using GRPO as the inner loop.
Inputs: DeepSeek-V3-Base checkpoint; reasoning prompt set; rule-based math/code verifiers; preference reward model.
Outputs: DeepSeek-R1 model.
Pseudocode:

Stage 1, Cold-start SFT
  curate ~thousands of long-CoT examples (filtered R1-Zero outputs + prompted V3 outputs)
  SFT(V3-Base) on this dataset → model_1

Stage 2, Reasoning-oriented RL
  GRPO(model_1) with:
    - accuracy reward (math: rule check; code: unit test)
    - format reward (<think>...</think> structure)
    - language-consistency reward (penalise English/Chinese code-switching)
  → model_2

Stage 3, Rejection sampling SFT
  generate completions from model_2 on broad prompt distribution
  score with rule-based + V3-as-judge rewards
  keep top-scoring ~600k reasoning samples + ~200k non-reasoning samples
  SFT(V3-Base) on this 800k mix (two epochs) → model_3

Stage 4, RL for all scenarios
  GRPO(model_3) with:
    - rule-based reward for reasoning prompts
    - preference reward model for general prompts (helpfulness + harmlessness)
  → DeepSeek-R1

Hand-traced example. Suppose after Stage 1 model_1 has AIME 2024 pass@1 ≈ 30% (rough analogue to DeepSeekMath-Instruct’s level). After Stage 2 RL with GRPO, AIME climbs into the 70s, Stage 2 is doing the heavy lifting. Stage 3 SFT re-bakes the reasoning capability into a checkpoint that also writes well and answers non-reasoning prompts coherently. Stage 4 RL then polishes alignment without sacrificing reasoning. The final R1 reports 79.8% AIME 2024 pass@1.²
Complexity: each RL stage runs the GRPO loop in ALGORITHM ENTRY 1 over the relevant prompt distribution; SFT stages are standard. The full pipeline is compute-heavy but parallelisable.
Novelty: [New] pipeline composition; each individual stage uses known building blocks.
Why it matters: this is the existence proof that a fully open-weight model trained on a published recipe can match a leading closed-source reasoner on benchmark suites.

ALGORITHM ENTRY 3: OpenAI RFT sample-grade-update loop

Source: OpenAI Reinforcement Fine-Tuning documentation + Cookbook.⁸⁹ [Reconstructed], the precise loss is not in OpenAI’s public documentation; the description below is based on the public framing.
Purpose: customer-facing API for adapting an o-series reasoning model to a verifier-defined task.
Inputs:
- Training dataset (JSONL: prompts + reference contexts / answers).
- Grader specification (string-match grader, model grader, or multi-grader composite).
- Base model choice (at GA: o4-mini).
- Training configuration (epochs, learning rate, possibly group size, OpenAI’s API exposes few knobs).
Outputs: a fine-tuned checkpoint, available via the standard model: ft:... parameter.
Pseudocode (reconstructed):

for epoch in range(epochs):
    for example in shuffle(dataset):
        prompt, reference = example.prompt, example.reference
        # Sample G candidate completions (G not disclosed; likely in the 8-32 range).
        candidates = [sample(current_model, prompt) for _ in range(G)]
        # Run the grader on each candidate.
        scores = [grader(prompt, candidate, reference) for candidate in candidates]
        # Compute advantages (group-normalised, plausibly).
        advantages = group_normalize(scores)
        # Apply a policy-gradient update (loss form not published).
        update(current_model, candidates, advantages)
return current_model

Hand-traced example. Customer dataset: a few dozen rare-disease cases with reference gene labels. Grader: model grader prompted with “did the model identify the correct causative gene?” Each training example → $G$ candidate completions → $G$ scores in $[0, 1]$ → group-normalised advantages → policy update on o4-mini.¹⁰
Complexity: OpenAI does not publish per-step cost; the pricing surface is per-token. The “tens to hundreds of samples” data-efficiency claim suggests the cycle is computationally lean per example, with most of the cost in the o-series model’s reasoning chain itself.
Novelty: [Adopted] of the GRPO-style loop into a commercial API surface. [Analysis] the technical content is not new; the productisation is.

Section 8: Specialised design contributions

Subsection 8A, LLM / prompt design.

PROMPT ENTRY 1: DeepSeek-R1-Zero training template

Source: DeepSeek-R1 Section 2.2.3, Table 1.²
Role: enforces the <think>...</think><answer>...</answer> structure that the format reward latches onto.
Prompt type: zero-shot template prepended to every training prompt.
Reconstructed template (the paper paraphrases this; the exact wording is in Table 1):

A conversation between User and Assistant. The user asks a question, and the
Assistant solves it. The assistant first thinks about the reasoning process
in the mind and then provides the user with the answer. The reasoning process
and answer are enclosed within <think> </think> and <answer> </answer> tags,
respectively, i.e., <think> reasoning process here </think> <answer> answer
here </answer>. User: {prompt}. Assistant:

Failure handling: format-reward bonus rewards correct tagging; the model converges on the structure within a few thousand RL steps.
Design rationale: the authors deliberately did not include content-specific scaffolding (“think step by step”, “show your work”) to let reasoning behaviour emerge from reward optimisation alone.
Complexity: ~50 tokens of preamble per training example.
Novelty: [New], this exact template is DeepSeek-R1’s design choice.

Subsection 8B, Architecture-specific details. Not applicable to the GRPO papers themselves, DeepSeekMath uses a standard 7B transformer; DeepSeek-R1 inherits DeepSeek-V3’s MoE architecture (671B total parameters, 37B active per token). GRPO is architecture-agnostic.

Subsection 8C, Training specifics (DeepSeek-R1 Section 2.3 + DeepSeekMath Section 5.2):

DeepSeekMath GRPO: group size $G = 64$ , KL $\beta = 0.04$ , clip $\varepsilon = 0.2$ , learning rate $1 \times 10^{-6}$ , single epoch over the chain-of-thought + program-aided RL prompt set.
DeepSeek-R1: hardware is not fully specified in the public paper, the Nature version lists training compute in the order of $10^{6}$ GPU-hours [Reconstructed] from secondary commentary; the paper does not publish a single headline FLOP number.
OpenAI RFT: training hyperparameters not exposed to the customer.

Subsection 8D, Inference / deployment specifics. Not the focus of these papers. DeepSeek-R1 inference is standard transformer decoding; DeepSeek’s deployment notes emphasise that the distilled variants (Qwen 7B/14B/32B, Llama 8B/70B) are far cheaper to serve than the full R1.¹³

Section 9: Experiments and results

Datasets. DeepSeekMath’s training: the paper’s own DeepSeekMath-Corpus (120B math-domain tokens) for pre-training; MATH and GSM8K-style problems for the RL phase. DeepSeek-R1’s training: undisclosed reasoning prompt set for stages 2 and 4; an internal mix for the rejection-sampling SFT.

Baselines. DeepSeekMath compares against Mistral 7B, LLaMA-2 70B, Qwen 7B, and InternLM2 across math benchmarks. DeepSeek-R1 compares against OpenAI o1-mini, o1-1217, Claude 3.5 Sonnet, GPT-4o, and the DeepSeek-V3 base model.

Evaluation metrics. Pass@1 (single-attempt accuracy), pass@k with majority voting (cons@64), Codeforces Elo rating, MMLU multiple-choice accuracy.

Reproduce key results.

Table 1: DeepSeekMath benchmark progression (Section 5).¹

Model	MATH	GSM8K
DeepSeekMath-Base 7B	36.2%	64.2%
DeepSeekMath-Instruct 7B (SFT)	46.8%	82.9%
DeepSeekMath-RL 7B (SFT + GRPO)	51.7%	88.2%

Table reproduced from DeepSeekMath (arXiv:2402.03300), Section 5, for editorial coverage.

The +4.9 point gain on MATH from SFT to SFT+GRPO is the paper’s headline empirical claim for GRPO.

Table 2: DeepSeek-R1 reasoning benchmarks (Section 3.1, Table 4).²

Benchmark	DeepSeek-R1	OpenAI o1-1217	OpenAI o1-mini	DeepSeek-V3 (base)
AIME 2024 pass@1	79.8%	79.2%	63.6%	39.2%
MATH-500 pass@1	97.3%	96.4%	90.0%	90.2%
GPQA Diamond pass@1	71.5%	75.7%	60.0%	59.1%
Codeforces (Elo)	2,029	2,061	1,820	1,134
MMLU (pass@1)	90.8%	91.8%	85.2%	88.5%

Table reproduced from DeepSeek-R1 (arXiv:2501.12948), Section 3.1 Table 4, for editorial coverage.

R1 matches or narrowly beats o1-1217 on AIME 2024 and MATH-500; trails on GPQA Diamond, Codeforces Elo, and MMLU.

Table 3: DeepSeek-R1-Zero vs DeepSeek-R1 (Sections 2.2.4 and 3.1).²

Benchmark	DeepSeek-R1-Zero (RL only, no SFT)	DeepSeek-R1 (full pipeline)
AIME 2024 pass@1	71.0%	79.8%
AIME 2024 cons@64	86.7%	(not separately reported)
MATH-500	(paper reports the matched-quality range)	97.3%

Reproduced from DeepSeek-R1 (arXiv:2501.12948), Sections 2.2.4 and 3.1, for editorial coverage. The R1-Zero column shows pure-RL performance from the base model.

Table 4: Distilled models (Section 3.2, Table 5).²

Distilled model	AIME 2024 pass@1	MATH-500 pass@1
R1-Distill-Qwen-1.5B	28.9%	83.9%
R1-Distill-Qwen-7B	55.5%	92.8%
R1-Distill-Qwen-14B	69.7%	93.9%
R1-Distill-Qwen-32B	72.6%	94.3%
R1-Distill-Llama-8B	50.4%	89.1%
R1-Distill-Llama-70B	70.0%	94.5%

Reproduced from DeepSeek-R1 (arXiv:2501.12948), Section 3.2 Table 5, for editorial coverage.

The Qwen-32B distilled model at 72.6% AIME 2024 beats o1-mini’s 63.6% with a far smaller serving footprint. This is the result that drove the open-source community’s adoption of R1-Distill as a default reasoning baseline through 2025.

Ablations. DeepSeekMath Section 5.2 includes ablations comparing GRPO with and without iterative reference-model updates, and outcome vs process supervision. The headline ablation is GRPO vs PPO with the same setup: GRPO matches or marginally beats PPO on MATH while using less memory.

Hyperparameter sensitivity. Not extensively explored in the paper. The published $\beta = 0.04$ and $\varepsilon = 0.2$ are reused across follow-up work. [Analysis] Subsequent community work (Dr. GRPO, $\lambda$ -GRPO) suggests the length-normalisation in the inner average is more impactful than $\beta$ or $\varepsilon$ .¹¹

Qualitative results. DeepSeek-R1’s “aha moment” example (Section 2.2.4): the model spontaneously inserts “wait, let me reconsider that step” mid-reasoning at an intermediate training step. The paper presents this as evidence of emergent self-correction.

Experimental scope limits. R1’s training compute, exact prompt-set composition, and rejection-sampling filtering criteria are partially specified. The Stage 3 600k + 200k sample count is given as approximate; selection criteria are described qualitatively.²

Independent benchmark cross-check. AIME 2024 and MATH-500 leaderboards on Papers With Code corroborate R1’s reported scores within rounding. The Codeforces rating of 2,029 is from the model competing in actual rated contests; this is a more independently verifiable number than self-reported benchmark accuracy. [External comparison] Independent third-party reproductions of R1’s RL recipe (e.g., the Open-R1 community effort and various Hugging Face replications during 2025) reach similar AIME numbers when starting from comparable base models, supporting the claim that the recipe is reproducible.

Evidence audit.

Strongly supported: GRPO converges on math benchmarks; outcome reward is sufficient to surface long-CoT behaviour; distillation transfers reasoning to smaller models.
Partially supported: the “aha moment” framing as evidence of emergent meta-reasoning, it is a real behaviour but the causal claim that GRPO specifically caused it (vs the base model already containing the capability and RL surfacing it) is [Analysis] Underdetermined.
Narrow evidence: hardware / training-compute disclosure for R1 is thin in the open paper.

Section 10: Technical novelty summary

Component	Type	Novelty level	Justification	Source
Group-relative advantage (no critic)	Algorithm	Incrementally novel	Group baselines have antecedents in RLOO and similar variants; the specific framing as PPO-without-the-critic is new.	DeepSeekMath §4.1.2
KL-in-loss (not KL-in-reward)	Algorithm	Combination novel	The unbiased KL estimator is Schulman’s; placing it in the surrogate loss rather than as a reward correction is a deliberate GRPO design.	DeepSeekMath §4.1.2
Pure-RL reasoning emergence (R1-Zero)	Empirical	Fully novel	Demonstrates strong reasoning without SFT cold-start — the first widely reproduced such demonstration in open weights.	DeepSeek-R1 §2.2
Four-stage SFT/RL alternation	Pipeline	Combination novel	Each stage uses established techniques; the alternation pattern with rejection sampling between RL passes is the design.	DeepSeek-R1 §2.3
R1-Distill family	Empirical	Incrementally novel	Distilling reasoning chains into smaller models is standard; the scale and quality of release is the contribution.	DeepSeek-R1 §3.2
OpenAI RFT API surface	Product	Adopted	Algorithmic content adopted from GRPO/group-baseline literature; productisation is the contribution.	OpenAI docs

Single most novel contribution. Across the cluster: DeepSeek-R1-Zero. The empirical demonstration that strong reasoning capability can be elicited from a base model with pure RL, no SFT, no human preference data, just verifier rewards and GRPO, reframed the post-training literature. Within twelve months it had triggered the “RLVR” (Reinforcement Learning with Verifier Rewards) line of work, including the publication’s prior coverage of RLVR-implicitly-incentivises-correct-reasoning.

What the papers do NOT claim to be novel. The clipped surrogate (PPO). The KL estimator (Schulman). The MoE backbone (DeepSeek-V3). The rule-based math verifier (long pre-dates the paper).

Section 11: Situating the work

What prior work did. Before 2024, the dominant post-training recipe for reasoning models was: SFT on chain-of-thought data, sometimes followed by PPO-RLHF against a learned reward model. The InstructGPT lineage⁷ defined this stack. Christiano et al. 2017 established the human-preference-pairs methodology. DPO (Rafailov et al. 2023) and its successors offered the “preference learning without RL infrastructure” alternative.

What this paper changes conceptually. The GRPO + DeepSeek-R1 line of work shifts the centre of post-training away from human preferences and toward verifier rewards: rule-based checks, executable tests, model-graded rubrics. The conceptual move is “if the domain admits a verifier, you don’t need preference labels.” This has reshaped the 2025-2026 post-training literature.

[External comparison] Contemporaneous related papers (≥2 required).

Lambert et al. 2025, “Tülu 3” (or contemporary AI2 work) explored a similar verifier-reward + group-baseline approach for open-source instruction tuning. Differs from R1 by emphasising broader task coverage over reasoning-depth.
OpenAI o1 system card / system documentation, December 2024. o1’s exact training recipe is not disclosed, but the broad description (“scale up reinforcement learning”) and the timing make it the unstated counterpart to R1. The relationship is competitive rather than collaborative.
Liu et al. 2025, “Understanding R1-Zero-Like Training” (Dr. GRPO).¹¹ Sea AI Lab. Critiques GRPO’s per-completion length normalisation as length-biased and proposes a corrected variant. The paper became the canonical reference for “GRPO is good but the normalisation has issues.”
The RLVR line of work, 2025-2026. Verifier-driven RL as a general post-training paradigm beyond math. Cited in this publication’s prior coverage.

[Reviewer Perspective] Strongest skeptical objection. Two related objections compete for first place. (1) GRPO is essentially REINFORCE-with-baseline plus PPO clipping, which is closer to a recombination than a new algorithm. The “novel” framing in DeepSeekMath is generous. (2) DeepSeek-R1’s emergent-reasoning narrative (“aha moment”) is overstated, the base model (V3) already contained sophisticated reasoning capability; RL surfaced it but did not create it. A more careful claim would say “RL is sufficient to surface latent reasoning in a strong base model” rather than “RL produces reasoning.”

[Reviewer Perspective] Author-side rebuttal. On (1): the value of an algorithm is its empirical reliability, not its categorical novelty; GRPO works, is simple to implement, and saves memory. On (2): the paper does not claim RL creates reasoning ex nihilo; the “incentivize” framing in the title is deliberate.

What remains unsolved.

Reward-hacking with model graders: when the reward is a learned model rather than a rule, RL eventually finds ways to game it.
Length bias: Dr. GRPO and follow-ups document that GRPO can prefer longer or shorter completions for reasons unrelated to quality.
Out-of-distribution generalisation: R1’s reasoning gains on AIME / MATH transfer partially but not fully to less verifiable domains (creative writing, open-ended QA).

Three future research directions.

Verifier robustness. Designing graders that resist reward-hacking when scaled to harder tasks (open mathematical research, theorem proving, legal reasoning). [Analysis]
Mixed reward signals. Combining rule-based, model-grader, and human-preference rewards in a single GRPO objective without one dominating. [Analysis]
Length-bias-corrected variants. Dr. GRPO is one attempt; $\lambda$ -GRPO is another. The community has not converged on the right normalisation. [Analysis]

Section 12: Critical analysis

Strengths.

The algorithm is genuinely simple. ALGORITHM ENTRY 1 fits on a single page; the open-source TRL GRPOTrainer is < 1000 lines of trainable code.¹²
The empirical claims are reproduced in independent community efforts (Open-R1, various HF replications). R1’s recipe is real, not a labs-only artefact.
The decision to release R1 weights and the full distilled family under permissive licences is the closest the open-source community has come to OpenAI-tier reasoning capability.¹³

Weaknesses stated by the authors.

DeepSeek-R1 Section 4.1: distillation outperforms direct RL on smaller models, the authors flag that small models cannot independently discover the reasoning patterns. Implication: the recipe is not “GRPO produces reasoning at any scale”; it’s “GRPO surfaces latent reasoning in sufficiently capable base models.”
DeepSeek-R1 Section 4.2: R1 underperforms on tasks where rule-based rewards are unavailable (creative writing, multilingual non-Chinese/English contexts).

Weaknesses not stated by the authors ([Reviewer Perspective]).

Independent commentary (Dr. GRPO and follow-ups) documents that GRPO’s length-normalisation introduces systematic bias toward longer-or-shorter completions depending on per-token vs per-completion averaging.¹¹ The publication’s reading is that this is a real artefact of the loss form, not a minor implementation detail.
The R1 paper’s compute disclosure is thin. [Analysis] For a paper published in Nature, the absence of a single headline training-compute figure is notable.
OpenAI RFT inherits the same opacity: customers cannot inspect or modify the inner loss; the only knobs are dataset + grader + base model + epoch count.

Reproducibility check.

Code: DeepSeek-R1 GitHub repository is public.¹³ DeepSeekMath’s GRPO is implemented in TRL.¹²
Data: R1’s training data is not released; the rejection-sampling SFT mix is described qualitatively, not enumerated.
Hyperparameters: partially released ( $G$ , $\beta$ , $\varepsilon$ , learning rate for DeepSeekMath; less complete for R1).
Compute: not fully reported.
Trained model weights: released on Hugging Face for R1 base, R1, and all distilled variants.¹⁴
Evaluation set: AIME / MATH / GPQA / Codeforces are public; R1’s exact eval harness for self-reported numbers is partially documented.
Overall: partially reproducible, the model is fully available, the recipe is described at a useful level of detail, but the exact prompt-set and compute disclosure would be needed for a clean reproduction.

Methodology.

Sample size: AIME 2024 has 30 problems; pass@1 is computed over multiple sample seeds. MATH-500 is a 500-problem subset.
Evaluation set: standard public benchmarks (AIME 2024, MATH-500 from Lightman et al., GPQA Diamond, Codeforces). Contamination check is not exhaustively discussed in the open paper, [Analysis] for AIME 2024 specifically, the test set was released after R1’s training-data cutoff, so contamination is low risk.
Baselines: o1-1217, o1-mini, Claude 3.5 Sonnet, GPT-4o, V3-Base, plus distillation comparisons.
Hardware/compute: not fully reported in the public paper; commentary suggests on the order of $10^{6}$ H800 GPU-hours [Reconstructed] from secondary sources, not stated explicitly in the paper.

Generalisability.

To other domains: GRPO + verifier rewards works wherever a verifier exists. Math and code are the natural fit. Open-ended domains (writing, summarisation) need model graders, which reintroduce reward-hacking risk.
To larger scales: R1 is already at frontier scale (DeepSeek-V3 backbone, 671B total params MoE). Scaling further is more an engineering than an algorithmic question.
To smaller models: per the paper’s own ablation, RL on small base models underperforms distillation from a stronger RL-trained teacher. Practical recommendation: distill, don’t direct-RL, below ~14B parameters.

Assumption audit.

“Verifier rewards are noise-free”, true for math-answer matching and code unit tests; false for model graders, which can hallucinate.
“Group statistics are informative”, fails when groups are uniformly correct or uniformly wrong. Curriculum design can mitigate.

What would make the papers significantly stronger. [Analysis] (1) Full compute disclosure for R1. (2) An explicit ablation comparing GRPO and PPO on the same R1 base + reward setup (the paper compares against PPO indirectly via DeepSeekMath, but a head-to-head at the R1 scale would close the loop). (3) Acknowledgement and discussion of the length-bias critique (Dr. GRPO) in a v3 of the paper.

Section 13: What is reusable for a new study

REUSABLE COMPONENT 1: GRPO trainer (the algorithm itself)

What it is: the loss function + training loop in ALGORITHM ENTRY 1.
Why worth reusing: simpler than PPO, no value network, well-supported in TRL and verl.¹²
Preconditions: a verifier or grader that can score completions; a base model strong enough to occasionally produce correct answers (otherwise rewards are all-zero and the gradient vanishes).
What would need to change in a different setting: the reward function. Math benchmark → rule-based check. Code → unit-test runner. Open-ended → model grader (with reward-hacking caveats).
Risks: groups with zero variance, reward hacking on model graders, length bias.
Interaction effects: KL coefficient $\beta$ interacts with reward scale, if rewards are 0/1, $\beta = 0.04$ is reasonable; if rewards are dense floats, retune.

REUSABLE COMPONENT 2: DeepSeek-R1 four-stage pipeline

What it is: cold-start SFT → reasoning RL → rejection-sampling SFT → general RL.
Why worth reusing: documented existence proof that this composition produces a polished reasoning model. Each individual stage is well-understood.
Preconditions: a strong base model (R1 used V3-Base, 671B MoE); rule-based verifiers for the reasoning RL stage; access to compute for two RL passes.
What would need to change: the cold-start dataset has to match the target output format; the language-consistency reward is bespoke and may not transfer.
Risks: any stage can erode capability from the prior stage if the reward signal misweights.

REUSABLE COMPONENT 3: Rule-based reasoning rewards (accuracy + format)

What it is: deterministic functions that return scalar rewards for correct math answers, passing code unit tests, and correct output formatting.
Why worth reusing: zero-cost at training time relative to learned reward models; no reward-hacking on the verifier itself (modulo problem-set leakage).
Preconditions: a domain that admits ground-truth verification.
Risks: verifier scope limitations, extending to open-ended reasoning requires model graders.

REUSABLE COMPONENT 4: R1-Distill checkpoints

What it is: open-weight Qwen 7B/14B/32B and Llama 8B/70B fine-tuned on R1’s reasoning traces.¹⁴
Why worth reusing: practical reasoning capability at sub-frontier scale; much cheaper to serve than R1 proper.
Preconditions: standard transformer inference infrastructure.
Risks: licensing terms vary by base model (Qwen and Llama have distinct terms; R1’s MIT licence applies to the DeepSeek-side weights only).

Dependency map. GRPO trainer → enables → R1 four-stage pipeline → produces → R1 model + R1-Distill family. The trainer is the foundation; everything else is empirical work that the trainer made tractable.

Recommendation. [Analysis] Highest-value component for a new study: the GRPO trainer itself, applied to a domain with a clean verifier. The four-stage pipeline is overkill for most use cases; the rule-based reward design is task-specific and not directly portable.

[Analysis] Study types that benefit most. Math, code, theorem-proving, and structured-output generation tasks where automated scoring is reliable. Closed-domain QA with reference answers. Tool-use and function-call training where success is verifiable from execution traces.

Section 14: Known limitations and open problems

Limitations explicitly stated by the authors.

Smaller-base-model failure. DeepSeek-R1 Section 4.1: applying RL directly to small base models (1.5B, 7B) underperforms distillation from R1. The reasoning patterns must already exist in latent form for RL to surface them.
Domain coverage. DeepSeek-R1 Section 4.2: R1 underperforms on non-verifiable domains (creative writing, certain multilingual contexts where the rule-based reward doesn’t apply).
Format brittleness. R1-Zero outputs are hard to read without the cold-start SFT step.

Limitations not stated by the authors ([Analysis] / [Reviewer Perspective]).

Length bias in GRPO normalisation (Liu et al. 2025, Dr. GRPO). The per-completion length normalisation in the inner average can incentivise longer or shorter completions for reasons unrelated to reasoning quality.¹¹
Reward-hacking with model graders. When the reward is itself an LLM, RL eventually finds reward-overestimation patterns. The R1 paper sidesteps this by using rule-based rewards in the reasoning RL stage; OpenAI RFT customers using model graders should expect to monitor for it.
Compute opacity. R1’s full training compute is not disclosed in the open paper.

Technical root causes.

Length bias root cause: the $\frac{1}{\mid o_i\mid }$ normalisation in the GRPO loss treats per-token contributions equally regardless of overall completion length, biasing the gradient toward whichever length the reward function happens to prefer.
Reward-hacking root cause: any learned reward model has a finite training distribution; RL optimises against this approximation and finds out-of-distribution maxima.

Open problems.

The right normalisation for grouped policy-gradient losses (Dr. GRPO, $\lambda$ -GRPO, MAD-GRPO are competing proposals).
Verifier design at scale for harder reasoning (research math, open theorem-proving).
Combining rule-based and model-grader rewards without one dominating.

What a follow-up paper would need to solve to address the most critical limitation. A formal characterisation of GRPO’s bias-variance tradeoff, when does the group baseline win over a learned critic, and what is the gradient noise floor? Dr. GRPO is a step in this direction empirically; a theoretical analysis would close the loop.

How this article reads at three depths

For the curious high-school reader. GRPO is a way to make AI language models better at math by asking the same question many times, comparing the answers, and pushing the model toward the best ones. DeepSeek used it to build a reasoning model that matches what was the leading closed-source model at the time, and they gave the weights away for free. OpenAI sells the same idea as a paid service called Reinforcement Fine-Tuning.

For the working developer or ML engineer. GRPO is a drop-in replacement for PPO in post-training pipelines: same clipped surrogate, same KL leash, but no value network, the advantage is the group-normalised reward. The memory savings (no critic) let you fit larger groups or longer sequences in the same GPU budget. The TRL GRPOTrainer and verl’s GRPO implementation are production-quality; building a custom training loop is a few hundred lines. The decision tree: if your task admits a clean verifier (math, code, structured output), GRPO + rule-based rewards is now the default. If you only have preference pairs, DPO/IPO/KTO is the alternative. If you have a small base model, distill from a strong RL-trained teacher rather than direct-RL.

For the ML researcher. GRPO is incrementally novel as an algorithm, group baselines have RLOO antecedents, the clipped surrogate is PPO-native, but the empirical impact through DeepSeek-R1 is large. R1-Zero’s pure-RL reasoning emergence is the load-bearing empirical claim; the four-stage pipeline is engineering polish. The strongest objection is that R1’s “emergent reasoning” framing overstates the algorithmic contribution (the base model already had the capability). The most useful follow-up direction is a clean theoretical characterisation of the bias-variance tradeoff of group baselines vs learned critics, and a principled fix for the length-bias artefact that Dr. GRPO surfaced. The interaction with OpenAI’s RFT productisation is worth tracking: when a research lab and a commercial API converge on the same training loop within 12 months of each other, the loop has stabilised.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao, Wang et al., arXiv:2402.03300, Feb 2024). The paper introducing GRPO. Equations 1 (PPO baseline), 3 (GRPO surrogate), 4 (outcome advantage + KL estimator), 5 (process advantage). Algorithm 1 (iterative GRPO). (accessed 2026-05-20) ↩
2. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI, arXiv:2501.12948, Jan 2025; published in Nature vol. 645, pp. 633-638, 2025). Sections 2.2 (R1-Zero pure-RL), 2.3 (four-stage pipeline), 2.4 (distillation), 3 (benchmarks). Tables 4 and 5. (accessed 2026-05-20) ↩
3. OpenAI — Learning to Reason with LLMs (o1 announcement, September 2024). The o1 announcement that demonstrated RL post-training for reasoning without disclosing the recipe. (accessed 2026-05-20) ↩
6. Schulman et al. 2017 — Proximal Policy Optimization Algorithms (arXiv:1707.06347). The PPO baseline that GRPO modifies. (accessed 2026-05-20) ↩
7. Ouyang et al. 2022 — Training Language Models to Follow Instructions with Human Feedback (InstructGPT). The canonical PPO-RLHF reference. (accessed 2026-05-20) ↩
8. OpenAI — Reinforcement Fine-Tuning API documentation. The official RFT framing. (accessed 2026-05-20) ↩
9. OpenAI Cookbook — Exploring Model Graders for Reinforcement Fine-Tuning. The grader-type and training-data format reference. (accessed 2026-05-20) ↩
10. CTOL Digital — OpenAI Debuts Reinforcement Fine-Tuning (December 2024 launch coverage). Berkeley Lab + Charité Hospital rare-disease demo; o1-mini alpha; the "tens to hundreds of samples" data-efficiency framing. (accessed 2026-05-20) ↩
11. Liu et al. 2025 — Understanding R1-Zero-Like Training: A Critical Perspective (Dr. GRPO, arXiv:2503.20783, COLM 2025). The length-bias critique of GRPO's per-completion normalisation. (accessed 2026-05-20) ↩
12. Hugging Face TRL — GRPOTrainer reference implementation. The production-grade open-source GRPO trainer. (accessed 2026-05-20) ↩
13. DeepSeek-R1 official GitHub repository. Model card, distilled checkpoints, licence (MIT for DeepSeek-side weights). (accessed 2026-05-20) ↩
14. Hugging Face — deepseek-ai/DeepSeek-R1 model card. R1 and the full R1-Distill family. (accessed 2026-05-20) ↩