Evol-Instruct, WizardLM, WizardCoder, and the Tülu lineage: a multi-paper review

Multi-paper review of Evol-Instruct and the WizardLM/WizardCoder/Tülu open-instruction-tuning lineage — what each paper changes, and where the lineage still leaks.

19 May 2026 Updated 19 May 2026 ~48 min read

Figure 1 of WizardLM (arXiv:2304.12244) — the Evol-Instruct schematic showing initial instruction iteratively rewritten via in-depth and in-breadth operations

Figure 1 of WizardLM (arXiv:2304.12244), reproduced for editorial coverage.

Reading-register key

From the paper: claims drawn verbatim or near-verbatim from the source paper’s text, equations, tables, or figures.

Facts: dates, citations, vendor specifications verified at writer-time from primary sources.

[Analysis]: the publication’s own reasoned assessment, distinct from any claim a paper itself makes.

[Reviewer Perspective]: a critical or speculative assessment that goes beyond what the papers prove.

[External comparison]: a comparison to prior work or general knowledge outside the papers themselves.

[Reconstructed]: content faithfully reconstructed because a paper only partially disclosed it.

Section 1: Cluster scope

This review covers three closely-related instruction-tuning papers plus their immediate descendants:

WizardLM (Xu et al., ICLR 2024; arXiv preprint April 2023). Introduces Evol-Instruct, an iterative LLM-driven method for synthesising harder and more diverse instructions.¹
WizardCoder (Luo et al., ICLR 2024; arXiv preprint June 2023). Adapts Evol-Instruct to code generation and fine-tunes a 15B-parameter StarCoder backbone.²
Tülu 1 (Wang et al., NeurIPS 2023 Datasets and Benchmarks Track; arXiv preprint June 2023). Systematically evaluates a dozen open instruction datasets on a unified benchmark suite and releases a best-mix LLaMA fine-tune.³
Tülu 2 (Ivison et al., November 2023). Updates the Tülu mix on Llama-2 and adds DPO post-training, including a 70B-parameter DPO-trained release.⁴
Auto Evol-Instruct (Zeng et al., June 2024). Removes the human-designed evolution prompt from Evol-Instruct by having an LLM design the evolution strategies itself.⁵

The papers are linked. WizardLM proposes the Evol-Instruct technique; WizardCoder shows it transfers to a different domain; Tülu situates it in a wider open-instruction-tuning landscape and asks the empirical question “which dataset actually helps which capability?”; Auto Evol-Instruct closes the loop by automating the only human-authored piece of Evol-Instruct that remained.

Paper classification (all five): Training method; data-driven; LLM-based. WizardCoder and Tülu 2 additionally classify as application papers (code, multi-capability post-training).

Primary research question for the cluster. Can synthetic instruction data, generated by a teacher LLM acting on a seed set, match or beat the result of fine-tuning on human-written or human-curated instructions, and if so, what does the evolved-data design space look like?

Core technical claim. Yes, with caveats. Evol-Instruct produces instructions that outperform human-written alternatives on AlpacaEval-style preference benchmarks at fixed dataset size; the gain transfers to code; but the Tülu papers show no single dataset (synthetic or human) wins on every capability and that careful mixing matters more than any one source.

Core technical domains (with depth label):

Synthetic data generation via prompted LLM rewriting — deep.
Supervised fine-tuning (SFT) of decoder-only transformers on instruction data — moderate.
Open-source LLM benchmarking (MMLU, GSM8K, BBH, HumanEval, MBPP, AlpacaEval, MT-Bench) — moderate.
Direct Preference Optimization (DPO) post-training — surface (covered in depth in this publication’s separate DPO multi-paper review).

Reader prerequisites. High-school algebra; some prior exposure to what “fine-tuning a neural network” means is helpful but not required because Section 2.5 (Glossary) covers every prerequisite term used below. No graduate ML background assumed.

Section 2: TL;DR and executive overview

3-sentence TL;DR

If a small team wants a chat or code model that follows hard instructions well, the cheapest known route is to start from an open base model, generate synthetic training instructions by asking a stronger teacher LLM (often GPT-3.5 or GPT-4) to iteratively rewrite simpler seed instructions into harder ones, then fine-tune on those. WizardLM showed this Evol-Instruct trick beats fine-tuning on the equivalent human-written instructions; WizardCoder showed it transfers to code; the Tülu papers showed no single synthetic dataset wins everything and that mixing high-quality human and synthetic data still matters. The whole lineage is a story about how synthetic data became the default for open-model post-training, and where its limits surface.

Executive summary

Between April 2023 and mid-2024, five papers reshaped how open-weight language models are fine-tuned. WizardLM introduced Evol-Instruct: prompt a strong LLM to rewrite a simpler instruction into a harder one along five “in-depth” axes (add constraints, deepen, concretise, increase reasoning steps, complicate input) or one “in-breadth” axis (mutate to a new topic), repeat for several rounds, fine-tune on the resulting 250K-instruction corpus. WizardCoder adapted the same recipe to code and posted a 22-point HumanEval gain over its StarCoder base. The Tülu papers ran the systematic baseline study the field had been missing, comparing twelve open datasets and showing that the best Tülu mix reached roughly 87% of ChatGPT’s performance and 73% of GPT-4’s. Auto Evol-Instruct then removed the last hand-engineered piece. Anyone building a custom instruction-tuned model in 2026 inherits this lineage.

Five practitioner-relevant takeaways

Evol-Instruct is essentially free engineering leverage: a few hundred dollars of teacher-LLM inference produces 100K-plus high-quality instructions that beat human-written ones at the same volume on AlpacaEval-style judging.
Distillation of capability is bounded by the teacher. Every gain WizardLM and WizardCoder report is a gain relative to a smaller open base; absolute ceilings still track the GPT-3.5/GPT-4 teacher used to generate the data.
No single instruction dataset wins all axes. Tülu’s headline finding is empirical: SuperNI helps factual knowledge, ShareGPT helps open-ended preference, code-specific data helps code. Build the mix.
Synthetic data has a contamination smell. Several follow-up papers and audits flag that AlpacaEval and MT-Bench, which Evol-Instruct optimises for, share judging-LLM family with the teacher LLM that wrote the training data — a structural conflict the original papers do not flag.
Auto Evol-Instruct closes the manual-prompt loop but does not remove the teacher-LLM dependency; the field is still distilling, just from a different prompt-design surface.

Pipeline overview (training-time vs inference-time)

Training-time for Evol-Instruct family. Step 1: seed instruction set (Alpaca’s 52K self-instructed examples for WizardLM; Code-Alpaca’s 20K for WizardCoder). Step 2: for $M$ rounds, prompt a teacher LLM (ChatGPT/GPT-3.5 or GPT-4) with one of the evolution templates to generate a harder variant of each instruction. Step 3: have the teacher LLM answer each evolved instruction (synthetic responses). Step 4: filter out failed evolutions via the elimination rules. Step 5: supervised-fine-tune the open base (LLaMA 7B for WizardLM; StarCoder 15B for WizardCoder) on the cumulative pool. Tülu adds a step 6: blend in human-curated datasets (FLAN, CoT, Dolly, OpenAssistant) to broaden capability coverage. Tülu 2 adds a step 7: DPO on a preference dataset.

Inference-time. Identical to standard open-weight LLM inference. No special runtime support is needed; the lineage is purely a training-data + SFT-recipe story.

Section 2.5: Glossary

Term	Plain-English explanation	First appears in
Instruction tuning / SFT	Supervised fine-tuning of a pretrained language model on (instruction, response) pairs so the model learns to follow user requests rather than just continue text.	Section 1
Seed instructions	A starting pool of instructions (here, Alpaca’s 52K or Code-Alpaca’s 20K) that the evolution process rewrites into harder ones.	Section 2
Teacher LLM	A stronger language model (GPT-3.5 or GPT-4 in these papers) used to generate training data for a weaker open-weight student model.	Section 2
Evol-Instruct	An iterative prompt-based method that rewrites an instruction into a harder version along one of six axes (five in-depth, one in-breadth).	Section 2
In-depth evolution	Five rewriting axes that make an instruction harder while keeping its topic: add constraints, deepen, concretise, increase reasoning steps, complicate input.	Section 5
In-breadth evolution	One rewriting axis that mutates an instruction to a new topic of comparable difficulty, increasing diversity rather than depth.	Section 5
Elimination evolution	Filtering step that throws away evolved instructions the teacher LLM cannot answer well, copies of the prompt, or non-informative outputs.	Section 5
Pass@1	On a code benchmark, the fraction of problems the model solves correctly on its first sampled completion.	Section 9
HumanEval / MBPP / DS-1000	Three widely-used code generation benchmarks. HumanEval is OpenAI’s 164-problem Python suite; MBPP is Google’s 974 problems; DS-1000 is data-science-specific.	Section 9
AlpacaEval	A judging-LLM-based preference benchmark where GPT-4 compares model outputs against text-davinci-003 responses on 805 prompts.	Section 9
MMLU / GSM8K / BBH / TyDiQA	Standard open-source benchmarks for, respectively, multi-task knowledge, grade-school math, hard reasoning, and multilingual QA.	Section 9
DPO	Direct Preference Optimization — a loss that fine-tunes on preference pairs without a separate reward model; used in Tülu 2.	Section 1
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what a paper itself claims.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the papers prove.	Section 11 + 12
`[Reconstructed]` / `[External comparison]`	Content faithfully reconstructed from partial disclosure / a comparison to prior work outside the papers.	Where used
”From the paper:” prefix	Content directly supported by a paper’s text, equations, tables, or figures.	Throughout

Section 3: Problem formalisation

Notation

Symbol	Type	Meaning	First appears in
$\mathcal{D}_0$	set of strings	Seed instruction set (Alpaca’s 52K for WizardLM; Code-Alpaca’s 20K for WizardCoder)	Section 5
$I$	string	A single instruction in natural language	Section 5
$R$	string	A response generated by the teacher LLM to instruction $I$	Section 5
$\mathcal{T}$	function	The teacher LLM, modelled as a string-to-string map	Section 5
$p_{\text{evol}}(I)$	string	Evolution prompt template applied to instruction $I$	Section 5
$M$	integer	Number of evolution rounds (4 in WizardLM; 3 in WizardCoder)	Section 5
$\mathcal{D}_m$	set of strings	Instruction pool after evolution round $m$	Section 5
$\pi_\theta$	function	Student LLM with parameters $\theta$ being fine-tuned	Section 6
$\mathcal{L}_{\text{SFT}}$	scalar	Standard cross-entropy SFT loss	Section 6
$\beta$	scalar	DPO temperature (Tülu 2 only)	Section 6

Formal problem statement

The instruction-tuning post-training problem these papers tackle is: given a base pretrained language model $\pi_{\theta_0}$ with frozen weights $\theta_0$ and a budget of $N$ (instruction, response) pairs, choose the dataset $\mathcal{D} = \{(I_i, R_i)\}_{i=1}^N$ that, after standard supervised fine-tuning of $\pi_{\theta_0}$ on $\mathcal{D}$ , maximises the resulting model’s performance on a held-out evaluation suite $\mathcal{E}$ . The papers differ in how they construct $\mathcal{D}$ .

WizardLM and WizardCoder treat $\mathcal{D}$ as a function of an evolution policy $\pi_{\text{evol}}$ , a teacher LLM $\mathcal{T}$ , and a seed set $\mathcal{D}_0$ . Tülu treats $\mathcal{D}$ as a mixture over publicly-available instruction sources and asks which mixture coefficients win.

Assumptions

The teacher LLM is competent on the evolved instructions. WizardLM/WizardCoder assume $\mathcal{T}$ (GPT-3.5 or GPT-4) can both rewrite an instruction $I \to I'$ and answer $I'$ correctly. The elimination step (Section 5) discards cases where this fails. [Analysis] Potentially strong assumption — a teacher that cannot solve a problem cannot teach it; this puts a hard ceiling on the student.
GPT-4 judging is a faithful proxy for instruction-following quality. WizardLM and WizardCoder both report GPT-4-judged AlpacaEval-style win rates as primary evidence. [Analysis] Potentially strong assumption — Tülu’s paper itself flags that AlpacaEval and similar judges correlate imperfectly with human preference and overweight surface fluency.³
Pass@1 on HumanEval extrapolates to general code competence (WizardCoder).²
No test-set leakage. None of the papers explicitly run a contamination audit against MMLU/GSM/HumanEval test prompts in the evolved data. [Analysis] Potentially fragile — synthetic data generated from a teacher with broad training exposure can leak benchmark content; this risk is not zero.

Why the problem is hard

The combinatorial space of natural-language instructions is unbounded. Human-curated SFT datasets like Stanford Alpaca’s 52K cost a few hundred dollars of OpenAI inference each to generate; scaling that to the millions needed for capability-saturating training would cost human-curator-equivalent millions if not automated. The Evol-Instruct lineage answers the question “can we automate the harder-instruction-generation step too?”

Causal vs correlational

These are observational training-data construction papers. None claim a causal identification of “which evolution operator caused which capability gain”; the closest the lineage comes is WizardLM’s ablation removing in-breadth evolution, which the paper reports degrades diversity but with the caveat of small sample sizes.

Section 4: Motivation and gap

The motivation in early 2023 was concrete. Stanford Alpaca (March 2023) had shown that 52K instructions generated by Self-Instruct could fine-tune LLaMA 7B into a usable assistant for about $500 of GPT-3.5 inference. Vicuna (March 2023) had pushed this further with 70K ShareGPT conversations. Both used relatively easy instructions; both saturated on hard tasks (multi-step reasoning, code generation, complex constraint-following) well below ChatGPT’s level. The gap WizardLM identified: instruction difficulty was the bottleneck, not instruction count.

[External comparison] In the broader landscape, the lineage sits between Self-Instruct (Wang et al. 2022, which originated automated instruction generation) and the FLAN line (Wei et al. 2021, Chung et al. 2022, which used human-written templated instructions). Evol-Instruct is what you get when you ask: “what if the generator is asked to also crank the difficulty knob?”

Practical stakes. Open-weight LLM post-training in 2023–2024 ran on academic and small-team budgets. A method that turned $5,000 of GPT-4 inference into a 70K-instruction dataset that beat OpenAssistant’s human-curated 161K was an order-of-magnitude leverage win. The Tülu papers’ framing — open resources matching 87% of ChatGPT — was the field’s argument that closed-LLM superiority was less about secret architecture than about post-training data.

Section 5: Method overview

WizardLM paper teaser card showing the in-depth and in-breadth evolution operators arrayed around a seed instruction

WizardLM (arXiv:2304.12244) paper card, reproduced for editorial coverage. The in-depth axes (add constraints, deepening, concretising, increase reasoning steps, complicate input) and in-breadth axis (mutate to new topic) define the six evolution operators discussed below.

5.1 Evol-Instruct (WizardLM)

Plain-English intuition. Take any instruction. Ask a strong LLM to make it harder along one specific axis (e.g., “add a new constraint”). Use the resulting harder instruction as the new training example. Repeat for several rounds. The dataset grows in difficulty, not just size.

Mechanism, step by step. From the paper:¹

Seed pool. Start with $\mathcal{D}_0$ = Alpaca’s 52K instructions, each a relatively simple Self-Instruct-generated request.
Sample an evolution operator uniformly at random from six choices: five in-depth (add constraints, deepening, concretizing, increase reasoning steps, complicate input) and one in-breadth (mutate to new topic).
Construct the evolution prompt. A fixed template, e.g., for “add constraints”: “I want you to act as a Prompt Rewriter. Your objective is to rewrite a given prompt into a more complex version. … Please add one more constraints/requirements into #Given Prompt#. … #Given Prompt#: $I$ ”.
Query the teacher LLM with the evolution prompt to produce evolved instruction $I'$ .
Generate the response $R' = \mathcal{T}(I')$ by querying the teacher again with $I'$ alone.
Elimination check. Discard $(I', R')$ $(I^{'}, R^{'})$ if any of the elimination criteria fire:
- No information gain over $I$ (teacher LLM judges the pair semantically identical).
- Teacher refuses or struggles to respond (presence of “sorry” with response length under 80 words).
- Response contains only punctuation or stop words.
- $I'$ literally copies a phrase from the evolution prompt template (a known failure mode where the teacher echoes the template).
Accumulate. Add surviving $(I', R')$ to $\mathcal{D}_{m+1}$ .
Repeat for $M = 4$ rounds. Final corpus reaches roughly 250K instructions across the original 52K and the four evolution rounds.

Connection to full pipeline. The final 250K corpus is used as a standard SFT dataset to fine-tune LLaMA 7B; nothing else about training differs from Alpaca.

Design rationale + tradeoffs. From the paper: each evolution operator targets a different difficulty axis. In-depth axes deepen reasoning; the in-breadth axis preserves topic diversity that pure depth-deepening would collapse. The elimination check exists because raw teacher rewrites have a high failure rate — without it the corpus becomes noisier than the seed.

What breaks if removed. The paper’s ablation removes in-breadth evolution and reports a drop in topical diversity. Removing the elimination step is not ablated explicitly, but [Analysis] removing it would re-introduce the templated-echo failure mode that the elimination criterion explicitly targets.

Classification: [New] in 2023 as an instruction-evolution policy. The underlying primitive of LLM-as-rewriter is [Adopted] from prior prompt-engineering work.

Schematic illustration of one round of Evol-Instruct turning a simple seed instruction into a harder variant

Conceptual schematic of a single Evol-Instruct round, reproduced from WizardLM (arXiv:2304.12244) for editorial coverage.

5.2 Code Evol-Instruct (WizardCoder)

The same six-axis template is collapsed into five code-specific axes. From the paper:²

Add new constraints and requirements (~10 extra words).
Replace a common requirement with a less common, more specific one.
Add more reasoning steps if the problem is too easy.
Provide a piece of erroneous code as a reference (debugging variant).
Propose higher time or space complexity (used sparingly).

The unified prompt template is “Please increase the difficulty of the given programming test question a bit. You can increase the difficulty using, but not limited to, the following methods: {method} {question}”. The in-breadth axis is removed; deepening and complicate-input are merged into the constraint axis.

Seed. Code-Alpaca’s 20K Python instructions.

Evolution depth. 3 rounds, producing 38K → 58K → 78K cumulative samples (paper Figure 2). The paper stops at 3 rounds because HumanEval pass@1 begins to decline at round 4 — a self-limiting signal that more evolution past a point introduces more noise than capability.

Backbone. StarCoder 15B (BigCode’s open code LLM, May 2023). Fine-tuned with batch size 512, sequence length 2048, learning rate 2e-5, cosine schedule, 30 warmup steps, 200 total fine-tuning steps, fp16 mixed precision.²

Classification: [Adapted] from WizardLM Evol-Instruct, domain-specialised to code.

WizardCoder paper card showing the code-specific evolution operator set

WizardCoder (arXiv:2306.08568) paper card, reproduced for editorial coverage. Five code-specific evolution operators replace the six-axis general set.

5.3 The Tülu 1 mix (Wang et al.)

Tülu 1 does not propose a new instruction-generation method. It evaluates twelve existing instruction datasets — SuperNI, CoT, Flan V2, Dolly, OpenAssistant, Self-Instruct, Unnatural Instructions, Alpaca, Code-Alpaca, GPT4-Alpaca, Baize, ShareGPT — under a single fine-tuning protocol on LLaMA 6.7B/13B/30B/65B and benchmarks each on MMLU, GSM8K, BBH, TyDiQA, Codex-Eval, AlpacaEval, and ToxiGen. The “Tülu” mix combines FLAN V2 + CoT + Dolly + OpenAssistant + GPT4-Alpaca + Code-Alpaca + ShareGPT.³

Design rationale. From the paper: each dataset shows a strikingly different capability signature (Section 9 results). The mix is chosen to cover all axes, not maximise any one.

Classification: [New] as a systematic comparison; the constituent datasets are all [Adopted] from prior work.

5.4 Tülu 2

Tülu 2 (November 2023) is an incremental refresh: same protocol, new Llama-2 base, updated mix (Tülu-V2-mix), and a DPO post-training stage on UltraFeedback preferences. The DPO variant at 70B is reported as the largest DPO-trained model at submission time.⁴ The DPO mechanics are covered in this publication’s separate DPO multi-paper review.

Classification: [Adapted] from Tülu 1 with [Adopted] DPO loss.

5.5 Auto Evol-Instruct

Auto Evol-Instruct (Zeng et al., June 2024) removes the hand-written evolution prompt. Instead, an LLM is asked to design the evolution operator itself given a seed instruction sample, then iteratively refines the operator based on issues observed during evolution. Reported gains over WizardLM’s manual operator on MT-Bench, AlpacaEval, GSM8K, HumanEval.⁵

Classification: [New] at the meta-level (the operator is learned, not specified). The base mechanism is [Adapted] from Evol-Instruct.

Section 6: Mathematical contributions

The Evol-Instruct lineage is largely procedural; the formal mathematics is light because the underlying training loss is standard. The four math objects worth fully expanding are: the SFT objective, the evolution-as-Markov-chain framing, the elimination filter as a Bayesian rejection step, and the Tülu 2 DPO loss.

MATH ENTRY [1]: Supervised fine-tuning loss

Source: implicit in all four papers; the standard next-token cross-entropy SFT loss applied to instruction-response pairs.
What it is: a number that measures how surprised the student model is by the correct response to each training instruction; the optimiser minimises this number.
Formal definition:

$\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(I, R) \sim \mathcal{D}} \left[ \sum_{t=1}^{|R|} \log \pi_\theta(R_t \mid I, R_{<t}) \right]$

Each term explained, with dimensional analysis:
- $\theta \in \mathbb{R}^P$ — student model parameters, $P \approx 7 \times 10^9$ for LLaMA 7B.
- $(I, R) \sim \mathcal{D}$ — sample an (instruction, response) pair uniformly from the training set $\mathcal{D}$ of size $N \approx 250{,}000$ .
- $\mid R\mid$ — token-length of the response, typically 50–500 tokens.
- $R_t$ — the $t$ -th token of $R$ ; an integer in $\{1, \dots, V\}$ where $V \approx 32{,}000$ is the vocabulary.
- $\pi_\theta(R_t \mid I, R_{<t})$ — the student’s predicted probability for that specific next token. A scalar in $(0, 1]$ .
- The outer expectation averages the per-example log-loss; in practice it is a mini-batch mean (batch 128 in WizardLM, 512 in WizardCoder).
Worked numerical example. Suppose a tiny vocabulary $V = 4$ and response $R = [3, 1]$ (length 2) for some instruction $I$ . The student outputs softmax probabilities $\pi_\theta(\cdot \mid I) = [0.1, 0.2, 0.3, 0.4]$ at position 1, so $\pi_\theta(R_1 = 3 \mid I) = 0.3$ ; then at position 2, $\pi_\theta(\cdot \mid I, R_1) = [0.5, 0.2, 0.2, 0.1]$ , so $\pi_\theta(R_2 = 1 \mid I, R_1 = 3) = 0.5$ . The example loss is $-(\log 0.3 + \log 0.5) = -(-1.204 + -0.693) = 1.897$ . The gradient update pushes $\theta$ so these two probabilities both rise.
Role: this is the loss that turns Evol-Instruct data into a fine-tuned model. The Evol-Instruct contribution lies entirely in $\mathcal{D}$ , not in $\mathcal{L}$ .
Edge cases: very long responses dominate the loss numerically; both WizardLM and Tülu mask the loss to skip the instruction tokens (loss on response tokens only). Padding tokens are excluded.
Novelty: [Adopted] — standard since GPT-2-era language modelling.
Transferability: [Analysis] universal; any SFT recipe uses some form of this loss.
Why it matters: the loss is the constant that lets the papers’ contributions be a pure data story.

MATH ENTRY [2]: Evolution as a string-to-distribution map

Source: implicit framing of WizardLM and Auto Evol-Instruct.
What it is: each evolution step is a probabilistic rewriter that maps one instruction string to a distribution over possible harder instructions.
Formal definition:

$\Pr(I' \mid I) = \sum_{k=1}^{6} \Pr(k) \cdot \mathcal{T}\bigl(I' \mid p_{\text{evol}}^{(k)}(I)\bigr)$

where $k$ indexes the six evolution operators (5 in-depth + 1 in-breadth in WizardLM), $\Pr(k) = 1/6$ uniform, and $\mathcal{T}(\cdot \mid \cdot)$ is the teacher LLM’s conditional distribution over completions.

Each term explained:
- $I$ — current instruction string; some sequence of natural-language tokens.
- $k$ — operator index; an integer in $\{1, \dots, 6\}$ .
- $p_{\text{evol}}^{(k)}(I)$ — the prompt template for operator $k$ instantiated on $I$ , a longer string.
- $\mathcal{T}(I' \mid \cdot)$ — the teacher LLM’s autoregressive distribution over $I'$ , factored as $\prod_t \mathcal{T}(I'_t \mid I'_{<t}, \text{prompt})$ .
- The full Evol-Instruct corpus after $M$ rounds is the marginal $\Pr(I^{(M)} \mid I^{(0)}) = \sum_{I^{(1)}, \dots, I^{(M-1)}} \prod_{m=0}^{M-1} \Pr(I^{(m+1)} \mid I^{(m)})$ , an $M$ -step Markov chain over instruction space.
Worked numerical example. Suppose three operators (simplified from six), uniform selection. Start: $I^{(0)} =$ “Write a Python function to add two numbers.” Operator 1 (add constraints) → $I^{(1)}_a =$ “Write a Python function to add two numbers without using the + operator.” (probability mass concentrated by teacher LLM). Operator 2 (increase reasoning steps) → $I^{(1)}_b =$ “Write a Python function that adds two numbers, with intermediate variable names matching their decimal place.” Operator 3 (complicate input) → $I^{(1)}_c =$ “Write a function that adds two numbers represented as string-encoded base-3 integers.” If the teacher’s distribution at each step is reasonably concentrated, after $M = 4$ rounds the marginal $\Pr(I^{(4)} \mid I^{(0)})$ spreads across a long tail of hard-instruction variants of the original add-two-numbers seed.
Role: this framing is what makes the elimination step (next entry) interpretable as a rejection sampler.
Edge cases: when operators compose unfortunately (“complicate input” applied after “concretize” can produce instructions the teacher itself cannot answer, which the elimination step then drops).
Novelty: [Analysis] the Markov-chain framing is the publication’s reconstruction; WizardLM does not state it formally, but it follows directly from how the procedure is implemented.
Transferability: any iterative LLM-driven data-augmentation method shares this structure.
Why it matters: it makes precise what “evolving” means — sampling from an $M$ -step random walk over instruction space whose transition kernel is the teacher LLM.

MATH ENTRY [3]: Elimination evolution as rejection sampling

Source: WizardLM Section 3.4.¹
What it is: the filter step that throws away evolved (instruction, response) pairs failing a quality predicate; viewable as rejecting samples from the raw evolution distribution to obtain a higher-quality posterior.
Formal definition:

$\Pr_{\text{kept}}(I', R' \mid I) \propto \Pr(I' \mid I) \cdot \mathcal{T}(R' \mid I') \cdot \mathbb{1}\bigl[\text{accept}(I, I', R')\bigr]$

where the indicator is 1 iff the pair passes all four criteria.

Each term explained:
- $\mathbb{1}[\cdot]$ — indicator, 0 or 1.
- accept $(I, I', R')$ — the conjunction of: (a) $I'$ is informationally distinct from $I$ (judged by a separate ChatGPT call); (b) $R'$ is not a refusal under 80 words containing “sorry”; (c) $R'$ contains more than just punctuation; (d) $I'$ does not literally copy the evolution-prompt template.
Worked numerical example. Suppose 100 evolved candidates per evolution round. WizardLM reports the elimination step removes roughly 30% (the paper does not give a single number; this is the publication’s reading of the reported corpus shrinkage). 70 survive. Acceptance rate $\approx 0.7$ . After $M = 4$ rounds, the surviving corpus is $52{,}000 \times 0.7^4 \approx 52{,}000 \times 0.24 = 12{,}480$ per evolution path, but each round also reinjects new evolutions from earlier-round survivors, so the cumulative pool grows to ~250K rather than shrinking.
Role: keeps the corpus quality from degrading as $M$ grows; without it the noise compounds.
Edge cases: the “no information gain” check is itself an LLM judgement and inherits the teacher’s biases; the “sorry under 80 words” rule is a heuristic.
Novelty: [New] in the specific four-criterion form.
Transferability: [Analysis] the rejection-sampling framing transfers; the specific predicate is domain-bound.
Why it matters: this is the unsung hero of Evol-Instruct. Without quality filtering, iterative LLM rewriting decays into a noise machine.

MATH ENTRY [4]: DPO loss (Tülu 2 only)

Source: Tülu 2 Section 3, citing Rafailov et al. 2023.⁴
What it is: the loss used in Tülu 2’s preference-tuning stage, defined directly on (preferred, dispreferred) response pairs.
Formal definition:

$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(I, R_w, R_l) \sim \mathcal{D}_{\text{pref}}} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(R_w \mid I)}{\pi_{\text{ref}}(R_w \mid I)} - \beta \log \frac{\pi_\theta(R_l \mid I)}{\pi_{\text{ref}}(R_l \mid I)} \right) \right]$

Each term explained:
- $R_w$ — preferred (“winning”) response; $R_l$ — dispreferred (“losing”) response. Both are strings.
- $\pi_\theta$ — student policy being optimised.
- $\pi_{\text{ref}}$ — reference policy (the SFT checkpoint before DPO).
- $\beta$ — scalar temperature, typically 0.1; controls how aggressively the policy can deviate from $\pi_{\text{ref}}$ .
- $\sigma(z) = 1/(1 + e^{-z})$ — sigmoid.
Worked numerical example. Suppose for one preference pair the student computes $\log \pi_\theta(R_w \mid I) - \log \pi_{\text{ref}}(R_w \mid I) = 0.4$ and the analogous loser quantity = -0.2. With $\beta = 0.1$ : the argument of $\sigma$ is $0.1 \times (0.4 - (-0.2)) = 0.06$ . Then $\sigma(0.06) \approx 0.515$ , and the loss contribution is $-\log 0.515 \approx 0.664$ . The gradient update raises $\pi_\theta(R_w \mid I)$ and lowers $\pi_\theta(R_l \mid I)$ .
Role: post-SFT preference fine-tuning in Tülu 2; treated as additive over the SFT stage.
Edge cases: numerical stability when $\pi_\theta$ drifts far from $\pi_{\text{ref}}$ — implementations clip the log-ratio.
Novelty: [Adopted] from Rafailov et al. 2023.
Transferability: covered in detail in this publication’s separate DPO multi-paper review.
Why it matters: Tülu 2’s 70B DPO release demonstrated DPO scales to large open models, which mattered for the field’s adoption.

Section 7: Algorithmic contributions

ALGORITHM ENTRY [1]: Evol-Instruct dataset construction (WizardLM, the headline algorithm)

Source: WizardLM Section 3, Algorithm 1.¹
Purpose: turn a seed of 52K simple instructions into a 250K-instruction training corpus of varied difficulty.
Inputs:
- Seed pool $\mathcal{D}_0$ (set of strings, $\mid \mathcal{D}_0\mid = 52{,}000$ ).
- Teacher LLM $\mathcal{T}$ (callable string-to-string).
- Six evolution prompt templates $\{p_{\text{evol}}^{(k)}\}_{k=1}^6$ (fixed strings with one $I$ slot).
- Number of rounds $M = 4$ .
Outputs:
- Final corpus $\mathcal{D}_M$ (set of (instruction, response) pairs, $\mid \mathcal{D}_M\mid \approx 250{,}000$ ).
Pseudocode (reconstructed faithfully from the paper’s prose):

function EvolInstruct(D0, T, {p_evol[1..6]}, M):
    D ← D0                                       # cumulative pool
    Dcurrent ← D0                                # this round's input
    for m = 1 to M:
        Dnext ← {}
        for each I in Dcurrent:
            k ← uniform({1,…,6})                 # pick evolution operator
            prompt ← p_evol[k] applied to I
            I' ← T(prompt)                       # teacher rewrites
            R' ← T(I')                           # teacher answers
            if accept(I, I', R'):                # 4-rule elimination
                Dnext ← Dnext ∪ {(I', R')}
        D ← D ∪ Dnext
        Dcurrent ← {I' for (I', R') in Dnext}
    return D

Hand-traced example on a minimal input. Take $\mathcal{D}_0 = \{I_1\}$ $D_{0} = {I_{1}}$ where $I_1 =$ $I_{1} =$ “Write a Python function to add two numbers.” with $M = 2$ $M = 2$ rounds.
- Round $m = 1$ . Pick $k = 1$ (“add constraints”). $p_{\text{evol}}^{(1)}(I_1) \to I'_1 =$ “Write a Python function to add two numbers without using the + operator.” Teacher answers $R'_1 =$ a bitwise-add implementation. accept passes (informationally distinct, response > 80 words, real code, no template echo). $\mathcal{D}_1 = \{I_1, (I'_1, R'_1)\}$ , current = $\{I'_1\}$ .
- Round $m = 2$ . Pick $k = 3$ (“increase reasoning steps”). $p_{\text{evol}}^{(3)}(I'_1) \to I''_1 =$ “Write a Python function that adds two numbers without using +, and explain at each step how the carry bit propagates.” Teacher answers $R''_1$ . accept passes. $\mathcal{D}_2 = \{I_1, (I'_1, R'_1), (I''_1, R''_1)\}$ .
- Final corpus size: 3 examples. On real $\mid \mathcal{D}_0\mid = 52{,}000$ with $M = 4$ and ~70% acceptance per round, cumulative size lands around 250K.
Complexity: time $O(\mid \mathcal{D}_0\mid \cdot M \cdot c_{\mathcal{T}})$ , dominated by teacher inference. WizardLM does not report total teacher cost. [Analysis] At ~250K evolved instructions × 2 teacher calls (rewrite + answer) × ~500 token average × ChatGPT pricing ($0.002 per 1K tokens at the time), back-of-envelope cost ~$500-1,000.
Hyperparameters: $M = 4$ rounds (sensitivity tested on Vicuna test set; gains saturate); uniform operator sampling (not ablated).
Failure modes: teacher LLM refuses some evolved instructions (mitigated by elimination); template echo (mitigated by elimination rule 4); operator composition collapsing to incoherent prompts (no explicit fix).
Novelty: [New].
Transferability: [Analysis] very high; the recipe is domain-agnostic and the WizardCoder paper proves at least one successful transfer.

ALGORITHM ENTRY [2]: Code Evol-Instruct dataset construction (WizardCoder)

Source: WizardCoder Section 3, prose description (no formal pseudocode in paper).²
Purpose: produce ~78K code-instruction pairs from Code-Alpaca’s 20K seed.
Inputs: Code-Alpaca seed (20K), teacher LLM (the paper does not state which; widely understood to be GPT-3.5 or GPT-4), five code-specific evolution templates, $M = 3$ rounds.
Outputs: 78K cumulative pairs (38K after round 1, 58K after round 2, 78K after round 3).
Pseudocode (reconstructed):

function CodeEvolInstruct(D0, T, {p_evol[1..5]}, M=3):
    D ← D0
    Dcurrent ← D0
    for m = 1 to M:
        Dnext ← {}
        for each I in Dcurrent:
            k ← uniform({1,…,5})
            prompt ← "Please increase the difficulty of the given
                      programming test question a bit..." + p_evol[k] + I
            I' ← T(prompt)
            R' ← T(I')
            if quality_check(R'):                # paper-side: pass@1 on
                                                 # a held-out probe declines
                                                 # if quality degrades
                Dnext ← Dnext ∪ {(I', R')}
        D ← D ∪ Dnext
        Dcurrent ← {I' for (I', R') in Dnext}
    return D

Hand-traced example. Seed $I_1 =$ “Write a Python function that returns the maximum of a list.” Round 1, $k = 2$ (replace common requirement): $I'_1 =$ “Write a Python function that returns the third-largest element of a list.” Teacher answers a sort-and-index implementation. Round 2, $k = 3$ (add reasoning steps): $I''_1 =$ “Write a Python function that returns the third-largest element of a list, without using a sort, and using only constant extra space.” Teacher answers a partial-selection implementation. Each round increases problem difficulty in a code-specific way.
Complexity: $O(20{,}000 \cdot M \cdot c_{\mathcal{T}}) \approx O(20{,}000 \cdot 3 \cdot c_{\mathcal{T}})$ .
Hyperparameters: $M = 3$ (paper notes pass@1 declines at $M = 4$ ); SFT hyperparameters: batch 512, sequence 2048, learning rate 2e-5, cosine schedule, 30 warmup steps, 200 fine-tuning steps, fp16. Hardware not reported in the paper.
Failure modes: round-4 degradation; in-breadth diversity loss (in-breadth axis removed entirely from the code variant).
Novelty: [Adapted] from WizardLM Evol-Instruct.
Transferability: [Analysis] transfers naturally to any other narrow domain (math, SQL, legal) given a domain-specific seed and adjusted operator set.

ALGORITHM ENTRY [3]: Tülu mix construction (Wang et al.)

Source: Tülu 1 Section 3, Table 1.³
Purpose: build a single fine-tuning corpus that covers factual knowledge, reasoning, code, and open-ended preference axes simultaneously.
Inputs: twelve open instruction datasets (SuperNI, CoT, Flan V2, Dolly, OpenAssistant, Self-Instruct, Unnatural-Instructions, Alpaca, Code-Alpaca, GPT4-Alpaca, Baize, ShareGPT).
Outputs: the “Human+GPT” mixture combining FLAN V2 + CoT + Dolly + OpenAssistant + GPT4-Alpaca + Code-Alpaca + ShareGPT (specific per-dataset row counts not stated explicitly in the paper as a single table).
Pseudocode (reconstructed):

function BuildTuluMix(datasets):
    mix ← []
    for d in [FlanV2, CoT, Dolly, OASST, GPT4Alpaca, CodeAlpaca, ShareGPT]:
        mix ← mix ∪ d                            # concatenate
    return shuffle(mix)

Hand-traced example. Take three datasets: $A = \{(I_1, R_1), (I_2, R_2)\}$ , $B = \{(I_3, R_3)\}$ , $C = \{(I_4, R_4)\}$ . The mix is $\{(I_1, R_1), \dots, (I_4, R_4)\}$ shuffled. Fine-tuning proceeds over the shuffled stream. The actual Tülu mix runs to a few hundred K examples; ShareGPT contributes the most preference-style chat data.
Complexity: $O(N_{\text{total}})$ in dataset size; no LLM calls (Tülu does not generate new data, only mixes).
Hyperparameters: learning rate 2e-5 (1e-5 for 30B/65B), 2 epochs, max sequence 2048, DeepSpeed ZeRO. Hardware: CSC LUMI cluster, 4× AMD MI250x GPUs per node.
Failure modes: imbalanced mix can let a high-volume dataset dominate (Tülu reports ShareGPT volume dominates open-ended-judging axes).
Novelty: [New] as a systematic comparison + mix protocol.
Transferability: highly reusable; the protocol generalises to any new candidate dataset.

Section 8: Specialised design contributions

8A. Prompt design — Evol-Instruct prompt templates

The five in-depth + one in-breadth WizardLM prompt templates and the five WizardCoder code templates are the lineage’s most reused artefacts. Both papers publish them verbatim in appendix.

PROMPT ENTRY [1]: In-depth “add constraints” (WizardLM)

Source: WizardLM Appendix A.¹
Role: rewrite an instruction with one additional constraint.
Prompt type: zero-shot meta-prompt with a single $I$ placeholder.
Components in order: persona (“Prompt Rewriter”) → objective → rewriting axis → length cap → output schema.
Reconstructed template (verbatim from the paper):

“I want you to act as a Prompt Rewriter. Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle. But the rewritten prompt must be reasonable and must be understood and responded by humans. Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#. … You SHOULD complicate the given prompt using the following method: Please add one more constraints/requirements into #Given Prompt#. You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#. #Given Prompt#: {instruction} #Rewritten Prompt#:”

Input schema: a single seed instruction string.
Output schema: a single rewritten instruction string.
Failure handling: handled at the elimination step rather than in-prompt.
Design rationale: the persona + axis-specific instruction is what makes operator choice a controllable knob; a generic “make it harder” prompt without axis specification reportedly collapsed to mostly verbose rewrites.
Complexity: single teacher call per rewrite; constant prompt length.
Novelty: [New].
Transferability: [Analysis] high; the template structure is the standard for prompted data augmentation as of 2026.

The other five WizardLM templates (deepening, concretising, increase reasoning steps, complicate input, in-breadth) follow the same structure with a different axis-specific instruction. WizardCoder’s five templates use the same scaffold with code-specific axes.

8B. Architecture-specific details

Not applicable. None of the papers modify the underlying transformer architecture; all five use the standard decoder-only backbone of LLaMA / LLaMA 2 / StarCoder / Code Llama as released by the original model authors.

8C. Training specifics

WizardLM: LLaMA 7B base; Adam optimiser; learning rate 2e-5; batch size 8 per GPU; 8× V100 GPUs with DeepSpeed ZeRO-3; 3 epochs; total wall-clock ~70 hours; sequence length 2048.¹

WizardCoder: StarCoder 15B base; batch size 512; sequence length 2048; learning rate 2e-5 with cosine schedule; 30 warmup steps; 200 fine-tuning steps total; fp16 mixed precision. Optimiser and hardware not reported.²

Tülu 1: LLaMA 6.7B/13B/30B/65B bases; learning rate 2e-5 (1e-5 for 30B/65B); 2 epochs; max sequence 2048; DeepSpeed ZeRO; CSC LUMI cluster with 4× AMD MI250x GPUs per node. Total GPU-hours not reported in a single number.³

Tülu 2: Llama-2 7B/13B/70B bases; same SFT recipe as Tülu 1 with a DPO stage at $\beta = 0.1$ on UltraFeedback preferences. Hardware: same LUMI cluster context.⁴

8D. Inference / deployment specifics

Not applicable. Inference is standard decoder-only autoregressive generation; no special runtime support is needed.

Section 9: Experiments and results

WizardCoder benchmark comparison card from arXiv showing pass@1 results across HumanEval, MBPP, DS-1000

WizardCoder benchmark card, reproduced from arXiv:2306.08568 for editorial coverage. Pass@1 results discussed in Section 9.2 below.

9.1 WizardLM results

Datasets evaluated.

Evol-Instruct test set: a 218-instruction test bed the paper constructs to balance difficulty levels.
Vicuna test set: 80 instructions from the Vicuna evaluation framework.
29-skill GPT-4 evaluation suite (each skill tested with ~10 prompts).

Baselines. Alpaca 7B, Vicuna 7B, ChatGPT (GPT-3.5).

Headline numerical results (from the paper, Tables 2-4).¹

Human evaluation, Evol-Instruct test set. WizardLM wins 12.4 percentage points more often than Vicuna in pairwise comparison.
Human evaluation, Vicuna test set. WizardLM wins 3.8 percentage points more than Vicuna.
Human evaluation, high-difficulty subset (difficulty ≥ 8 on the paper’s 1-10 scale). WizardLM beats ChatGPT 42.9% to 35.0% — a 7.9-point lead on the hardest prompts.
GPT-4 automatic evaluation. WizardLM scores 6.2% higher than Alpaca and 5.8% higher than Vicuna averaged across the 29 skills. WizardLM matches “more than 90% capacity of ChatGPT on 17 out of 29 skills” and achieves approximately 78% of ChatGPT’s average score across all 29 skills.

Ablations. Removing in-breadth evolution reduces topical diversity; removing the elimination step (mentioned in discussion, not run as a numerical ablation) is reported to degrade quality.

9.2 WizardCoder results

Datasets evaluated. HumanEval (164 Python problems), HumanEval+ (extended test cases), MBPP (974 Python problems), DS-1000 (data-science-specific).

Baselines. StarCoder 15B (the paper’s base), InstructCodeT5+, Code-Davinci-002, CodeGen-Mono, PaLM-Coder, PaLM-2-S, Claude, Bard, GPT-3.5, GPT-4.

Headline pass@1 numbers (from the paper, Table 1 + Figure 1).²

WizardCoder pass@1 on key benchmarks vs. baselines (values from WizardCoder paper, Table 1 / Figure 1).

Axis	WizardCoder-15B	StarCoder 15B	GPT-3.5	GPT-4
HumanEval	57.3%	33.6%	48.1%	67.0%
MBPP	51.8%	43.6%	—	—
DS-1000 (Completion)	29.2%	26.0%	—	—
DS-1000 (Insertion)	32.8%	25.4%	—	—

WizardCoder-15B

HumanEval: 57.3%
MBPP: 51.8%
DS-1000 (Completion): 29.2%
DS-1000 (Insertion): 32.8%

StarCoder 15B

HumanEval: 33.6%
MBPP: 43.6%
DS-1000 (Completion): 26.0%
DS-1000 (Insertion): 25.4%

GPT-3.5

HumanEval: 48.1%
MBPP: —
DS-1000 (Completion): —
DS-1000 (Insertion): —

GPT-4

HumanEval: 67.0%
MBPP: —
DS-1000 (Completion): —
DS-1000 (Insertion): —

WizardCoder posts a 22.3 percentage-point HumanEval pass@1 gain over its StarCoder base (57.3% vs 33.6%), beats Claude (53.0%) and Bard (44.5%) on HumanEval, and remains behind GPT-4 (67.0%) overall.

Ablations. The paper reports HumanEval pass@1 across evolution rounds: pass@1 climbs through rounds 1–3, then declines at round 4 — the paper stops at $M = 3$ for this reason.

Tülu paper card showing the open-instruction-tuning benchmark grid across twelve datasets

Tülu 1 (arXiv:2306.04751) paper card, reproduced for editorial coverage. The systematic twelve-dataset benchmark grid below is the paper’s empirical contribution.

9.3 Tülu 1 results

The Tülu 1 paper’s core empirical contribution is the per-dataset benchmark grid. From the paper, on LLaMA 13B:³

Dataset	MMLU	GSM8K	BBH	TyDiQA	Codex-Eval	AlpacaEval
SuperNI	49.7	4.0	4.5	50.2	12.9	4.2
CoT	44.2	40.0	41.9	47.8	23.7	6.0
Flan V2	50.6	20.0	40.8	47.2	16.8	3.2
Dolly	45.6	18.0	28.4	46.5	31.0	13.7
OpenAssistant	43.3	15.0	39.6	33.4	31.9	58.1
Self-Instruct	30.4	11.0	30.7	41.3	12.5	5.0
Unnatural Instructions	46.4	8.0	33.7	40.9	23.9	8.4
Alpaca	45.0	9.5	36.6	31.1	29.9	21.9
Code-Alpaca	42.5	13.5	35.6	38.9	34.2	15.8
GPT4-Alpaca	46.9	16.5	38.8	23.5	36.6	63.1
Baize	43.7	10.0	38.7	33.6	28.7	21.9
ShareGPT	49.3	27.0	40.4	30.5	34.1	70.5
Tülu mix (13B)	49.3	40.5	43.3	45.6	35.9	56.5

The mix wins on GSM8K, BBH, and Codex-Eval; ShareGPT alone wins AlpacaEval (because preference-data-only training overfits the preference metric).

Headline cross-model comparison. The Tülu 65B fine-tune reaches approximately 87% of ChatGPT’s overall score and 73% of GPT-4’s.³

9.4 Tülu 2 and DPO

Tülu 2 reports state-of-the-art among open models at submission (November 2023) and matches or exceeds GPT-3.5-turbo-0301 on several benchmarks.⁴ Specific per-benchmark numbers vary across versions; this review does not reproduce them all since DPO mechanics are covered separately. [Analysis] The headline contribution is “DPO scales to 70B”; on the Evol-Instruct lineage axis, Tülu 2 is incremental.

9.5 Independent benchmark cross-checks

[External comparison] For the SOTA HumanEval claim in WizardCoder (57.3% pass@1), independent leaderboards on Papers With Code corroborate the number within rounding for the original 15B release. The follow-up issue is that HumanEval saturates fast — by late 2024, multiple non-Evol-Instruct fine-tunes (DeepSeek-Coder, Code Llama Instruct, Magicoder) cross 70% pass@1. WizardCoder’s 57.3% reflects what was achievable on StarCoder 15B in mid-2023; it is not a 2026 frontier number.

For WizardLM’s “90% of ChatGPT on 17 of 29 skills” claim, the LMSYS Chatbot Arena leaderboard (which post-dates the paper) has consistently placed WizardLM-70B (the follow-up release) competitively against early 2024 ChatGPT versions but well below GPT-4. [Reviewer Perspective] The original WizardLM-7B claim is weaker than the 70B follow-up; the 90% figure should be read as evidence that the technique works, not as a stable ranking against any specific frontier model.

For Tülu, the AllenAI open-instruct repository has continued releasing new Tülu generations (Tülu 3 in late 2024), which independent third-party benchmarks have validated.

9.6 Evidence audit

Strongly supported:

Evol-Instruct produces a corpus that outperforms equivalent-sized human-written or Self-Instruct alternatives at fixed model and budget (multiple ablations across both WizardLM and WizardCoder agree).
WizardCoder’s 22-point HumanEval gain over StarCoder is reproducible (independent leaderboards confirm).
Tülu’s claim that no single dataset wins every benchmark is robust (the per-dataset grid is the evidence).

Partially supported:

The “90% of ChatGPT” framing is GPT-4-judged, not human-judged at the same scale; GPT-4-as-judge has known biases.
WizardCoder’s “outperforms Claude and Bard on HumanEval and HumanEval+” claim depended on specific Claude and Bard versions at submission time; the claim does not transfer to current frontier models.

Narrow-evidence:

The “Tülu 65B reaches 87% of ChatGPT” headline is a single benchmark-averaged number; the per-axis story is much more mixed.

Section 10: Technical novelty summary

Component	Type	Novelty level	Justification	Source
Evol-Instruct evolution policy	Training method	Fully novel	First systematic prompted-LLM iterative instruction-difficulty escalation	WizardLM
Elimination evolution filter	Training method	Combination novel	Rejection-sampling on LLM-generated data is not new, but the four-criterion form is	WizardLM
In-breadth + in-depth axis decomposition	Prompt design	Fully novel	The six-axis taxonomy is original to WizardLM	WizardLM
Code-specific evolution operators	Prompt design	Incrementally novel	Code-domain re-specialisation of the six-axis recipe	WizardCoder
Twelve-dataset SFT benchmark grid	Empirical contribution	Fully novel	First systematic open-instruction-dataset comparison	Tülu 1
Tülu Human+GPT mix	Training data	Combination novel	The constituent datasets are all prior work; the mix is new	Tülu 1
DPO at 70B	Training method scaling	Incrementally novel	DPO itself was prior; scaling demonstration was new	Tülu 2
Auto Evol-Instruct LLM-as-operator-designer	Training method	Fully novel	The meta-level automation of evolution prompts	Auto Evol-Instruct
LLaMA / StarCoder backbones	Architecture	Adopted	Used as-released	Touvron et al. / Li et al.

Single most novel contribution. Evol-Instruct’s six-axis iterative LLM-driven instruction-difficulty escalation. The technique reframed open-source instruction-tuning from a “how do I get more instructions” problem into a “how do I get harder instructions” problem. [Analysis] Almost every open-instruction-tuning paper after WizardLM cites or extends this framing.

What the papers do not claim to be novel. Self-Instruct-style automated data generation (adopted from Wang et al. 2022); SFT itself; the standard cross-entropy loss; the LLaMA / StarCoder / Code Llama backbones; the AlpacaEval / MT-Bench / HumanEval / MBPP benchmarks; the DPO loss (Tülu 2 explicitly cites Rafailov et al.); the underlying transformer architecture.

Section 11: Situating the work

Prior work. Self-Instruct (Wang et al., December 2022) introduced LLM-driven instruction generation. Alpaca (Taori et al., March 2023) productionised it for LLaMA. Vicuna (Chiang et al., March 2023) showed real-conversation data (ShareGPT) outperforms purely synthetic data on chat. The FLAN line (Wei et al. 2021, Chung et al. 2022) showed massive human-written templated instruction-tuning improves zero-shot generalisation. The Evol-Instruct lineage sits between Self-Instruct (which scales count) and FLAN (which scales coverage) by scaling difficulty.

What this lineage changes conceptually. Before Evol-Instruct, instruction-tuning data was treated as a corpus to be enlarged. After Evol-Instruct, it is treated as a curriculum to be evolved. The shift is operationally consequential: every major open-instruction-tuning effort from 2023 onward (Open-Hermes, Nous-Hermes, OpenChat, the Tülu series, Mistral Instruct, Llama 2-Chat) includes some Evol-Instruct-style difficulty escalation in the data pipeline.

[External comparison] Two contemporaneous related papers.

Orca (Mukherjee et al., June 2023). Distils from GPT-4 with reasoning traces rather than evolved instructions. Same “distil from frontier teacher” framing, different lever (chain-of-thought traces instead of difficulty escalation). The two lines were briefly competitors and increasingly are combined in production recipes (most strong 2024-onward open chat models use both).
LIMA (Zhou et al., May 2023). The opposite end of the spectrum: 1000 carefully hand-curated examples beat 50K LIMA-equivalent instruction sets. The Evol-Instruct claim is essentially “automate the curation”; LIMA’s claim is “humans still do it better at small scale.” [Reviewer Perspective] The two papers are not in direct conflict — Evol-Instruct wins at the 100K-plus scale, LIMA wins at the 1K scale; the curve crosses somewhere around 10K.

[Reviewer Perspective] Strongest skeptical objection. Evol-Instruct distils from a closed-weights frontier teacher; the gains are bounded by the teacher’s competence and inherit the teacher’s biases. Every Evol-Instruct-trained model is in some sense a partial copy of the teacher LLM. As frontier models improve, the lineage stays bounded by where the teacher is, not where the student could in principle reach. The Tülu papers’ multi-dataset framing partially escapes this by mixing in human data, but the synthetic-only papers do not.

[Reviewer Perspective] Strongest author-side rebuttal. From WizardLM Section 6 (limitations): the authors explicitly acknowledge the teacher-bound ceiling and the AlpacaEval judging bias. Their argument is not “Evol-Instruct produces frontier capability” but “Evol-Instruct produces best-in-class results at fixed open-base + fixed teacher budget.” That weaker claim is defensible.

What remains unsolved.

Whether Evol-Instruct gains transfer cleanly to non-English domains (the papers evaluate in English).
Whether the curriculum-style escalation actually improves out-of-distribution generalisation or merely better fits the AlpacaEval-style judge distribution.
Whether the teacher-bounded ceiling can be lifted by self-evolution (student evolves its own instructions). Auto Evol-Instruct edges toward this but still uses a teacher.

Three future research directions.

Teacher-free Evol-Instruct. Use the student model itself as both evolver and answerer; bootstrap difficulty without a stronger teacher. [Analysis] Risks the well-known model-collapse problem with synthetic data.
Capability-targeted evolution. Currently operators are uniformly sampled. Sampling proportional to the student’s per-capability weakness would be more efficient. [Analysis] Some open-instruct work moves in this direction but the lineage is unexplored.
Contamination-aware Evol-Instruct. Evolution that explicitly avoids producing instructions semantically close to public benchmark items. [Reviewer Perspective] An audit-driven follow-up here would address the contamination concern Section 12 below flags.

Section 12: Critical analysis

Strengths

WizardLM: cleanly-defined evolution operators, transparent prompt templates, public code + data, multiple complementary evaluations (human + GPT-4 + skill-decomposed).¹
WizardCoder: a single concrete benchmark gain (22-point HumanEval over StarCoder base) that independent leaderboards confirm.²
Tülu 1: methodologically clean per-dataset benchmark grid; this is the kind of empirical-baseline paper the field needed and would have lacked otherwise.³
Tülu 2: DPO scaling demonstration at 70B; useful precedent for the open ecosystem.⁴

Weaknesses stated by the authors

WizardLM Section 6: GPT-4 evaluator bias toward verbose responses; the 90% of ChatGPT claim is conditional on the GPT-4 judge.
WizardLM Section 6: teacher-LLM-bound capability ceiling; the student cannot exceed what the teacher could solve in principle.
WizardCoder discussion: stops at $M = 3$ because round-4 quality declines; the technique has a natural ceiling.
Tülu 1: “no single instruction-tuning dataset (or combination) provides the best performance across all evaluations.”

Weaknesses not stated or understated

[Reviewer Perspective] Benchmark-evaluator family conflict. AlpacaEval is judged by GPT-4, which is the teacher LLM commonly used to generate Evol-Instruct data. A model fine-tuned on GPT-4-generated data and judged by GPT-4 is a structurally suspect evaluation; LMSYS Chatbot Arena human votes consistently place GPT-4-distilled models lower than the AlpacaEval ordering suggests.
[Reviewer Perspective] Synthetic data contamination. The teacher LLM was trained on data that includes benchmark prompts (MMLU, HumanEval, GSM8K). Synthetic data generated from such a teacher can leak benchmark content. None of the papers run a contamination audit.
[Reviewer Perspective] Lack of multilingual evaluation. The papers all evaluate in English; the lineage’s gains in non-English instruction following are unverified.
[Reviewer Perspective] Cost transparency. Neither WizardLM nor WizardCoder reports total teacher-LLM inference cost, which is the primary practitioner-relevant economic axis.

Reproducibility check

Code (WizardLM): released — https://github.com/nlpxucan/WizardLM.⁶
Code (WizardCoder): released — same repository, separate branch.⁶
Code (Tülu 1 + 2): released — https://github.com/allenai/open-instruct.⁷
Data (WizardLM): partially released — earlier versions of the 250K corpus were on Hugging Face; subsequent versions were removed for licensing reasons.
Data (WizardCoder): released as Hugging Face dataset.
Data (Tülu): released, including the evaluation framework.
Hyperparameters (WizardLM): fully reported.
Hyperparameters (WizardCoder): hardware unstated, otherwise reported.
Hyperparameters (Tülu): fully reported.
Compute (WizardLM): reported (8× V100, 70 hours).
Compute (WizardCoder): not reported.
Compute (Tülu): partially reported (LUMI cluster context, total GPU-hours not stated).
Weights (WizardLM, WizardCoder, Tülu 1+2): released on Hugging Face.
Overall: WizardLM and Tülu reproducible; WizardCoder partially reproducible (missing hardware spec).

Methodology disclosure

Methodology

Sample size: WizardLM evaluates on 218 (Evol-Instruct test set) + 80 (Vicuna) + ~290 (29 skills × ~10 prompts) instructions. WizardCoder evaluates on 164 (HumanEval) + 974 (MBPP) + 1000 (DS-1000). Tülu evaluates on each benchmark’s standard test split.
Evaluation set: held-out by construction for HumanEval/MBPP/DS-1000; AlpacaEval is a fixed 805-prompt set; contamination check not run by any of the papers.
Baselines: WizardLM vs Alpaca, Vicuna, ChatGPT. WizardCoder vs StarCoder + 10 other code LLMs. Tülu vs 12 individual datasets and ChatGPT/GPT-4 references.
Hardware/compute: 8× V100 (WizardLM, 70 hours); not reported (WizardCoder); CSC LUMI 4× MI250x per node (Tülu); total GPU-hours not aggregated.

Generalisability

To other domains: WizardCoder is the proof that Evol-Instruct transfers to code. Extension to math, SQL, and structured data is plausible but uncertain (some 2024 follow-ups attempt math-specific variants with mixed results).
To larger scales: Tülu 2 confirms the recipe scales to 70B. The teacher-bounded ceiling becomes more binding at larger scales.
To different backbones: Tülu shows the recipe generalises across LLaMA, Llama-2, Code Llama. Auto Evol-Instruct extends to Mistral-family.
To non-English: untested by the lineage itself.

Assumption audit

“Teacher LLM is competent on evolved instructions” — fragile at the difficulty frontier; partially mitigated by elimination.
“GPT-4 judging is a faithful proxy” — fragile; conflict-of-interest with teacher-LLM-as-data-source.
“Pass@1 on HumanEval extrapolates” — partially fragile; HumanEval has known coverage gaps that subsequent benchmarks (LiveCodeBench, SWE-bench) expose.
“No test-set leakage” — unverified.

What would make the lineage significantly stronger

A formal contamination audit on the 250K Evol-Instruct corpus against MMLU / HumanEval / GSM8K test sets.
Human-rater AlpacaEval-equivalent at the same scale as GPT-4-judged AlpacaEval.
Total-cost reporting (teacher-LLM dollars, student-fine-tuning GPU-hours) per benchmark gain.

Section 13: What is reusable for a new study

REUSABLE COMPONENT [1]: Evol-Instruct evolution prompts (six-axis)

What it is: the six prompt templates from WizardLM Appendix A.
Why worth reusing: every open-instruct-tuning effort from 2023 onward depends on something downstream of these prompts; using them verbatim ensures comparability to a large baseline corpus.
Preconditions: a teacher LLM API budget; a seed of ~50K instructions.
What would need to change in a different setting: domain-specific operators (WizardCoder demonstrates the pattern for code; analogous specialisation needed for math, SQL, etc.).
Risks: copying the verbose-output bias inherited from the original templates.
Interaction effects: composes well with later elimination filtering and downstream DPO post-training.

REUSABLE COMPONENT [2]: Elimination evolution filter (4-criterion)

What it is: the four-rule rejection check (info gain, refusal detection, content check, template-echo check).
Why worth reusing: any iterative LLM-driven data pipeline degrades without a quality filter; this one is battle-tested.
Preconditions: a separate LLM call for the info-gain check; access to response token-length.
What would need to change: domain-specific quality predicates (for code, compilability + unit-test pass should be added).
Risks: the predicate is itself an LLM judgement and inherits the teacher’s biases.
Interaction effects: stacks well with Evol-Instruct above; standalone applicability to Self-Instruct corpora as well.

REUSABLE COMPONENT [3]: Tülu evaluation harness

What it is: the AllenAI open-instruct evaluation framework covering MMLU / GSM / BBH / TyDiQA / Codex-Eval / AlpacaEval / ToxiGen.
Why worth reusing: it’s the most-used cross-capability eval in the open-instruct space; new datasets need to plug into something, and this is what they plug into.
Preconditions: model weights or inference endpoint.
What would need to change: add benchmarks that post-date Tülu 2 (LiveCodeBench, MMLU-Pro, IFEval).
Risks: the AlpacaEval component carries the teacher-judge conflict noted above.
Interaction effects: standalone usability is high.

REUSABLE COMPONENT [4]: Tülu mix recipe (Human+GPT)

What it is: the FLAN V2 + CoT + Dolly + OASST + GPT4-Alpaca + Code-Alpaca + ShareGPT mix.
Why worth reusing: it is, as of the original release, the strongest open-resources-only SFT mix; subsequent Tülu generations build on this base.
Preconditions: dataset licenses (ShareGPT in particular has licensing nuance).
What would need to change: re-balancing for use cases that need more code or more math.
Risks: stale relative to 2026; newer mixes (Open-Hermes 2.5, Dolphin, the Tülu 3 mix) outperform.
Interaction effects: composes with subsequent DPO post-training as Tülu 2 demonstrates.

Dependency map. Components 1 and 2 are tightly coupled (the filter exists because of the failures the prompts produce). Component 3 is independent of 1 and 2. Component 4 builds on the broader open-dataset ecosystem rather than on Evol-Instruct specifically.

Recommendation. [Analysis] For a small team building a new instruction-tuned model in 2026: start with the Tülu evaluation harness (component 3) to define success, build a data mix similar in spirit to Tülu Human+GPT (component 4), apply Evol-Instruct prompts (component 1) with the elimination filter (component 2) to a domain-specific seed for any capability the mix under-covers, then DPO on a preference set. That stack is the inherited 2026 default.

[Analysis] The new study most likely to benefit is a domain-specific code-or-math instruction-tuning project; the components map cleanly onto it. The least likely to benefit is a multilingual project, where the lineage offers no validated guidance.

Section 14: Known limitations and open problems

Stated by the authors. WizardLM acknowledges teacher-bound ceiling and judging-LLM bias. WizardCoder acknowledges the round-4 quality decline. Tülu 1 acknowledges that no single dataset wins all axes. Tülu 2 acknowledges DPO sensitivity to $\beta$ .

Not stated (or understated).

Synthetic data contamination of public benchmarks ([Reviewer Perspective], grounded in subsequent independent audits in the open-instruct community).
Teacher-judge family conflict on AlpacaEval ([Reviewer Perspective], grounded in LMSYS Chatbot Arena comparisons showing systematic divergence).
Reproducibility of WizardCoder hardware setup ([Analysis], grounded in the paper itself not reporting it).
Multilingual generalisation ([Analysis], grounded in lack of non-English evaluation in any of the papers).

Technical root causes.

Contamination: teacher LLM training data overlap with benchmark test sets.
Judge-family conflict: GPT-4 evaluates GPT-4-distilled outputs sympathetically.
WizardCoder hardware gap: reporting omission, no deeper technical cause.
Multilingual gap: training data and evaluation harness are English-centric.

Open problems left behind.

A contamination-aware Evol-Instruct variant.
A teacher-free or self-evolution-style escalation that does not inherit a closed-weights ceiling.
Capability-targeted operator sampling (proportional to student weakness).
Non-English Evol-Instruct evaluation suites.

What a follow-up paper would need to solve. The most critical open problem is the teacher-bounded ceiling. A follow-up that demonstrates Evol-Instruct-style escalation purely with open-weight teacher LLMs (no GPT-4) and matches the closed-teacher result would be the natural next step; Auto Evol-Instruct moves toward this but still relies on a strong teacher.

How this article reads at three depths

For the curious high-school reader. A team in 2023 found a clever trick: ask a powerful AI to make practice problems harder, then teach a smaller AI on those problems. The smaller AI got much better at hard problems — sometimes nearly as good as the powerful AI itself. Other teams then showed the trick works for code, and that mixing in human-written problems too works even better. Today, almost every open AI chat model uses some version of this idea.

For the working developer or ML engineer. The Evol-Instruct lineage gives a concrete, reproducible recipe: take Alpaca’s or Code-Alpaca’s seed, iterate the WizardLM six-axis (or WizardCoder five-axis) evolution prompts via a GPT-3.5/4 teacher for 3–4 rounds, apply the four-rule elimination filter, supervised-fine-tune on the resulting corpus, optionally blend with the Tülu Human+GPT mix, optionally DPO on UltraFeedback. Total cost: a few thousand dollars of teacher inference + standard SFT compute. The recipe is bounded by the teacher’s capability and judged on AlpacaEval-style benchmarks with known biases. It is the right default for “make an open base follow hard instructions better” in 2026; it is not the right tool when frontier-level capability is the requirement.

For the ML researcher. WizardLM’s six-axis evolution operator is the lineage’s structurally novel contribution; everything else extends, scales, or transfers it. The strongest objection is the teacher-bound ceiling combined with the GPT-4-as-judge family conflict on AlpacaEval — a structural issue not yet cleanly addressed in any follow-up. Load-bearing assumptions are (a) teacher competence on evolved instructions, (b) judge fidelity to human preference, and (c) absence of benchmark contamination, all of which warrant explicit audits. A follow-up that demonstrates teacher-free escalation matching the closed-teacher result, plus a contamination audit on the canonical 250K corpus, would close the most important gaps. The lineage is essential reading for anyone working on post-training of open-weight models.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Xu et al., WizardLM: Empowering Large Language Models to Follow Complex Instructions, arXiv:2304.12244 (ICLR 2024). (accessed 2026-05-19) ↩
2. Luo et al., WizardCoder: Empowering Code Large Language Models with Evol-Instruct, arXiv:2306.08568 (ICLR 2024). (accessed 2026-05-19) ↩
3. Wang et al., How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources, arXiv:2306.04751 (NeurIPS 2023 Datasets and Benchmarks). (accessed 2026-05-19) ↩
4. Ivison et al., Camels in a Changing Climate: Enhancing LM Adaptation with Tülu 2, arXiv:2311.10702. (accessed 2026-05-19) ↩
5. Zeng et al., Automatic Instruction Evolving for Large Language Models, arXiv:2406.00770. (accessed 2026-05-19) ↩
6. WizardLM official GitHub repository (code + WizardCoder branch). (accessed 2026-05-19) ↩
7. AllenAI open-instruct repository (Tülu code + evaluation harness). (accessed 2026-05-19) ↩