Neural Tech Daily
ai-research

Synthetic data + textbook-quality data for small LLMs: Phi, Orca, Self-Instruct, Cosmopedia

Multi-paper review of the synthetic-data lineage that produced small but capable LLMs: Self-Instruct, Orca, Phi-1, Phi-3, and the Cosmopedia open replication.

Updated ~44 min read
Share

Reading-register key

  • From the paper: claims drawn from the source paper’s text, tables, equations, or figures.
  • [Analysis] the publication’s own reasoned assessment, distinct from any claim a source paper itself makes.
  • [External comparison] comparison to prior work or general knowledge outside the five sources.
  • [Reviewer Perspective] critical or speculative assessment that goes beyond what any of the five sources prove.

Section 1: Cluster scope

This review covers five linked artefacts from the synthetic-data-for-small-LLMs research line that emerged between late 2022 and 2024:

  • Self-Instruct (Wang et al., 2022; arXiv:2212.10560 4 ), the open template for bootstrapping instruction-following data from a teacher model.
  • Orca (Mukherjee et al., 2023; arXiv:2306.02707 3 ), explanation-tuned distillation from GPT-4 traces into a 13B student.
  • Phi-1 (Gunasekar et al., 2023; arXiv:2306.11644 1 ), the original “Textbooks Are All You Need” 1.3B code model trained on filtered code plus GPT-3.5-generated synthetic textbooks and exercises.
  • Phi-3 (Abdin et al., 2024; arXiv:2404.14219 2 ), the production-scaled descendant: phi-3-mini (3.8B), phi-3-small (7B), phi-3-medium (14B), trained on heavily filtered web data plus synthetic data.
  • Cosmopedia (Ben Allal et al., 2024 5 ). Hugging Face’s open replication attempt at the Phi-1.5 / Phi-2 training data, generated with Mixtral-8x7B-Instruct.

The publication’s reading is that these five are not five independent papers; they are the canonical genealogy of one research idea: a small model trained on high-quality, curated, often synthetically-generated data can match or beat a larger model trained on raw web scrape. Self-Instruct introduced the bootstrap recipe; Alpaca proved it at consumer scale; Orca scaled it with richer teacher signals; Phi-1 transferred the idea from instruction-tuning into pre-training; Phi-3 demonstrated it at production scale; Cosmopedia tried to reproduce the recipe in the open.

The paper-classification taxonomy on this cluster: all five sit at the intersection of training method, data-driven, and LLM-based, Self-Instruct + Orca + Cosmopedia explicitly use an LLM teacher to author training data; Phi-1 + Phi-3 inherit synthetic data from upstream pipelines. Reader prerequisites: high-school algebra; some familiarity with what an LLM is and what “fine-tuning” means is helpful but not required because the Glossary in Section 2.5 covers every prerequisite term used downstream.

Section 2: TL;DR and executive overview

Most large language models are trained on enormous scrapes of the internet. The papers in this cluster ask a different question: can a much smaller model do almost as well if its training data is much cleaner, more educational, or written by another, larger AI specifically to teach it? Across five years and five papers, the answer the research consensus has converged on is “mostly yes, with caveats.” Phi-3-mini, a 3.8-billion-parameter model, matches benchmark scores that previously required models four to ten times its size. 2 The cost is paid up front in compute used to generate or filter the training data, and in residual concerns about how much the small models really know versus how much they have been taught to imitate.

[Executive summary] The five papers collectively define a recipe. Self-Instruct supplies the data-bootstrap pattern: take a small seed of human-written examples, prompt a teacher LLM to generate many more, filter for diversity and quality, train a student on the result. 4 Orca refines this for reasoning tasks by drawing rich step-by-step explanation traces from GPT-4 rather than terse imitation outputs. 3 Phi-1 transposes the pattern from instruction-tuning into pre-training itself, building a “CodeTextbook” from filtered code and GPT-3.5-generated synthetic textbooks plus a “CodeExercises” set; phi-1 reaches 50.6 percent pass@1 on HumanEval at 1.3 billion parameters. 1 Phi-3 scales this to trillions of tokens; phi-3-mini matches Mixtral 8x7B on MMLU at one-twelfth the active parameter count. 2 Cosmopedia open-sources the recipe. 5

Practitioner-relevant takeaways:

  • Synthetic data is now a serious pre-training ingredient, not an instruction-tuning afterthought. Phi-3’s training mix is “heavily filtered publicly available web data and synthetic data” by the authors’ own characterisation. 2
  • A bigger teacher generates better synthetic data. Self-Instruct used GPT-3 davinci; Orca used GPT-4; Cosmopedia used Mixtral-8x7B-Instruct because the authors could not access GPT-4 at the volume needed. 4 3 5
  • Diversity is the bottleneck. Self-Instruct’s ROUGE-L filter, Cosmopedia’s audience/format multiplication (4 audiences × 3 styles), and Orca’s 16 system messages all exist to prevent the teacher from generating near-duplicate outputs. 4 5 3
  • Small models lose on factual recall, not on reasoning. Phi-3’s authors note that “smaller models lack capacity for extensive factual knowledge,” evident on TriviaQA, and recommend retrieval augmentation. 2
  • The open replication is harder than the closed paper makes it look. Cosmo-1B trained on 25B tokens of Cosmopedia underperforms Phi-1.5 by a notable gap, attributed to teacher-model quality and prompt-engineering depth. 5

Pipeline overview in text (cluster-wide). Training-time: a teacher LLM is prompted (often with a seed bank of human-written examples, a diversity-inducing template, and audience/format/topic conditioners) to generate synthetic training data at scale. The output is filtered (ROUGE-L deduplication, classifier-based educational-value scoring, decontamination against benchmarks) and used either as instruction-tuning data on a smaller student (Self-Instruct, Orca) or as pre-training data on a smaller base model trained from scratch (Phi-1, Phi-3, Cosmo-1B). Inference-time: the student model runs without the teacher; the teacher is a training-time artefact only.

Section 2.5: Glossary

TermPlain-English explanationFirst appears in
Synthetic dataTraining data written by another model rather than scraped from humans. The teacher model is given prompts and produces text the student is then trained on.Section 1
DistillationTraining a smaller “student” model to imitate a larger “teacher” model, either through the teacher’s output text or its internal probabilities.Section 1
Pre-trainingThe initial, expensive training phase that teaches a model general language ability from massive text; before any task-specific fine-tuning.Section 2
Instruction-tuningA second training phase that adapts a pre-trained model to follow user instructions, typically using prompt/response pairs.Section 2
TokenThe basic unit of text a model processes; roughly a word fragment. “3.3T tokens” means 3.3 trillion such units.Section 2
Pass@1A code-benchmark scoring rule: the fraction of problems solved on the first generation attempt; HumanEval and MBPP report pass@1.Section 2
MMLUMassive Multitask Language Understanding — a 57-subject multiple-choice benchmark widely used to score “general knowledge and reasoning.”Section 2
HumanEvalA 164-problem Python coding benchmark from OpenAI; each problem has a function signature and hidden unit tests.Section 2
MT-BenchA multi-turn conversation benchmark scored by GPT-4 as judge on a 1-10 scale.Section 2
BBHBig-Bench Hard — 23 challenging reasoning tasks selected from the larger BIG-bench suite.Section 9
ROUGE-LA text-similarity metric using longest-common-subsequence; Self-Instruct uses ROUGE-L < 0.7 to filter near-duplicate instructions.Section 6
DecontaminationThe process of checking the training set does not contain examples from the evaluation benchmarks, usually via n-gram or embedding overlap.Section 9
Compute-optimalThe Chinchilla scaling result: for a fixed compute budget, models train best at roughly 20 tokens per parameter.Section 4
Data-optimal regimePhi-3’s framing: optimize for data quality given a model size, rather than for compute given a token budget.Section 4
[Analysis] labelThe publication’s own reasoned assessment, distinct from what any of the five sources themselves claim.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what any of the five sources prove.Sections 11 + 12

Section 3: Problem formalisation (cluster-wide)

All five works share a single formal setup. There is a target model size NN (parameters), a training token budget TT, and an evaluation suite EE (e.g., MMLU, HumanEval, BBH). Classical pre-training picks a corpus Dweb\mathcal{D}_{\text{web}} from web scrape; the loss is the standard next-token autoregressive objective

Lpre(θ)=ExDweb[t=1xlogπθ(xtx<t)]\mathcal{L}_{\text{pre}}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}_{\text{web}}}\left[\sum_{t=1}^{|x|} \log \pi_\theta(x_t \mid x_{<t})\right]

The synthetic-data line replaces or augments Dweb\mathcal{D}_{\text{web}} with a curated and/or model-generated corpus Dsyn\mathcal{D}_{\text{syn}}. The research question becomes: for a fixed (N,T)(N, T), can Dsyn\mathcal{D}_{\text{syn}} produce a model that beats one trained on Dweb\mathcal{D}_{\text{web}} on EE?

Notation table (cluster-wide).

SymbolTypeMeaningFirst appears in
NNScalarStudent model parameter countSection 3
TTScalarTraining token budgetSection 3
Dweb\mathcal{D}_{\text{web}}CorpusWeb-scrape training corpus (baseline)Section 3
Dsyn\mathcal{D}_{\text{syn}}CorpusSynthetic-data training corpusSection 3
Dseed\mathcal{D}_{\text{seed}}CorpusSmall human-written seed set (Self-Instruct: 175 tasks)Section 7
πT\pi_TDistributionTeacher LLM distribution (GPT-3.5 / GPT-4 / Mixtral)Section 5
πθ\pi_\thetaDistributionStudent model being trainedSection 3
LLFunctionLength-normalised log-prob, used in benchmark scoringSection 6
EESetEvaluation benchmark suiteSection 3

Assumptions. The cluster makes three load-bearing assumptions, none formally proven across the five works:

  1. Teacher knowledge dominance. The teacher LLM knows things the student does not, and can convey them in generated text. [Analysis] This is trivially satisfied when the teacher is GPT-4 and the student is 1-13B parameters; it is less obviously satisfied when teacher and student are similar size, which is one reason Cosmopedia’s Mixtral-generated data falls short of the Phi-1.5 data quality. 5
  2. Imitation transfer. Imitating teacher outputs transfers underlying capability, not just surface style. Orca’s contribution is largely to push back against this assumption, its motivation is that prior Vicuna-style instruction-tuning “learns to imitate the style, but not the reasoning process” of the teacher. 3
  3. Synthetic-data composability. Filtered web data and synthetic data compose without negative interference. Phi-3 trains in two phases (web emphasis first, synthetic emphasis second) which the publication’s reading is precisely an engineering response to this assumption being not-quite-true. 2

Complexity. The bottleneck is teacher-inference cost, not student training. Cosmopedia reports “over 10,000 GPU hours” to generate 25B tokens through Mixtral-8x7B. 5 [Analysis] At ar5iv-quoted Mixtral throughput numbers, this corresponds to roughly the same cost as pre-training a 1B parameter model from scratch, a substantial up-front investment paid in teacher inference. The student-training compute is small by comparison: Phi-1 reports 4 days on 8 A100s for the 1.3B model on 7B tokens. 1

Section 4: Motivation and gap

Standard scaling literature gives two reference points. Kaplan et al. (2020) established the original power-law scaling relationship between model size, dataset size, and loss; the prescription was to scale all three roughly together. 9 Chinchilla (Hoffmann et al., 2022) refined this to a compute-optimal ratio of roughly 20 tokens per parameter. 8 Both treat the corpus as a given quantity to be enlarged, not a quality to be improved.

The synthetic-data line argues this is the wrong frame for small models. From the Phi-1 paper: “by crafting ‘textbook quality’ data we were able to train a model that surpasses almost all open-source models on coding benchmarks such as HumanEval and MBPP despite being 10x smaller in model size and 100x smaller in dataset size.” 1 The argument is that the signal density of the training corpus matters as much as its raw token count: a small model trained on educationally-curated content can outpace a much larger model trained on web sludge.

[External comparison] Phi-3’s framing for this is the “data-optimal regime”: rather than fix a compute budget and ask how to spend it (Chinchilla’s question), fix a model size and ask what data composition extracts maximum capability from those parameters. 2 The two regimes are not contradictory, they ask different questions, but the synthetic-data papers consistently argue the data-optimal frame is the right one for sub-15B models targeting deployment on consumer hardware.

The practical stakes are large. Phi-3-mini “quantises to ~1.8GB, enabling deployment on iPhone 14 with A16 Bionic chip at 12+ tokens per second” per the Phi-3 report. 2 A 4GB model that matches Mixtral 8x7B (a model that does not fit on consumer hardware) on the most-cited general-knowledge benchmark is a deployment unlock, not just a benchmark curiosity.

Position in the broader landscape [External comparison]. The synthetic-data line is one of three competing post-Chinchilla scaling strategies. The other two are: (a) mixture-of-experts scaling (Mixtral, DBRX, Phi-3.5-MoE) which decouples active parameter count from total parameter count; (b) inference-time compute scaling (OpenAI o1, DeepSeek-R1) which trades training cost for test-time reasoning steps. Synthetic-data scaling is the most accessible of the three for academic-budget reproduction, which is why Cosmopedia exists and the others do not have public open-replication counterparts at the same scale.

Section 5: Method overview, per source

5.1 Self-Instruct (Wang et al., 2022)

The method is a four-step bootstrap pipeline operating from a 175-task seed bank curated by the authors at the University of Washington. 4 Plain-English intuition: a small set of human-written task examples is used to prompt a large language model to invent more task examples; the inventions are filtered for diversity and quality; the surviving inventions become training data.

Step 1, instruction generation. Eight in-context examples are sampled per call (six from the seed bank, two from previously-generated tasks). The model generates a new instruction. From the paper: this design “promotes diversity while maintaining quality guidance” through the human-written majority of the in-context examples. 4

Step 2, classification identification. A few-shot classifier head determines whether the generated instruction is a classification task or a free-form generation task. From the paper: 12 classification and 19 non-classification seed examples are used for the few-shot prompt. 4

Step 3, instance generation. Two variants are used depending on Step 2’s verdict. Input-first: generate input fields conditioned on the instruction, then generate the output. Output-first: for classification tasks, generate possible labels first, then inputs conditioned on the labels, this prevents label-bias degeneration.

Step 4, filtering and post-processing. Instructions with ROUGE-L similarity below 0.7 to all existing instructions are kept (the threshold is for novelty, so low ROUGE-L means kept). Heuristics drop entries containing “image” or “graph” since the text-only teacher cannot ground those.

Design rationale. The pipeline is shaped by what GPT-3 davinci can and cannot do reliably. The two-step instance generation handles classification vs free-form differently because GPT-3 was observed to collapse to the majority label otherwise. The ROUGE-L threshold is empirically tuned; below 0.5 admits irrelevant tasks, above 0.8 admits duplicates. Novelty classification: [New], the four-step bootstrap is the paper’s central methodological contribution and was widely re-used downstream (Alpaca 10 reproduced it almost verbatim).

5.2 Orca (Mukherjee et al., 2023)

Orca is an explanation-tuned distillation of GPT-4 into a 13-billion-parameter LLaMA-based student. 3 Plain-English intuition: rather than train the student on the teacher’s short final answers, the student is trained on the teacher’s step-by-step explanation traces. The student learns to think, not just to answer.

The method begins from the FLAN-v2 collection. From the paper, this is a “5 million examples” corpus of instruction-response pairs spanning hundreds of tasks. The Orca contribution is to augment these prompts with explanation-eliciting system messages, by the publication’s reading of the methodology section, 16 such system messages are hand-crafted, each instructing GPT-4 to “explain like I am 5,” “think step by step,” “justify your steps,” “include a chain of thought,” and similar variants. The resulting GPT-4 responses are several times longer than vanilla outputs and contain explicit reasoning steps.

A second design choice is progressive learning: the student is first trained on ChatGPT-3.5 traces (cheaper, more abundant) before being trained on GPT-4 traces (richer, more expensive). The framing in the paper is that the easier curriculum makes the harder one more learnable. [Reviewer Perspective] This is plausibly also driven by economics. ChatGPT calls were vastly cheaper at the time than GPT-4 calls, and a 5M-example dataset generated entirely by GPT-4 would have been prohibitively expensive.

Novelty classification: [Adapted], the FLAN-v2 base set and the LLaMA student are inherited; the system-message-induced explanation augmentation and the progressive curriculum are the Orca-specific contributions.

5.3 Phi-1 (Gunasekar et al., 2023)

Phi-1 is the first cluster paper to move synthetic data from instruction-tuning into pre-training. The training mix has three components per ar5iv-extracted Section 2: 1

  1. Filtered code-language dataset (~6B tokens). A subset of The Stack and StackOverflow filtered by a classifier trained on ~100K GPT-4-annotated code samples for “educational value to a student learning basic coding concepts.” A CodeGen-embedding-based random forest predicts the educational value of each sample; below-threshold samples are dropped. From the paper, this filtering alone moves a 350M model from 12.19 percent to 17.68 percent on HumanEval.
  2. Synthetic textbook dataset (<1B tokens). GPT-3.5 generates Python textbooks aimed at algorithmic reasoning. Diversity is induced by varying topics and target audience in the generation prompt.
  3. CodeExercises dataset (~180M tokens). GPT-3.5 generates Python exercises with docstring + function-body format. Diversity is induced by varying the function name in the generation prompt.

The total training corpus is under 7 billion tokens, which the paper describes as ~100x smaller than competing code models’ training corpora. The 1.3B base model is trained from scratch for 4 days on 8 A100 GPUs. From the paper, this is the headline efficiency claim of the work.

Novelty classification: [New], the central contribution is the demonstration that pre-training (not instruction-tuning) can use synthetic data, and that the resulting model can be both small and benchmark-competitive.

5.4 Phi-3 (Abdin et al., 2024)

Phi-3 productionises the Phi-1 recipe at much larger scale. The training mix is, by the authors’ own framing, “a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data.” 2

Three model sizes: phi-3-mini (3.8B parameters, 3.3T training tokens), phi-3-small (7B parameters, 4.8T tokens), phi-3-medium (14B parameters, 4.8T tokens). The architecture is a standard decoder-only transformer with grouped-query attention and SwiGLU; phi-3-mini uses LongRope-style positional interpolation to extend the context window from 4K to 128K tokens in the long-context variant. Post-training uses supervised fine-tuning followed by direct-preference-optimization-style alignment.

The “data-optimal regime” framing the paper introduces: rather than train at compute-optimal ratios (Chinchilla’s ~20 tokens/parameter), Phi-3-mini trains at roughly 870 tokens/parameter, a vast over-training relative to Chinchilla’s prescription. From the paper, this is justified by the observation that for a fixed target deployment size, more high-quality training data continues to improve the model well past Chinchilla’s compute-optimal point. [Analysis] This is a “Chinchilla optimal is wrong for inference-optimal” argument; the same argument shows up in Llama-3 8B’s training (15T tokens on 8B parameters = ~1875 tokens/parameter).

Novelty classification: [Adapted], the synthetic-data philosophy is inherited from Phi-1; the two-phase data curriculum, the LongRope context extension, and the safety post-training are Phi-3-specific contributions.

5.5 Cosmopedia (Hugging Face, 2024)

Cosmopedia is the open replication of (approximately) the Phi-1.5 / Phi-2 training data, generated entirely with the open-weight Mixtral-8x7B-Instruct. 5 6

The methodology multiplies prompt-coverage through audience × format conditioning. Four audiences (young children, high school students, college students, researchers/professionals) crossed with three styles (textbooks, blog posts, WikiHow articles) yield 12x prompt multiplication from a single topic. From the paper: roughly 20 percent of prompts come from curated sources (Stanford courses, Khan Academy, OpenStax, WikiHow) and 80 percent from RefinedWeb topic clusters (145 clusters, 112 retained after filtering, ~23M prompts generated). The final dataset is 25B tokens across 30M+ files. Less than 1 percent duplicates per MinHash dedup.

Novelty classification: [Adapted], the synthetic-data-for-pre-training idea is inherited from Phi-1.5; the audience/format multiplication and the open release with code are Cosmopedia-specific contributions.

Figure 2.1 of Gunasekar et al. Textbooks Are All You Need (arXiv:2306.11644) showing pass@1 HumanEval performance of phi-1 (1.3B) against larger code models including StarCoder and Replit-Code, the headline efficiency result that motivated subsequent Phi versions

Section 6: Mathematical contributions

The five papers are predominantly empirical; the math content is concentrated in the loss formulations and the scaling arguments. The three load-bearing math objects across the cluster are below.

MATH ENTRY 1: Next-token pre-training loss (cluster baseline)

  • Source: Standard objective used by all five works; explicit in Phi-1 Section 2 and Phi-3 Section 2.
  • What it is: The expected negative log-probability the model assigns to each next token in the training corpus.
  • Formal definition: Lpre(θ)=ExD[t=1xlogπθ(xtx<t)]\mathcal{L}_{\text{pre}}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{\mid x\mid } \log \pi_\theta(x_t \mid x_{<t})\right]
  • Each term explained AND dimensional/type analysis:
    • θ\theta is the parameter vector of the student, of size NN (e.g., N=3.8×109N=3.8 \times 10^9 for phi-3-mini).
    • xDx \sim \mathcal{D} is a document sampled from the training corpus D\mathcal{D}; x\mid x\mid is its length in tokens.
    • πθ(xtx<t)\pi_\theta(x_t \mid x_{<t}) is a scalar probability in (0,1)(0, 1).
    • logπθ()\log \pi_\theta(\cdot) is a scalar in (,0](-\infty, 0].
    • The inner sum is a scalar in (,0](-\infty, 0] of length x\mid x\mid .
    • The outer expectation is a scalar, the loss.
  • Worked numerical example. Take a 4-token sequence x=[t1,t2,t3,t4]x = [t_1, t_2, t_3, t_4]. Suppose the model’s per-token probabilities under the current parameters are [0.5,0.25,0.125,0.1][0.5, 0.25, 0.125, 0.1]. The per-token log-probs (natural log) are [0.693,1.386,2.079,2.303][-0.693, -1.386, -2.079, -2.303]. The sequence log-likelihood is 6.461-6.461. The negative-log-likelihood loss for this single sequence is 6.4616.461 nats, or about 0.56.461/(4ln2)=2.330.5 \cdot 6.461 / (4 \cdot \ln 2) = 2.33 bits per token (perplexity ≈ 5.1). The expectation in Lpre\mathcal{L}_{\text{pre}} averages this over many such sequences sampled from D\mathcal{D}.
  • Role: The loss every model in the cluster minimises during pre-training. The contribution of the cluster is not to change this loss, but to change D\mathcal{D}.
  • Edge cases: undefined when the model assigns zero probability to a true token (log of zero); mitigated in practice by tied-weight softmax which never outputs exactly zero, and by numerical stability tricks like log-softmax.
  • Novelty: [Adopted], standard since Bengio et al. 2003 neural language models.
  • Transferability: [Analysis] Universally transferable; the math is invariant to data source.
  • Why it matters: This is the lens through which the cluster’s contribution is visible, same loss, different D\mathcal{D}, dramatically different result.

MATH ENTRY 2: Chinchilla compute-optimal ratio (the baseline scaling prescription Phi-3 deliberately violates)

  • Source: Hoffmann et al. 2022; Phi-3 Section 1 cites this as the comparison point. 8
  • What it is: The empirical claim that for a fixed compute budget CC, the optimal model size NN^* and dataset size TT^* scale together such that roughly T/N20T^* / N^* \approx 20.
  • Formal definition: Given total compute C6NTC \approx 6 N T (the standard forward+backward FLOP estimate), the compute-optimal allocation satisfies NC0.5N^* \propto C^{0.5} and TC0.5T^* \propto C^{0.5}, i.e., T/NconstT^* / N^* \approx \text{const}.
  • Each term explained AND dimensional/type analysis:
    • CC is total training FLOPs, dimensionless.
    • NN is parameter count, dimensionless.
    • TT is training tokens, dimensionless.
    • The constant 6 is approximate: 2 FLOPs per parameter per token for the forward pass, 4 for the backward.
  • Worked numerical example. A compute budget of C=1022C = 10^{22} FLOPs. Chinchilla-optimal: NC/(620)=1022/1209.1×109N \approx \sqrt{C/(6 \cdot 20)} = \sqrt{10^{22} / 120} \approx 9.1 \times 10^9, so N9.1N \approx 9.1B parameters; T=20N1.8×1011=180T = 20 N \approx 1.8 \times 10^{11} = 180B tokens. By contrast, phi-3-mini’s actual choice is N=3.8N = 3.8B and T=3.3T = 3.3T, so T/N870T / N \approx 870, well above the Chinchilla 20. The implied compute is 6×3.8×109×3.3×10127.5×10226 \times 3.8 \times 10^9 \times 3.3 \times 10^{12} \approx 7.5 \times 10^{22} FLOPs, comparable to the example budget but spent on a much smaller, much more over-trained model.
  • Role: The reference point the synthetic-data cluster argues against for inference-deployment settings.
  • Edge cases: Chinchilla’s measurements were on web-data corpora; the law has not been re-derived for high-quality synthetic-data corpora. Whether the same 20:1 ratio holds when D\mathcal{D} is Cosmopedia rather than RefinedWeb is an open question (the cluster does not answer it formally).
  • Novelty: [Adopted]. Chinchilla’s contribution, cited as comparison.
  • Why it matters: Phi-3’s “data-optimal” framing is most precisely understood as choosing a different objective from Chinchilla’s compute-optimal one. They are not in conflict; they answer different questions.

MATH ENTRY 3: Self-Instruct ROUGE-L filter

  • Source: Wang et al. 2022 Section 2.4. 4
  • What it is: A diversity filter that keeps a newly-generated instruction only if its longest-common-subsequence overlap with every existing instruction is below 0.7.
  • Formal definition: For candidate instruction inewi_{\text{new}} and pool I\mathcal{I}, keep(inew)=1[maxiIROUGE-L(inew,i)<0.7]\text{keep}(i_{\text{new}}) = \mathbb{1}\left[\max_{i \in \mathcal{I}} \text{ROUGE-L}(i_{\text{new}}, i) < 0.7\right] where ROUGE-L(a,b)=2LCS(a,b)a+b\text{ROUGE-L}(a, b) = \frac{2 \cdot \text{LCS}(a, b)}{\mid a\mid + \mid b\mid }, LCS\text{LCS} is longest common subsequence length, and a\mid a\mid is token length.
  • Each term explained AND dimensional/type analysis:
    • LCS(a,b)\text{LCS}(a, b) is a scalar in [0,min(a,b)][0, \min(\mid a\mid , \mid b\mid )].
    • ROUGE-L(a,b)\text{ROUGE-L}(a, b) is a scalar in [0,1][0, 1].
    • The indicator returns 0 or 1.
  • Worked numerical example. Candidate: “Translate the following English sentence to French.” Existing in pool: “Translate the following English text to Spanish.” Tokenising into words, LCS is “Translate the following English to” (length 5); a=7\mid a\mid = 7, b=7\mid b\mid = 7, so ROUGE-L = 25/(7+7)=0.712 \cdot 5 / (7+7) = 0.71. This is above the 0.7 threshold so the candidate is rejected. If instead the existing instruction were “Identify the named entities in the following English text,” the LCS is much shorter (length 3 or 4) and ROUGE-L falls below 0.7, so the candidate is kept.
  • Role: The diversity-induction mechanism that prevents Self-Instruct from generating a 52K-instruction set composed of paraphrases of the same 100 instructions.
  • Edge cases: Computational cost is O(IL2)O(\mid \mathcal{I}\mid \cdot L^2) per candidate where LL is instruction length; the paper amortises this with embedding-based filtering on large pools.
  • Novelty: [Adapted]. ROUGE-L is a 2004 summarisation metric (Lin), re-purposed here as a deduplication filter.
  • Why it matters: The filter is the choke point that determines whether the synthetic dataset has 52K distinct tasks or 52K paraphrases of a small core. Without it, teacher LLMs collapse to high-frequency modes.

Section 7: Algorithmic contributions

The cluster has three load-bearing algorithms. The headline one, Self-Instruct’s bootstrap, is reproduced as a PNG figure below; the others are inlined as MDX code-fence pseudocode.

Reproduced pseudocode of Self-Instruct's four-step bootstrap algorithm: instruction generation, classification identification, instance generation, and ROUGE-L filtering, drawn from Wang et al. 2022 Section 2 and rendered as a code-block image for the article's headline algorithm reproduction

ALGORITHM ENTRY 1: Self-Instruct bootstrap (headline; reproduced above as PNG)

  • Source: Wang et al. 2022 Sections 2.1-2.4. 4
  • Purpose: Expand a 175-task human seed bank into a 52K-task synthetic instruction-tuning corpus using a teacher LLM.
  • Inputs: Seed pool Iseed\mathcal{I}_{\text{seed}} (175 hand-written instructions); teacher LLM πT\pi_T (GPT-3 davinci in the original paper); target pool size MM (52K).
  • Outputs: Final instruction pool I\mathcal{I} with IM\mid \mathcal{I}\mid \approx M; instance set {(x,y)}\{(x, y)\} totalling ~82K input/output pairs.
  • Hand-traced example on minimal input. Take seed pool Iseed={\mathcal{I}_{\text{seed}} = \{“Write a sentence about a cat,” “Classify the sentiment of a review” (positive/negative/neutral), “Translate English to French”}\}. Iteration 1: sample 8 in-context examples (but here only 3 exist, so use all 3 plus copies). Teacher generates: “Summarise a paragraph in one sentence.” Step 2 classifier: non-classification. Step 3 input-first: teacher generates input “The fox jumps over the lazy dog. The dog barks. The fox runs away.” then output “A fox jumps over a dog, the dog barks, the fox runs away.” Step 4 filter: ROUGE-L vs existing pool, all are low overlap; instruction is kept. Pool now: 4 instructions. Iteration 2: teacher generates “Translate the following English sentence to French”. ROUGE-L vs existing “Translate English to French” is 0.71; filter rejects. Pool stays at 4. Iteration 3: teacher generates “Identify the part of speech of the underlined word.” Classifier: classification. Step 3 output-first: teacher generates labels “noun, verb, adjective, adverb” then inputs conditioned on each label. Filter: kept. Pool now: 5 instructions. Repeat 52K times.
  • Complexity: Teacher-call-bound: roughly 4M4M teacher calls (one for instruction generation, one for classification, two for instance generation and filtering). Bottleneck: teacher inference latency. For GPT-3 davinci at 2022 rates, the paper does not report total cost but [External comparison] Stanford Alpaca’s reproduction of this exact pipeline cost roughly $500 in OpenAI credits for 52K instructions. 10
  • Hyperparameters: Number of in-context examples (8: 6 human + 2 generated); ROUGE-L threshold (0.7); classification/non-classification ratio in step-2 prompt (12:19); max instruction length (cut at 256 tokens).
  • Failure modes: Teacher collapses to a small set of high-frequency task types (mitigated by ROUGE-L filter). Generated outputs are wrong even when generated inputs are valid; paper’s manual quality check finds 92 percent of instructions valid but only 58 percent of full input-output pairs correct. Tasks involving images, graphs, or audio cannot be expressed in text and pollute the pool (mitigated by keyword heuristics).
  • Novelty: [New], the four-step bootstrap is the paper’s central contribution.
  • Transferability: [Analysis] Highly transferable. The same pattern with a stronger teacher (GPT-4 instead of davinci) and a domain-specific seed bank has been used to generate medical, legal, and code-specific instruction sets in dozens of follow-up works.

ALGORITHM ENTRY 2: Orca explanation-tuned data generation

  • Source: Mukherjee et al. 2023 Section 3. 3
  • Purpose: Generate richly-explained training data from GPT-4 on top of FLAN-v2 prompts.
  • Inputs: FLAN-v2 prompt pool P\mathcal{P} (~5M examples); 16 system messages {s1,,s16}\{s_1, \ldots, s_{16}\} each requesting a different explanation style; teacher πT\pi_T (ChatGPT-3.5 then GPT-4).
  • Outputs: Tuples (si,p,r)(s_i, p, r) where pPp \in \mathcal{P} is a prompt, sis_i is a system message, rr is the teacher’s explained response.
# Orca explanation-tuned generation, reconstructed from Section 3
for prompt p in FLAN_v2 pool P:
    sample system message s_i from `{s_1, ..., s_16}`  # uniform
    response_chatgpt = ChatGPT_3.5(system=s_i, user=p)
    add (s_i, p, response_chatgpt) to D_chatgpt
# After 5M ChatGPT examples, switch teacher to GPT-4
for prompt p in subset_P of P:
    sample system message s_i from `{s_1, ..., s_16}`
    response_gpt4 = GPT_4(system=s_i, user=p)
    add (s_i, p, response_gpt4) to D_gpt4
# Train Orca progressively: first on D_chatgpt, then on D_gpt4
  • Hand-traced example on minimal input. pp = “What is 12 × 13?” sis_i = “Explain your reasoning step by step before giving the final answer.” ChatGPT-3.5 response: “I will multiply 12 by 13. I can decompose: 12 × 13 = 12 × (10 + 3) = 120 + 36 = 156. Final answer: 156.” This response is added to Dchatgpt\mathcal{D}_{\text{chatgpt}}. Later GPT-4 is prompted with the same prompt and system message; GPT-4 produces a longer, more elaborate reasoning trace. The student is trained first on Dchatgpt\mathcal{D}_{\text{chatgpt}} (5M examples) then on Dgpt4\mathcal{D}_{\text{gpt4}} (a smaller subset).
  • Complexity: Teacher-call-bound; GPT-4 call cost is the binding constraint, which is why the GPT-4 set is a subset of the ChatGPT-3.5 set.
  • Hyperparameters: Number of system messages (16); ChatGPT-to-GPT-4 dataset-size ratio (~5x in the paper); learning-rate schedule used in progressive training.
  • Failure modes: Style drift, the student learns the surface tics of step-by-step explanation without internalising the reasoning. The paper’s framing is that this is precisely what prior Vicuna-style training did, and explanation-tuning fixes it; [Reviewer Perspective] independent benchmarking has been mixed on whether Orca-13B really reasons better, or merely talks more like a model that reasons.
  • Novelty: [Adapted], the FLAN-v2 base is inherited; the system-message augmentation and progressive curriculum are Orca-specific.
  • Transferability: [Analysis] The system-message-augmented prompting pattern is highly transferable; the specific ChatGPT-then-GPT-4 curriculum is dependent on having teacher access to two capability levels, which not every lab has.

ALGORITHM ENTRY 3: Phi-1 educational-value classifier filter

  • Source: Gunasekar et al. 2023 Section 2.2. 1
  • Purpose: Filter The Stack + StackOverflow code corpora to retain only educationally-valuable samples for training.
# Phi-1 educational-value filter, reconstructed
# Step 1: annotate seed
sample 100k code snippets from The Stack
for snippet c in sample:
    score = GPT_4(prompt="Rate educational value 1-5 for a student learning Python: {c}")
    add (c, score) to D_annotated

# Step 2: train classifier
embeddings = CodeGen_embed(D_annotated.snippets)
classifier = RandomForest.fit(embeddings, D_annotated.scores)

# Step 3: filter full corpus
for snippet c in The_Stack ∪ StackOverflow:
    e = CodeGen_embed(c)
    if classifier.predict(e) >= threshold:
        keep c

# Step 4: combine
CodeTextbook = (filtered_corpus, GPT_3.5_synthetic_textbooks)
CodeExercises = GPT_3.5_synthetic_python_exercises
training_set = CodeTextbook + CodeExercises  # ≈7B tokens
  • Hand-traced example. Annotate snippet def fib(n): return 1 if n<2 else fib(n-1)+fib(n-2) with GPT-4 → score 4 (clean idiomatic recursion, illustrative). Annotate x=1;y=2;print(x+y) → score 2 (trivial, not pedagogically rich). Annotate <huge auto-generated config file> → score 1. Random forest learns: small idiomatic functions get high scores; auto-generated config gets low scores. Filter the full Stack; 6B tokens survive.
  • Complexity: GPT-4 calls for the 100K annotation pass are the up-front cost (≈$10K at 2023 OpenAI rates [External comparison]); classifier inference scales linearly with corpus size and is cheap.
  • Hyperparameters: Annotation pool size (100K); classifier type (random forest); decision threshold (paper does not specify a numeric value; tuned to retain ~6B tokens).
  • Failure modes: The classifier inherits GPT-4’s biases about “educational value,” which favour conventional textbook-style code over production code. [Reviewer Perspective] Phi-1 is therefore expected to be weaker on production-code patterns (async, generators, framework-specific idioms) than on textbook patterns; the paper does not investigate this.
  • Novelty: [New], the classifier-based educational-value filter is the Phi-1-specific data-curation contribution.
  • Transferability: [Analysis] Highly transferable: Cosmopedia uses an analogous 1-10 educational-value scoring step on RefinedWeb topic clusters. 5

Section 8: Specialised design contributions

8A, LLM / prompt design. Three of the five works rely on LLM prompting as a load-bearing pipeline component (Self-Instruct, Orca, Cosmopedia); Phi-1 and Phi-3 inherit synthetic data from upstream and do not author novel prompt templates in the published papers.

PROMPT ENTRY 1 (Self-Instruct instruction-generation template, reconstructed).

  • Source: Wang et al. 2022 Appendix A. 4
  • Role in pipeline: Step 1 of the bootstrap.
  • Prompt type: Few-shot (8 in-context examples: 6 human, 2 generated).
  • Input schema: list of 8 (task instruction, optional input, optional output) examples.
  • Output schema: a single new task instruction string.
  • Reconstructed template:
Come up with a series of tasks. Try to be as diverse as possible.
Task: {seed_instruction_1}
Task: {seed_instruction_2}
...
Task: {seed_instruction_6}
Task: {generated_instruction_1}
Task: {generated_instruction_2}
Task: [the model generates here]
  • Failure handling: outputs containing “image,” “graph,” “audio,” “video” are dropped by heuristic.
  • Design rationale: the human-majority in-context examples anchor the teacher to high-quality task formats; the two generated examples introduce diversity beyond the seed bank.
  • Complexity: ~52K calls in the full run.
  • Novelty: [New].

PROMPT ENTRY 2 (Cosmopedia audience/format conditioning, reconstructed from blog post).

  • Source: Hugging Face Cosmopedia blog Section 3. 5
  • Role: every Mixtral generation call uses an explicit (topic, audience, format) tuple in the system prompt.
  • Reconstructed template:
Write a {format} on the topic {topic} aimed at {audience}.
The text should be informative, accurate, and engaging.
[topic-specific subsection structure injected here]
  • where format ∈ {textbook, blog post, WikiHow article}, audience ∈ {young children, high school students, college students, researchers/professionals}.
  • Design rationale: the 12 combinations multiply prompt diversity 12x from a fixed topic pool; without this, Mixtral was observed to generate near-duplicate outputs from similar topics.
  • Complexity: 30M+ calls in the full run, ~10K GPU hours.
  • Novelty: [New].

8B. Architecture-specific details. Phi-3-mini uses grouped-query attention (8 KV heads vs 32 query heads) and SwiGLU; the long-context variant uses LongRope positional interpolation to reach 128K tokens. Orca uses the unmodified LLaMA architecture. Phi-1 uses a 24-layer decoder transformer with rotary positional embeddings. None of the five works claim architectural novelty; the contribution is data. 1 2

8C. Training specifics. Phi-1: 4 days on 8 A100s, 1.3B parameters, ~7B tokens, fp16. 1 Phi-3-mini: not fully reported in the paper but [Analysis] estimated at ~10K GPU-days on H100s given 3.3T tokens at 3.8B parameters. Orca: 13B parameters on FLAN-v2 derivative data; the paper does not fully disclose training compute. Cosmo-1B: trained on Cosmopedia 25B tokens; training cost not the bottleneck.

8D. Inference / deployment specifics. Phi-3-mini is the only model in the cluster targeted explicitly at on-device deployment; the paper reports 1.8GB at 4-bit quantization, running at 12+ tokens/sec on iPhone 14 (A16 Bionic). 2

Section 9: Experiments and results

Datasets. The cluster evaluates against a standard battery: HumanEval and MBPP for code (Phi-1, Phi-3); MMLU, MT-Bench, HellaSwag, ARC, PIQA, TriviaQA for general knowledge (Phi-3, Cosmopedia); BBH and AGIEval for reasoning (Orca); SuperNI and a 252-task user-oriented evaluation set for Self-Instruct.

Baselines. The most common baselines across the cluster: Llama-2 / Llama-3 at matched parameter count, Mixtral 8x7B, GPT-3.5 (which Phi-3 papers as a soft target), and prior-generation Phi variants. [Analysis] Notably absent from most cluster baselines: contemporaneous synthetic-data models from other labs (Nemotron, Qwen-1.5, DeepSeek-Coder) which would be the most directly informative comparisons.

Main quantitative results (cluster table). Reproduced for editorial coverage; numbers as reported in the cited primary sources.

ModelParamsTraining tokensHumanEval pass@1MMLU 5-shotMT-BenchSource
Phi-11.3B~7B50.6%Phi-1 paper 1
Phi-1-small350M~7B45%Phi-1 paper 1
Phi-3-mini3.8B3.3T69%8.38Phi-3 report 2
Phi-3-small7B4.8T75%8.7Phi-3 report 2
Phi-3-medium14B4.8T78%8.9Phi-3 report 2
Mixtral 8x7B47B (13B active)68.4%Phi-3 report 2
GPT-3.5undisclosed71.4%Phi-3 report 2
Orca 13B13BOrca paper 3
Cosmo-1B1.8B25BCosmopedia 5

The Phi-3 report’s headline framing: phi-3-mini’s 69 percent MMLU and 8.38 MT-Bench “rivals models such as Mixtral 8x7B and GPT-3.5” despite the 12x active-parameter gap with Mixtral. 2

Orca-specific results. From the abstract and paper Section 6: Orca-13B “surpasses Vicuna-13B by over 100 percent on Big-Bench Hard (BBH),” reaches “42 percent improvement on AGIEval over Vicuna,” and “achieves parity with ChatGPT on the BBH benchmark.” It performs competitively on professional-admissions tests (SAT, LSAT, GRE, GMAT) with roughly 4-point gaps to optimised GPT-4 baselines. 3

Self-Instruct-specific results. Fine-tuning GPT-3 davinci on the 52K Self-Instruct data yields “33 percent absolute improvement on SuperNI” relative to vanilla GPT-3 and reaches “within 5 percent of InstructGPT-001 on novel user-oriented tasks” per human evaluation. 4 [Analysis] This is the result that triggered Stanford Alpaca’s near-identical reproduction with LLaMA-7B as student, which launched the consumer-LLM-fine-tuning era.

Cosmopedia-specific results. Cosmo-1B (1.8B parameters trained from scratch on 25B Cosmopedia tokens) “outperforms TinyLlama 1.1B on ARC-easy, ARC-challenge, OpenBookQA, and MMLU”; matches Qwen-1.5-1B on some axes; trails Phi-1.5 by a noticeable gap. 5 The blog post attributes the Phi-1.5 gap to Mixtral’s lower generation quality vs GPT-4, plus less mature prompt engineering than the Phi authors deployed.

Ablations. Phi-1 ablates each of the three training-set components and reports the educational-value filter alone moves 350M HumanEval from 12.19 to 17.68 percent; adding CodeExercises moves it further. 1 Self-Instruct ablates the ROUGE-L threshold and reports degradation when set above 0.8 (admits duplicates) or below 0.5 (admits irrelevant tasks). 4 Phi-3 does not publish corresponding ablations in the main text; [Analysis] this is one of the larger transparency gaps in the cluster.

Independent benchmark cross-checks for SOTA claims. Phi-3-mini’s 69 percent MMLU is on the higher end of the contemporary 3-4B-parameter band and has been broadly reproduced; Llama-3-8B’s 66 percent MMLU is the closest comparison and lands in the same ballpark. [Reviewer Perspective] The “rivals GPT-3.5” framing depends on which GPT-3.5 variant: gpt-3.5-turbo-0613 scored 70.0 percent MMLU; gpt-3.5-turbo-0301 scored ~68 percent. Phi-3-mini’s 69 percent is genuinely in the same band by the paper’s stated cited GPT-3.5 number, but small-model-vs-API comparisons carry methodology nuance the paper does not fully disclose.

Reproducibility check. Phi-1: weights released on Hugging Face (microsoft/phi-1); training data NOT released (the synthetic textbooks and exercises remain proprietary). Phi-3: weights released (microsoft/Phi-3-mini-4k-instruct, etc.); training data NOT released. Orca: paper released; weights NOT released by Microsoft Research (community Open-Orca attempts exist as approximations). Self-Instruct: code and 52K-instruction dataset released. Cosmopedia: dataset, code, and Cosmo-1B model all released under Apache 2.0. 5

Experimental scope limits. [Analysis] Three structural gaps across the cluster: (1) no work in the cluster runs the same student architecture and tokens on web-data vs synthetic-data corpora with matched compute, so the headline “synthetic beats web” claim is never cleanly isolated from confounders; (2) no work tests how synthetic-data benefits decay as the student model grows past 14B parameters, every benchmark gain is at the small-model end; (3) no work characterises what kinds of capability synthetic data fails to teach, beyond Phi-3’s note about factual recall.

Evidence audit.

  • Strongly supported: synthetic data plus aggressive filtering produces small models competitive on standard benchmarks. Confirmed across Phi-1, Phi-3, and Cosmopedia at matched-size comparisons.
  • Partially supported: synthetic data is better than equivalent-token-count filtered web data. The cluster’s lack of controlled head-to-head studies (point 1 above) means this is the consensus framing but not cleanly demonstrated.
  • Narrowly supported: synthetic-data benefits persist at frontier scale. Only Phi-3-medium (14B) tests beyond 7B; no public synthetic-data experiment exists at the 70B+ frontier as of the cluster’s publication dates.
Figure of Phi-3 Technical Report (arXiv:2404.14219) showing phi-3-mini's benchmark performance against Llama-3-8B and Mixtral-8x7B across MMLU, HellaSwag, and other standard benchmarks, demonstrating the parameter-efficiency claim central to the synthetic-data thesis

Section 10: Technical novelty summary

ComponentTypeNovelty levelJustificationSource
Four-step bootstrapPipelineFully novelSelf-Instruct’s central contributionSelf-Instruct
ROUGE-L diversity filterMethodCombination novelRe-purposing 2004 summarisation metric for instruction deduplicationSelf-Instruct
Explanation-tuned distillationMethodIncrementally novelSystem-message augmentation over FLAN-v2 baseOrca
Progressive ChatGPT→GPT-4 curriculumMethodCombination novelCurriculum learning + teacher ensembleOrca
Educational-value classifierMethodFully novelFirst time used at pre-training scale on codePhi-1
Synthetic textbooks for pre-trainingConceptFully novelPhi-1 moves synthetic data from instruction-tuning into pre-trainingPhi-1
”Data-optimal regime” framingConceptIncrementally novelRe-articulation of inference-deployment-driven trainingPhi-3
Two-phase data curriculumMethodIncrementally novelWeb-first then synthetic-emphasis trainingPhi-3
Audience × format prompt multiplicationMethodFully novelCosmopedia’s 12x diversity-induction trickCosmopedia
Open-replication of closed pre-trainingArtefactCombination novelFull open-source stack from data to modelCosmopedia

Single most novel contribution (cluster-wide). [Analysis] The publication’s reading is that Phi-1’s transposition of synthetic data from the instruction-tuning stage (where Self-Instruct and Alpaca had established it) into the pre-training stage is the single most consequential novelty in the cluster. Self-Instruct’s bootstrap was a powerful instruction-tuning tool but did not change how anyone pre-trained base models. Phi-1’s “Textbooks Are All You Need” paper was the first published demonstration that a base model’s initial learning could be substantially synthetic, and it triggered the whole Phi-2/Phi-3/Cosmopedia/Nemotron line.

What the cluster does NOT claim to be novel. The transformer architecture (decoder-only, RoPE, GQA, SwiGLU) is inherited from the standard literature; the next-token autoregressive pre-training loss is inherited; the FLAN-v2 base data for Orca is inherited; the standard benchmark suite (MMLU, HumanEval, BBH) is inherited. The contribution across the cluster is consistently data, not model or loss.

Section 11: Situating the work

The synthetic-data line traces back to two distinct streams that converged in 2022-2023. The first is knowledge distillation (Hinton et al. 2015), train a small student to imitate a large teacher’s outputs. The second is instruction-tuning (Wei et al. 2021 FLAN; Sanh et al. 2021 T0), adapt a pre-trained model with prompt/response training examples. Self-Instruct combines them: use a large teacher to generate instruction-tuning data for a smaller student. 4

Contemporaneous related work [External comparison]. Two papers from the same window are essential context:

  1. Stanford Alpaca (Taori et al. 2023) 10 , directly applied Self-Instruct to LLaMA-7B with GPT-3.5 (not davinci) as teacher; cost roughly $500 in OpenAI credits and reached “qualitatively similar” performance to text-davinci-003 in the authors’ framing. This was the consumer-scale proof of concept that made Self-Instruct’s influence widespread.
  2. Phi-1.5 (Li et al., arXiv:2309.05463) 7 , the bridge paper between Phi-1 (code) and Phi-3 (general). Phi-1.5 demonstrated that the “textbooks” recipe transfers from code to natural language; this is the paper Cosmopedia most directly tries to reproduce.

Strongest skeptical objection [Reviewer Perspective]. The cluster benefits massively from benchmark choice. MMLU, HumanEval, MT-Bench, and BBH all reward exactly the kind of textbook-shaped competence that synthetic data is designed to instil. Small models trained on synthetic textbooks excel at textbook benchmarks; whether they excel at messy real-world tasks is much less clear and much less studied. The Phi-3 paper’s TriviaQA weakness (which Microsoft calls out openly) is one symptom; agentic and long-horizon tasks where small models repeatedly underperform are another.

Strongest author-side rebuttal [Reviewer Perspective]. Phi-3’s deployment story, 1.8GB on an iPhone with 70 percent MMLU, is a genuine production unlock that no amount of benchmark methodology critique can erase. Even if the synthetic-data approach over-optimises for textbook benchmarks, the resulting models are useful at sizes where prior models were not.

What remains unsolved. (a) Whether synthetic data can scale to frontier model sizes; (b) what categories of capability synthetic data fails to teach; (c) whether models trained predominantly on AI-generated text exhibit cumulative distributional drift across generations (the “model collapse” concern from Shumailov et al. 2023).

Three future research directions, grounded in cluster-specific gaps.

  1. Matched-compute web-vs-synthetic ablations. No cluster paper trains the same student on Dweb\mathcal{D}_{\text{web}} vs Dsyn\mathcal{D}_{\text{syn}} at matched token count and compute. A clean controlled comparison is the missing experiment that would settle the cluster’s central claim.
  2. Capability-coverage maps. Independent benchmarking that explicitly maps where synthetic-data models excel and where they degrade, across messy real-world domains (agentic tasks, code in unfamiliar frameworks, multi-document reasoning).
  3. Open-frontier-scale replication. Cosmopedia tops out at the 1-2B scale; a 70B-or-larger fully-open synthetic-data pre-training run remains an open goal for the open-source community.

Section 12: Critical analysis

Strengths (cluster-wide).

  • Reproducible at multiple scales. Self-Instruct → Alpaca → Phi-1 → Cosmopedia is a chain of increasingly accessible reproductions. The original recipe is genuinely portable.
  • Operationally validated. Phi-3-mini is deployed on consumer hardware. The cluster’s claims are not benchmark theatre alone.
  • Open-source replication exists. Cosmopedia plus Cosmo-1B is a fully open stack from data to model weights under permissive license.

Author-stated weaknesses.

  • Phi-3 explicitly notes small-model factual-recall weakness on TriviaQA. 2
  • Self-Instruct’s manual quality audit reports 92 percent valid instructions but only 58 percent correct input-output pairs, substantial input-output drift even after filtering. 4
  • Cosmopedia’s blog post is candid about the Phi-1.5 quality gap and attributes it to teacher-model quality. 5

Weaknesses not stated or understated [Reviewer Perspective].

  • Benchmark-selection bias. The cluster systematically evaluates on benchmarks that reward the kind of competence synthetic data is engineered to produce. Mixed independent benchmarking on agentic tasks suggests this is not the full picture.
  • Recursive teacher dependence. The pipeline assumes a strong teacher exists. As the open-source frontier closes the gap with closed models, the teacher-student capability gap narrows and the recipe’s headroom shrinks. This is a structural limit the cluster does not discuss.
  • Distributional drift / model collapse risk. As more AI-generated text enters pre-training corpora (Cosmopedia is itself such a corpus), repeated training generations on synthetic data risk distributional collapse per Shumailov et al. 2023. The cluster does not engage with this risk explicitly.

Reproducibility check.

  • Code: Self-Instruct (yes, MIT), Orca (no), Phi-1 (no), Phi-3 (no), Cosmopedia (yes, Apache 2.0).
  • Data: Self-Instruct (yes), Cosmopedia (yes, 92GB on HF), Phi-1/Phi-3 (no), Orca (no).
  • Hyperparameters: fully reported in Self-Instruct and Cosmopedia; partially in Phi-1 and Phi-3 main text (some appendix gaps); minimally in Orca.
  • Compute: reported in Phi-1 (4 days, 8 A100s), Cosmopedia (~10K GPU hours for data); not reported in Phi-3 or Orca.
  • Trained model weights: Phi-1, Phi-3, Cosmo-1B all on Hugging Face; Orca-13B not released by Microsoft.
  • Evaluation set: standard public benchmarks; decontamination procedures disclosed in Phi-1 (yes, detailed) and Cosmopedia (yes), not detailed for Phi-3 main text or Orca.
  • Overall: Self-Instruct and Cosmopedia are fully reproducible; Phi-1 and Phi-3 are partially reproducible (weights yes, data no); Orca is the least reproducible of the cluster.

Methodology

  • Sample size: Self-Instruct 52K instructions / 82K instances; Orca ~5M FLAN-v2 examples plus GPT-4 subset; Phi-1 ~7B training tokens; Phi-3-mini 3.3T tokens; Cosmopedia 25B tokens / 30M+ files.
  • Evaluation set: MMLU, HumanEval, MBPP, MT-Bench, BBH, AGIEval, SuperNI, ARC, HellaSwag, PIQA, TriviaQA, OpenBookQA, standard public benchmarks; contamination checks disclosed for Phi-1 (detailed) and Cosmopedia (10-gram MinHash); Phi-3 contamination procedure described in less detail in main text.
  • Baselines: Llama-2/3 at matched parameter count, Mixtral 8x7B, GPT-3.5 / GPT-4 where comparable, prior Phi variants, TinyLlama 1.1B (Cosmopedia baseline), Vicuna-13B (Orca baseline).
  • Hardware/compute: Phi-1, 8 A100s for 4 days. Cosmopedia data generation, ~10K GPU hours. Phi-3 training hardware not reported in the technical-report main text. Orca training hardware not fully reported.

Generalisability. [Analysis] The synthetic-data recipe transfers well within the textbook-shaped knowledge domain (code, math, science, structured reasoning) but is undertested on the long-tail (agentic web tasks, fresh code in unfamiliar frameworks, multi-document open-domain QA). Generalisation to scales above 14B is untested in public.

Assumption audit. The three load-bearing assumptions from Section 3, teacher knowledge dominance, imitation transfer, synthetic-data composability, are well-satisfied at small scale (1-3B parameters) with a strong teacher (GPT-4). They become progressively shakier as student size approaches teacher size and as the closed-vs-open teacher gap narrows. Cosmopedia’s gap to Phi-1.5 is the canonical evidence: Mixtral as teacher is just enough less capable than GPT-4 to leave a measurable quality gap.

What would make the cluster significantly stronger [Analysis]. A single controlled study running matched (student-architecture, parameter-count, training-token-count, compute) experiments on (a) raw web scrape, (b) filtered web scrape, (c) synthetic-only, (d) synthetic + filtered-web mix. The cluster’s headline claim, synthetic data beats raw data for small models, is widely believed but never cleanly demonstrated in any single paper.

Cosmopedia dataset composition figure from the Hugging Face blog post showing curated sources (Stanford, Khan Academy, OpenStax, WikiHow) at 20 percent and web-clustered topic prompts at 80 percent, the open-replication attempt at the Phi-1.5 training data

Section 13: What is reusable for a new study

REUSABLE COMPONENT 1: Self-Instruct bootstrap pattern. What it is: the 4-step pipeline (generate, classify, instance, filter). Why worth reusing: well-tested, cheap with modern teachers (a 52K-task generation on GPT-4o costs ~$200 [External comparison]), and the ROUGE-L diversity filter is empirically validated. Preconditions: access to a teacher LLM stronger than the target student. Risks: low-diversity collapse if the seed set is too narrow; teacher-bias inheritance.

REUSABLE COMPONENT 2: Orca explanation-message augmentation. What it is: a small set (≈16) of hand-crafted system messages instructing the teacher to produce step-by-step reasoning. Why worth reusing: dramatically improves reasoning-benchmark performance vs flat-imitation training. Preconditions: a teacher capable of step-by-step reasoning at the depth the student should learn. Risks: the student may learn the surface style of reasoning without the substance.

REUSABLE COMPONENT 3: Educational-value classifier (Phi-1). What it is: a small classifier trained on teacher-LLM-annotated samples that filters a large corpus for pedagogical quality. Why worth reusing: works at corpus scale; classifier inference is cheap. Preconditions: ~100K teacher annotations as training set. Risks: classifier inherits the teacher’s pedagogical biases.

REUSABLE COMPONENT 4: Audience × format prompt multiplication (Cosmopedia). What it is: cross-product conditioning that multiplies prompt diversity from a fixed topic pool. Why worth reusing: directly addresses the synthetic-data diversity bottleneck. Preconditions: a teacher that can adapt style and depth to specified audiences and formats. Risks: low-quality format/audience combinations may produce noisy outputs.

Dependency map. Component 1 (Self-Instruct) is independent and can be used standalone. Component 2 (Orca explanations) layers on top of Component 1 by augmenting the instruction-generation prompt. Component 3 (Phi-1 classifier) is independent of 1 and 2; it filters an existing corpus rather than generating one. Component 4 (Cosmopedia conditioning) layers on top of any generation pipeline (1 or 2) and addresses diversity rather than quality.

Recommendation [Analysis]. For a new instruction-tuning study, Components 1 + 2 + 4 stacked. For a new pre-training study with a custom corpus, Components 3 + 4 stacked. The Phi-1 educational-value classifier is the most undervalued component in the cluster, it is corpus-agnostic, cheap to train, and substantially improves downstream model quality per the ablations.

Section 14: Known limitations and open problems

Limitations explicitly stated by the authors.

  • Phi-3: small-model factual-recall weakness (TriviaQA); recommends search-engine augmentation. 2
  • Self-Instruct: 58 percent input-output correctness rate among generated tasks; teacher (GPT-3 davinci) biases inherited. 4
  • Orca: the paper notes that prior small-model imitation tends to “learn to imitate the style, but not the reasoning process”, positions itself as a partial fix but does not claim full resolution. 3
  • Cosmopedia: explicit Phi-1.5 quality gap attributed to teacher-model differences and prompt-engineering depth. 5

Limitations not stated [Reviewer Perspective + Analysis].

  • No matched-compute web-vs-synthetic ablation exists in the published cluster (Section 9 evidence audit).
  • Benchmark-selection bias, all five works evaluate predominantly on benchmarks aligned to the textbook-shaped competence synthetic data is designed to instil.
  • Frontier-scale untested, no synthetic-data pre-training experiment public at 70B+ parameters.
  • Model-collapse risk untested, recursive synthetic-data training generations not engaged with.

Technical root causes. The matched-ablation gap is partly a cost issue (each clean ablation requires a fresh expensive pre-training run) and partly a competitive-secrecy issue (the closed labs releasing Phi-3 have no incentive to publish controlled studies that might isolate which specific data choices drove the result). The benchmark-bias issue is structural to the field: the same benchmarks that defined “small-model success” are the ones synthetic data is best at gaming.

Open problems.

  1. Are synthetic-data benefits monotonic in teacher capability? (Implied yes; not directly measured.)
  2. At what scale does synthetic data stop helping? (Unmeasured above 14B.)
  3. Is there a sustainable open-source teacher loop, or does the recipe inherently depend on closed-frontier-model access? (Cosmopedia’s quality gap suggests the latter as of 2024.)

What a follow-up paper would need to solve. The single most valuable follow-up would be a controlled matched-compute study isolating the contribution of synthetic data vs filtered web data vs raw web data at a single fixed student architecture and parameter count. The cluster has the parts; nobody has yet built the controlled experiment.

Self-Instruct figure from Wang et al. (arXiv:2212.10560) showing the four-step bootstrap pipeline diagrammatically: instruction generation, classification identification, instance generation, and filtering, the seminal open template for synthetic data generation that this cluster builds on

How this article reads at three depths

For the curious high-school reader. Large AI language models are usually trained on most of the internet, which contains a lot of junk. The papers in this article ask whether a much smaller model can do almost as well if it is trained on a much smaller, much cleaner set of text written by another AI for it. The answer turns out to be mostly yes, a 3.8-billion-parameter model trained this way matches one twelve times its size at common tests, and fits on a phone.

For the working developer or ML engineer. The synthetic-data recipe is now a production training ingredient, not a research curiosity. Self-Instruct is the open template for generating instruction-tuning data; Orca extends it with explanation-tuned distillation; Phi-1/Phi-3 prove the recipe transfers into pre-training; Cosmopedia is the open-source reproduction stack. Engineering decisions follow: pick a teacher LLM at least one capability tier above the target student; deduplicate aggressively (ROUGE-L < 0.7 or equivalent); use audience/format/topic conditioning to multiply prompt diversity; expect a 100x speedup in training-data scale vs raw web while paying the cost in teacher-inference compute. Deployment payoff: a 3-4B model with 70 percent MMLU now fits on a phone.

For the ML researcher. The cluster’s central novelty is Phi-1’s transposition of synthetic data from the instruction-tuning stage into pre-training, plus Phi-3’s “data-optimal regime” reframing of Chinchilla scaling for inference-deployment targets. The load-bearing assumptions are teacher knowledge dominance, imitation transfer, and synthetic-data composability; all three are well-satisfied at sub-15B scale with a frontier teacher and progressively shakier outside that envelope. The single biggest gap is the absence of a matched-compute controlled ablation of web vs synthetic data at fixed student architecture. The single biggest production result is phi-3-mini on consumer hardware at GPT-3.5-comparable MMLU. The single biggest unresolved risk is recursive-synthetic-data distributional drift as the open web is itself increasingly AI-generated.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Gunasekar et al., "Textbooks Are All You Need" (Phi-1), arXiv:2306.11644, June 2023. Source for all phi-1 claims, training data composition, HumanEval results, and ablations. (accessed )
  2. 2. Abdin et al., Phi-3 Technical Report, arXiv:2404.14219, April 2024. Source for phi-3-mini/small/medium parameter counts, training token counts, MMLU and MT-Bench results, and the data-optimal regime framing. (accessed )
  3. 3. Mukherjee et al., "Orca: Progressive Learning from Complex Explanation Traces of GPT-4," arXiv:2306.02707, June 2023. Source for FLAN-v2 5M base, explanation-tuning method, BBH/AGIEval results. (accessed )
  4. 4. Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions," arXiv:2212.10560, December 2022. Source for 175 seed tasks, 52K final instructions, ROUGE-L filter, four-step pipeline, SuperNI results. (accessed )
  5. 5. Ben Allal et al., "Cosmopedia: how to create large-scale synthetic data for pre-training," Hugging Face blog, March 2024. Source for 25B tokens, 30M+ files, Mixtral-8x7B generator, audience × format conditioning, Cosmo-1B benchmark results, Phi-1.5 quality gap framing. (accessed )
  6. 6. Cosmopedia dataset card on Hugging Face. Source for dataset configuration splits (web_samples_v1/v2, stanford, stories, wikihow, openstax, khanacademy, automathtext), 92.2GB size, Apache 2.0 license. (accessed )
  7. 7. Li et al., "Textbooks Are All You Need II: phi-1.5 technical report," arXiv:2309.05463, September 2023. Cited as bridge paper between Phi-1 and Phi-3 and as Cosmopedia's primary reproduction target. (accessed )
  8. 8. Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla), arXiv:2203.15556. Source for the compute-optimal ratio (~20 tokens/parameter) Phi-3 deliberately violates. (accessed )
  9. 9. Kaplan et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361. Source for the original neural scaling laws Chinchilla and Phi-3 both reference. (accessed )
  10. 10. Taori et al., Stanford Alpaca repository. Source for the Self-Instruct-on-LLaMA-7B reproduction at ~\$500 in OpenAI credits. (accessed )

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.