Neural Tech Daily
ai-research

ARC-AGI, Five Years On: A Multi-Paper Review of Chollet's Benchmark, the 2024 Prize, o3, and ARC-AGI-2

Multi-paper review of the ARC-AGI benchmark: Chollet's On the Measure of Intelligence (2019), the 2024 Prize Technical Report, o3's 87.5% breakthrough, and ARC-AGI-2.

Updated ~59 min read
Share

Section 1: Paper identity and scope

This review covers three primary artefacts that together tell the story of the ARC-AGI benchmark from its 2019 foundation through the December 2024 o3 breakthrough and into the 2025 ARC-AGI-2 redesign, plus the four secondary papers that document the winning 2024 Prize solutions.

Primary papers:

  1. Chollet, On the Measure of Intelligence (arXiv:1911.01547, 2019). 1 The foundational paper that defines intelligence as skill-acquisition efficiency and introduces the Abstraction and Reasoning Corpus.
  2. Chollet, Knoop, Kamradt, Landers, ARC Prize 2024: Technical Report (arXiv:2412.04604, December 2024). 2 Documents the 2024 Kaggle competition, the public leaderboard scores including o3’s breakthrough, and the open-source paper award winners.
  3. Chollet et al., ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems (arXiv:2505.11831, May 2025). 3 Introduces the harder successor benchmark on which frontier reasoning models score below ten percent.

Secondary coverage (cited where load-bearing): Akyurek et al. on test-time training, 9 Li et al. on combining induction and transduction, 10 Greenblatt’s GPT-4o program-synthesis approach, 11 and the ARC Prize blog posts on o1 and o3. 6 7

Retrieval confirmation. The three primary papers were fetched at writer-time from arXiv abstract pages and the ar5iv HTML renders on 2026-05-19. The ARC Prize 2024 Technical Report ar5iv render exposed the full text including the team-by-team approach summaries. The o3 announcement page on arcprize.org was inaccessible to automated fetch during the draft session; the o3 numbers below are cross-verified via multiple secondary sources (VentureBeat, Techmeme aggregating the December 20 2024 announcement, MarkTechPost) and the ARC Prize 2024 Technical Report itself.

Paper classification (cluster): Benchmark · Theoretical (foundational paper) · Evaluation · Reasoning · LLM-based (for the 2024 solutions) · Survey (the technical report functions as a partial survey of the competing approaches).

One-paragraph technical abstract (publication voice). Chollet’s 2019 paper argues that task-specific benchmarks measure skill rather than intelligence, and that skill can be “bought” with unlimited priors or data without any improvement in the underlying generalization machinery. The paper proposes an information-theoretic definition of intelligence as skill-acquisition efficiency over a scope of tasks, with respect to fixed priors and bounded experience, and operationalizes that definition via the Abstraction and Reasoning Corpus, a benchmark of grid-pattern puzzles built to assume only the Core Knowledge priors that human infants possess. ARC-AGI remained unbeaten by frontier models for five years; in 2024 the state-of-the-art on the private evaluation set rose from thirty-three to fifty-five percent through test-time training and program-synthesis hybrid approaches, and in December 2024 OpenAI’s o3 reasoning model crossed the eighty-five percent target on the semi-private evaluation set at very high compute cost. The 2025 ARC-AGI-2 release shifts the goalposts back, introducing compositional, multi-step, and in-context-symbolic tasks on which the same frontier reasoning models score below ten percent.

Primary research question (cluster-level). What does it mean to measure machine intelligence in a way that is not gameable by scale, that admits a fair human-versus-machine comparison under matched priors, and that survives contact with frontier reasoning models trained explicitly to chain-of-thought through problems?

Core technical claim (cluster-level). From the paper (Chollet 2019): intelligence should be measured as the efficiency with which a system acquires new skills, not the level of skill it eventually exhibits. The 2024 Technical Report extends this with an empirical claim: the breakthrough on ARC-AGI-1 came from test-time adaptation rather than scale alone, since “there does not exist any static inference-style transduction solution that scores above 11%.” 2 The ARC-AGI-2 paper adds that the next generation of difficulty requires compositional generalization that frontier reasoning models do not yet exhibit.

Core technical domains. Algorithmic information theory (moderate), measurement theory (moderate), benchmark design (deep), program synthesis (moderate), in-context learning and test-time training (moderate), reasoning-model architectures (surface, the o3 internals are not public).

Reader prerequisites. High-school algebra; familiarity with the broad shape of supervised learning and large language models helpful but not required because the Glossary covers them. No prior knowledge of the ARC-AGI benchmark assumed.

Section 2: TL;DR and executive overview

Three-sentence TL;DR. Most AI benchmarks reward systems for being trained on lots of similar problems, which means a model can score well by memorising rather than by being smart. In 2019 François Chollet proposed ARC-AGI, a puzzle benchmark whose problems are designed to be new to the system at test time and to assume only the kind of common-sense reasoning a human infant has. For five years no AI system could solve even half of it; in December 2024 OpenAI’s o3 model finally crossed the eighty-five-percent threshold by spending a great deal of compute thinking about each puzzle, and Chollet’s team responded with ARC-AGI-2, on which the same frontier models score below ten percent.

One-paragraph executive summary. ARC-AGI is the canonical benchmark for distinguishing memorisation from generalisation in modern AI. The original 2019 paper formalises intelligence as skill-acquisition efficiency: a measure of how much new skill a system gains per unit of new experience, controlling for the priors it started with. The benchmark itself comprises 1,000 grid-puzzle tasks split into public training, public evaluation, semi-private evaluation, and a held-out private evaluation set used to score Kaggle competition entries. The 2024 ARC Prize competition saw the state-of-the-art rise from thirty-three to fifty-five percent on the private set through test-time training (the ARChitects and MindsAI), through inductive-plus-transductive ensembles (Li et al.), and through LLM-guided program synthesis (Greenblatt’s forty-two percent on the public leaderboard). OpenAI’s o3 reasoning model then crossed the eighty-five-percent target on the semi-private evaluation, at a cost of roughly twenty dollars per task at low compute and over three thousand dollars per task at high compute. The 2025 ARC-AGI-2 successor benchmark was designed specifically to defeat brute-force chain-of-thought reasoning and currently sits below ten percent for frontier systems while humans solve all tasks within two attempts.

Five practitioner-relevant takeaways.

  • Skill is not intelligence. From the paper (Chollet 2019): an AI system that scores well on a benchmark for which the developer optimised the system is measuring developer effort plus compute, not the system’s generalisation capacity. Most reported benchmark wins are skill demonstrations.
  • Static prompting maxes out around eleven percent on ARC-AGI-1. Per the ARC Prize 2024 Technical Report, 2 no static, no-test-time-adaptation transduction approach has cleared eleven percent on the public evaluation set; meaningful scores require test-time training or program-synthesis search.
  • Induction and transduction are complementary. Per Li et al. (arXiv:2411.02272), 10 even when trained on identical synthetic data, inductive program-synthesis models and transductive output-prediction models solve disjoint subsets of tasks; ensembling them is the load-bearing trick behind 2024 paper-award-winning systems.
  • o3’s eighty-seven percent does not mean ARC-AGI is solved. From ARC-AGI-2 (Chollet et al. 2025), 3 the same o3 system scores around three percent on the semi-private evaluation set of ARC-AGI-2 while humans solve every task within two attempts.
  • Cost-per-task matters in any AGI claim. Per ARC Prize reporting, 7 o3 at low compute spent approximately seventeen to twenty dollars and thirty-three million tokens per task on ARC-AGI-1; the seventeen-times-higher compute configuration that hit eighty-seven percent costs are not directly disclosed but extrapolate to thousands of dollars per puzzle.

Pipeline overview in text. The benchmark itself has no training pipeline in the conventional supervised sense, the public training set is meant for humans to develop intuition about the task format, not for the system to learn from. At inference time the system is shown a small number of demonstration input-output pairs and one or more held-out test inputs, and must produce the exact-match test output. Modern winning approaches add a test-time training pipeline: on receipt of a new task, fine-tune a base model on augmentations of the demonstration pairs, then sample candidate outputs from the fine-tuned model and score them. The 2024 ARC Prize Kaggle leaderboard ran under hard constraints: a single P100 GPU, twelve hours, no internet access, meaning successful solutions had to fit a base model plus a test-time training loop plus inference under those limits.

Section 2.5: Glossary

TermPlain-English explanationFirst appears in
ARC-AGIThe Abstraction and Reasoning Corpus, a benchmark of grid-puzzle tasks designed to test general fluid intelligence rather than task-specific skill.Section 1
Skill-acquisition efficiencyChollet’s definition of intelligence: how much new skill the system picks up per unit of new experience, given fixed priors.Section 1
Core Knowledge priorsThe four pre-built assumptions Chollet says all ARC-AGI tasks may rely on: objectness, agentness, basic numbers, and elementary geometry.Section 3
Generalization difficultyA measure of how much the new task differs from anything in the training set; high generalization difficulty means the system cannot just retrieve a memorised answer.Section 3
Test-time training (TTT)A technique where the model fine-tunes itself on the demonstration examples of each task at inference time, producing a task-specific variant of the model.Section 5
InductionThe approach of writing a program (e.g., Python code) that explains the demonstration pairs and then running that program on the test input.Section 5
TransductionThe approach of asking a model to directly predict the test output without going through an explicit program.Section 5
Private evaluation setA held-out set of one hundred ARC-AGI tasks never released publicly; used for scoring Kaggle competition entries to prevent over-fitting.Section 1
Semi-private evaluation setA set of one hundred tasks shown to selected API-accessed models like o3 but not released publicly; used for public-leaderboard scoring.Section 1
Pass@kEvaluation metric: the system gets k attempts per task and counts as correct if any attempt matches; ARC-AGI uses pass@2.Section 9
Reasoning modelA model like o1 or o3 trained to produce explicit chain-of-thought tokens before its final answer, often with reinforcement learning on correct reasoning traces.Section 2
Compositional generalizationThe ability to combine known building blocks (rules, concepts, operations) in a new arrangement that was never seen during training.Section 4
[Analysis] labelThe publication’s own reasoned assessment, distinct from what the paper itself claims.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what the paper proves.Section 11 + 12
[Reconstructed] labelContent the publication faithfully reconstructed because the paper only partially disclosed it.Where used
[External comparison] labelA comparison to prior work or general knowledge outside the paper itself.Section 4 + 11
”From the paper:” prefixContent directly supported by the paper’s text, equations, tables, or figures.Throughout

Section 3: Problem formalisation

Notation table.

SymbolTypeMeaningFirst appears in
TTtaskA single ARC-AGI task; a tuple of demonstration pairs plus held-out test inputs.Section 3
IOi\mathrm{IO}_ipairThe ii-th demonstration input-output pair within task TT.Section 3
xtest,ytestx_{\text{test}}, y_{\text{test}}gridsHeld-out test input and ground-truth test output for TT.Section 3
IS\mathrm{IS}systemThe intelligent system being evaluated; in formal terms, a learning algorithm with priors plus a curriculum-style trainer.Section 3
PE\mathrm{PE}scalarPriors expressed as algorithmic complexity (Kolmogorov-style) of the system’s built-in knowledge before any task experience.Section 6
E\mathrm{E}scalarExperience: total information content of the curriculum the system received.Section 6
GD\mathrm{GD}scalarGeneralization difficulty: the complexity of the optimal solution program for the task, given the priors and experience.Section 6
II(T)\mathrm{II}(T)scalarIntelligence Index for task TT; the paper’s headline measure.Section 6
θ\thetaparametersTrainable parameters of an LLM or transformer used in a 2024 ARC solution.Section 5
DT\mathcal{D}_TdatasetAugmentations of the demonstration pairs of task TT used for test-time training.Section 5

Formal problem statement. A single ARC-AGI task TT is a tuple of three to five demonstration input-output grid pairs plus one or more held-out test inputs. Inputs and outputs are 2D grids of size up to thirty by thirty, each cell taking one of ten color values. The system must infer the latent transformation from the demonstration pairs and apply it to the test input to produce an exact-match output. The benchmark scores binary pass-fail per task with up to two attempts allowed per test input (pass@2). The overall score for a system is the fraction of tasks for which any of its at-most-two attempts exactly matches the ground-truth output.

Explicit assumption list. From the paper (Chollet 2019, Section II.1 and Appendix on Core Knowledge): (a) every ARC-AGI task is solvable from the Core Knowledge priors alone, namely objectness and elementary physics, agentness and goal-directedness, natural numbers and elementary arithmetic, and elementary geometry and topology; (b) no task requires acquired knowledge such as language, learned cultural symbols, or domain-specific vocabulary; (c) tasks are intended to be developer-aware-novel, meaning neither the system nor its creators have seen the specific tasks before. [Analysis] Potentially strong assumption: the claim that the public training set does not leak information about the private set is an empirical claim about task-distribution similarity that the 2024 results partially complicate, because successful systems trained heavily on synthetic ARC-like data (the ARC-Heavy and Re-ARC corpora) and on the public training and evaluation sets.

Formal complexity argument. From the paper (Chollet 2019, Section II.4): the Intelligence Index for a system on a task is defined as

II(T)  =  GD(TPE)PE+E(T)\mathrm{II}(T) \;=\; \frac{\mathrm{GD}(T \mid \mathrm{PE})}{\mathrm{PE} + \mathrm{E}(T)}

where GD(TPE)\mathrm{GD}(T \mid \mathrm{PE}) is the algorithmic complexity of the shortest solution program for task TT given the system’s priors PE\mathrm{PE}, and E(T)\mathrm{E}(T) is the experience the system consumed before attempting TT. The interpretation: a system that solves a hard task (high GD\mathrm{GD}) with small priors and small experience exhibits high intelligence; a system that solves the same task only after consuming an enormous training corpus exhibits lower intelligence even at identical task performance.

If LLM-based. For the 2024 solutions, the LLM’s role splits into two patterns. Pattern A (test-time training): the LLM is fine-tuned at inference time on DT\mathcal{D}_T, a set of augmentations of the demonstration pairs (rotations, reflections, color permutations, geometric transforms), then samples candidate outputs that are scored by execution-and-match. Pattern B (program synthesis): the LLM is prompted with the demonstration pairs and asked to produce a Python program; the program is executed; if it matches all demonstration outputs it is run on the test input. Greenblatt’s approach (forty-two percent on the public leaderboard) is the canonical Pattern B implementation; 11 Akyurek et al. is the canonical Pattern A implementation. 9

If theoretical. From the paper (Chollet 2019): the intelligence-index formulation is grounded in Algorithmic Information Theory, drawing on Kolmogorov complexity and Solomonoff’s theory of universal induction. The paper acknowledges that GD\mathrm{GD} is uncomputable in the strict Kolmogorov sense and uses approximate bounds in practice. There are no formal theorems with proofs; the paper is a position paper plus a benchmark proposal.

Section 4: Motivation and gap

Real-world problem with concrete example. From the paper (Chollet 2019, introduction): in 2017 a deep-RL system reached superhuman performance on the Atari benchmark; in 2018 AlphaZero exceeded human strength at chess and Go from self-play alone. Each result was framed as a step toward general intelligence. The paper argues this framing conflates two different quantities: how well the system performs on a fixed task after training, versus how efficiently the system would acquire performance on a new, unseen task. A chess engine that has seen a billion self-play games and a human grandmaster who has played a few hundred thousand games may have similar skill, but the human exhibits vastly higher intelligence by the skill-acquisition-efficiency definition because their experience was so much smaller.

Existing approaches and their failure modes. From the paper (Chollet 2019, Section I.3): pre-ARC benchmarks fall into three categories that each fail in distinct ways. (i) Game benchmarks like Atari and Go: the developer encodes domain priors directly into the system architecture, so high scores reflect engineering effort plus compute rather than the system’s own generalization. (ii) Supervised-learning benchmarks like ImageNet: train and test draw from the same distribution, so a sufficiently large model can interpolate without generalizing. (iii) Multi-task benchmarks like GLUE: the test tasks were public when the systems were built, so developers could optimise for them.

Gap the paper claims to fill. From the paper: a benchmark for general fluid intelligence needs three properties simultaneously, tasks must be developer-aware-novel (neither system nor developer has seen them); tasks must depend only on a small fixed set of priors (the Core Knowledge priors); and the benchmark must enable apples-to-apples human-versus-machine comparison. ARC-AGI was designed to satisfy all three.

Why prior methods were insufficient. [External comparison] The Winograd Schema Challenge (Levesque et al. 2012) and BIG-Bench (Srivastava et al. 2022) each tried to inject more reasoning into NLP benchmarks but both failed Chollet’s developer-aware-novel criterion because the test items were public to LLM training corpora. The 2024 Technical Report 2 argues that the static-prompting eleven-percent ceiling on ARC-AGI-1 is itself evidence that ARC-AGI captures something the other benchmarks did not: at human-level cost of zero examples per task, no amount of pre-training closes the gap unless the system also performs test-time adaptation.

Practical stakes. [Analysis] If Chollet’s framing is correct, then most claims about LLM “reasoning” or “AGI proximity” that rest on standard NLP benchmarks are measuring memorisation under disguise. The practical consequence is that decisions about deploying LLMs as general-purpose reasoners, autonomous agents, scientific assistants, planning systems, should be benchmarked against developer-novel tasks rather than against MMLU or similar curated suites. ARC-AGI is one of very few benchmarks where the developer-novelty property is enforced by the held-out private evaluation set.

Position in the broader landscape. [External comparison] ARC-AGI sits in the lineage of psychometric intelligence testing (Raven’s Progressive Matrices, the Wechsler tests) more than in the lineage of ML benchmarks. The 2024 results pulled it into the ML mainstream because the same techniques that improve frontier LLM performance, test-time compute, chain-of-thought, program synthesis, turned out to also lift ARC-AGI scores, suggesting that the benchmark is no longer measuring a completely orthogonal axis to what frontier labs are optimising. ARC-AGI-2 deliberately re-orthogonalises by adding compositional and in-context-symbolic tasks.

Section 5: Method overview

The cluster has three method strands: the benchmark design (the foundational paper), the test-time training pipeline (the 2024 winners), and the program-synthesis pipeline (Greenblatt and the Li et al. paper). This section walks each.

Strand 1, The ARC-AGI benchmark itself. From the paper (Chollet 2019, Section III): the benchmark comprises 1,000 tasks split into 400 public training tasks (curated to be easier and to illustrate the task format), 400 public evaluation tasks (curated to be harder), 100 semi-private evaluation tasks (introduced in the 2024 cycle, shown to specific API-accessed evaluators), and 100 private evaluation tasks (held out, used to score Kaggle submissions). Each task is hand-authored by the ARC team to depend only on Core Knowledge priors. Plain-English intuition: imagine someone showed you three pairs of cartoon grids, for instance, a grid with a red square that in the output becomes a red square with a blue outline; another pair with a green triangle becoming a green triangle with a blue outline; a third with a yellow circle becoming a yellow circle with a blue outline. Then they hand you a fourth grid with a purple star and ask what the output should be. You answer “a purple star with a blue outline.” That is the entire genre. Design rationale: the abstraction is generally short to describe but combinatorially diverse, which is the paper’s bet on what “intelligence” tests should look like. Classification: [New] for the benchmark, the priors framework, and the intelligence-index formula.

Strand 2, Test-time training (Akyurek et al. and the ARChitects). Plain-English intuition: when a student is given a new kind of puzzle, they do not just stare at it; they internally rehearse small variations and check which interpretation is consistent with all the examples. Test-time training operationalises that rehearsal: on receipt of a new task, the system generates augmented copies of the demonstration pairs (apply the same input-output transformation to rotated, reflected, or color-permuted versions of the grids), runs a short fine-tuning loop on the base model using those augmented copies, and then samples test-input outputs from the fine-tuned model. Exact mechanism per Akyurek et al. 9 : start from a base LLM fine-tuned on Re-ARC and ARC-Heavy synthetic data, generate per-task augmentations using rotation, reflection, transposition, and color-permutation symmetries, run a leave-one-out fine-tuning objective in which the model predicts each demonstration pair from the others, then sample candidate outputs and vote across augmented inverse-transformations. From the paper (Akyurek et al.): TTT yields up to six-times higher accuracy than fine-tuned baselines and reaches 53.0% on the public validation set with an 8B-parameter model, rising to 61.9% when combined with program synthesis. Design rationale: per the ARC Prize 2024 Technical Report, 2 “there does not exist any static inference-style transduction solution that scores above 11%,” meaning the test-time adaptation is doing the heavy lifting, not the base model. Classification: [Adapted], test-time training itself dates to Sun et al. 2020 (self-supervised TTT for vision distribution shift); applying it with leave-one-out objectives to ARC-AGI is the novel contribution.

Strand 3, LLM-guided program synthesis (Greenblatt and Li et al.). Plain-English intuition: instead of trying to predict the output directly, ask the LLM to write a short Python program that, when run on the demonstration inputs, produces the demonstration outputs; then run the program on the test input. Exact mechanism per Greenblatt: 11 prompt GPT-4o with a textual description of the demonstration pairs, ask for many candidate Python programs (k=2,048 in the 50% variant), execute each in a sandbox, keep the ones that match the demonstration outputs, and apply them to the test input with majority voting. From Greenblatt: 42% on the ARC-AGI-Pub leaderboard, rising to roughly 50% on the same public eval set with more samples. Li et al. 10 generalise this: train two separate neural models on synthetically generated Python programs derived from the public training tasks, an induction model that emits programs and a transduction model that emits outputs directly. Critical finding from Li et al.: “even when trained on the same data and using the same architecture, induction and transduction excelled at different types of ARC-AGI tasks”, induction excels at precise composable computations, transduction excels at fuzzy perceptual concepts. Ensembling the two reaches near-human performance on the public eval. Design rationale: by separating the program-emitting subsystem from the output-emitting subsystem, the ensemble covers both crisp and fuzzy task families. Classification: [New] for the induction-transduction-complementarity framing; [Adapted] from prior LLM-program-synthesis work (Codex, AlphaCode).

Strand 4, Frontier reasoning models (o1, o3). [External comparison] OpenAI’s o1 and o3 models, per the ARC Prize reporting, 6 7 were not designed specifically for ARC-AGI. They are general reasoning models trained with reinforcement learning on chain-of-thought traces and deployed with very long inference-time reasoning budgets. From ARC Prize reporting: 7 o3 was “trained on the ARC-AGI-1 Public Training set” (which means the system saw the 400 training tasks, not the held-out evaluation sets), and the evaluation used the 100-task Semi-Private Evaluation. o1-preview scored a maximum of 32% per Chollet’s own December 2024 reporting. o3 in the high-efficiency configuration (six samples per task) reached 75.7% at under ten thousand dollars total compute; o3 in the high-compute configuration (1024 samples per task, approximately 172 times more compute) reached 87.5%. Design rationale from outside the system: deeper chain-of-thought + larger sampling budget + reinforcement-learned reasoning heuristics gets the model past the eleven-percent static-prompting ceiling that defeated GPT-4o.

Section 6: Mathematical contributions

MATH ENTRY 1: Intelligence Index for a single task.

  • Source: Chollet 2019, Section II.4.
  • What it is: the paper’s headline measure of how much intelligence a system demonstrates when it solves a single task, normalised so that solving hard tasks with small priors and small experience yields a high number, while solving easy tasks after huge training yields a low number.
  • Formal definition:

II(T)  =  GD(TPE)PE+E(T)\mathrm{II}(T) \;=\; \frac{\mathrm{GD}(T \mid \mathrm{PE})}{\mathrm{PE} + \mathrm{E}(T)}

  • Each term explained:
    • GD(TPE)\mathrm{GD}(T \mid \mathrm{PE}) is the generalization difficulty of task TT given the system’s priors PE\mathrm{PE}. Operationally, it is the algorithmic complexity (length of the shortest program in a universal Turing machine encoding) of the optimal solution for TT assuming the priors are free. Larger means harder.
    • PE\mathrm{PE} is “priors energy”, the algorithmic complexity of the system’s built-in knowledge before any task experience. Includes architectural inductive biases, hand-coded heuristics, and any pre-training. Larger means more prior knowledge baked in.
    • E(T)\mathrm{E}(T) is the experience the system consumed before attempting TT. Operationally, the information content of the curriculum: every example, every gradient step, every demonstration.
  • Worked numerical example: suppose system A is GPT-4o with no test-time training. Set PEA=104\mathrm{PE}_A = 10^4 bits (a coarse proxy for the algorithmic content of its pre-training distribution beyond the universal-Turing-machine baseline). Set EA(T)=0\mathrm{E}_A(T) = 0 because no per-task fine-tuning happens. Suppose the task’s GD=100\mathrm{GD} = 100 bits (a hard ARC task requires roughly a 100-bit program to express). Then IIA(T)=100/104=0.01\mathrm{II}_A(T) = 100 / 10^4 = 0.01. Now system B is an 8B model with test-time training that consumes 50 bits of per-task augmentation. Set PEB=5103\mathrm{PE}_B = 5 \cdot 10^3 bits (smaller base model). Then IIB(T)=100/(5103+50)=0.0198\mathrm{II}_B(T) = 100 / (5 \cdot 10^3 + 50) = 0.0198, almost twice the intelligence index of system A even though both score around the same on the task, because system B did it with less inherent prior. The example illustrates the formula’s direction; the absolute numbers are not the paper’s claim, only the proportionality.
  • Role: the index is what the benchmark aspires to measure indirectly; the actual ARC-AGI score is a proxy because PE\mathrm{PE}, E\mathrm{E}, and GD\mathrm{GD} are not directly measurable.
  • Edge cases: when E(T)PE\mathrm{E}(T) \gg \mathrm{PE}, the index converges to GD/E\mathrm{GD} / \mathrm{E}, i.e., experience-dominated systems get penalised. When PE\mathrm{PE} \to \infty (developer encodes the solution directly), the index goes to zero, i.e., domain-specific engineering does not count as intelligence.
  • Novelty: [New], the formulation is original to the paper, though it draws on Solomonoff and Hutter’s universal-induction tradition.
  • Transferability: [Analysis] the formula transfers to any benchmark, but the difficulty of estimating GD\mathrm{GD}, PE\mathrm{PE}, and E\mathrm{E} in practice means most uses default to the ARC-AGI score itself as the empirical proxy.
  • Why it matters: the formula is the formal anchor for the publication’s repeated claim that benchmark scores alone do not measure intelligence.

MATH ENTRY 2: Aggregate intelligence over a task scope.

  • Source: Chollet 2019, Section II.4 (extension).
  • What it is: how to combine per-task intelligence indices over a benchmark to get a single number for the system.
  • Formal definition:

II(IS,scope)  =  ETscope[ωTθTII(T)]\mathrm{II}(\mathrm{IS}, \mathrm{scope}) \;=\; \mathbb{E}_{T \sim \mathrm{scope}} \big[ \omega_T \cdot \theta_T \cdot \mathrm{II}(T) \big]

where ωT\omega_T is a task-specific weight (the paper sets ωT=1\omega_T = 1 uniformly for ARC-AGI tasks) and θT\theta_T is an indicator of whether the system solved TT within budget.

  • Each term explained:
    • ETscope\mathbb{E}_{T \sim \mathrm{scope}} is expectation over the distribution of tasks in the benchmark scope.
    • ωT\omega_T is the per-task importance weight.
    • θT{0,1}\theta_T \in \{0, 1\} is the binary solve indicator.
    • II(T)\mathrm{II}(T) is the per-task intelligence index from MATH ENTRY 1.
  • Worked numerical example: ARC-AGI has 100 private evaluation tasks of varying difficulty. If a system solves 55 of them (θT=1\theta_T = 1 on 55 tasks, 00 on 45) with uniform ωT=1\omega_T = 1, and the per-task indices average IIˉ=0.015\bar{\mathrm{II}} = 0.015 across the solved tasks (because the system used substantial test-time training), then II(IS)=(55/100)0.015=0.00825\mathrm{II}(\mathrm{IS}) = (55/100) \cdot 0.015 = 0.00825. A different system that solves the same 55 tasks with no test-time training (lower E\mathrm{E}, higher per-task index of, say, 0.0250.025) would score (55/100)0.025=0.01375(55/100) \cdot 0.025 = 0.01375. Same benchmark score, higher intelligence.
  • Role: this is how the paper proposes to compare two systems with the same raw ARC-AGI score but different generalisation profiles.
  • Edge cases: when θT\theta_T is all zeros, the aggregate is zero regardless of II\mathrm{II} per task. This means a system must actually solve tasks to demonstrate intelligence; high-potential-but-zero-skill systems score zero.
  • Novelty: [New], same lineage as MATH ENTRY 1.
  • Transferability: [Analysis] applies to any benchmark with binary scoring but the PE\mathrm{PE} and E\mathrm{E} accounting is benchmark-specific.
  • Why it matters: it is the formal answer to the question “two systems hit 55% on ARC-AGI but used very different amounts of compute, which one is more intelligent?”

MATH ENTRY 3: Test-time training leave-one-out objective.

  • Source: Akyurek et al. 2024, Section 3.
  • What it is: the loss function used during the test-time fine-tuning loop on each new ARC-AGI task; it teaches the model to predict each demonstration pair from the others, which is the closest thing to the actual test-time prediction available without the ground-truth test output.
  • Formal definition: given a task TT with demonstration pairs {IO1,,IOn}\{\mathrm{IO}_1, \ldots, \mathrm{IO}_n\} and augmentations A\mathcal{A},

LTTT(θ;T)  =  i=1naAlogpθ(a(yi)a(xi),{a(IOj)}ji)\mathcal{L}_{\text{TTT}}(\theta; T) \;=\; \sum_{i=1}^{n} \sum_{a \in \mathcal{A}} -\log p_\theta\big(a(y_i) \mid a(x_i), \{a(\mathrm{IO}_j)\}_{j \neq i}\big)

where aa ranges over augmentation transforms (rotations, reflections, color permutations), xix_i and yiy_i are the input and output of pair ii, and pθp_\theta is the LLM’s autoregressive likelihood.

  • Each term explained:
    • θ\theta is the LLM parameters being updated by gradient descent during test-time training (typically a low-rank adapter on a 7-8B base model).
    • A\mathcal{A} is the augmentation set; Akyurek et al. report eight transforms (four rotations, four reflections, plus color permutations) producing roughly 85!8 \cdot 5! augmented copies per task.
    • The inner logpθ()-\log p_\theta(\cdot) is the standard next-token cross-entropy loss conditioned on the other demonstrations.
  • Worked numerical example: suppose the task has n=3n = 3 demonstration pairs and A=8\mid \mathcal{A}\mid = 8 augmentation transforms. Then the per-step loss sums over 38=243 \cdot 8 = 24 leave-one-out predictions. With a base model of 8B parameters and a low-rank adapter of rank 16, the per-token gradient flows only through approximately 8B * (16 / hidden_dim) ≈ a few hundred million effective trainable parameters per step. After 5-10 gradient steps per task on a P100 GPU, the adapter has shifted enough to encode the task’s specific transformation rule. Approximate budget: roughly thirty seconds per task on the competition hardware.
  • Role: this loss is what differentiates test-time-training-based ARC solutions from static-prompting baselines.
  • Edge cases: when the augmentation set is invariant under the task’s transformation (e.g., the task itself involves rotation, so augmenting by rotation does not generate new training signal), TTT can collapse to a single demonstration’s worth of supervision. Akyurek et al. report this as a known failure mode.
  • Novelty: [Adapted] from the Sun et al. 2020 TTT framework, with the leave-one-out objective being the ARC-specific adaptation.
  • Transferability: [Analysis] the LOO objective applies to any few-shot in-context learning task where the test answer is unknown but multiple demonstrations are available; the augmentation set is task-domain-specific.
  • Why it matters: this is the load-bearing equation behind the 2024 jump from thirty-three percent to fifty-five percent on the private evaluation set.

MATH ENTRY 4: Compute scaling on ARC-AGI for o3.

  • Source: ARC Prize 2024 Technical Report and ARC Prize blog post on o3. 2 7
  • What it is: an empirical scaling relationship between sampling budget per task and accuracy for o3 on the Semi-Private Evaluation set.
  • Formal definition (empirical, not derived):

accuracy(N)    α+βlog2N\mathrm{accuracy}(N) \;\approx\; \alpha + \beta \cdot \log_2 N

where NN is the number of samples drawn per task and α,β\alpha, \beta are empirical constants fit to the two reported data points.

  • Each term explained:
    • N=6N = 6 for the high-efficiency configuration yielding 75.7% accuracy.
    • N=1024N = 1024 for the high-compute configuration yielding 87.5% accuracy.
    • α\alpha and β\beta are fit numerically by solving the two-point system.
  • Worked numerical example: from the two data points, β=(87.575.7)/(log21024log26)=11.8/(102.585)=11.8/7.4151.591\beta = (87.5 - 75.7) / (\log_2 1024 - \log_2 6) = 11.8 / (10 - 2.585) = 11.8 / 7.415 \approx 1.591 accuracy-percent per log2-sample. α=75.71.591log26=75.71.5912.58571.6\alpha = 75.7 - 1.591 \cdot \log_2 6 = 75.7 - 1.591 \cdot 2.585 \approx 71.6 percent. So the fit predicts that at N=64N = 64 samples, accuracy would be roughly 71.6+1.5916=81.271.6 + 1.591 \cdot 6 = 81.2 percent. [Reconstructed], the paper does not publish the fit; the publication reconstructs it from the two disclosed data points and labels the projection accordingly.
  • Role: this is the empirical scaling behind the o3 announcement. The cost implication is dramatic: every doubling of accuracy past 75% costs roughly two times more sampling budget per task. The reported cost of $17-20 per task at low compute extrapolates to roughly $3,000+ per task at the 172x configuration.
  • Edge cases: the log-linear fit cannot extrapolate past 100% accuracy; the curve almost certainly flattens. [Analysis] At ARC-AGI-2 with the same model, the curve sits below 10% per Chollet et al. 2025, 3 meaning the scaling behaviour is benchmark-specific not model-specific.
  • Novelty: [External comparison], log-linear sample-budget scaling has been reported for chain-of-thought self-consistency (Wang et al. 2022) and majority voting in code generation (Li et al. AlphaCode 2022); o3’s behaviour fits the same family.
  • Transferability: [Analysis] the curve transfers to any test-time-compute-scaled reasoning model on a benchmark where verification is cheap (in ARC-AGI, checking whether a candidate output exactly matches is trivial).
  • Why it matters: it is the formula behind the cost critique of o3, the system is impressive, but the dollars-per-task figure forces the question of whether scaling has actually solved the benchmark or merely paid for the right answer.

Section 7: Algorithmic contributions

ALGORITHM ENTRY 1: Test-Time Training pipeline (headline algorithm).

  • Source: Akyurek et al. 2024, Algorithm 1, and the ARChitects’ open-source release per the ARC Prize 2024 Technical Report. 2
  • Purpose: take a base LLM fine-tuned on synthetic ARC-like data and adapt it to a single specific ARC-AGI task at inference time, then sample candidate outputs.
  • Inputs:
    • Base model parameters θ0\theta_0 (typically an 8B-parameter LLM).
    • Task TT with demonstration pairs {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n and test input xtestx_{\text{test}}.
    • Augmentation set A\mathcal{A}.
    • Hyperparameters: learning rate η\eta, number of fine-tuning steps KK, number of output samples SS.
  • Outputs: two candidate outputs y^test(1),y^test(2)\hat{y}_{\text{test}}^{(1)}, \hat{y}_{\text{test}}^{(2)} for the pass@2 score.
  • Pseudocode:
function TTT_solve(theta_0, T, A, eta, K, S):
  # 1. Build augmented training set
  D_T = []
  for each a in A:
    for each (x_i, y_i) in T.demos:
      D_T.append((a(x_i), a(y_i)))

  # 2. Test-time fine-tune via LOO objective
  theta = theta_0  # often a LoRA adapter init
  for step in 1..K:
    sample mini-batch B from D_T
    L = sum over (a(x_j), a(y_j)) in B of:
          -log p_theta(a(y_j) | a(x_j), other_demos_in_T)
    theta = theta - eta * grad(L)

  # 3. Sample candidates under inverse augmentations
  candidates = []
  for each a in A:
    for s in 1..S:
      y_aug = sample_from(p_theta(. | a(x_test), T.demos))
      candidates.append(a_inverse(y_aug))

  # 4. Vote and return top-2
  return top_two_by_majority_vote(candidates)
  • Hand-traced example on minimal input: suppose the task is “given a grid, draw a blue outline around every red shape.” Demonstration pairs are three (input, output) grids that illustrate this. The augmentation set A\mathcal{A} contains four rotations (0, 90, 180, 270 degrees). Step 1: DTD_T has 43=124 \cdot 3 = 12 augmented pairs. Step 2: starting from θ=θ0\theta = \theta_0, after 10 gradient steps the adapter θ\theta has shifted so that on rotated inputs the model now reliably predicts rotated “blue outline” outputs. Step 3: under each of the four rotation augmentations, sample S=8S = 8 candidate outputs, invert the rotation on each, accumulate 32 candidate outputs. Step 4: majority-vote over the 32; if 20 of them agree on a particular output, that becomes attempt #1, and the next-most-common output is attempt #2.
  • Complexity: time is O(KAncfwd)O(K \cdot \mid \mathcal{A}\mid \cdot n \cdot c_{\text{fwd}}) for fine-tuning plus O(AScfwd)O(\mid \mathcal{A}\mid \cdot S \cdot c_{\text{fwd}}) for sampling, where cfwdc_{\text{fwd}} is the cost of one LLM forward pass. Space is O(θ)O(\mid \theta\mid ) plus the adapter weights. The bottleneck is the inference pass, which dominates for any reasonable SS.
  • Hyperparameters: learning rate (Akyurek et al. report 5×1055 \times 10^{-5}), fine-tuning steps KK (5-20 typical), sample count SS (8-32 typical), adapter rank (16-32 typical).
  • Failure modes: when the augmentation set is invariant under the task’s transformation (the task is “rotate everything 90 degrees” and the augmentation includes rotations), TTT loses signal. Akyurek et al. acknowledge this and use augmentation selection heuristics.
  • Novelty: [Adapted], TTT itself is from Sun et al. 2020; the ARC-specific augmentation strategy and leave-one-out objective are novel.
  • Transferability: [Analysis] applies broadly to few-shot reasoning tasks with multiple demonstrations and natural symmetries; less applicable when demonstrations are heterogeneous in format.

ALGORITHM ENTRY 2: LLM-guided program synthesis (Greenblatt approach).

  • Source: Greenblatt’s Redwood Research blog post, August 2024. 11
  • Purpose: generate many candidate Python programs that explain the demonstration pairs, then apply the surviving programs to the test input.
  • Inputs: base LLM (GPT-4o in the original), task TT, sample count kk (2,048 in the 50% variant).
  • Outputs: top-2 candidate test outputs.
  • Pseudocode:
function PROGSYN_solve(LLM, T, k):
  programs = []
  prompt = build_prompt(T.demos)  # textual description of pairs
  for i in 1..k:
    p = LLM.generate(prompt, sample=True)
    programs.append(p)

  surviving = []
  for p in programs:
    try:
      passes_all = all(
        execute(p, x_i) == y_i for (x_i, y_i) in T.demos
      )
      if passes_all:
        surviving.append(p)
    except: pass

  outputs = []
  for p in surviving:
    outputs.append(execute(p, T.x_test))
  return top_two_by_frequency(outputs)
  • Hand-traced example: suppose the task is “find the largest connected component in the grid and color it blue.” With k=2,048k = 2{,}048 samples, GPT-4o generates 2,048 candidate Python programs. Of those, perhaps 50 correctly handle all three demonstration pairs (using numpy connected-component labeling, then color replacement). All 50 are applied to the test input; if 45 produce output A and 5 produce output B, attempt #1 is A and attempt #2 is B.
  • Complexity: dominated by kk LLM generation calls. With GPT-4o at roughly one cent per program completion and k=2,048k = 2{,}048, raw cost is roughly twenty dollars per task. Adding the execution sandbox and intermediate debugging passes pushes the cost higher in the 50% variant.
  • Hyperparameters: kk (sample count), temperature (sampling diversity), maximum program length, execution timeout per candidate.
  • Failure modes: tasks that require non-standard reasoning steps (color-symbol substitution, in-context rule learning) tend to be unreachable by short Python programs; tasks that require numerical iteration or grid arithmetic are reliably solved.
  • Novelty: [Adapted] from earlier LLM program-synthesis work (Codex, AlphaCode); the load-bearing trick is treating the demonstration-pair-verification as the program filter.
  • Transferability: [Analysis] applies broadly to verifiable few-shot tasks; specifically requires that demonstration-input correctness checking is fast and unambiguous.

ALGORITHM ENTRY 3: Induction-plus-transduction ensemble (Li et al.).

  • Source: Li et al. 2024, arXiv:2411.02272. 10
  • Purpose: combine an inductive program-emitting model with a transductive output-emitting model, leveraging their complementary task-coverage profiles.
  • Inputs: two separately trained models Mind,MtransM_{\text{ind}}, M_{\text{trans}}, task TT, sample budget per model.
  • Outputs: top-2 candidate outputs.
  • Pseudocode (faithfully reconstructed from the paper’s abstract and the ARC Prize 2024 Technical Report):
function ENSEMBLE_solve(M_ind, M_trans, T, k_ind, k_trans):
  ind_outputs = []
  for i in 1..k_ind:
    p = M_ind.generate(T.demos, sample=True)
    if all(execute(p, x_i) == y_i for (x_i, y_i) in T.demos):
      ind_outputs.append(execute(p, T.x_test))

  trans_outputs = []
  for j in 1..k_trans:
    y = M_trans.generate(T.demos + T.x_test, sample=True)
    trans_outputs.append(y)

  all_outputs = ind_outputs + trans_outputs
  return top_two_by_frequency(all_outputs)
  • Hand-traced example: suppose kind=1,000k_{\text{ind}} = 1{,}000 and ktrans=1,000k_{\text{trans}} = 1{,}000. On a precise computational task (“multiply each cell value by 2 modulo 10”), MindM_{\text{ind}} generates roughly 200 correct programs whose outputs all agree; MtransM_{\text{trans}} generates a noisy distribution of 1,000 candidates of which maybe 100 match. On a fuzzy perceptual task (“color the central object red”), MindM_{\text{ind}} generates few correct programs because the natural-language description is hard to operationalise, but MtransM_{\text{trans}} generates many close matches. Ensembling captures both regimes.
  • Complexity: dominated by the two sampling loops; roughly twice the per-task budget of the single-strand approaches.
  • Hyperparameters: kind,ktransk_{\text{ind}}, k_{\text{trans}}, sampling temperatures for each model.
  • Failure modes: when both models fail on the same task (especially in-context-symbolic tasks that defeat both program synthesis and direct prediction), the ensemble cannot recover. ARC-AGI-2 tasks fall heavily in this regime.
  • Novelty: [New], the complementarity finding is the paper’s central contribution.
  • Transferability: [Analysis] the induction-transduction split applies to any task with both crisp computational and fuzzy perceptual subfamilies; benchmark-specific tuning required.

Section 8: Specialised design contributions

Subsection 8A, LLM and prompt design.

PROMPT ENTRY 1: Greenblatt’s GPT-4o program-synthesis prompt.

  • Source: Greenblatt’s Redwood Research blog post. 11
  • Role: convert the demonstration pairs into a textual description that GPT-4o can use to emit a candidate Python program.
  • Prompt type: few-shot with chain-of-thought scaffolding.
  • Components in order: a textual rendering of the demonstration pairs (each grid serialised as a Python list-of-lists with color names), a description of the task format, an instruction to write a Python function transform(grid) that maps input grids to outputs, and a request for the model to think step-by-step before emitting the function body.
  • Reconstructed template with placeholders [Reconstructed]:
You are an expert at solving abstract pattern puzzles. Each puzzle
shows several input-output grid pairs. Your job: figure out the
transformation rule and write a Python function transform(grid)
that maps input grids to outputs.

Demonstration pair 1:
Input: <INPUT_GRID_1>
Output: <OUTPUT_GRID_1>

Demonstration pair 2:
Input: <INPUT_GRID_2>
Output: <OUTPUT_GRID_2>

... [more pairs] ...

Think step-by-step about what transformation explains all pairs.
Then write a Python function transform(grid) -> grid that
implements this transformation. The function will be tested by
running it on the input grids above; only programs that exactly
reproduce all outputs will be kept.

[Not specified in paper: exact wording of the step-by-step
scaffolding; the blog post shows truncated excerpts.]
  • Failure handling: programs that throw runtime errors or fail to match demonstration outputs are silently discarded.
  • Design rationale: the demonstration-pair verification serves as a hard filter; the LLM does not need to be perfectly accurate, only frequent enough that some sample lands on the right program.
  • Complexity: per Greenblatt, k = 2,048 samples for the 50% variant; cost on the order of $20 per task.
  • Novelty: [Adapted], Codex-style program synthesis is older; the load-bearing trick is the execution-as-filter pattern.
  • Transferability: [Analysis] applies to any task where demonstration-pair verification is cheap.

Subsection 8B, Architecture-specific details.

The o1 and o3 architectures are not disclosed by OpenAI. [Reconstructed from secondary coverage] o3 appears to use a chain-of-thought reasoning trace with reinforcement-learned verification heuristics; the high-compute configuration runs 172x the inference compute of the low-compute configuration, suggesting heavy parallel sampling and rejection. The ARC Prize blog 7 states that o3 was trained on the ARC-AGI-1 Public Training set, meaning the system saw the 400 training tasks during its RL fine-tuning phase.

Subsection 8C, Training specifics.

For the 2024 Kaggle winners, training spans three regimes: (i) pre-training on Re-ARC and ARC-Heavy synthetic data (400,000+ procedurally generated ARC-like tasks), (ii) fine-tuning on the 400 public training tasks plus the 400 public evaluation tasks, (iii) test-time training on each held-out task at inference. The ARChitects’ setup ran on a single P100 GPU under 12-hour total budget per the Kaggle compute constraint. 2

Subsection 8D, Inference / deployment specifics.

Not applicable to this paper cluster in the deployment sense, ARC-AGI is a benchmark, not a deployable system. The closest “deployment” question is the cost-per-task figure for o3, which makes any o3-style deployment economically prohibitive at current pricing.

Section 9: Experiments and results

Datasets. From the paper (Chollet 2019): the original ARC release contained 400 training tasks plus 400 evaluation tasks. The 2024 cycle added a 100-task semi-private evaluation set and retained the 100-task private evaluation set used by the Kaggle leaderboard since 2020. All tasks are hand-authored by the ARC team; the public training and evaluation tasks are released as JSON-encoded grid lists.

Baselines. From the 2024 Technical Report 2 : GPT-4o scored approximately 5% on the ARC-AGI-1 Semi-Private Evaluation. Claude 3.5 Sonnet scored roughly 21% per the same report. o1-preview scored a maximum of 32% per Chollet’s December 2024 reporting.

Evaluation metric. Pass@2: the system has at most two attempts per test input; the task counts as solved if any attempt exactly matches the ground-truth output. Score is the fraction of tasks solved.

Main quantitative results, ARC-AGI-1. [Reproduced from ARC Prize 2024 Technical Report and the o3 announcement. 2 7 ]

SystemEval setScoreCompute notes
Randomprivate0%baseline
GPT-4o (static)semi-private~5%API call only
Claude 3.5 Sonnetsemi-private~21%API call only
Greenblatt + GPT-4opublic42-50%k=2,048+ samples
Akyurek et al.public47.5-62.8%TTT + program synthesis
ARChitects (1st place Kaggle)private53.5%TTT under P100 + 12h
MindsAI (ineligible)private55.5%closed-source, TTT
o1-previewpublicup to 32%reasoning model
o3 (low compute)semi-private75.7%k=6 samples, under 10k USD total
o3 (high compute)semi-private87.5%k=1,024 samples, 172x cost
Human (calibrated panel)varies~85%reported in foundational paper

Table reproduced from ARC Prize 2024 Technical Report (arXiv:2412.04604) and the o3 announcement blog, accessed 2026-05-19.

Main quantitative results, ARC-AGI-2. From the ARC-AGI-2 paper 3 :

SystemSemi-private score (May 2025)
o3 (Medium)~3.0%
o3-mini (High)~3.0%
o4-mini (Medium)~2.4%
Claude 3.7 (8K)~0.9%
Human (two-attempt panel)100% (every task solved by at least two testers)

Table reproduced from ARC-AGI-2 (arXiv:2505.11831), accessed 2026-05-19.

Ablations. From the 2024 Technical Report 2 : removing test-time training drops the ARChitects’ score from 53.5% to below 11%, confirming that TTT is doing the heavy lifting. Removing program-synthesis ensembling from Akyurek et al. drops the 61.9% combined score back to 53.0%.

Hyperparameter sensitivity. [From the paper:] Akyurek et al. report that increasing TTT steps beyond 20 yields diminishing returns. The ARChitects report that the augmentation-stability validation criterion (selecting outputs that are stable across augmentations) is the load-bearing addition over vanilla TTT.

Qualitative results. [From the paper:] the 2024 Technical Report includes example tasks where program synthesis succeeds and transduction fails, and vice versa, supporting the Li et al. complementarity claim.

Experimental scope limits. [Analysis] The Kaggle 12-hour P100 constraint biases the competition toward small base models (3-8B parameter range); systems that depend on much larger models (frontier closed-source LLMs) cannot compete on the Kaggle leaderboard and instead appear only on the ARC-AGI-Pub leaderboard which has a separate $10k total-compute budget cap.

Independent benchmark cross-checks. [External comparison] The o3 87.5% result was cross-reported by VentureBeat, Techmeme, MarkTechPost, and the CIO magazine coverage of the December 20, 2024 announcement. None of these are fully independent reproductions; all relay the OpenAI-disclosed and ARC-Prize-presented numbers. As of the draft date, no independent third-party reproduction of the 87.5% number has been published. The ARC-AGI-2 results are the most adversarial cross-check: the same o3 system that scored 87.5% on ARC-AGI-1 scores roughly 3% on ARC-AGI-2 per Chollet et al. 2025, 3 suggesting that the ARC-AGI-1 score reflects a combination of genuine reasoning capability and benchmark-specific overfitting from training on the Public Training set.

Evidence audit. [Analysis] The strongly supported claims are: (a) test-time training raises ARC-AGI-1 scores from under 11% to above 50%, supported by multiple independent open-source implementations; (b) induction and transduction solve different task families, supported by Li et al.’s controlled ablations; (c) frontier reasoning models scale accuracy with compute budget on ARC-AGI-1, supported by the o1 and o3 reporting. Partially supported: (d) o3 achieves 87.5% on ARC-AGI-1, supported only by ARC Prize-reported numbers from a single OpenAI-administered test; independent reproduction not yet available. (e) ARC-AGI-2 is a harder benchmark than ARC-AGI-1, supported by the Chollet et al. 2025 numbers but the benchmark is only six months old at the draft date and the long-term trajectory is unclear.

Section 10: Technical novelty summary

ComponentTypeNovelty levelJustificationSource
Intelligence-as-skill-acquisition-efficiency frameworkTheoreticalFully novelNo prior benchmark in ML had operationalised the priors-vs-experience tradeoff this way; the Solomonoff lineage is acknowledged.Chollet 2019
The ARC-AGI benchmarkBenchmarkFully novelFirst benchmark designed for developer-aware-novelty under fixed Core Knowledge priors.Chollet 2019
Core Knowledge priors specificationConceptualCombination novelDrawn from Spelke and Kinzler’s developmental psychology work; applied to ML benchmark design for the first time.Chollet 2019
Test-time training for ARCMethodIncrementally novelTTT itself dates to Sun et al. 2020; the ARC-specific augmentation strategy and leave-one-out objective are new.Akyurek et al. 2024
Induction-transduction complementarityEmpirical findingFully novelFirst controlled demonstration that the two paradigms solve disjoint task families even at identical training.Li et al. 2024
ARC-AGI-2 in-context-symbolic tasksBenchmark designFully novelTasks where the meaning of a symbol is defined within the task itself, designed specifically to defeat brute-force chain-of-thought.Chollet et al. 2025

Single most novel contribution. [Analysis] The intelligence-as-skill-acquisition-efficiency framework. Most ML benchmark papers propose a new task; Chollet 2019 also proposes a definition of what the benchmark is supposed to be measuring and an information-theoretic accounting of the priors-vs-experience tradeoff. Without that framing, ARC-AGI would be just another grid-puzzle benchmark; with it, the benchmark becomes a probe of how the field defines progress toward general intelligence.

What the paper cluster does NOT claim to be novel. Test-time training itself (Sun et al. 2020), program synthesis with LLMs (Codex / AlphaCode), chain-of-thought reasoning (Wei et al. 2022), reinforcement learning for reasoning traces (multiple lineages culminating in o1 / o3), augmentation strategies for grid-pattern tasks (older OCR / vision lineage), and the algorithmic-complexity definition of intelligence (Hutter / Legg 2007 precede Chollet 2019 by over a decade and are explicitly cited).

Section 11: Situating the work

What prior work did. [External comparison] Pre-2019 AI benchmarks focused on either single-task skill (ImageNet, GLUE, Atari) or on multi-task aggregation that still leaked into training corpora (BIG-Bench post-2022 had visible contamination). Psychometric intelligence tests (Raven’s Progressive Matrices) measured human fluid intelligence in a way that ML benchmarks did not directly attempt to replicate. Legg and Hutter’s 2007 “Universal Intelligence” framework proposed a Kolmogorov-complexity-based intelligence measure but did not operationalise it into a benchmark.

What this cluster changes conceptually. Chollet 2019 makes the case that ML benchmarks should be designed for developer-aware-novelty and matched-prior comparison; ARC-AGI is the proof of concept. The 2024 Technical Report shows that the benchmark is solvable by test-time-adaptation methods that frontier labs were not yet emphasising; ARC-AGI-2 shifts the bar to compositional generalization that frontier reasoning models still cannot reach.

Contemporaneous related papers. [External comparison] Two contemporary lines: (a) Bober-Irizar and Banerjee 2024, “Neural Networks for Abstraction and Reasoning: Towards Broad Generalization in Machines”, adapted Dreamcoder-style program synthesis to ARC with a domain-specific language called PeARL that encodes visual concepts (shape, color, motion, symmetry, alignment) as composable primitives; differs from Li et al. by emphasising hand-designed DSL rather than learned program induction. (b) Wang et al. 2024 on ConceptSearch (arXiv:2412.07322), uses an LLM to search a program space using concept-graph guidance, achieving substantial gains over flat enumeration; differs from Greenblatt by adding structured search instead of pure sampling. Both papers operate in the same Q3-Q4 2024 window as the ARC Prize 2024 entries and represent the program-synthesis-with-LLM line of research that the prize’s first-place paper award also occupies.

Strongest skeptical objection. [Reviewer Perspective] The objection most often raised on social media after the o3 announcement: o3 was trained on the ARC-AGI-1 Public Training set, which means the 87.5% on the Semi-Private Evaluation is not a true developer-aware-novelty result, the public training tasks gave the system substantial information about the task family. The ARC Prize blog 7 explicitly discloses this; the publication’s reading is that the result is real but its interpretation requires acknowledging that the priors-vs-experience accounting has been shifted by the training-set inclusion.

Strongest author-side rebuttal. [Reviewer Perspective] The 2024 Technical Report and the ARC Prize team’s framing: humans also see the Public Training tasks before attempting evaluation tasks, so the “training set inclusion” is the human-equivalent priors-setting. The fair comparison is human-after-training-tasks versus model-after-training-tasks, and on that comparison o3 at high compute still trails the human panel (~85% calibrated human versus 87.5% o3, comparable but with a 172x compute multiplier and no human-equivalent cost figure).

What remains unsolved. [Analysis] ARC-AGI-2 sits below 10% for the best public systems; the in-context-symbolic tasks defeat both program synthesis and direct prediction; compute scaling does not appear to be the obvious solution because o3-at-high-compute on ARC-AGI-2 was reported at roughly 3% in the Chollet et al. 2025 paper. 3

Three future research directions. (a) Test-time training adapted to ARC-AGI-2’s in-context-symbolic tasks, the leave-one-out objective in Akyurek et al. assumes the task transformation is stable across augmentations, which is the assumption that ARC-AGI-2’s symbolic-rebinding tasks deliberately violate. (b) Verifier-guided search where the LLM proposes candidate transformations and a separate verifier model checks compositional consistency, preliminary work in this direction includes the ConceptSearch line. (c) Curricula that decompose ARC-AGI-2 tasks into Chollet’s Core Knowledge primitives plus a learned composition operator; this is the line of work the PeARL DSL points toward, and it remains under-explored at frontier-LLM scale.

Section 12: Critical analysis

Strengths with concrete evidence. (i) The benchmark has demonstrably survived five years of LLM scaling, frontier models scored under 10% from 2020 to 2024 even as model scale grew by orders of magnitude, supporting the paper’s core claim that scale alone is not enough. (ii) The 2024 Technical Report’s ablations show that removing test-time training collapses scores to the static-prompting ceiling, providing controlled evidence that the breakthrough came from test-time adaptation rather than from scale. (iii) The ARC-AGI-2 follow-through demonstrates that the benchmark family can adapt: when ARC-AGI-1 was reached, the team shipped a harder successor within months rather than declaring victory.

Weaknesses stated by the authors. From the paper (Chollet 2019, conclusion): the Intelligence Index involves uncomputable Kolmogorov-complexity terms, and the practical measurement defaults to the ARC-AGI score as a proxy. The paper acknowledges this gap. From the 2024 Technical Report 2 : the Kaggle leaderboard compute constraint excludes frontier closed-source models, and the resulting top scores may underestimate what current systems can achieve at higher compute budgets.

Weaknesses not stated by the authors. [Reviewer Perspective] (a) The “Public Training set” leak issue, once a system has been trained on the 400 public training tasks plus the 400 public evaluation tasks plus 400,000 synthetic ARC-Heavy variations, the developer-aware-novelty property is degraded for the Semi-Private and Private evaluation sets. The 2024 ARC team responded by adding the Semi-Private set, but the entire family of synthetic generators inevitably leaks task-family priors that are not in the original Core Knowledge specification. (b) The pass@2 scoring is a generous evaluation budget. Frontier-model scores would drop noticeably on pass@1, especially the high-compute o3 configuration where sample-budget scaling is the load-bearing trick.

Reproducibility check.

  • Code: open-source for the 2024 Kaggle winners (the ARChitects’ solution is published per the prize requirement); the foundational ARC-AGI dataset is released at github.com/fchollet/ARC-AGI. The 2024 paper award winners (Akyurek et al., Li et al., Bonnet and Macfarlane) have public repos.
  • Data: ARC-AGI-1 training and evaluation sets are public; private evaluation set is held out by design; ARC-AGI-2 follows the same pattern.
  • Hyperparameters: fully disclosed in the open-source paper award entries; not disclosed for o1, o3 (proprietary).
  • Compute: ARC Prize 2024 Technical Report discloses Kaggle constraints; o3 cost figures are approximate.
  • Trained model weights: 8B base models for the 2024 winners are open-source LLaMA derivatives; o3 weights are proprietary.
  • Evaluation set: public training and evaluation sets are released; semi-private and private are held out.
  • Overall: partially reproducible. The 2024 Kaggle-winning solutions are fully reproducible because the prize required open-source release; the o3 result is not reproducible because the model is closed.

Methodology disclosure.

Methodology (cluster)

  • Sample size: ARC-AGI-1 private eval = 100 tasks; semi-private eval = 100 tasks; ARC-AGI-2 task counts are not fully disclosed in the abstract.
  • Evaluation set: held-out private set never released; semi-private set shown only to specific evaluators; held-out / training-distribution split documented; the inclusion of the Public Training set in o3’s training data is disclosed by ARC Prize.
  • Baselines: GPT-4o, Claude 3.5 Sonnet, o1-preview, o1, o3, multiple Kaggle entries, human calibrated panel.
  • Hardware / compute: Kaggle leaderboard = single P100 GPU + 12h + no internet; ARC-AGI-Pub leaderboard = under $10k total compute; o3 high-compute estimate not publicly disclosed beyond the 172x multiplier on the low-compute configuration.

Generalisability. [Analysis] The Intelligence Index framework generalises to other benchmarks in principle but the practical measurement problem (estimating PE\mathrm{PE}, E\mathrm{E}, GD\mathrm{GD}) means most users default to the raw score. The test-time-training approach generalises beyond ARC-AGI but its effectiveness depends on the task having natural augmentation symmetries; for text-only reasoning tasks the augmentation set is much weaker.

Assumption audit. Re-visiting the Section 3 assumptions: (a) Core Knowledge sufficiency, holds for ARC-AGI-1 by construction; ARC-AGI-2 stretches it because in-context-symbolic tasks require a meta-level priors (the concept that a symbol can be defined within a task), which is not in the original four-prior list. (b) Developer-aware-novelty, degraded for systems trained on the Public Training set; substantially preserved by the held-out private set. (c) Cross-encoder-distillation analog, not applicable here.

What would make the cluster stronger. [Analysis] (i) An independent third-party reproduction of o3’s 87.5% on the Semi-Private set, since the only data point comes from OpenAI-administered runs reported by ARC Prize. (ii) A formal accounting of the priors and experience accumulated by each evaluated system, so the Intelligence Index can be approximated quantitatively rather than left abstract. (iii) A larger Semi-Private and Private evaluation set; 100 tasks is small enough that random fluctuations on a few tasks meaningfully shift the score.

Section 13: What is reusable for a new study

REUSABLE COMPONENT 1: The Intelligence Index formula.

  • What it is: II(T)=GD(TPE)/(PE+E(T))\mathrm{II}(T) = \mathrm{GD}(T \mid \mathrm{PE}) / (\mathrm{PE} + \mathrm{E}(T)) as a normative measure of intelligence per task.
  • Why worth reusing: provides a principled framework for comparing two systems with the same raw benchmark score but different training costs.
  • Preconditions: a way to estimate or bound the algorithmic complexity terms; can be done via compression-based proxies (e.g., gzip-length of the system’s parameters and its training set).
  • What would need to change: the operationalisation; the paper itself defaults to raw ARC-AGI scores rather than computed indices.
  • Risks: Kolmogorov complexity is uncomputable; any proxy invites criticism about whether it preserves the formula’s spirit.

REUSABLE COMPONENT 2: The test-time training pipeline.

  • What it is: leave-one-out fine-tuning of a base LLM on augmented demonstration pairs at inference time.
  • Why worth reusing: lifts ARC-AGI-1 scores from under 11% to above 50%, the largest single methodological gain in the benchmark’s history.
  • Preconditions: a base model that can be fine-tuned within the inference budget; a meaningful augmentation set for the task domain.
  • What would need to change: augmentation-set construction for non-grid domains.
  • Risks: tasks whose transformations interact with the augmentation set lose signal.

REUSABLE COMPONENT 3: The induction-transduction-ensemble pattern.

  • What it is: train one neural model to emit programs and another to emit outputs directly, then ensemble.
  • Why worth reusing: empirically pushes scores toward human level by covering complementary task families.
  • Preconditions: a verifier that can check program execution; a way to detect agreement between the two model families.
  • What would need to change: program-emitting model needs a DSL or executable target language.

Dependency map. The Intelligence Index sits at the top of the conceptual hierarchy; the ARC-AGI benchmark operationalises it; TTT and program-synthesis methods solve the operationalised benchmark; the o3 reasoning approach scales test-time compute on top of all of the above; ARC-AGI-2 redefines the bar after ARC-AGI-1 was approached.

Recommendation. [Analysis] The highest-value reusable component for new studies is the test-time-training pipeline. It transfers more readily than the Intelligence Index (which has uncomputable inputs) and more cleanly than the ensemble pattern (which requires two trained models). For a team building reasoning benchmarks for new domains, the recommended starting position is to specify the priors carefully (Chollet’s framing), build the benchmark around developer-aware-novelty (ARC-AGI’s design pattern), and evaluate baselines under both static prompting and test-time-adaptation conditions to surface the gap that TTT closes.

Section 14: Known limitations and open problems

Stated limitations. Per Chollet 2019: the Intelligence Index has uncomputable components; the Core Knowledge specification is informal rather than formal. Per the 2024 Technical Report 2 : the Kaggle compute constraint excludes frontier closed-source models from the main competition; the Public Training set inclusion in some systems’ training data complicates the developer-novelty claim.

Not stated by the authors. [Reviewer Perspective] (a) The 100-task evaluation-set size is small; per-task variance is significant. (b) The pass@2 scoring is generous; pass@1 would tell a different story for high-sample-count systems. (c) The cost-per-task figure for o3 makes the 87.5% economically unreachable for almost all practical deployments, by the publication’s reading, this is a benchmark-versus-deployment gap that the paper cluster does not yet address.

Open problems. (i) Compositional generalization at scale, ARC-AGI-2 currently sits below 10% for all systems; no published technique looks like it will close that gap soon. (ii) In-context-symbolic reasoning, defining the meaning of a symbol within a single task and using that definition consistently downstream is a primitive that frontier reasoning models do not exhibit reliably. (iii) Cost-aware scoring, the field still reports accuracy without a parallel cost-per-task figure, which means systems like o3 at 172x compute can claim records without acknowledging the compute differential.

What a follow-up paper would need to solve. [Analysis] The clearest single-paper opportunity is a system that scores meaningfully above 30% on ARC-AGI-2 within a Kaggle-style compute budget. Such a system would need to combine (a) test-time training at the meta-level, adapting not just to the task transformation but to the symbolic vocabulary introduced within the task; (b) program synthesis with in-context type inference, so candidate programs can use task-defined symbols; (c) a verifier that checks compositional consistency, not just demonstration-pair match.

How this article reads at three depths

For the curious high-school reader. ARC-AGI is a benchmark of grid puzzles built to measure how cleverly an AI system can figure out new patterns rather than how much training it has done. For five years no AI could solve even half of it; in December 2024 OpenAI’s o3 model finally crossed 85% on the version called ARC-AGI-1 by spending enormous compute on each puzzle. Chollet’s team responded with ARC-AGI-2, which the same models cannot yet solve, humans still beat machines on it by a wide margin.

For the working developer or ML engineer. The cluster is the canonical reference for “scale alone does not measure reasoning.” Static-prompting LLMs cap at roughly 11% on ARC-AGI-1, but test-time training plus augmentation-stability voting reaches 53.5% on the private set under a Kaggle 12-hour P100 budget; program synthesis with LLM-emitted Python plus execution filtering reaches 42-50% on the public eval. If you are building a reasoning-heavy system, the practical takeaway is that test-time adaptation matters more than base-model scale in this regime, that induction and transduction solve disjoint subsets of tasks and benefit from ensembling, and that any “AGI proximity” claim that does not show ARC-AGI-2 numbers should be discounted. The o3 87.5% number is real but cost roughly $3,000+ per task at high compute; it does not generalise to ARC-AGI-2 where the same model scores around 3%.

For the ML researcher. The cluster’s load-bearing technical contributions are: (a) the formal definition of intelligence as skill-acquisition efficiency, operationalised as II(T)=GD/(PE+E)\mathrm{II}(T) = \mathrm{GD}/(\mathrm{PE} + \mathrm{E}), uncomputable but conceptually rigorous; (b) the test-time training pipeline with leave-one-out objective on augmented demonstration pairs, the load-bearing trick behind the 2024 jump from 33% to 55%; (c) the induction-transduction complementarity finding from Li et al., controlled evidence that the two paradigms cover disjoint task families even at identical training; (d) ARC-AGI-2’s compositional and in-context-symbolic task design, a deliberate orthogonalisation against frontier reasoning models. The strongest objection is that o3 was trained on the Public Training set, so the 87.5% on the Semi-Private set is not a pure developer-novelty result; the strongest rebuttal is that humans also see the training tasks. A follow-up paper that closes the ARC-AGI-2 gap with a Kaggle-budget compute footprint would be the most impactful next step.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.