Toolformer, Gorilla and the Berkeley Function Calling Leaderboard: A Multi-Paper Review of Tool Learning

Multi-paper review of tool learning: Toolformer's self-supervised API filtering, Gorilla's retriever-aware fine-tuning, and the Berkeley Function Calling…

19 May 2026 Updated 19 May 2026 ~52 min read

Section 1: Paper identity and scope

This review covers three artefacts that together define how the field thinks about teaching language models to call external tools.

Primary papers (full venue):

Schick, Dwivedi-Yu, Dessì, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom, Toolformer: Language Models Can Teach Themselves to Use Tools (NeurIPS 2023, arXiv:2302.04761).¹
Patil, Zhang, Wang, Gonzalez, Gorilla: Large Language Model Connected with Massive APIs (arXiv:2305.15334, NeurIPS 2024).²
Patil, Mao, Yan, Ji, Suresh, Stoica, Gonzalez, The Berkeley Function Calling Leaderboard: From Tool Use to Agentic Evaluation of Large Language Models (ICML 2025).³

Retrieval confirmation. Toolformer and Gorilla were fetched at writer-time from the arXiv abstract pages plus ar5iv HTML renders on 2026-05-19; the multi-source rule was satisfied because the ar5iv render exposed full text, equations, and tables, and the NeurIPS proceedings pages added no detail beyond the arXiv preprint. The BFCL ICML 2025 paper was fetched from OpenReview on the same day; the live leaderboard and the GitHub repository were checked for current top-of-leaderboard scores and the V4 category list. Supplementary material for Toolformer (Appendix A-E with extended prompt templates and ablations) was retrieved from ar5iv. Gorilla’s appendix (Sec. 8.1 and 8.2 in arXiv pagination) covering hyperparameters and APIBench statistics was retrieved.

Paper classification (multi-paper cluster): LLM-based · Training method · Inference method · Data-driven · Application (tool use, function calling) · Benchmark (BFCL).

One-paragraph technical abstract (publication voice). The three papers in this cluster address a single question: how does an autoregressive language model decide when, what and how to call an external tool, and how is that decision graded by the community? Toolformer treats tool use as a token-level supervision problem solved by self-supervision: the model generates candidate API calls, executes them, and keeps the ones that lower the perplexity of subsequent tokens. Gorilla treats tool use as an instruction-tuning problem over a curated catalogue of real-world APIs, with a retriever feeding documentation snippets into the prompt at both training and inference time. BFCL standardises evaluation through a multi-category benchmark with Abstract Syntax Tree (AST) and executable scoring across single-turn, parallel, multi-turn, and agentic settings. Together they trace an arc from “the model invents tool calls during pretraining-style finetuning” to “the model is fine-tuned against a curated API catalogue” to “the field grades all of these on the same yardstick.”

Primary research question (cluster-level). What is the minimal supervision signal that lets a language model learn to use external tools usefully, and what evaluation framework lets the community compare progress?

Core technical claim (cluster-level). A modest amount of self-supervised filtering against a perplexity-reduction criterion (Toolformer) suffices for a small set of well-defined APIs, but scaling to thousands of real-world APIs requires a retriever-augmented training pipeline (Gorilla); progress on either is meaningful only against a standardised benchmark that scores parsable structure, executable correctness, and multi-turn state (BFCL).

Core technical domains. Self-supervised learning (deep on Toolformer), instruction tuning (deep on Gorilla), retrieval-augmented generation (moderate), AST-based program analysis (moderate on BFCL), benchmark methodology (deep on BFCL).

Reader prerequisites. High-school algebra; familiarity with autoregressive language models, cross-entropy loss, and the idea of fine-tuning helpful but not required because the Glossary covers them.

Section 2: TL;DR and executive overview

TL;DR (3 sentences). Tool learning is the field of teaching language models to call external functions like calculators, search engines, or any web API. Toolformer (2023) showed that a model can teach itself to do this by inserting candidate calls into its own training text and keeping the ones that help it predict what comes next; Gorilla (2023) showed that fine-tuning on a curated catalogue of 1,645 real APIs with a retriever beats GPT-4 at API-calling accuracy in zero-shot tests; and the Berkeley Function Calling Leaderboard (BFCL, 2024-2026) is now the standard scoreboard the field uses to compare function-calling models.

One-paragraph executive summary. Modern AI agents work by reasoning over external tools (web search, code interpreters, vendor APIs). The papers in this cluster set the foundations. Toolformer figured out the training signal: keep an API call if and only if it reduces the perplexity of the text that follows it, scoring API usefulness in the same units as language modelling itself. Gorilla figured out the data: build a catalogue of real API documentation, generate instruction-call pairs, fine-tune a small LLaMA-7B model with a retriever in the loop, and beat GPT-4 on a held-out evaluation. BFCL figured out the yardstick: parse generated calls into an AST, match against ground truth, optionally execute, and aggregate across single-turn, parallel, and multi-turn categories. Together they define what “function calling” means in 2026 evaluation reports.

Five practitioner-relevant takeaways:

For small, well-defined tool sets (calculator, calendar, search), Toolformer-style self-supervised filtering against a perplexity-reduction criterion is a viable data-generation pipeline; no human annotation is required.
For large, evolving API catalogues, Gorilla-style retriever-augmented fine-tuning is the production-default; the retriever lets you swap API documentation without retraining the model.
The Gorilla “hallucination versus error” distinction (calling an API not in the catalogue versus calling an existing API incorrectly) is the right diagnostic split when debugging an agent.
BFCL’s AST evaluation is the single most useful scoring tool when iterating on a custom function-calling model; it does not require sandboxed execution and scales to thousands of functions.
Single-turn function-calling accuracy is largely solved by frontier and strong open models; the unsolved frontier per BFCL V3+ is multi-turn state, memory, and long-horizon tool sequences.

Pipeline overview in text.

Training-time (Toolformer). Take a base LM. For each tool, write 5-15 in-context examples. Sample candidate API calls into a large unlabeled corpus. Execute the calls. Keep a call at position $i$ only if executing it reduces the LM’s loss on the next $N$ tokens by at least $\tau_f$ nats. Fine-tune the LM on the filtered corpus, where API calls now appear inline as special token sequences.
Training-time (Gorilla). Take LLaMA-7B. Curate APIBench from HuggingFace, TorchHub, TensorFlow Hub. For each API, synthesise instruction-call pairs via Self-Instruct. Optionally prepend a retrieved documentation snippet to each prompt. Fine-tune for 5 epochs on 8x A100 GPUs.
Inference-time (both). The fine-tuned model emits tool calls inline; an outer harness intercepts them, executes the call, returns the result, and the model continues generating.
Evaluation (BFCL). Run model on the BFCL prompt corpus. Parse outputs into an AST. Subtree-match against ground-truth call. Aggregate accuracy across categories.

Section 2.5: Glossary

Term	Plain-English explanation	First appears in
Language model (LM)	A model that predicts the next word given the words so far; the probability it assigns to the correct next word is its prediction quality.	Section 1
Perplexity	A way of summarising how surprised the model is by the actual next word; lower is better. Mathematically it is the exponential of the average loss.	Section 5
Loss function	A number that measures how wrong the model’s predictions are; the algorithm minimises this number.	Section 6
API / tool / function call	A request the model emits to an external system to do something (compute, look up, translate); the result is fed back into the model’s context.	Section 1
Self-supervised	A training signal generated from the data itself, with no human labels; Toolformer uses the model’s own loss change as its signal.	Section 5
Fine-tuning	Continued training of a pretrained model on a smaller, task-specific dataset.	Section 5
Retrieval-augmented	A pattern where the model receives, in its prompt, documentation fetched from an external index before generating an answer.	Section 5
Hallucination (Gorilla sense)	An API call to a function that does not exist in the catalogue; distinct from an incorrectly-parameterised call to a real function.	Section 6
Abstract Syntax Tree (AST)	A tree representation of a piece of code that ignores formatting and captures the call structure (function name, arguments, nesting).	Section 6
Subtree match	A check that the AST of the generated call is contained as a subtree of the AST of the ground-truth call.	Section 6
Multi-turn	An evaluation setting where the model takes several conversational turns, each potentially issuing tool calls, with state persisting between turns.	Section 9
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what the paper itself claims.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the paper proves.	Section 11 + 12
`[Reconstructed]` label	Content the publication faithfully reconstructed because the paper only partially disclosed it.	Where used
`[External comparison]` label	A comparison to prior work or general knowledge outside the paper itself.	Section 4 + 11

The “From the paper:” prefix marks claims directly supported by the paper’s text, equations, tables, or figures.

Paper A — Toolformer (Schick et al., 2023)

A.3 Problem formalisation

Notation.

Symbol	Type	Meaning	First appears in
$\mathcal{C}$	corpus	Unlabeled text corpus used as training material.	A.3
$\mathbf{x} = (x_1, \ldots, x_n)$	token sequence	A passage of natural language with $n$ tokens.	A.3
$c(a, i)$	API call	A call to API $a$ inserted between tokens $x_{i-1}$ and $x_i$ .	A.3
$r$	API response	The string returned by executing the call.	A.3
$e(c, r)$	linearised call	The textual representation of a call together with its response, embedded in the sequence.	A.5
$L_i^+$	loss	Cross-entropy of the model on tokens $x_i, \ldots, x_n$ when conditioned on the linearised call $e(c, r)$ .	A.6
$L_i^-$	loss	Minimum of the model’s loss on the same tokens conditioned on (a) no API call and (b) the call without its result.	A.6
$\tau_f$	scalar	Filtering threshold in nats.	A.6
$w_t$	weight	Position-dependent weight emphasising tokens near the call site.	A.6

Formal problem statement (Toolformer). Given a base LM with parameters $\theta$ and a small set of APIs $\mathcal{A} = \{a_1, \ldots, a_K\}$ , augment an unlabeled training corpus $\mathcal{C}$ with API calls so that fine-tuning the LM on the augmented corpus yields a model that, at inference time, emits API calls inline when they help predict subsequent tokens. The augmentation function must be self-supervised (no human-annotated calls) and tool-agnostic (the same procedure applies to any API exposing a text-in, text-out interface).

Assumptions.

Each API is a deterministic function from a text query to a text response. (From the paper:, Section 2.) [Analysis] Potentially strong assumption for stateful APIs such as multi-step web search.
The base LM is strong enough to emit syntactically-plausible API calls from a handful of in-context demonstrations. (From the paper:, Section 2.1.)
A position-level perplexity-reduction signal is a reasonable proxy for “this API call was useful.” [Analysis] Strong assumption: useful tool use that the LM cannot predict from context will not raise this signal.

Structural reason the problem is hard. Without ground-truth API call locations, the search space is combinatorial: for a corpus of $\mid \mathcal{C}\mid$ tokens and $K$ APIs, the number of (position, API, argument) triples is at least $K \cdot \mid \mathcal{C}\mid$ before considering argument space. The paper’s contribution is reducing this to a per-position filtered set via the self-supervised criterion.

Role of the LM. The base LM is both the data generator (sampling candidate calls) and the supervision signal (its loss change scores each candidate). After filtering, the LM is the model being fine-tuned.

A.4 Motivation and gap

Real-world problem with concrete example. A 6.7B-parameter LM is strong at language modelling but answers “What is 13 times 17?” with a confidently-stated wrong number. Tool use solves this by letting the model emit Calculator(13 * 17) -> 221 instead of guessing.

Existing approaches and failure modes. Pre-Toolformer tool use relied on (a) hand-annotated traces (Komeili, Shuster, Weston 2022; Thoppilan et al. 2022; cited in Toolformer Section 5) which are expensive to produce at scale, or (b) reinforcement-learning loops with a reward signal that is hard to define and slow to converge.

Gap. No prior work showed that the LM’s own perplexity could supply the supervision signal for tool use; Toolformer is the first to demonstrate that “did this call help me predict the next words?” suffices.

Practical stakes. If self-supervised tool use works, the cost of adding a new tool collapses from “annotate thousands of traces” to “write five in-context examples.” [External comparison] The same logic motivated the contemporaneous wave of self-instruct and constitutional-AI methods.

[External comparison] Position in the broader landscape. ReAct⁹ framed tool use as a reasoning-and-acting interleave at inference time; Toolformer is the training-time complement.

A.5 Method overview

Three-step pipeline (From the paper:, Section 2):

Sample API calls. For each API $a$ , prompt the LM in-context with 5-15 demonstrations of when to call $a$ and what arguments to pass. At each position $i$ in a passage, if the top- $k$ next-token distribution puts at least $\tau_s$ probability mass on the API-call start token, draw up to $m$ candidate calls.
Execute API calls. Run each candidate call through the actual API and collect the response $r$ .
Filter API calls. Compute the loss with the call and result, the loss without the call, and the loss with the call but no result; retain the call only if the loss reduction exceeds $\tau_f$ .

Plain-English intuition. The LM is asked, “If you were free to call a calculator right here, what would you ask it?” It tries a few options. Each is executed. The ones whose result actually helps the LM predict the next sentence are kept; the rest are thrown away. The kept ones become training examples.

Design rationale. The filtering criterion ties the supervision signal directly to the language-modelling objective; the augmented data trains the model to insert calls that, by construction, reduce its own loss.

What breaks if removed. Without filtering, the corpus is overwhelmed with low-quality calls (the model invents implausible queries); without sampling guided by in-context demonstrations, the calls drift away from API-syntax-valid forms.

Classification. [New] for the self-supervised filtering criterion; [Adopted] for in-context sampling (a standard 2022 pattern) and for the underlying instruction-tuning recipe (Wei et al. 2022).

A.6 Mathematical contributions

MATH ENTRY 1: Weighted cross-entropy at position $i$

Source: Toolformer Section 2.2, Eq. 1.
What it is: The model’s loss on the tokens starting at position $i$ , where tokens immediately after the API call site count more than tokens further away.
Formal definition:

$L_i(\mathbf{z}) = -\sum_{j=i}^{n} w_{j-i} \cdot \log p_\theta(x_j \mid \mathbf{z}, x_{1:j-1})$

Each term explained:
- $\mathbf{z}$ is a prefix that may include the linearised API call (or not, depending on which loss is being computed); type: token sequence.
- $x_j$ is the next-token target at position $j$ ; type: token id.
- $p_\theta(\cdot \mid \cdot)$ is the LM’s next-token distribution given parameters $\theta$ ; type: probability over the vocabulary, so a vector of length $\mid V\mid$ (vocabulary size, $\sim$ 50k).
- $w_{j-i}$ is a position weight that decays with distance from $i$ ; the paper uses $w_t = \tilde{w}_t / \sum_s \tilde{w}_s$ with $\tilde{w}_t = \max(0, 1 - 0.2 \cdot t)$ , giving $w_0 = 1/W, w_1 = 0.8/W, \ldots, w_5 = 0$ (after normalisation by $W = \sum_t \tilde{w}_t$ ).
- The outer sum runs from $j=i$ (the token right after the call) to at most $j = i + 4$ in practice because $w_t$ zeroes out after $t=4$ .
Worked numerical example. Take a passage where positions $i, i+1, i+2$ have ground-truth tokens “is”, “221”, ”.”. Suppose the LM, given no API call, places probabilities $0.6, 0.001, 0.8$ on those three tokens (it guesses “is” and ”.” easily but cannot predict “221”). Then $-\log 0.6 \approx 0.51$ , $-\log 0.001 \approx 6.91$ , $-\log 0.8 \approx 0.22$ . With weights $w_0 = 1/2.4 \approx 0.417, w_1 = 0.8/2.4 \approx 0.333, w_2 = 0.6/2.4 = 0.25$ (using $\tilde{w}_0=1, \tilde{w}_1=0.8, \tilde{w}_2=0.6$ ), the weighted loss is $0.417 \cdot 0.51 + 0.333 \cdot 6.91 + 0.25 \cdot 0.22 \approx 0.21 + 2.30 + 0.055 = 2.57$ . Now suppose the same LM is conditioned on Calculator(13 * 17) -> 221. Probabilities on the three target tokens become $0.6, 0.7, 0.8$ (the call result lets it predict “221” easily). Losses are $0.51, 0.36, 0.22$ ; weighted total $\approx 0.21 + 0.12 + 0.055 = 0.39$ . The loss drop $L_i^- - L_i^+ = 2.57 - 0.39 = 2.18$ nats; with $\tau_f = 1.0$ nat (the paper’s threshold for several tools), this call is retained.
Role: defines the per-position score the filter compares against $\tau_f$ .
Edge cases: when the LM is already confident about the next tokens with no call, $L_i^-$ is small, the loss reduction is small, and the call is rejected (correctly, because it did not help).
Novelty: [New] for the weighted form tied to the call site; the weighting design appears in Toolformer Eq. 2.
Transferability: [Analysis] The weighted-loss filter transfers to any tool whose response is short relative to the downstream context.
Why it matters: it gives a calibrated, language-modelling-unit score for tool usefulness without external annotation.

MATH ENTRY 2: The filtering criterion

Source: Toolformer Section 2.2, Eq. 3.
What it is: The rule that decides whether to keep a candidate API call.
Formal definition:

$L_i^- - L_i^+ \ge \tau_f$

with

$L_i^+ = L_i(e(c, r)), \quad L_i^- = \min\bigl(L_i(\epsilon), L_i(e(c, \epsilon))\bigr)$

Each term explained:
- $L_i^+$ is the loss when the call $c$ and its response $r$ are both injected into the prefix.
- $L_i(\epsilon)$ is the loss with no API call at all (the original passage).
- $L_i(e(c, \epsilon))$ is the loss when the call appears but its response is suppressed (replaced by empty).
- The $\min$ in $L_i^-$ is the conservative choice: the call must beat the better of “no call” and “call but no result,” ensuring the result itself contributes the value.
- $\tau_f$ is a per-tool threshold in nats; the paper reports $\tau_f$ values ranging from 0.05 (Calendar, easy) to 1.0 (Calculator, hard).
Worked numerical example. Continuing the calculator example: $L_i(\epsilon) = 2.57$ (the no-call loss above), $L_i(e(c, \epsilon)) = 2.4$ (the call appears in the prefix but the model does not get to see “221” as the answer), $L_i^+ = 0.39$ . Then $L_i^- = \min(2.57, 2.4) = 2.4$ , and $L_i^- - L_i^+ = 2.0$ nats. With $\tau_f = 1.0$ , the call is retained. If a competing candidate had been Calculator(11 * 19) producing the wrong result “209,” its $L_i^+$ would be close to $L_i(\epsilon)$ , the difference would be near zero, and it would be filtered out.
Role: the entire data-augmentation pipeline reduces to applying this criterion at every position for every candidate call.
Edge cases: if $L_i(e(c, \epsilon)) < L_i(\epsilon)$ , the call itself acts as a useful syntactic cue independent of its result; the $\min$ correctly disallows such calls from being credited as tool use.
Novelty: [New] in this exact form.
Transferability: [Analysis] The criterion transfers to any tool whose response materially changes next-token predictability; tools whose outputs encode information the LM already knows (e.g., a thesaurus query for a common-word synonym) will fail to clear $\tau_f$ .
Why it matters: it is the load-bearing object of the paper; the entire data pipeline reduces to this inequality.

A.7 Algorithmic contributions

ALGORITHM ENTRY 1: Toolformer data augmentation (headline)

Source: Toolformer Algorithm 1 (Section 2; full pseudocode in Appendix A.5 / A.6).
Purpose: produce a tool-augmented training corpus from an unlabeled corpus, a small API set, and a base LM.
Inputs: corpus $\mathcal{C}$ (token sequences); API set $\mathcal{A}$ ; base LM with parameters $\theta$ ; per-tool prompt $P_a$ with 5-15 demonstrations; sampling threshold $\tau_s$ ; filtering threshold $\tau_f$ ; sample budget $m$ .
Outputs: augmented corpus $\mathcal{C}^*$ where some positions carry inline linearised API calls.

for each passage x = (x_1, ..., x_n) in C:
  for each API a in A:
    # Step 1: sample candidate positions
    positions = []
    for i in 1..n:
      if p_theta(<API> | P_a, x_{1:i-1}) >= tau_s:
        positions.append(i)
    # Step 2: sample candidate calls at each position
    for i in positions:
      candidates = sample(LM, P_a, x_{1:i-1}, num_samples=m)
      for c in candidates:
        r = execute(a, c)
        Lplus  = L_i(e(c, r), x_{i:n})
        Lminus = min(L_i(epsilon, x_{i:n}),
                     L_i(e(c, epsilon), x_{i:n}))
        # Step 3: filter
        if Lminus - Lplus >= tau_f:
          insert e(c, r) between x_{i-1} and x_i in x
          break  # one call per position
  emit augmented passage into C*

Hand-traced example on minimal input. Take $\mathbf{x}$ = “The product of 13 and 17 is 221.” with $n=9$ tokens. $\mathcal{A}$ = {Calculator}. At $i=8$ (just before the token “221”), the API-start probability under the calculator prompt clears $\tau_s = 0.05$ . Sample $m=5$ candidates: Calculator(13 * 17), Calculator(13 + 17), Calculator(17 - 13), Calculator(221 / 13), Calculator(13 * 7). Execute: results 221, 30, 4, 17, 91. Compute $L_i^+$ for each (using the worked numbers from MATH ENTRY 2 for the first candidate: $L_i^+ \approx 0.39$ , loss drop 2.0 nats; for the second, $L_i^+ \approx 2.45$ , loss drop 0.12 nats; and similar near-zero drops for the rest). Filter at $\tau_f = 1.0$ : only the first candidate is kept. Insert [Calculator(13 * 17) -> 221] between “is” and “221” in $\mathbf{x}$ . The augmented passage becomes “The product of 13 and 17 is [Calculator(13 * 17) -> 221] 221.” which the model now learns to emit at training time.
Complexity: $O(\mid \mathcal{C}\mid \cdot K \cdot m \cdot c_{\text{LM}})$ where $c_{\text{LM}}$ is the cost of one LM forward pass; the dominant cost is the candidate-sampling step (Section 2.2 reports “billions of forward passes” for the full augmentation).
Hyperparameters: $\tau_s \in [0.05, 0.2]$ (per-tool, paper Appendix A.6, Table 9); $\tau_f \in [0.05, 1.0]$ ; $m = 5$ ; max samples per tool capped at 25k.
Failure modes: tools whose responses are predictable from context (e.g., calendar queries inside passages that already mention the date) yield few high-loss-drop candidates and the model learns to under-use them; tools whose syntax the base LM cannot generate (e.g., complex JSON arguments) yield few syntactically-valid candidates.
Novelty: [New] for the self-supervised filter pipeline.
Transferability: [Analysis] Transfers cleanly to text-in, text-out tools; transferring to multi-step or stateful tools requires extending the criterion to span multiple call sites jointly.

A.9 Experiments and results (Toolformer)

Datasets. CCNet subset for augmentation (From the paper:, Section 2.3); evaluation suites: LAMA T-REx, Google-RE, SQuAD; ASDiv, SVAMP, MAWPS for math; WebQuestions, NaturalQuestions, TriviaQA for QA; MLQA for translation; TempLAMA for temporal.

Baselines. GPT-J 6.7B (the base model); GPT-J fine-tuned on CCNet (no tools); OPT 66B; GPT-3 175B (in-context).

Key results (From the paper:, Table 3 and Table 4).

Task	GPT-J	GPT-J + CC	Toolformer	OPT 66B	GPT-3 175B
LAMA SQuAD	17.8	19.2	33.8	21.6	26.8
LAMA Google-RE	6.8	5.6	11.5	7.3	7.0
LAMA T-REx	31.9	32.9	53.5	35.4	39.8
ASDiv (math)	7.5	7.8	40.4	6.0	14.0
SVAMP (math)	5.2	5.4	29.4	4.9	10.0
MAWPS (math)	9.9	9.6	44.0	7.9	19.8

Selected results from Tables 3 and 4 of Toolformer (arXiv:2302.04761), reproduced for editorial coverage. Values are exact-match accuracy in percent.

Ablations (From the paper:, Section 3.2). Removing the filtering step (training on all sampled calls) drops performance to near GPT-J baseline; removing the per-tool threshold tuning costs 5-10 points on math.

Independent benchmark cross-check. [Analysis] As of 2026-05-19, Toolformer is rarely run as a baseline on contemporary function-calling benchmarks (BFCL, ToolLLM-Eval); its head-to-head versus instruction-tuned tool-callers like Gorilla is not in the literature. The paper’s LAMA and math wins are reproducible from the released code base but are no longer the SOTA on those tasks given subsequent advances in instruction tuning.

Experimental scope limits. Toolformer evaluates one base model (GPT-J 6.7B); scaling behaviour is reported only in a small comparison versus GPT-2 family models in the appendix, showing the method benefits from a sufficiently-capable base LM.

Paper B — Gorilla (Patil et al., 2023)

B.3 Problem formalisation

Notation.

Symbol	Type	Meaning	First appears in
$\mathcal{D}_{\text{API}}$	catalogue	The APIBench corpus of API documentation entries.	B.3
$q$	text	A natural-language instruction.	B.3
$c^*$	API call	The ground-truth API call for instruction $q$ .	B.3
$\text{ret}(q)$	function	Retriever that returns API documentation snippets for $q$ .	B.3
$\hat{c} = f_\theta(q, \text{ret}(q))$	API call	The model’s predicted call given instruction and retrieval.	B.3

Formal problem statement. Given a catalogue $\mathcal{D}_{\text{API}}$ of API documentation entries and a set of instruction-call training pairs $\{(q_j, c_j^*)\}$ , learn a model $f_\theta$ such that, for held-out instructions, the AST of $\hat{c}$ subtree-matches the AST of $c^*$ . Optionally support a retriever $\text{ret}$ that conditions the generation on documentation snippets, allowing the catalogue to update without retraining.

Assumptions.

API call correctness can be judged by AST subtree matching against a canonical reference. (From the paper:, Section 3.3.)
Synthetic instruction generation via Self-Instruct produces a sufficiently-diverse training distribution. (From the paper:, Section 3.1.) [Analysis] Potentially strong assumption when the held-out evaluation distribution differs from Self-Instruct’s prior.
The catalogue is representative of real-world deployment. [Analysis] APIBench is biased toward ML model APIs (HuggingFace, TorchHub, TensorFlow Hub); transfer to non-ML APIs is unstudied in the original paper.

B.4 Motivation and gap

Problem. Prompted GPT-4 hallucinates API calls: it invents plausible-sounding function names that do not exist, or invokes real functions with wrong argument names. Gorilla’s authors document this empirically (Figure 2 of the paper, showing GPT-4 inventing a non-existent transformers.AutoModelForCausalLM.from_pretrained('gpt-4') call).

Gap. No prior work had treated the API catalogue as the unit of supervision and trained explicitly to call within it.

[External comparison] ReAct + a retrieval tool can mitigate the issue at inference time; Gorilla’s contribution is to push the supervision into training time and supply a benchmark that grades whether the model stays within the catalogue.

B.5 Method overview

APIBench construction (From the paper:, Section 3.1).

HuggingFace: top 20 models per task category, yielding 925 model entries.
TorchHub: 95 entries (94 valid API calls).
TensorFlow Hub v2: 626 entries (down from 801 in v1 after deduplication).
For each entry, 10 instruction-call pairs are generated via Self-Instruct with GPT-4 as the data-augmentation oracle, giving roughly 16,450 training pairs.

Retriever-aware training (From the paper:, Section 3.2). During training, prepend to each prompt the string “Use this API documentation for reference: <retrieved_API_doc_JSON>” with the documentation snippet for the ground-truth API. At inference, $\text{ret}$ is either an oracle (retrieves the gold entry) or a BM25 / GPT-Index dense retriever over the catalogue. The model thus learns to condition on retrieved documentation, making swapped or updated catalogues handleable without retraining.

AST sub-tree matching evaluation (From the paper:, Section 3.3). Parse $\hat{c}$ and $c^*$ into ASTs. Check that the AST of $\hat{c}$ is contained as a subtree of the AST of $c^*$ , with the constraint that all required arguments must match and optional Python default arguments are allowed to differ.

Hallucination versus error (From the paper:, Section 3.3, key definition): “A hallucination is an API call that is not a sub-tree of any API in the database; invoking an API incorrectly is an error.”

Classification. [Adapted] retriever-augmented generation (Lewis et al. 2020) to the API-call setting; [New] the AST-subtree-match-with-default-arguments evaluation; [New] the APIBench catalogue.

B.6 Mathematical contributions

MATH ENTRY 3: AST subtree match indicator

Source: Gorilla Section 3.3; the evaluation function is described in prose, with the indicator formalised here as the publication’s faithful restatement.
What it is: A binary score that captures whether the generated call is a syntactically and semantically valid invocation of the ground-truth API.
Formal definition:

$\text{match}(\hat{c}, c^*) = \mathbb{1}\bigl[\text{AST}(\hat{c}) \sqsubseteq \text{AST}(c^*)\bigr] \cdot \mathbb{1}\bigl[\text{args}_{\text{req}}(\hat{c}) = \text{args}_{\text{req}}(c^*)\bigr]$

where $\sqsubseteq$ denotes the “is a subtree of, modulo optional-default-argument differences” relation.

Each term explained:
- $\text{AST}(\cdot)$ maps a string of code to its abstract syntax tree (Python ast.parse for Python calls).
- $\sqsubseteq$ allows the candidate to omit optional arguments that have defaults in the API signature.
- $\text{args}_{\text{req}}$ extracts the set of required (non-default) argument names with their values.
- $\mathbb{1}[\cdot]$ is the indicator function returning 0 or 1.
Worked numerical example. Ground truth: $c^* =$ pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english'). Candidate $\hat{c}_1 =$ pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english', device=-1). The candidate has the extra optional device=-1; the AST of $\hat{c}_1$ subtree-matches $c^*$ when defaults are allowed, and required args (task positionally, model keyword) match exactly. Score: 1. Candidate $\hat{c}_2 =$ pipeline('text-classification', model='bert-base-uncased'). Required-arg values differ on model; score 0. Candidate $\hat{c}_3 =$ transformers.AutoModelForCausalLM.from_pretrained('gpt-4'). The function name does not appear anywhere in $\mathcal{D}_{\text{API}}$ ; score 0 and the call is classified as a hallucination, not just an error.
Role: the metric reported across all tables of the paper (zero-shot accuracy, oracle-retriever accuracy, hallucination rate) is a function of this indicator.
Edge cases: nested calls (a call as an argument) require recursive AST matching, handled by the paper’s implementation; calls whose arguments contain string literals with API names are not classified as hallucinations.
Novelty: [Adapted] AST matching is standard in compiler tools; the optional-default tolerance and the hallucination definition are new in the API-call evaluation context.
Transferability: [Analysis] The metric transfers to any structured-output task where a reference parser exists; for natural-language tool calls (no formal grammar), it does not apply.
Why it matters: it is the load-bearing scoring function for both Gorilla and (in evolved form) BFCL.

B.7 Algorithmic contributions

ALGORITHM ENTRY 2: Gorilla training pipeline

Source: Gorilla Algorithm 1 (Section 3.2).
Purpose: train a retriever-aware API-calling LLaMA-7B from APIBench.
Inputs: APIBench instruction-call pairs $\{(q_j, c_j^*)\}$ ; documentation catalogue $\mathcal{D}_{\text{API}}$ ; base model LLaMA-7B.
Outputs: fine-tuned LLaMA-7B with retriever-aware prompt schema.

for epoch in 1..5:
  for batch in shuffle(APIBench):
    for (q_j, c_j*) in batch:
      doc_j = lookup(D_API, c_j*)            # retrieve gold doc
      prompt = (
        "Use this API documentation for "
        "reference: " + doc_j + "\n" +
        "Instruction: " + q_j + "\n" +
        "API call:"
      )
      target = c_j*
      loss   = -log p_theta(target | prompt)
      theta  = theta - eta * grad(loss)

Inference variant: replace lookup(D_API, c_j*) with BM25_retrieve(q) or GPT_Index_retrieve(q).

Hand-traced example on minimal input. Take $q$ = “Translate this English sentence to French using a small Helsinki model.” Ground-truth call $c^* =$ pipeline('translation_en_to_fr', model='Helsinki-NLP/opus-mt-en-fr'). Lookup returns the HuggingFace doc snippet for Helsinki-NLP/opus-mt-en-fr (JSON with the pipeline task name, model id, README excerpt). Prompt becomes “Use this API documentation for reference: {json}\nInstruction: Translate this English sentence to French using a small Helsinki model.\nAPI call:” Target: pipeline('translation_en_to_fr', model='Helsinki-NLP/opus-mt-en-fr'). Loss: cross-entropy over the target tokens. After one gradient step, the model’s probability of emitting the correct call given the same prompt rises; after 5 epochs over the full corpus, the model has learned to copy the model id from the retrieved doc into the call argument.
Complexity: $O(N_{\text{pairs}} \cdot E \cdot L \cdot c_{\text{forward}})$ where $N_{\text{pairs}} \approx 16k$ , $E = 5$ , $L =$ sequence length 2048. Reported wall-clock: not explicitly stated in the paper. Hardware: 8x A100 40GB.
Hyperparameters: learning rate $2 \times 10^{-5}$ , batch size 64, max sequence length 2048, 5 epochs (From the paper:, Section 8.2).
Failure modes: when the retriever returns the wrong documentation snippet at inference, the model copies from the wrong snippet (Gorilla’s Table 3 shows this as the dominant failure mode of the BM25-retriever variant).
Novelty: [Adapted] standard instruction-tuning with retrieval augmentation; the specific data construction and prompt schema are new.
Transferability: [Analysis] Transfers to any tool catalogue with structured documentation entries.

B.9 Experiments and results (Gorilla)

Datasets. APIBench (the paper’s own contribution); 80/20 train/test split per API category (From the paper:, Section 4.1).

Baselines. GPT-4 (zero-shot), GPT-3.5 (zero-shot), Claude (zero-shot), LLaMA-7B base.

Key results (From the paper:, Table 1, zero-shot accuracy in percent):

Model	TorchHub	HuggingFace	TensorFlow Hub
Gorilla (0-shot)	59.13	71.68	83.79
GPT-4 (0-shot)	38.70	19.80	18.20
GPT-3.5 (0-shot)	48.38	16.81	41.75
Claude (0-shot)	18.81	6.19	9.19
LLaMA-7B (0-shot)	0.00	0.00	0.00

Table 1 of Gorilla (arXiv:2305.15334), reproduced for editorial coverage.

The paper headlines “Gorilla outperforms GPT-4 by 20.43% and ChatGPT by 10.75%” in aggregate accuracy (From the paper:, Section 4, abstract restatement).

Retriever-aware results (From the paper:, Table 2). With an oracle retriever, Gorilla rises from 71.68% to 91.26% on HuggingFace, a +19.58 absolute. With a BM25 retriever, it rises modestly above zero-shot. The takeaway: retrieval quality is the load-bearing variable.

Hallucination analysis (From the paper:, Table 1 right columns). GPT-4 hallucinations on HuggingFace: 36.55%; Gorilla zero-shot: 11.16%. The drop is largely driven by training-data conditioning: the model learned to stay in the catalogue.

Ablations (From the paper:, Section 4.2). Training without the retrieval augmentation drops HuggingFace accuracy to 47.46% (a 24-point drop from the oracle variant), demonstrating that the retrieval-aware prompt schema is what unlocks the full quality.

Independent benchmark cross-check. [Analysis] Gorilla’s training and test sets are constructed from the same Self-Instruct pipeline; the head-to-head against GPT-4 is on Gorilla’s home turf. The contemporaneous ToolLLM paper¹⁰ introduced ToolBench, a benchmark of 16,000+ real-world APIs from RapidAPI, on which GPT-4 closes a substantial fraction of the gap when prompted with adequate retrieval. The SOTA framing in the Gorilla paper is the authors’ framing on their chosen benchmark suite; the broader picture as of 2026 (per BFCL V4) is that frontier closed models match or exceed open fine-tuned models on single-turn function calling.

Paper C — Berkeley Function Calling Leaderboard (Patil et al., 2025)

C.3 Problem formalisation

Notation.

Symbol	Type	Meaning	First appears in
$\mathcal{T}$	benchmark	The full BFCL prompt-answer corpus.	C.3
$\mathcal{T}_k$	category	One of the named evaluation categories (simple AST, parallel, multi-turn, etc.).	C.3
$s_k(\theta)$	scalar	Per-category accuracy of model $\theta$ .	C.3
$S(\theta)$	scalar	Aggregate score, a weighted mean of per-category accuracies.	C.3

Formal problem statement. Define a multi-category evaluation $\mathcal{T} = \bigcup_k \mathcal{T}_k$ such that aggregate performance $S(\theta)$ distinguishes models on capabilities that matter for production tool-using agents: single-turn calling, parallel calling, multiple-function selection, relevance detection (knowing when not to call), multi-turn stateful interaction, and agentic settings (web search, memory).

C.5 Method overview (BFCL evaluation categories)

Categories (From the paper:, Section 3; live leaderboard categories):

Simple AST. One function, one call expected. AST subtree match.
Multiple. Several function definitions available; model must select the right one.
Parallel. A single user query implies multiple simultaneous calls; the model must emit all of them.
Parallel Multiple. Both above combined.
Relevance / Irrelevance. No function fits the query; the model must abstain.
Executable. The generated call is executed against a sandbox; correctness is judged on the return value.
REST. Calls against a REST API surface.
Java / JavaScript. Function calls in non-Python languages.
Multi-turn (V3 addition). A conversational sequence of calls with persistent state across turns.
V4 additions. Agentic web search, agentic memory management, format sensitivity.

Dataset size. Over 2,000 question-function-answer pairs in V3 (From the paper:, Section 3.1, and BFCL GitHub repository README); V4 adds new categories on top.

Scoring (From the paper:, Section 3.4). Per-category accuracy is the fraction of prompts the model gets right under that category’s match rule; aggregate $S(\theta)$ is a (slightly category-weighted) average. Unevaluated categories count as zero in the overall column.

C.6 Mathematical contributions

MATH ENTRY 4: Aggregate BFCL score

Source: BFCL paper Section 3.4; reconstructed precisely from the live leaderboard scoring code in the GitHub repository.
What it is: A weighted mean of per-category accuracies treated as the headline leaderboard number.
Formal definition:

$S(\theta) = \frac{1}{\sum_k w_k} \sum_{k} w_k \cdot s_k(\theta), \quad s_k(\theta) = \frac{1}{|\mathcal{T}_k|} \sum_{(q, c^*) \in \mathcal{T}_k} \text{match}_k(f_\theta(q), c^*)$

Each term explained:
- $w_k$ is the per-category weight (default 1 for every evaluated category; unevaluated categories enter as $s_k = 0$ ).
- $\text{match}_k$ is the category-specific match rule: AST subtree for AST categories, return-value equality for Executable, irrelevance-detection accuracy for Relevance, multi-turn state correctness for Multi-Turn.
- $f_\theta(q)$ is the model’s generated call(s) under a category-specific prompt format.
Worked numerical example. Suppose a model is evaluated on three categories with sizes $\mid \mathcal{T}_1\mid = 400$ (simple AST), $\mid \mathcal{T}_2\mid = 200$ (multiple), $\mid \mathcal{T}_3\mid = 200$ (parallel). It gets 320/400, 130/200, 80/200 correct. Per-category accuracies: $s_1 = 0.80, s_2 = 0.65, s_3 = 0.40$ . With uniform weights $w_k = 1$ across the three evaluated categories: $S = (0.80 + 0.65 + 0.40)/3 \approx 0.617$ . If the model is also subject to multi-turn (size 200) but produces zero correct (or is not evaluated), and multi-turn enters with $w_4 = 1$ : $S = (0.80 + 0.65 + 0.40 + 0)/4 = 0.4625$ . The example shows how multi-turn becomes the headline-killer for models that ace single-turn.
Role: it is the column in the public leaderboard that the community treats as “the BFCL number.”
Edge cases: categories with very different difficulties get the same weight by default; reading per-category breakdowns is necessary to avoid being misled by aggregates.
Novelty: [Adapted] standard benchmark-aggregation arithmetic; the contribution is the choice of categories and match rules, not the aggregation.
Transferability: [Analysis] The scoring transfers to any sufficiently-categorised tool-use benchmark.
Why it matters: it is the operational definition of “function-calling progress” in 2026.

C.9 Experiments and results (BFCL leaderboard state)

Current top of leaderboard (as of 2026-04-21, per the third-party BFCL v3 mirror at pricepertoken.com).⁸ GLM 4.5 leads at 76.7%, followed by Qwen3 32B at 75.7%. On BFCL V4, Qwen3.5-397B-A17B leads at 0.729 across the V4-evaluated subset. [Reviewer Perspective] Leaderboard rankings on BFCL shift weekly; readers should consult the live leaderboard rather than relying on any cached snapshot.⁶

What the leaderboard shows in shape. Single-turn AST categories are nearly saturated for frontier and strong open models (top scores in the high 80s and 90s). Multi-turn and agentic categories remain a noticeable gap; V4’s web-search and memory categories separate models more sharply than single-turn ever did. [Analysis] This matches the BFCL paper’s stated finding (Section 4): “models excel at single-turn tasks but struggle with memory, dynamic decision-making, and long-horizon reasoning.”

Independent benchmark cross-check. BFCL is itself the independent benchmark for Gorilla, Toolformer-descendants, and modern function-calling models; no separate cross-check is necessary at the cluster level. The reproducibility check below covers BFCL’s own reproducibility.

Section 10: Technical novelty summary (cluster-level)

Novelty map.

Component	Type	Novelty level	Justification	Source
Self-supervised filtering criterion ( $L^- - L^+ \ge \tau_f$ )	Loss / data filter	Fully novel	No prior work uses LM-loss-reduction as the supervision signal for tool use.	Toolformer Eq. 3
Weighted-position loss for tool filtering	Loss design	Incrementally novel	Position-weighted CE is a standard pattern; the application is new.	Toolformer Eq. 2
APIBench (HuggingFace + TorchHub + TFHub instruction-call corpus)	Dataset	Fully novel	First curated corpus tying API documentation to instruction-call pairs.	Gorilla Section 3.1
Retriever-aware prompt schema for API calls	Training method	Combination novel	RAG was known; applying it to API documentation at training and inference time as a unified schema is new.	Gorilla Section 3.2
AST subtree match with optional-default-argument tolerance	Evaluation	Incrementally novel	AST matching is standard; the tolerance rule and the hallucination/error distinction are new.	Gorilla Section 3.3
Hallucination versus error definition	Diagnostic	Fully novel	Operationalises a previously-vague phenomenon.	Gorilla Section 3.3
BFCL category set	Benchmark	Combination novel	Each category individually has precedents; the multi-category corpus and unified scoring are new.	BFCL Section 3
BFCL multi-turn and agentic evaluation	Benchmark	Incrementally novel	Single-turn tool-call evaluation existed; multi-turn with state is the BFCL V3+ contribution.	BFCL Section 3

Single most novel contribution per paper.

Toolformer: the loss-reduction filtering criterion as the supervision signal. No human labels, no reward model, no RL: just “did the call help predict the next tokens by at least $\tau_f$ nats.”
Gorilla: APIBench together with the AST-with-defaults evaluation rule. Future tool-calling benchmarks inherit both.
BFCL: the multi-turn and irrelevance categories. Single-turn function calling has become close to saturated; BFCL V3+ is what separates models in 2026.

What the papers do NOT claim as novel. Tool use as a research direction (long predates them); in-context demonstration patterns (standard in 2022); RAG (Lewis et al. 2020); AST parsing (compiler tools); LLaMA fine-tuning (Touvron et al. 2023).

Section 11: Situating the work

Prior work the cluster sits on top of. Komeili, Shuster, Weston 2022 (internet-augmented dialogue); LaMDA (Thoppilan et al. 2022, with hand-annotated tool traces); the ReAct family (Yao et al. 2022 / 2023, framing tool use as a reasoning-and-acting interleave at inference time); RAG (Lewis et al. 2020); HuggingGPT (Shen et al. 2023, prompting GPT-4 to orchestrate HuggingFace models).

What this cluster changes. Tool use moves from “prompt-engineered orchestration at inference time” (HuggingGPT, ReAct) to “supervised at training time” (Toolformer, Gorilla), with a standardised evaluation (BFCL) to compare results. The shift matters because evaluating prompt-engineered tool use is unreliable: small prompt changes swing scores by tens of points. Training the capability in is what made function calling a deployable feature in 2024-2026 APIs.

Contemporaneous related papers.

ReAct (Yao et al., ICLR 2023).⁹ Frames tool use as a reasoning-then-acting trace. Toolformer is the training-time complement: ReAct shows the model how to use tools at inference; Toolformer trains it to want to.
ToolLLM (Qin et al., ICLR 2024).¹⁰ Introduces ToolBench, a corpus of over 16,000 real-world RapidAPI APIs. ToolLLM trains a model (ToolLLaMA) with a DFS-based decision-making tree that is a strict superset of Gorilla’s single-call setting. The two papers are direct contemporaries with overlapping ambition; Gorilla’s contribution is the catalogue-anchored training pipeline and AST-with-defaults evaluation, ToolLLM’s is the agentic multi-call decision tree and scale.

[Reviewer Perspective] Strongest skeptical objections.

Toolformer: the filtering criterion rewards calls that help the LM predict near-call tokens. This is not the same as rewarding calls that are correct for the human user’s task. The two coincide for the paper’s chosen tools (math, calendar) but can diverge for tools whose value is in long-horizon factual correctness rather than next-token surprise.
Gorilla: the train and test distributions are both produced by Self-Instruct with GPT-4 as the oracle; the model’s 91.26% oracle-retriever HuggingFace accuracy partly measures how well a small LM can imitate the GPT-4 instruction-generation distribution. ToolLLM’s broader RapidAPI distribution does not show comparable gains.
BFCL: category weights are uniform by default; this overweights easy categories where models cluster near the ceiling and underweights multi-turn where ranking actually differentiates. Sophisticated readers consult per-category breakdowns; the aggregate column is a coarse summary.

[Reviewer Perspective] Strongest author-side rebuttals (grounded in the papers).

Toolformer: Section 5 explicitly states the next-token-surprise criterion is a proxy, not the goal; the empirical demonstration on LAMA and math validates the proxy in those domains.
Gorilla: Section 4.1 acknowledges Self-Instruct provenance; the held-out test partition is constructed before the model sees any training data and the AST-match evaluation is independent of the data-generation distribution.
BFCL: Section 4 explicitly discusses category weights and provides the per-category breakdown alongside the aggregate.

What remains unsolved.

Tool-use that requires multi-step planning across heterogeneous tools (BFCL multi-turn).
Tool selection in catalogues of $10^5$ + APIs (BFCL V4 starts to surface this; ToolLLM scaled to $\sim$ 16k).
Calibrated abstention (BFCL Relevance category): models still over-call on irrelevant prompts.
Failure recovery: when a tool returns an error, the model rarely handles it gracefully without specific training.

Three future research directions (each grounded in a paper-specific gap).

Toolformer follow-up. Replace the next-token-loss criterion with a downstream-task criterion (did the call help the model answer the question correctly), retaining the self-supervision spirit but tying it to human-aligned correctness. [Analysis]
Gorilla follow-up. Scale the catalogue to $10^5$ + APIs and study how the retriever’s recall budget interacts with model size; APIBench at $\sim$ 1.6k entries is small relative to production catalogues. [Analysis]
BFCL follow-up. Introduce a cost-aware aggregate that penalises wasteful or excessive tool calls (over-calling is currently uncosted; saturating the parallel category by always emitting many calls is gameable). [Reviewer Perspective]

Section 12: Critical analysis

Strengths (cluster-level, with evidence).

Toolformer: the worked numerical demonstration that perplexity reduction supplies a usable supervision signal, with clean improvements on tasks the base LM is known to struggle with (math, factual lookup). Tables 3 and 4 show 10-30 absolute-point gains.
Gorilla: the catalogue-anchored training pipeline produces a model that explicitly stays in-catalogue, dropping HuggingFace hallucinations from 36.55% (GPT-4) to 11.16% (zero-shot Gorilla); the AST-with-defaults evaluation is the practical scoring tool the community adopted.
BFCL: the category structure is the right structure; “single-turn AST, parallel, relevance, multi-turn” maps onto deployment realities and the leaderboard is publicly mirrored and code-released.

Weaknesses stated by the authors.

Toolformer: the tool set is small (5 tools) and the base LM is single-size; scaling behaviour is not characterised in depth (Section 6, Limitations).
Gorilla: APIBench is ML-API-centric; transfer to other API categories is not studied (Section 6).
BFCL: uniform category weights are an editorial choice; readers may prefer per-category views (Section 4).

Weaknesses not stated or understated. [Reviewer Perspective]

The cluster has no shared train-test contamination audit: BFCL evaluation prompts may overlap with Gorilla’s Self-Instruct corpus and with public APIBench data that frontier-model training corpora likely ingested. The community’s de facto reliance on BFCL as the headline number is therefore subject to the same contamination concerns as MMLU and GSM8K.
The Toolformer Calculator-only setting masks a fundamental issue with the criterion: useful tool calls whose results are linguistically common (e.g., a calculator returning “2”) will not produce large next-token-loss drops because the LM was going to predict “2” anyway from context. The filter under-credits high-value-but-language-common tool returns.
Independent commentary on Gorilla’s evaluation: the ToolLLM authors note that AST-subtree-match is permissive (it allows the model to add optional arguments freely), which inflates accuracy compared to exact-match scoring. The paper does not benchmark against exact-match alongside.

Reproducibility check.

Toolformer. Code: released by the authors at the official GitHub repository (linked from the arXiv abstract). Data: CCNet is publicly available; augmented dataset not released in full. Hyperparameters: fully reported in Appendix A.6. Compute: not reported in detail. Weights: not released. Eval set: standard public benchmarks (LAMA, ASDiv, SVAMP, MAWPS, NaturalQuestions, TriviaQA, WebQuestions). Overall: partially reproducible.
Gorilla. Code: released at github.com/ShishirPatil/gorilla. Data: APIBench released. Hyperparameters: fully reported (Sec. 8.2). Compute: 8x A100 40GB, 5 epochs (Sec. 8.2). Weights: Gorilla-7B and Gorilla-Falcon-7B released on Hugging Face. Eval set: APIBench held-out partition. Overall: fully reproducible.
BFCL. Code: released at github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard. Data: V3 prompts public; V4 partially gated. Hyperparameters: not applicable (it is a benchmark, not a trained model). Compute: per-model evaluation costs reported in the BFCL paper Appendix. Weights: not applicable. Eval set: released. Overall: fully reproducible for V3; partially for V4 as new categories continue to be added.

Methodology

Toolformer. Sample size: hundreds of thousands of CCNet passages, augmented to 25k examples per tool. Evaluation set: LAMA T-REx 34k, Google-RE 6k, SQuAD 1.4k; math 2-3k per dataset. Baselines: GPT-J 6.7B (base), GPT-J + CC, OPT 66B, GPT-3 175B. Hardware: not reported in detail.
Gorilla. Sample size: $\sim$ 16,450 instruction-call training pairs, 80/20 split. Evaluation set: APIBench held-out 20%. Baselines: GPT-4, GPT-3.5, Claude, LLaMA-7B (zero-shot prompted). Hardware: 8x A100 40GB.
BFCL. Sample size: over 2,000 question-function-answer pairs in V3. Evaluation set: BFCL V3 corpus (publicly released, contamination check noted as community concern but not formalised in the paper). Baselines: open and closed frontier models (full list on the live leaderboard). Hardware: per-model API-call costs reported in the appendix.

Generalisability. Toolformer’s filtering criterion transfers cleanly to other text-in, text-out tools; it does not transfer obviously to multimodal tools or stateful tools. Gorilla’s retriever-aware training transfers cleanly to other structured-API catalogues. BFCL’s category structure transfers as a template to domain-specific function-calling benchmarks (e.g., a clinical-API BFCL, a financial-API BFCL).

Assumption audit. [Analysis] The Toolformer assumption that next-token loss is a good proxy for tool usefulness is the most fragile across the cluster; it predicts failure modes that the paper does not fully characterise. The Gorilla assumption that AST-with-defaults is the right evaluation rule is defensible for ML APIs but produces over-permissive scores on APIs with many optional arguments. The BFCL assumption that uniform category weights produce a meaningful aggregate is the most often-flagged objection by readers of the leaderboard.

What would make the papers significantly stronger. [Analysis] (a) A Toolformer follow-up that swaps the next-token loss for a downstream-task reward and reports head-to-head; (b) a Gorilla follow-up scaled to a $10^5$ + API catalogue with the retriever evaluated separately; (c) a BFCL update with category weights chosen by an explicit utility analysis rather than uniform default.

Section 13: What is reusable for a new study

REUSABLE COMPONENT 1: Loss-reduction filtering criterion

What it is: The Toolformer rule $L^- - L^+ \ge \tau_f$ .
Why worth reusing: It is the cleanest self-supervised signal for tool usefulness; works without human annotation.
Preconditions: A base LM whose loss can be cheaply evaluated on a candidate-augmented prefix; tools with text-in, text-out interfaces.
What would need to change: $\tau_f$ must be tuned per tool; the position-weight schedule may need adjustment for tools whose effect is far from the call site.
Risks: under-credits tool calls whose value is in long-horizon correctness rather than immediate next-token predictability.
Interaction effects: Combines with curriculum-style training (sample easy tools first, then add harder ones once the model can syntactically emit them).

REUSABLE COMPONENT 2: AST subtree match with optional-default tolerance

What it is: The Gorilla evaluation rule formalised in MATH ENTRY 3.
Why worth reusing: It is the de facto standard for grading structured-output tool calls.
Preconditions: A reference parser for the target language (Python ast.parse, JavaScript Esprima, Java JavaParser).
What would need to change: The “subtree contained in” relation may need an exact-match variant alongside, for stricter scoring on safety-critical tools.
Risks: over-permissive on APIs with many optional arguments.

REUSABLE COMPONENT 3: Retriever-aware prompt schema

What it is: The Gorilla pattern of prepending “Use this API documentation for reference: <doc_json>” at both training and inference time.
Why worth reusing: Enables the catalogue to evolve without retraining; tracks documentation versions naturally.
Preconditions: Structured documentation per API; a retriever (BM25, dense, or oracle in training).
What would need to change: For very large catalogues, the retriever quality dominates; investing in retriever training pays off more than scaling the call-generation model.
Risks: When retrieval fails (wrong doc returned), the model copies confidently from the wrong source.

REUSABLE COMPONENT 4: BFCL category structure as benchmark template

What it is: Simple AST, multiple, parallel, parallel multiple, relevance, executable, multi-turn.
Why worth reusing: It is the right product-shaped split; covers the calling patterns deployed agents actually exhibit.
Preconditions: A target API domain with enough function variety to populate each category.
What would need to change: Domain-specific match rules; safety-critical domains may need additional categories (e.g., “destructive-call confirmation”).
Risks: Aggregate uniform-weight scoring is gameable; per-category reporting is the safer default.

REUSABLE COMPONENT 5: Hallucination versus error diagnostic split

What it is: The Gorilla definition: hallucination = call to a non-existent function; error = call to a real function with wrong arguments.
Why worth reusing: It is the right diagnostic split when debugging an agent; the two failure modes have different root causes and different fixes.
Preconditions: An enumerable catalogue against which “non-existent” can be defined.
Risks: The diagnostic does not distinguish “wrong function but right intent” from “wrong function and wrong intent.”

Dependency map in text form. Toolformer depends on a base LM (GPT-J) plus a small per-tool prompt set. Gorilla depends on LLaMA-7B plus APIBench plus an optional retriever. BFCL depends on Gorilla’s AST evaluation primitives, extended to additional categories and languages. ToolLLM depends on the broader Gorilla framing but uses its own ToolBench corpus and DFS-based decision tree, not Gorilla’s training corpus.

Recommendation. [Analysis] For a new study targeting production function-calling, the highest-value components are: (1) the retriever-aware training schema from Gorilla (most directly applicable), (2) the AST-with-defaults evaluation from Gorilla (most directly applicable for scoring), and (3) BFCL as the evaluation harness (most directly applicable for benchmarking). The Toolformer loss-reduction criterion is a powerful conceptual tool but applies cleanly only when the candidate tools’ utility is captured by next-token predictability.

[Analysis] What type of new study benefits most. A study that fine-tunes a small open model (3B-8B) for a domain-specific API catalogue (e.g., a medical-records API, a CRM API), uses Gorilla-style retriever-aware training on Self-Instruct-generated instruction-call pairs, and grades on a BFCL-template benchmark customised to that domain.

Section 14: Known limitations and open problems

Limitations explicitly stated by the authors.

Toolformer (Section 6). Small tool set; the criterion under-credits tools whose returns are linguistically predictable; single base LM evaluated; no interactive tools.
Gorilla (Section 6). APIBench is ML-API-centric; the test set is constructed from the same Self-Instruct distribution as training; the model is small (7B) and frontier closed models have since closed parts of the gap on broader benchmarks.
BFCL (Section 4 + Section 5 limitations). Category weights are uniform by default; some categories (multi-turn, agentic) are evaluation-cost-heavy and may not be run on all models; the live leaderboard’s freshness depends on community submissions.

Limitations not stated or understated. [Analysis] and [Reviewer Perspective]:

Contamination. APIBench and BFCL prompts are public; frontier-model training corpora plausibly ingest them. Reported BFCL scores on closed models may be inflated by training-data overlap. The community’s response (per BFCL’s own GitHub README) is to rotate categories more aggressively in V4. Independent commentary in the BFCL OpenReview thread³ raises this as a recurring concern.
Stateful and destructive tools. All three papers treat tools as pure functions. Production agents call APIs that mutate state (POST /payments, DELETE /resources); the evaluation does not capture the cost of incorrect destructive calls.
Cost-of-call as a metric. None of the three papers penalises wasteful tool calls. A model that emits ten redundant calls when one would suffice is graded the same as a parsimonious model. Production deployments care intensely about call budgets.

Technical root cause of each.

Contamination: shared training-data distributions between frontier-model training corpora and public benchmarks.
Stateful tools: the supervision signal in all three papers is structural (call shape, return value), not semantic (effect on world state).
Cost: the evaluation aggregate weights correctness only.

Open problems left behind.

Self-supervised tool use for stateful APIs (Toolformer’s criterion does not generalise).
Catalogue scaling beyond $10^4$ APIs with calibrated abstention (BFCL Relevance and ToolLLM’s setting both point here).
Multi-turn tool use with planning, recovery from tool errors, and graceful abstention.
Cost-aware evaluation (number of calls, latency, dollar cost per task completion).

What a follow-up paper would need to solve. [Analysis] A follow-up addressing the multi-turn gap would need (a) a training signal that credits long-horizon tool sequences (Toolformer’s criterion extended to spans rather than positions, or a downstream-reward variant), (b) a catalogue at $10^5$ -scale with a strong retriever evaluated independently of the call-generation model, and (c) an evaluation harness extending BFCL with cost-of-call and destructive-action categories. The closest contemporary candidate is ToolLLM, which solves (a) partially via DFS decision trees but does not satisfy (c).

How this article reads at three depths

For the curious high-school reader. Tool learning is the field that teaches AI models to use external programs like calculators, search engines, or web APIs. Toolformer (2023) showed how a model can teach itself which tools to call by checking whether each candidate call helps it predict what comes next. Gorilla (2023) showed that fine-tuning a smaller model on a curated catalogue of 1,645 real APIs beats GPT-4 at the same task. The Berkeley Function Calling Leaderboard (BFCL) is now the scoreboard the field uses to compare every new function-calling model.

For the working developer or ML engineer. If a small, well-defined tool set is the target (calculator, calendar, in-house search), the Toolformer loss-reduction filter is a usable, label-free data-generation pipeline. For a real API catalogue at $10^3$ - $10^4$ scale, the Gorilla pattern is the production-default: structured documentation, Self-Instruct-generated instruction-call pairs, retriever-aware fine-tuning, AST-with-defaults evaluation. Benchmark on BFCL V3 for single-turn and parallel; treat the V3+ multi-turn and V4 agentic categories as the actual differentiators. Closed frontier models are competitive on single-turn AST but the gap narrows or reverses for domain-specialised fine-tuned models. The Gorilla codebase (Apache-2.0) and the BFCL harness are the right starting points for building a custom function-calling stack.

For the ML researcher. The cluster’s enduring contributions are (a) Toolformer’s loss-reduction filtering criterion as a self-supervised signal for tool usefulness, (b) Gorilla’s retriever-aware training and AST-with-defaults evaluation as the catalogue-anchored training and grading recipe, and (c) BFCL’s category structure as the de facto operational definition of function-calling progress. The load-bearing objects are: Toolformer’s filtering inequality, Gorilla’s AST-match indicator, and BFCL’s category-aggregated score. The strongest open objection is that none of the three papers treats tool use as a cost-aware, multi-step, state-modifying process; ToolLLM extends in this direction but does not subsume the cluster’s evaluation framework. A follow-up paper would need to combine a downstream-reward training signal, a $10^5$ -scale catalogue with a strong independent retriever, and a cost-and-state-aware evaluation harness; no current paper does all three.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Schick et al. — Toolformer: Language Models Can Teach Themselves to Use Tools (arXiv:2302.04761, NeurIPS 2023) (accessed 2026-05-19) ↩
2. Patil et al. — Gorilla: Large Language Model Connected with Massive APIs (arXiv:2305.15334) (accessed 2026-05-19) ↩
3. Patil et al. — The Berkeley Function Calling Leaderboard (OpenReview, ICML 2025 poster) (accessed 2026-05-19) ↩
4. ar5iv HTML render of Toolformer — Table 3 LAMA scores; Table 4 math scores; Section 2.2 filtering equations; GPT-J 6.7B base model (accessed 2026-05-19) ↩
5. ar5iv HTML render of Gorilla — Table 1 zero-shot accuracy comparisons; Table 2 oracle-retriever results; Section 3.3 AST-match and hallucination definitions; 8x A100 training hardware (accessed 2026-05-19) ↩
6. Berkeley Function Calling Leaderboard V4 — current top-of-leaderboard scores and category list (accessed 2026-05-19) ↩
7. BFCL GitHub repository — TEST_CATEGORIES.md, evaluation harness, V4 release notes (accessed 2026-05-19) ↩
8. BFCL v3 third-party mirror — GLM 4.5 at 76.7% and Qwen3 32B at 75.7% as of 2026-04-21 (accessed 2026-05-19) ↩
9. Yao et al. — ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629, ICLR 2023) (accessed 2026-05-19) ↩
10. Qin et al. — ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (arXiv:2307.16789, ICLR 2024) (accessed 2026-05-19) ↩

Anonymous · no cookies set

Found this useful? Share it.