ReAct vs ReWOO vs Plan-and-Solve: a multi-paper review of agent reasoning patterns

Multi-paper review of ReAct (Yao 2022), ReWOO (Xu 2023), and Plan-and-Solve (Wang 2023): how the three prompting patterns differ, what each measures, where each breaks.

19 May 2026 Updated 19 May 2026 ~49 min read

Reading-register key

From the paper: claims drawn directly from the source paper’s text, equations, tables, or figures.

[Analysis]: the publication’s own reasoned assessment, distinct from any claim the papers themselves make.

[External comparison]: a comparison to prior or contemporaneous work outside the three papers.

[Reviewer Perspective]: a critical or speculative assessment beyond what the papers prove.

Figure 1 of ReAct (arXiv:2210.03629): four-panel comparison of Standard, Chain-of-Thought, Act-only, and ReAct prompting on a HotpotQA question and an ALFWorld household task

Figure 1 of ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629), reproduced for editorial coverage. The diagram contrasts the four prompting patterns that anchor this multi-paper review.

Section 1: Cluster scope

This review covers three papers that propose prompting patterns for LLM-driven reasoning and acting: ReAct (Yao et al., ICLR 2023),¹ ReWOO (Xu et al., 2023),² and Plan-and-Solve (Wang et al., ACL 2023).³ All three were published within an eight-month window in 2022-2023 and have shaped how production LLM agents are structured through 2026.

The papers are linked, not independent. ReAct is the parent of the interleaved-tool-use pattern. ReWOO takes ReAct’s central design choice (interleaving reasoning and observations) and inverts it, planning the entire chain of tool calls up front. Plan-and-Solve sits in a different lane: it is a pure-reasoning prompt for math and commonsense problems, no tool calls, but it shares ReWOO’s “plan first, execute second” structural commitment. [Analysis] Reading them as a cluster reveals two design axes that recur across every modern agent framework: when does the model decide what to do next, and how often does it re-decide.

Section 2: TL;DR for the cluster

Three patterns. ReAct interleaves natural-language thoughts, tool actions, and tool observations in a single trajectory: the model sees the result of every tool call before deciding the next one. From the paper, this raises HotpotQA exact-match from 22.4% (Act-only) to 27.4% (ReAct) on PaLM-540B, and lifts ALFWorld household-task success from 45% to 71%.¹

ReWOO decouples the same problem into three modules. A Planner writes the full chain of tool calls up front with placeholder variables. A Worker executes those calls. A Solver synthesises the evidence into a final answer. The prompt for the model is never re-fed the context after the first call. From the paper, this delivers “5x token efficiency and 4% accuracy improvement on HotpotQA” relative to ReAct.²

Plan-and-Solve is a single-prompt pattern for reasoning without tool calls. The model is asked to first devise a plan, then execute it step by step. The paper analyses Zero-shot Chain-of-Thought’s failure modes (27% semantic misunderstanding, 12% missing steps, 7% calculation) and shows PS+ reduces missing-step errors from 12% to 7% on GSM8K.³

Practitioner takeaways:

ReAct is the default for tool-using agents where the next action genuinely depends on the previous observation (web search, interactive shells, web shopping, embodied tasks).
ReWOO is the right pattern when the reasoning chain is foreseeable (multi-hop QA, table lookups, calculator chains) and the savings of 5x token cost matter for budget or latency.
Plan-and-Solve is the right pattern for pure reasoning (math word problems, symbolic puzzles) on a model that supports zero-shot prompting but not full Chain-of-Thought few-shot exemplars.
None of the three is a strict superset. ReAct can recover from a failed plan mid-trajectory; ReWOO cannot. ReWOO uses fewer tokens; ReAct adapts. Plan-and-Solve is the cheapest of all but never calls tools.
The token-cost gap is real and reproducible. ReWOO’s measured token reduction (HotpotQA 9,795 → 1,986) is consistent with the prompt-redundancy argument and shows up in any framework that adopts the same structure.²

Section 2.5: Glossary

Term	Plain-English explanation	First appears in
Prompt	The text fed to a language model to elicit a response. Includes instructions, examples, and the user query.	Section 3
Chain-of-Thought (CoT)	A prompting pattern that asks the model to write its reasoning step-by-step before the final answer, which raises accuracy on math and multi-hop problems.	Section 3
Zero-shot CoT	The Kojima et al. variant where the model is prompted with the literal phrase “Let’s think step by step” rather than worked-example demonstrations.	Section 4
Few-shot CoT	The standard CoT pattern where the prompt includes several worked-example demonstrations before the user query.	Section 4
Trajectory	The sequence of prompts, model responses, and tool observations produced over the course of one task attempt.	Section 3
Tool / action / API	An external function the agent can invoke (search, calculator, code-runner, retrieval). The model emits a structured token that the framework parses and executes.	Section 3
Observation	The result returned by a tool call, fed back into the model’s next prompt.	Section 3
Thought	A natural-language reasoning step the model emits between actions in ReAct’s pattern. Does not change the environment.	Section 3
Token cost	The total number of tokens fed to and emitted by the LLM across a task. Drives both latency and dollar cost.	Section 2
Exact Match (EM)	Question-answering metric: the predicted answer string matches the reference answer string exactly, after normalisation.	Section 9
HotpotQA	A multi-hop question-answering benchmark requiring the model to combine evidence from multiple Wikipedia paragraphs.	Section 9
ALFWorld	A text-based simulation of household tasks (“put the apple on the desk”) used to evaluate language agents.	Section 9
WebShop	A simulated e-commerce environment used to evaluate language agents on a “buy the product matching this description” task.	Section 9
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what the papers themselves claim.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the papers prove.	Section 11 + 12

Section 3: Problem formalisation (cluster-wide)

All three papers operate on the same underlying setup: a language model $\pi$ parameterised as a conditional distribution $\pi(y \mid c)$ over output tokens $y$ given a context $c$ . The context is a prompt, possibly with prior turns concatenated. They differ on what the agent loop around $\pi$ looks like.

ReAct’s formalisation. From the paper, ReAct extends a domain-specific action space $\mathcal{A}$ to an augmented space $\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{L}$ , where $\mathcal{L}$ is the space of natural-language strings.¹ An action in $\mathcal{L}$ (a “thought”) does not change the environment but updates the context $c$ for subsequent action choices. The agent loop is:

$\text{Thought}_t, \text{Action}_t \sim \pi(\cdot \mid c_t), \quad \text{Observation}_t = \text{env}(\text{Action}_t), \quad c_{t+1} = c_t \oplus \text{Thought}_t \oplus \text{Action}_t \oplus \text{Observation}_t$

where $\oplus$ denotes string concatenation. The context grows by one Thought-Action-Observation triple per step.

ReWOO’s formalisation. From the paper, three modules: Planner $\mathcal{P}$ , Worker $\mathcal{W}$ , Solver $\mathcal{S}$ .² Given question $q$ :

$(p_1, a_1), \ldots, (p_k, a_k) \sim \mathcal{P}(q), \quad e_i = \mathcal{W}(a_i), \quad \text{answer} = \mathcal{S}(q, \{(p_i, a_i, e_i)\}_{i=1}^k)$

Here $p_i$ is a natural-language plan step, $a_i$ is the tool call for step $i$ , and $e_i$ is the evidence string returned by the tool. The Planner emits all $k$ plan-action pairs in a single LLM call before any tool is invoked. The Solver makes one final LLM call after all evidence is gathered. The Worker is not an LLM call.

Plan-and-Solve’s formalisation. No tools, no environment. A single LLM call with a structured prompt of the form “Q: $q$ . A: Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step.”³ The model’s output contains both the plan and the step-by-step execution interleaved as one stream.

Notation.

Symbol	Meaning	First appears
$\pi$	The language model, as a conditional distribution over output tokens.	Section 3
$c_t$	The prompt context at trajectory step $t$ .	Section 3
$\mathcal{A}$	The domain-specific action space (tool calls).	Section 3 (ReAct)
$\mathcal{L}$	The natural-language “thought” space.	Section 3 (ReAct)
$\mathcal{P}, \mathcal{W}, \mathcal{S}$	Planner, Worker, Solver.	Section 3 (ReWOO)
$k$	Number of plan steps.	Section 3
$e_i$	Evidence string returned by the Worker for step $i$ .	Section 3 (ReWOO)
$q$	The user question.	Section 3

Assumptions, shared across all three:

The LLM has been trained well enough that following instructions in plain English yields useful behaviour. None of the three trains the LLM; all three are pure prompting patterns at evaluation time.
For ReAct and ReWOO, tools are reliable enough that observations are interpretable. Both papers run robustness ablations on this assumption.
For ReAct, the few-shot exemplars in the prompt cover the task distribution closely enough that the LLM generalises. The paper uses 6 exemplars on HotpotQA, 2 on ALFWorld.¹
For Plan-and-Solve, the LLM is capable of zero-shot reasoning. The paper benchmarks GPT-3 text-davinci-003 and shows the pattern transfers to other zero-shot-capable backbones.³

Section 4: Motivation and gap, per paper

ReAct (Yao et al., ICLR 2023).¹ The motivation, from the paper, is that pure Chain-of-Thought reasoning hallucinates because the model has no external grounding, and pure action-emission (“Act-only”) wastes the reasoning capability the LLM has. From the paper: “reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources.” The gap is the absence of a prompting pattern that interleaves both. [External comparison] Prior work split into reasoning-only (Chain-of-Thought, Wei et al. 2022⁴) and action-only (WebGPT, SayCan); ReAct’s contribution is the synergy claim.

ReWOO (Xu et al., 2023).² The motivation is the redundancy of ReAct’s prompt structure. Every tool call in ReAct’s trajectory re-feeds the full context (instructions, exemplars, prior thoughts, prior actions, prior observations) back into the LLM. With $k$ steps, the cumulative token cost grows roughly quadratically in $k$ because each step’s prompt is the concatenation of all prior steps. ReWOO’s gap claim: if the chain of tool calls can be planned in advance, this redundancy is avoidable.

Plan-and-Solve (Wang et al., ACL 2023).³ The motivation is the empirical failure modes of Zero-shot Chain-of-Thought (Kojima et al. 2022⁵). The Wang paper analyses 46 GSM8K problems where Zero-shot-CoT fails on GPT-3 and categorises the errors: 27% are semantic misunderstanding of the problem, 12% are missing reasoning steps, 7% are calculation errors, with the rest distributed across other categories. The gap is a zero-shot prompt that explicitly mitigates the missing-step and calculation failure modes.

Practical stakes. [Analysis] All three papers landed during the Q4-2022 to Q2-2023 window when production deployment of LLM agents was just becoming feasible. Each addresses a different operational pain point: ReAct addresses hallucination, ReWOO addresses cost, Plan-and-Solve addresses reasoning quality without paying for few-shot exemplars. The patterns appear together in modern agent frameworks: LangChain’s ReActAgent is the ReAct pattern, LangChain’s PlanAndExecute is ReWOO with a different name, and most zero-shot reasoning prompts now look closer to Plan-and-Solve than to “Let’s think step by step.”

Figure 2 of ReAct (arXiv:2210.03629): the structured action-emission template the model follows during a ReAct trajectory

Figure 2 of ReAct (arXiv:2210.03629), reproduced for editorial coverage. Shows the per-step structured emission the LLM produces at each Thought / Action step of the trajectory.

Section 5: Method overview, per paper

ReAct

The prompting pattern interleaves three element types in the trajectory: Thought:, Action:, Observation:. From the paper, ReAct’s HotpotQA prompt uses three actions (search[entity], lookup[string], and finish[answer]) bound to a Wikipedia API.¹

A typical ReAct trajectory on the HotpotQA question “Aside from the Apple Remote, what other devices can control the program Apple Remote was originally designed to interact with?” looks like:

Thought 1: I need to search Apple Remote and find the program it was
designed to interact with.
Action 1: search[Apple Remote]
Observation 1: The Apple Remote is a remote control introduced in
October 2005 by Apple ... originally designed to control the Front Row
media center program ...
Thought 2: Apple Remote was designed to control the Front Row media
center program. I need to search Front Row next and find what other
devices can control it.
Action 2: search[Front Row]
Observation 2: ... Front Row is a discontinued media center software ...
Thought 3: Front Row is controlled by an Apple Remote or the keyboard
function keys.
Action 3: finish[keyboard function keys]

Each step’s prompt is the original instructions plus exemplars plus everything emitted and observed so far. The model decides Thought $t+1$ and Action $t+1$ jointly from this growing context.

Design rationale. Thoughts are free-form text that does not change the environment but conditions the next action. The synergy claim is empirical: a thought that articulates “I should look up Front Row next” makes the subsequent search[Front Row] more likely than the same prompt without the thought.

What breaks if removed. Removing thoughts gives the Act-only baseline; HotpotQA exact-match drops from 27.4% to 25.7% on PaLM-540B. Removing actions gives Chain-of-Thought; the model hallucinates on questions that require lookup. Removing observations gives a no-grounding setting that the paper does not evaluate.

Classification. [New] for the interleaved Thought-Action-Observation pattern; [Adopted] for the Wikipedia tool actions (the paper acknowledges WebGPT and SayCan as precedents).

ReWOO

The three-module architecture is the design’s defining choice. From the paper:²

The Planner receives the user question $q$ and emits a numbered list of plan steps. Each step has a natural-language description and a tool call, with placeholders #E1, #E2, ... for evidence that the Worker will fill in later. A canonical Planner output for the HotpotQA question above:

Plan: I should find the program Apple Remote was originally designed
to interact with.
#E1 = Wikipedia[Apple Remote]
Plan: I should find what other devices can control that program.
#E2 = Wikipedia[Front Row (software)]

Crucially, #E2’s tool call refers to “Front Row (software)”, a string the Planner emitted without seeing #E1’s actual result. The Planner inferred from the question phrasing what the second step would need to look up. [Analysis] This is the paper’s central engineering bet: that for many tasks, the chain of tool calls is foreseeable from the question alone.

The Worker executes each tool call sequentially. For #E1 = Wikipedia[Apple Remote], the Worker fetches the Wikipedia article and stores its text as the value of #E1. No LLM call.

The Solver receives the original question, the plan, and the evidence map {#E1: ..., #E2: ...} and emits the final answer in one LLM call.

Design rationale. Two LLM calls per task (Planner + Solver) regardless of the number of tool steps. ReAct uses $k$ LLM calls for $k$ steps, each with growing context. ReWOO’s token cost grows linearly in $k$ from the evidence, not quadratically from re-fed context.

What breaks if removed. Removing the Planner reduces to standard tool-use with no advance planning; the Solver cannot recover. Removing the Solver leaves the evidence ungrounded. Removing the Worker is impossible; the architecture requires external tool execution.

Classification. [New] for the three-module separation with evidence-variable binding; [Adapted] from earlier “plan then execute” patterns that did not formalise the evidence-variable referencing scheme.

Plan-and-Solve

The contribution is two specific prompt strings:³

The PS prompt appends to the question:

Let's first understand the problem and devise a plan to solve the
problem. Then, let's carry out the plan and solve the problem step
by step.

The PS+ prompt is more elaborate:

Let's first understand the problem, extract relevant variables and
their corresponding numerals, and make a complete plan. Then, let's
carry out the plan, calculate intermediate variables (pay attention
to correct numerical calculation and commonsense), solve the
problem step by step, and show the answer.

The paper appends a separate “answer extraction” prompt to retrieve the final numeric answer from the reasoning trace, mirroring the Kojima et al. Zero-shot-CoT two-stage pattern.

Design rationale. PS+ explicitly mentions “calculate intermediate variables (pay attention to correct numerical calculation and commonsense)” because the failure-mode analysis identified calculation and missing-step errors as the dominant correctable failures. The phrase “extract relevant variables and their corresponding numerals” is targeted at the semantic-misunderstanding failures, which the paper acknowledges PS+ reduces less effectively (27% before, 27% after on the 100-example error audit).

What breaks if removed. The “make a complete plan” clause is the load-bearing part; removing it reverts to a slightly-wordier Zero-shot-CoT. The “pay attention to correct numerical calculation” clause is the PS+ delta over PS.

Classification. [New] for the specific prompt strings and the failure-mode analysis; [Adapted] from Zero-shot-CoT and Chain-of-Thought.

Figure 1 of ReWOO (arXiv:2305.18323): the Planner / Worker / Solver workflow with #E evidence variables binding plan steps to tool observations

Figure 1 of ReWOO: Decoupling Reasoning from Observations (arXiv:2305.18323), reproduced for editorial coverage. The Planner emits the full chain of plan steps with #E1, #E2, ... evidence variables; the Worker fills those variables by calling tools; the Solver synthesises the answer in one final LLM call.

Section 6: Mathematical contributions

ReAct: action-space augmentation

MATH ENTRY [1]: Action-space augmentation.

Source: Section 2 of arXiv:2210.03629.
What it is: ReAct treats natural-language reasoning steps as members of the same action space the agent samples from, alongside the real environment-affecting actions.
Formal definition: $\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{L}$ where $\mathcal{A}$ is the original action space and $\mathcal{L}$ is the unbounded space of natural-language strings. The agent’s policy is $\pi(\hat{a}_t \mid c_t)$ where $\hat{a}_t \in \hat{\mathcal{A}}$ .
Each term explained:
- $\mathcal{A}$ is finite or structured; for HotpotQA it has three entries (search, lookup, finish).
- $\mathcal{L}$ is countably infinite; any string up to the model’s context length is a legal “thought”.
- $\hat{\mathcal{A}}$ inherits the language-model’s autoregressive distribution: the model emits tokens that the framework parses as either a thought or an action depending on a prefix string.
Worked numerical example: on a tiny tool set $\mathcal{A} = \\{\mathrm{search}, \mathrm{finish}\\}$ with two integer-id slots and a thought-space of strings up to 50 tokens, the model’s per-step output distribution covers $\mid \mathcal{A}\mid \cdot N_{\mathrm{slots}} + N_{\mathrm{thought}}$ possible outputs, where $N_{\mathrm{thought}}$ for strings up to 50 tokens with a 50,000-token vocabulary is roughly $50{,}000^{50}$ . The combinatorial dominance of $\mathcal{L}$ is the entire point: the model spends most of its decoding budget in the thought space, only occasionally emitting an action.
Role: this is the abstraction that lets a Chain-of-Thought model and a tool-using model be the same model with the same prompt.
Edge cases: if $\mathcal{L}$ is restricted to the empty string, ReAct degenerates to Act-only. If $\mathcal{A}$ is empty, ReAct degenerates to Chain-of-Thought.
Novelty: [New]. The augmented-action-space framing is the paper’s central technical contribution; prior tool-using LMs (WebGPT, SayCan) kept reasoning and acting separated.
Why it matters: the augmentation explains why one LLM with one prompting recipe can subsume both Chain-of-Thought and tool-use. The same decoder, the same context, the same sampling procedure.

ReWOO: token-cost decomposition

MATH ENTRY [2]: ReWOO’s token-cost decomposition.

Source: Section 3 of arXiv:2305.18323, “Token Efficiency Analysis.”
What it is: a closed-form comparison of how many tokens an LLM agent consumes in ReAct versus ReWOO, given the same task and $k$ tool-call steps.
Formal definition: for a trajectory with $k$ tool-call steps, instruction tokens $T_{\mathrm{inst}}$ , per-exemplar tokens $T_{\mathrm{ex}}$ with $n_{\mathrm{ex}}$ exemplars, per-step prompt tokens $T_{\mathrm{step}}$ (the thought, action, observation triple), and final-answer tokens $T_{\mathrm{ans}}$ :

$T_{\text{ReAct}} \approx \sum_{t=1}^{k} \left[ T_{\text{inst}} + n_{\text{ex}} \cdot T_{\text{ex}} + \sum_{s=1}^{t-1} T_{\text{step}}^{(s)} + T_{\text{step}}^{(t)} \right] + T_{\text{ans}}$

The inner sum $\sum_{s=1}^{t-1} T_{\mathrm{step}}^{(s)}$ is the part that grows: at step $t$ , the prompt includes every prior step.

For ReWOO, the same task has two LLM calls (Planner and Solver):

$T_{\text{ReWOO}} \approx \left[ T_{\text{inst,P}} + n_{\text{ex,P}} \cdot T_{\text{ex,P}} + T_q \right] + \left[ T_{\text{inst,S}} + n_{\text{ex,S}} \cdot T_{\text{ex,S}} + T_q + \sum_{i=1}^{k} T_{e_i} \right] + T_{\text{ans}}$

where $T_q$ is the question, $T_{e_i}$ is the evidence string for step $i$ , and the P / S subscripts distinguish Planner and Solver instructions.

Each term explained:
- $T_{\mathrm{inst}}$ is the fixed instruction prefix (around 100-300 tokens for ReAct’s HotpotQA prompt).
- $T_{\mathrm{ex}}$ is one few-shot exemplar (200-800 tokens each in the published prompts).
- $T_{\mathrm{step}}^{(s)}$ is one thought-action-observation triple (50-500 tokens depending on observation length).
- The summation upper bound $k$ is the trajectory length, typically 3-7 for HotpotQA.
Worked numerical example: take $T_{\mathrm{inst}} = 200$ , $n_{\mathrm{ex}} = 6$ , $T_{\mathrm{ex}} = 400$ , $T_{\mathrm{step}} = 200$ , $k = 4$ , $T_{\mathrm{ans}} = 50$ . ReAct’s cost is $\sum_{t=1}^{4} [200 + 6 \cdot 400 + 200(t-1) + 200] + 50 = \sum_{t=1}^{4} [2600 + 200(t-1) + 200] + 50 = (2800 + 3000 + 3200 + 3400) + 50 = 12{,}450$ . ReWOO with the same parameters and $T_{\mathrm{inst,P}} = T_{\mathrm{inst,S}} = 200$ , two exemplars each, $T_{e_i} = 200$ averages, $T_q = 50$ : $(200 + 2 \cdot 400 + 50) + (200 + 2 \cdot 400 + 50 + 4 \cdot 200) + 50 = 1050 + 1850 + 50 = 2{,}950$ . The ratio $12{,}450 / 2{,}950 \approx 4.2\times$ aligns with the paper’s reported “5x token efficiency” within the noise of varying parameters.
Role: this is the analytical backbone of ReWOO’s headline claim. The 5x efficiency is not an empirical accident; it follows from the prompt-redundancy structure of ReAct.
Edge cases: if $k = 1$ (single tool call), ReWOO’s overhead from two LLM calls cancels the savings; the patterns are comparable. If exemplars are removed (zero-shot ReAct on a strong enough model), the inner $n_{\mathrm{ex}} \cdot T_{\mathrm{ex}}$ term vanishes and the gap shrinks proportionally.
Novelty: [New]. No prior work formalised the token-redundancy structure of interleaved tool-use this explicitly. [External comparison] The argument is structurally similar to recurrence-vs-attention complexity arguments in early transformer work, where the prompt-redundancy is the analogue of the recompute-vs-cache trade-off.
Why it matters: this is the math justifying every “plan-then-execute” agent framework that has shipped since.

Plan-and-Solve: failure-mode taxonomy

MATH ENTRY [3]: PS+ failure-mode reduction.

Source: Section 4 and Table 7 of arXiv:2305.04091.
What it is: a quantitative comparison of three error categories before and after applying PS / PS+ to GSM8K Zero-shot reasoning with GPT-3 text-davinci-003.
Formal definition: let $E_{\mathrm{cat}}^{\mathrm{method}}$ be the fraction of incorrect predictions falling in category cat under method. From Table 7:

$E_{\text{calc}}^{\text{CoT}} = 0.07, \quad E_{\text{step}}^{\text{CoT}} = 0.12, \quad E_{\text{sem}}^{\text{CoT}} = 0.27$

$E_{\text{calc}}^{\text{PS+}} = 0.05, \quad E_{\text{step}}^{\text{PS+}} = 0.07, \quad E_{\text{sem}}^{\text{PS+}} = 0.27$

Each term explained:
- $E_{\mathrm{calc}}$ : the prediction is wrong because the model performed arithmetic incorrectly (e.g., “12 + 7 = 21”).
- $E_{\mathrm{step}}$ : the reasoning trace omits a step that the correct solution requires.
- $E_{\mathrm{sem}}$ : the model misunderstood what the question asked.
- Numbers are fractions of the 100-example error audit, not fractions of GSM8K overall.
Worked numerical example: of 100 GSM8K problems sampled for the audit, under Zero-shot-CoT the failure breakdown is 7 calculation errors, 12 missing-step errors, 27 semantic errors, and 54 falling in other categories. Under PS+ the same audit shows 5 + 7 + 27 = 39 in the named categories, suggesting that the load-bearing improvement is in step-completeness (12 → 7) rather than semantic understanding (27 → 27).
Role: this is the empirical evidence behind PS+‘s prompt design. The “make a complete plan” and “calculate intermediate variables” clauses are targeted at the step and calculation categories respectively; the semantic-misunderstanding category is untouched, which the paper acknowledges as a limitation.
Edge cases: error-category boundaries are subjective. The paper audits 100 examples by hand; replicating the audit with a different annotator would shift the numbers within roughly $\pm 5\%$ .
Novelty: [New] failure-mode taxonomy for Zero-shot-CoT in this specific form; [Adapted] from prior error-analysis practice in NLP evaluation.
Why it matters: the taxonomy is the diagnostic that justifies PS+. Without it, the PS+ prompt would be one of dozens of prompt variants; with it, the prompt is targeted at a measured failure distribution.

Section 7: Algorithmic contributions

ALGORITHM ENTRY [1]: ReAct trajectory generation.

Source: Section 2 of arXiv:2210.03629; Appendix prompts.
Purpose: produce a Thought-Action-Observation trajectory that terminates at a final answer.
Inputs: language model $\pi$ , prompt prefix $P$ (instructions + few-shot exemplars), question $q$ , environment env exposing actions in $\mathcal{A}$ , maximum step count $T$ .
Outputs: the final answer string or a failure marker.
Pseudocode:

context := P + q
for t in 1..T:
    output := pi.sample(context)
    parse output into (thought_t, action_t)
    if action_t is finish[answer]:
        return answer
    obs_t := env(action_t)
    context := context + thought_t + action_t + obs_t
return failure

Hand-traced example on minimal input: question $q$ $q$ = “What year was the city where the Eiffel Tower stands founded?”, tool set $\mathcal{A} = \\{\mathrm{search}, \mathrm{finish}\\}$ $A = search, finish$ , $T = 3$ $T = 3$ .
- Step 1: context = $P + q$ . $\pi$ samples Thought: I need to find the city where the Eiffel Tower stands. Action: search[Eiffel Tower]. obs_1 = “The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.”
- Step 2: context now includes the step-1 triple. $\pi$ samples Thought: The Eiffel Tower is in Paris. I need the founding year of Paris. Action: search[Paris founding date]. obs_2 = “Paris was founded by the Parisii tribe in approximately the 3rd century BC.”
- Step 3: context now includes both prior triples. $\pi$ samples Thought: The answer is 3rd century BC. Action: finish[3rd century BC]. Return.
Complexity: $T$ LLM calls. Each call’s prompt length is $O(\sum_{s=1}^{t} T_{\mathrm{step}}^{(s)})$ , so total token cost is $O(T \cdot (T_{\mathrm{prefix}} + T \cdot T_{\mathrm{step}}))$ in the worst case, quadratic in $T$ when steps dominate.
Hyperparameters: $T$ (typically 7-10 for HotpotQA per the paper’s appendix), $n_{\mathrm{ex}}$ (6 for HotpotQA, 2 for ALFWorld), greedy decoding by default.
Failure modes: (a) the model emits a malformed action that cannot be parsed; (b) the trajectory exceeds $T$ steps without emitting finish; (c) repeated identical actions in a loop (the paper notes ReAct can get stuck in repetitive action patterns on certain prompts).
Novelty: [New] for the interleaved trajectory format; [Adopted] for the underlying autoregressive decoding loop.
Transferability: [Analysis] the pattern transfers cleanly to any tool-set with structured-string action representations. It does not transfer to environments with continuous actions or sub-symbolic state.

ALGORITHM ENTRY [2]: ReWOO Planner-Worker-Solver loop.

Source: Section 3 of arXiv:2305.18323; Appendix prompts.
Purpose: produce a final answer from a question using exactly two LLM calls and $k$ tool calls.
Inputs: language model $\pi$ , question $q$ , Planner prompt $P_P$ , Solver prompt $P_S$ , available tools.
Outputs: final answer string.
Pseudocode:

plan_output := pi.sample(P_P + q)
parse plan_output into list of (plan_i, action_i) with evidence
  placeholders #E1, #E2, ...
evidence := {}
for i in 1..k:
    action_resolved := substitute(action_i, evidence)
    e_i := tool.call(action_resolved)
    evidence[#Ei] := e_i
solver_input := P_S + q + format_plan_with_evidence(plan, evidence)
answer := pi.sample(solver_input)
return answer

Hand-traced example on minimal input: same Eiffel Tower question, tool = Wikipedia, $k$ $k$ chosen by Planner.
- Planner call: $\pi$ samples Plan: Find the city where the Eiffel Tower stands. #E1 = Wikipedia[Eiffel Tower]. Plan: Find when that city was founded. #E2 = Wikipedia[#E1's city]. The second action references #E1 symbolically.
- Worker step 1: Wikipedia[Eiffel Tower] returns the article text. evidence[#E1] = the article.
- Worker step 2: the framework either substitutes a derived value from #E1 (typically the Planner pre-binds the second tool argument; in practice ReWOO’s “foreseeable reasoning” framing means the Planner often hard-codes “Paris” because it already inferred the answer to step 1 from the question phrasing) or falls through to a simpler regex lookup. evidence[#E2] = the Paris article.
- Solver call: $\pi$ receives $q$ + the plan + both evidence entries, and emits “3rd century BC”.
Complexity: 2 LLM calls regardless of $k$ . Tool calls: $k$ . Total token cost is linear in $k$ via the evidence accumulation, not quadratic.
Hyperparameters: $k$ is chosen implicitly by the Planner. Few-shot exemplar count is typically 1-3 for the Planner.
Failure modes: (a) the Planner emits a plan whose tool calls don’t actually decompose the question; (b) #E2 references a derived value of #E1 that the framework cannot resolve without an extra LLM call; (c) the Solver fails to synthesise the final answer when evidence is verbose or contradictory.
Novelty: [New] for the Planner-Worker-Solver separation with evidence-variable binding.
Transferability: [Analysis] transfers cleanly to “foreseeable reasoning” tasks (multi-hop QA, table lookup, math with calculator); does not transfer to tasks where the next action genuinely depends on observing the previous one (interactive shell debugging, embodied exploration).

ALGORITHM ENTRY [3]: Plan-and-Solve two-stage prompting.

Source: Section 3 of arXiv:2305.04091.
Purpose: produce a final answer to a math or commonsense question using one or two LLM calls, no tools.
Inputs: language model $\pi$ , question $q$ , plan-prompt string $P_{\mathrm{plan}}$ , answer-extraction prompt string $P_{\mathrm{extract}}$ .
Outputs: final answer.
Pseudocode:

reasoning := pi.sample(q + P_plan)
answer := pi.sample(reasoning + P_extract)
return answer

Hand-traced example on minimal input: $q$ $q$ = “Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?”, $P_{\mathrm{plan}}$ $P_{plan}$ = the PS+ prompt above.
- First call: $\pi$ samples “Variables: Roger’s initial balls = 5; cans bought = 2; balls per can = 3. Plan: Compute total new balls = cans × balls per can. Add to initial. Step 1: 2 × 3 = 6. Step 2: 5 + 6 = 11. The answer is 11.”
- Second call: $\pi$ samples “Therefore, the answer (Arabic numerals) is 11.” Return 11.
Complexity: 2 LLM calls. Token cost is bounded by prompt + reasoning length, roughly 100-500 tokens total per task.
Hyperparameters: temperature 0 for greedy decoding (the paper). Temperature 0.7 with $N = 10$ samples for the self-consistency variant.
Failure modes: (a) the model produces a plan but does not follow it during execution; (b) the answer-extraction step picks up a wrong number from the reasoning trace; (c) the prompt language is sensitive, and the paper notes GPT-3 is sensitive to phrasing.
Novelty: [New] for the specific prompt design; [Adopted] for the two-stage extract-the-answer pattern from Kojima et al.⁵.
Transferability: [Analysis] transfers to any reasoning task without tool calls. Does not address grounding or hallucination; pure-CoT failure modes other than calculation and missing-step remain.

Section 8: Specialised design contributions

Subsection 8A: LLM / prompt design. All three papers’ contribution is at the prompt-design layer. Specific prompts disclosed:

ReAct HotpotQA prompt: 6 worked-example trajectories, three actions (search, lookup, finish). Full prompt in the paper’s appendix; reproduced in the official code at react-lm.github.io.⁸
ReAct ALFWorld prompt: 2 worked-example trajectories per task category, action space inherited from the ALFWorld text environment.
ReAct WebShop prompt: 1 worked-example trajectory, action space search[query], click[button], buy[product].
ReWOO Planner prompt: instruction + few-shot exemplars showing the numbered-plan-with-#E-bindings format. Full prompt in the GitHub repo.⁹
ReWOO Solver prompt: instruction + the plan + the evidence map, with the question repeated.
PS prompt and PS+ prompt: verbatim above. Plus a separate answer-extraction prompt “Therefore, the answer (arabic numerals) is”.

Subsection 8B: Architecture-specific details. Not applicable to this cluster; all three papers are prompting patterns, not architectural changes. The underlying models (PaLM-540B, GPT-3, GPT-3.5, text-davinci-003, LLaMA) are off-the-shelf.

Subsection 8C: Training specifics. ReAct’s paper includes a fine-tuning ablation: a PaLM-8B and PaLM-62B fine-tuned on ReAct trajectories generated from PaLM-540B. From the paper, the fine-tuned PaLM-62B outperforms the prompted PaLM-540B on HotpotQA, showing the pattern transfers to smaller fine-tuned models. ReWOO’s paper includes a parallel distillation experiment: fine-tuning LLaMA-7B on Planner outputs from GPT-3.5 produces a model that matches GPT-3.5’s accuracy on HotpotQA and TriviaQA in the Planner role. Plan-and-Solve does no fine-tuning.

Subsection 8D: Inference / deployment specifics. ReAct is one LLM call per trajectory step. ReWOO is two LLM calls per task regardless of plan length. Plan-and-Solve is one or two LLM calls per task. [Analysis] For production deployment, ReWOO’s call-count advantage compounds when tool calls have non-trivial latency (web search at 200-1000 ms each); the LLM-call budget can be parallelised against tool execution in ways that ReAct’s sequential trajectory cannot.

Figure 3 of ReWOO (arXiv:2305.18323): benchmark accuracy and token-usage comparison across ReWOO, ReAct, CoT, and direct-prompting baselines

Figure 3 of ReWOO (arXiv:2305.18323), reproduced for editorial coverage. Per-benchmark accuracy alongside token usage — the headline 5x token-efficiency claim is visible on the HotpotQA bar.

Section 9: Experiments and results

ReAct results

From the paper, on PaLM-540B with few-shot prompting:¹

Benchmark	Standard	CoT	CoT-SC	Act	ReAct	ReAct → CoT-SC
HotpotQA (EM)	28.7	29.4	33.4	25.7	27.4	35.1
FEVER (Acc)	56.3	56.3	60.4	58.9	60.9	64.6

Table 1 of ReAct (arXiv:2210.03629), reproduced for editorial coverage.

The ReAct → CoT-SC column is a fallback: when ReAct fails to produce an answer within $T$ steps, the model retries with Chain-of-Thought + self-consistency. From the paper, this combined strategy is the strongest setting on both benchmarks. On HotpotQA, ReAct alone trails CoT-SC by 6.0 points (27.4 vs 33.4); the combined strategy gains 1.7 points over CoT-SC alone.

On interactive decision-making:

Benchmark	BUTLER (IL)	Act	ReAct	Best baseline
ALFWorld success rate	37	45	71	37
WebShop success rate	29.1	30.1	40	29.1 (IL) / 28.7 (IL+RL)

Table 3 and Table 4 of ReAct, reproduced for editorial coverage.

From the paper, the ALFWorld jump from Act-only (45%) to ReAct (71%) is the headline result on decision-making tasks: the “absolute success rate of 34%” improvement over BUTLER is the IL baseline trained on $10^5$ expert trajectories. ReAct uses 2 in-context exemplars.

ReWOO results

From the paper, comparing on the same benchmarks against ReAct as the primary baseline:²

Benchmark	Method	Token cost	Accuracy
HotpotQA	ReAct	9,795	38.4%
HotpotQA	ReWOO	1,986	42.4%
TriviaQA	ReAct	4,213	60.7%
TriviaQA	ReWOO	1,341	66.6%
GSM8K	ReAct	1,874	58.0%
GSM8K	ReWOO	1,089	62.4%

Reconstructed from Table 2 and Table 3 of ReWOO (arXiv:2305.18323).

The paper’s average over all evaluated benchmarks: “64% token reduction with 4.4% absolute accuracy gain.” [Analysis] The accuracy gain over ReAct comes from two distinct sources: (a) Plan-then-Solve framing reduces the model’s chance of going off-script mid-trajectory, and (b) the Solver receives all evidence at once, which mimics few-shot QA conditioning more closely than ReAct’s interleaved feed.

The paper also reports the LLaMA-7B distillation: a 7B parameter model fine-tuned on Planner outputs matches GPT-3.5’s Planner accuracy on HotpotQA in zero-shot evaluation.

Plan-and-Solve results

From the paper, on GPT-3 text-davinci-003 (175B):³

Dataset	Zero-shot	Zero-shot-CoT	PS	PS+	Manual-CoT (8-shot)
GSM8K	16.8	56.4	58.2	59.3	58.4
SVAMP	65.7	69.9	72.0	75.7	80.3
MultiArith	32.5	83.8	87.2	91.8	93.6
AddSub	87.0	85.3	88.1	92.2	91.6
AQuA	24.8	38.9	42.5	46.0	48.4
SingleEq	86.6	88.1	89.2	94.7	93.5
CSQA	79.5	65.2	66.2	71.9	78.3
StrategyQA	65.5	63.8	62.0	65.4	71.2
Last Letter	1.5	64.8	65.0	75.2	70.6
Coin Flip	53.3	96.8	99.6	99.6	100.0

Reproduced from Table 2 and Table 3 of Plan-and-Solve Prompting (arXiv:2305.04091).

From the paper, PS+ “consistently outperforms Zero-shot-CoT across all datasets by a large margin” and is competitive with the 8-shot manual CoT baseline. PS+ exceeds 8-shot CoT on AddSub, SingleEq, and Last Letter; it trails on the others by margins of 0.4 to 5.2 points.

Self-consistency variant. From the paper, PS+ with self-consistency (temperature 0.7, $N = 10$ samples) on GSM8K reaches 73.7% (up from 58.7% greedy) and on SVAMP reaches 84.4% (up from 75.7%).

Cross-paper experimental scope

[Analysis] The three papers do not share a common benchmark suite that would allow a direct head-to-head. HotpotQA is the closest overlap: ReAct reports 27.4% EM on PaLM-540B; ReWOO reports 42.4% on GPT-3.5. These are different backbones and not directly comparable. Plan-and-Solve does not evaluate on HotpotQA (no tool use).

Independent benchmark cross-checks. The three patterns appear in widely-used open-source agent frameworks (LangChain, LlamaIndex, Haystack), where practitioners report broadly consistent qualitative trade-offs: ReWOO’s token-cost advantage holds at production scale; ReAct’s recovery-from-failure advantage holds on interactive tasks. [External comparison] Reflexion (Shinn et al. 2023⁶) extends ReAct with a verbal-RL outer loop and reports further gains on the same benchmarks; this is contemporaneous work that builds on ReAct’s interleaved-trajectory framing.

Evidence audit.

Strongly supported claims: ReAct’s hallucination reduction (6% false-positive rate vs 14% for CoT) on FEVER. ReWOO’s token-cost reduction on HotpotQA (9,795 → 1,986). PS+‘s missing-step error reduction on GSM8K (12% → 7%).
Partially supported claims: ReWOO’s “4% accuracy improvement on HotpotQA”. This is GPT-3.5 vs GPT-3.5, but the comparison ReAct baseline was re-implemented by the ReWOO authors and may differ in prompt detail from the original ReAct paper’s PaLM-540B numbers.
Claims relying on narrow evidence: Plan-and-Solve’s error-category boundaries are from a 100-example hand-audit; replication with a different annotator would shift the boundaries.

Section 10: Technical novelty summary

Component	ReAct	ReWOO	Plan-and-Solve
Interleaved Thought-Action-Observation pattern	[New]	[Adapted: replaced with up-front planning]	Not present
Augmented action space $\mathcal{A} \cup \mathcal{L}$	[New]	Not present	Not present
Planner-Worker-Solver separation	Not present	[New]	Not present (no tools)
Evidence-variable binding (`#E1, #E2, ...`)	Not present	[New]	Not present
Token-cost decomposition for prompting	Not present	[New] (formalised)	Not present
Zero-shot CoT failure-mode taxonomy	Not present	Not present	[New]
Specific PS / PS+ prompt strings	Not present	Not present	[New]
Few-shot exemplars for trajectories	[Adopted] from Chain-of-Thought	[Adopted] for Planner	Not present (zero-shot)
Self-consistency aggregation	[Adopted] from Wang et al. 2022	Not present	[Adopted] from Wang et al. 2022

Single most novel contribution per paper:

ReAct: the augmented-action-space framing $\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{L}$ that lets reasoning and acting share a single autoregressive decode.
ReWOO: the Planner-Worker-Solver decomposition with evidence variables, which formalises the “plan once, execute the chain” pattern with a concrete implementation contract.
Plan-and-Solve: the empirical failure-mode taxonomy of Zero-shot-CoT, combined with prompt strings targeted at the dominant correctable categories.

Figure 1 of Plan-and-Solve (arXiv:2305.04091): error-type distribution across 46 GSM8K failures under Zero-shot-CoT, broken down into calculation, missing-step, and semantic-misunderstanding errors

Figure 1 of Plan-and-Solve Prompting (arXiv:2305.04091), reproduced for editorial coverage. The failure-mode audit (calculation 7%, missing-step 12%, semantic-misunderstanding 27%) that motivated the PS+ prompt design.

Figure 5 of Plan-and-Solve (arXiv:2305.04091): correlation matrix between generated reasoning components (variable definitions, plan presence) and error types (calculation, missing-step)

Figure 5 of Plan-and-Solve Prompting (arXiv:2305.04091), reproduced for editorial coverage. Negative correlation between variable-definition / plan-presence and the two targeted error classes — the empirical evidence behind the PS+ prompt’s instruction expansion.

Section 11: Situating the work

Prior work. Chain-of-Thought (Wei et al. 2022⁴) is the intellectual ancestor of all three papers. Zero-shot-CoT (Kojima et al. 2022⁵) is the direct parent of Plan-and-Solve. Tool-using LMs in 2021-2022 (WebGPT, SayCan, Toolformer per Schick et al. 2023⁷) are the parents of ReAct’s tool-use formalism. None of those parents combined reasoning and tool-use under a single prompt template; ReAct’s contribution is the synergy claim.

[External comparison] Contemporaneous related work. Reflexion (Shinn et al., March 2023⁶) extends ReAct’s trajectory with a self-reflection loop in which the model verbally critiques its own failed trajectory and tries again. Reflexion builds directly on ReAct’s interleaved-trajectory framing; the relationship is “ReAct + verbal-RL outer loop.” Toolformer (Schick et al., February 2023⁷) trains the model to insert tool-call tokens autoregressively, which is an alternative to ReAct’s prompt-based action elicitation; the technical relationship is that Toolformer makes tool-use a learned model behaviour while ReAct makes it a prompted behaviour.

[Reviewer Perspective] The strongest skeptical objection to the cluster is that all three patterns are prompt-design contributions that depend heavily on the underlying LLM’s instruction-following ability. The same prompts on a smaller or worse-instruction-tuned model produce qualitatively different behaviour; the patterns are not portable to arbitrary backbones. The empirical results from each paper would not necessarily replicate on a 7B-parameter open model without significant adaptation.

[Reviewer Perspective] The strongest author-side rebuttal, grounded in the papers’ empirical sections, is that the patterns transfer across at least the major instruction-tuned backbones evaluated (PaLM-540B, GPT-3, GPT-3.5, text-davinci-003, fine-tuned LLaMA-7B). ReWOO’s distillation result and ReAct’s fine-tuning result both show the underlying pattern survives backbone change.

Unsolved problems:

A unified benchmark that compares all three on matched data, model, and tool set.
A theory of when interleaved (ReAct) versus planned (ReWOO) tool-use is optimal, beyond the foreseeability heuristic.
Plan-and-Solve’s semantic-misunderstanding category remains untouched at 27%; no prompt-only technique in the cluster addresses it.

Future research directions:

[Analysis] Hybrid agents that start with a ReWOO plan and fall back to ReAct’s interleaved trajectory when the plan’s evidence variables cannot be resolved. The framing exists in LangChain’s PlanAndExecute agent but no published paper studies it systematically.
[Analysis] Cost-aware planners that decide between ReAct and ReWOO per task based on a predicted foreseeability score. A small classifier could route between the two patterns at the input layer.
[Reviewer Perspective] Plan-and-Solve-style prompts adapted to tool-using agents: a single prompt that asks the model to plan in natural language, then execute via tool calls, then verify. This is the obvious next-step composition of the three patterns and would benefit from a dedicated paper.

Section 12: Critical analysis

Strengths with concrete evidence.

ReAct’s interpretability claim is concrete: the trajectory is readable. From the paper, human raters could correct ReAct trajectories mid-execution on ALFWorld, raising success rate further. This is the most operationally useful property of the pattern.
ReWOO’s token-cost claim is supported analytically (Section 6 above) and empirically (Section 9 above). The argument generalises to any agent framework that interleaves prompts and tool observations.
Plan-and-Solve’s failure-mode taxonomy is methodologically clean: the audit is documented, the categories are operationalised, and the prompt design follows from the diagnosis.

Weaknesses explicitly stated by the authors.

ReAct’s paper acknowledges “limited support of reasoning and acting behaviors” under pure prompting and notes that scaling requires fine-tuning, which itself requires substantial human annotation. The paper also notes ReAct can get stuck in repetitive action loops on certain tasks.
ReWOO’s paper acknowledges that fully environment-dependent tasks requiring exploration (e.g., ALFWorld embodied tasks) require observation-dependent planning and that ReWOO’s “foreseeable reasoning” framing does not apply. The paper does not evaluate on ALFWorld.
Plan-and-Solve’s paper acknowledges sensitivity to prompt wording and that semantic-misunderstanding errors persist despite the approach.

Weaknesses not stated by the authors. [Reviewer Perspective]

ReAct’s hallucination claim (6% vs 14% false-positive rate on FEVER) is measured against Chain-of-Thought on the same backbone. The framing implies grounding through tool use; the more honest framing is that ReAct hallucinates less than CoT on tasks where the tool is informative, but the residual 6% false-positive rate is non-trivial for production fact-checking.
ReWOO’s accuracy gain is reported as “4% absolute on HotpotQA” but the comparison ReAct baseline (38.4% in ReWOO’s reimplementation) differs from the original ReAct paper’s PaLM-540B number (27.4%) and from the ReAct paper’s PaLM-540B-FT (35.1%). The comparisons across papers should be read with this implementation-detail caveat.
Plan-and-Solve’s PS+ over PS delta (1.1 points on GSM8K) is small relative to the prompt-string length difference and is sensitive to wording. [Analysis] A practitioner choosing between PS and PS+ should benchmark on their own data; the gap is within plausible noise for a single 100-example error audit.

Reproducibility check.

Paper	Code	Data	Trained weights	Eval set	Hyperparameters	Hardware/compute reported
ReAct	Yes — react-lm.github.io and GitHub repo	HotpotQA / FEVER / ALFWorld / WebShop all public	Fine-tuned PaLM not released; prompted PaLM not open-weight	Released	Documented	Partially (PaLM via Google)
ReWOO	Yes — github.com/billxbf/ReWOO	HotpotQA / TriviaQA / GSM8K / public benchmarks	Distilled LLaMA-7B Planner partially released	Released	Documented	Partially
Plan-and-Solve	Yes — GitHub repo (AGI-Edgerunners/Plan-and-Solve-Prompting)	All 10 datasets public	Not applicable — no fine-tuning	Released	Documented	OpenAI API

All three are well-reproduced in the open-source ecosystem. The patterns appear in LangChain (ReActAgent, PlanAndExecute), LlamaIndex (ReActAgent), and Haystack. [Analysis] Operational reproducibility is high; exact-number reproducibility against the closed-weight PaLM-540B and GPT-3 numbers is bounded by API drift.

Methodology disclosure.

Sample sizes. ReAct: 500 HotpotQA dev questions, 500 FEVER. ALFWorld: 134 unseen tasks. WebShop: 500 dev instructions. ReWOO: matched evaluation suites (paper does not always report the exact subset). Plan-and-Solve: full test sets of GSM8K (1,319 problems), full SVAMP (1,000), full MultiArith (600), etc.
Evaluation sets. All public benchmarks with documented train/test splits. Contamination risk against GPT-3.5’s training data exists for HotpotQA and GSM8K both; [Reviewer Perspective] this is a non-trivial caveat the ReWOO and Plan-and-Solve papers do not address directly.
Baselines. ReAct vs Standard / CoT / CoT-SC / Act / Imitation Learning. ReWOO vs ReAct / Direct / CoT. Plan-and-Solve vs Zero-shot / Zero-shot-CoT / 8-shot Manual CoT.
Hardware / compute. ReAct: PaLM-540B via Google internal infrastructure. ReWOO: GPT-3.5 via OpenAI API plus LLaMA-7B distillation on academic GPU cluster (cluster not specified to GPU-count level). Plan-and-Solve: GPT-3 text-davinci-003 via OpenAI API; no local compute.

Generalisability. The three patterns are tied to instruction-tuned LLMs and prompt-following behaviour. None is a learned mechanism that survives without prompts; all are at the LLM’s disposal at inference time. [Analysis] As LLMs improve, all three patterns transfer; the question is whether the patterns will be subsumed by future LLMs that plan natively without explicit prompting scaffolds.

Assumption audit. The “foreseeability” assumption ReWOO depends on is realistic for multi-hop QA and table lookups, fragile for interactive tasks. The “instruction-following” assumption all three depend on is realistic for major instruction-tuned models, fragile for base-pretrained models or for prompt formats the model has not seen. The “few-shot exemplars cover the task distribution” assumption ReAct depends on is realistic for narrow benchmarks (HotpotQA), fragile for open-ended tasks.

What would make the cluster significantly stronger. [Analysis] A single benchmark on a single backbone (e.g., GPT-4o or Llama-3-70B) comparing all three patterns plus Reflexion plus a few-shot CoT baseline on a matched evaluation suite would replace the cross-paper number comparisons that are currently the practitioner’s only signal.

Section 13: What is reusable for a new study

REUSABLE COMPONENT [1]: ReAct’s interleaved trajectory format. Any agent framework that needs tool use can adopt the Thought-Action-Observation structure verbatim. Preconditions: the LLM must follow few-shot exemplars and emit parsable action strings. Risks: trajectory loops, malformed actions. Interaction effects: composes with self-consistency (sample multiple trajectories, vote) and with Reflexion (verbal self-critique outer loop).

REUSABLE COMPONENT [2]: ReWOO’s Planner-Worker-Solver separation. Preconditions: the task’s tool-call chain is largely foreseeable from the question. What changes in a different setting: the Planner prompt must be re-engineered per task type; the Worker contract (how tools are invoked and how evidence is named) is reusable as-is. Risks: opaque failure when the Planner emits a plan that cannot be executed.

REUSABLE COMPONENT [3]: ReWOO’s evidence-variable binding (#E1, #E2, ...). Even without the full Planner-Worker-Solver architecture, the convention of letting the Planner refer to future evidence by symbolic name is reusable in any framework that wants to write down a plan before executing tools.

REUSABLE COMPONENT [4]: Plan-and-Solve’s failure-mode taxonomy. The error-category audit methodology (sample 100 errors, classify them by hand, design a prompt targeted at the dominant categories) is reusable for any prompt-design effort. The specific PS+ prompt is not the contribution; the diagnostic discipline that produced it is.

REUSABLE COMPONENT [5]: ReWOO’s token-cost decomposition formula. Any practitioner evaluating an agent framework can plug the formula into the framework’s prompt structure and estimate cost before running.

Dependency map. Components 1, 2, 3 form the agent-framework family; ReAct (Component 1) and ReWOO (Components 2 + 3) are alternatives, not stacked. Component 4 is methodologically independent. Component 5 is an evaluation tool that applies to either framework.

Recommendation. [Analysis] Highest-value adoption for a new agent study: Component 2 (Planner-Worker-Solver) for any task with foreseeable tool chains, because the token-cost savings compound at production scale and the architecture is more amenable to parallel tool execution than ReAct’s sequential trajectory.

[Analysis] A new study that benefits most: a production deployment of an LLM agent on a fixed-budget tool set (web search, calculator, retrieval) would adopt ReWOO as the default and fall back to ReAct only on tasks the Planner cannot decompose.

Section 14: Known limitations and open problems

Limitations explicitly stated by the authors.

ReAct: limited to tasks where actions are representable as structured strings; requires substantial human annotation to scale via fine-tuning; can get stuck in repetitive action loops.¹
ReWOO: not applicable to fully environment-dependent tasks (e.g., ALFWorld embodied exploration); requires the reasoning chain to be foreseeable from the question.²
Plan-and-Solve: sensitive to prompt wording; semantic-misunderstanding errors are not addressed.³

Limitations not stated by the authors. [Reviewer Perspective]

All three patterns are evaluated on instruction-tuned major-vendor backbones; transferability to open-weight 7B-30B models is partial (ReWOO’s LLaMA-7B distillation is the only data point and only for the Planner role).
The benchmarks used (HotpotQA, FEVER, GSM8K, ALFWorld, WebShop) all pre-date the relevant LLMs’ training data; contamination risk is non-trivial and none of the three papers runs a contamination check.
The patterns do not compose cleanly: stacking ReAct inside a ReWOO Solver, or running Plan-and-Solve inside a ReAct thought, are obvious next steps that none of the three papers attempts.

Technical root cause of each.

ReAct’s loops: the autoregressive decoder has no mechanism to detect that the current state is identical to a prior state; the model can re-emit the same action indefinitely.
ReWOO’s foreseeability constraint: the Planner runs once at the start with no environment access; it cannot adapt to evidence that contradicts the plan.
Plan-and-Solve’s semantic gap: prompt-only interventions cannot change the model’s interpretation of the question without changing the question itself.

Open problems.

Adaptive selection between ReAct and ReWOO per task, with cost-aware routing.
Compose Plan-and-Solve with tool use: a “plan, then execute via tools, then verify” prompt pattern.
Address the foreseeability constraint via a re-plan-on-failure mechanism that does not re-run the whole Planner.
Trajectory-loop detection at the framework layer for ReAct.

Most critical open problem. [Analysis] A unified evaluation of all three patterns plus Reflexion plus modern tool-use approaches (Toolformer, function-calling APIs) on a single matched benchmark with a single backbone. The cluster’s biggest weakness is that practitioners deciding between patterns must triangulate from three independent papers with different backbones and different prompt details. A follow-up paper that levels the playing field would crystallise the trade-offs.

How this article reads at three depths

For the curious high-school reader. Language models can write essays, but to actually do things in the world (look up facts on Wikipedia, run calculations, navigate a website) they need to call external tools. Three influential papers asked: how should the model think and act in sequence? ReAct interleaves thinking and acting one step at a time. ReWOO plans all the steps first and executes the plan. Plan-and-Solve is for pure reasoning problems (no tools): it asks the model to write down a plan first, then carry it out. All three patterns show up inside the AI assistants that exist in 2026.

For the working developer or ML engineer. ReAct interleaves Thought / Action / Observation in a single trajectory: every tool call’s prompt re-feeds the full prior context, which causes ~quadratic token growth and ~71% ALFWorld success vs 45% for action-only. ReWOO splits into Planner (one LLM call producing the full chain with #E1, #E2 evidence placeholders), Worker (executes tools), and Solver (one LLM call synthesising the answer): ~5x token efficiency and ~4% accuracy gain on HotpotQA. Plan-and-Solve is a zero-shot prompt for math/commonsense reasoning that lifts GSM8K from 56.4% (Zero-shot-CoT) to 59.3% (PS+) on GPT-3 text-davinci-003. Production rule: ReWOO when the tool chain is foreseeable and cost matters; ReAct when the next action depends on the previous observation; Plan-and-Solve for pure reasoning without tools.

For the ML researcher. The cluster maps two axes of agent design: when the model commits to actions (incrementally in ReAct, up front in ReWOO) and how often it re-decides (every step in ReAct, never in ReWOO after the Planner runs). The augmented-action-space framing $\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{L}$ from ReAct is the cluster’s most reusable abstraction. ReWOO’s contribution is the formalised token-cost decomposition: quadratic in $k$ for ReAct, linear for ReWOO. Plan-and-Solve’s contribution is methodological: a failure-mode taxonomy of Zero-shot-CoT that lets prompt design target measured failures rather than guess. The strongest objection to the cluster is that none of the three is benchmarked head-to-head on a matched backbone and matched evaluation suite, so practitioners triangulate trade-offs from three papers with different setups. A follow-up paper that ran ReAct, ReWOO, Plan-and-Solve, and Reflexion on the same backbone with the same tools and the same test sets would crystallise the design-space mapping the cluster opened up.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao — ReAct: Synergizing Reasoning and Acting in Language Models (arXiv:2210.03629, ICLR 2023). Introduces the interleaved Thought-Action-Observation trajectory; reports HotpotQA 27.4 EM, FEVER 60.9 Acc on PaLM-540B; ALFWorld 71% success rate; WebShop 40% success rate. Project site at react-lm.github.io with reference code. (accessed 2026-05-19) ↩
2. Xu, Peng, Lei, Mukherjee, Liu, Xu — ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models (arXiv:2305.18323). Introduces the Planner-Worker-Solver architecture with `#E1, #E2` evidence-variable binding; reports 5x token efficiency and 4% absolute accuracy gain on HotpotQA versus ReAct; includes LLaMA-7B distillation of the Planner role. (accessed 2026-05-19) ↩
3. Wang, Xu, Lan, Hu, Lan, Lee, Lim — Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models (arXiv:2305.04091, ACL 2023). Defines PS and PS+ prompts; reports 100-example failure-mode audit on GSM8K showing 7% calculation, 12% missing-step, 27% semantic-misunderstanding errors under Zero-shot-CoT; PS+ reduces missing-step to 7%. (accessed 2026-05-19) ↩
4. Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903). The 2022 paper that introduced few-shot Chain-of-Thought prompting; the parent of both ReAct's reasoning component and Plan-and-Solve's pattern. (accessed 2026-05-19) ↩
5. Kojima, Gu, Reid, Matsuo, Iwasawa — Large Language Models are Zero-Shot Reasoners (arXiv:2205.11916). The "Let's think step by step" Zero-shot-CoT paper; the direct parent of Plan-and-Solve. (accessed 2026-05-19) ↩
6. Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao — Reflexion: Language Agents with Verbal Reinforcement Learning (arXiv:2303.11366). Contemporaneous extension of ReAct that adds a verbal-self-critique outer loop over the interleaved trajectory. (accessed 2026-05-19) ↩
7. Schick, Dwivedi-Yu, Dessì, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom — Toolformer: Language Models Can Teach Themselves to Use Tools (arXiv:2302.04761). Contemporaneous work that makes tool-use a learned model behaviour rather than a prompted one; the alternative paradigm to ReAct's prompt-based action elicitation. (accessed 2026-05-19) ↩
8. ReAct project site with reference code, prompt templates, and trajectory examples for HotpotQA, FEVER, ALFWorld, and WebShop evaluations. (accessed 2026-05-19) ↩
9. ReWOO reference implementation including Planner and Solver prompts, tool integrations, and benchmark scripts. (accessed 2026-05-19) ↩

Anonymous · no cookies set

Found this useful? Share it.