SWE-agent, OpenHands, and Claude on SWE-bench — A Multi-Paper Technical Reference

Three software-engineering agent designs read together: SWE-agent's ACI, OpenHands' CodeAct platform, and Anthropic's minimal two-tool harness on SWE-bench Verified.

19 May 2026 Updated 19 May 2026 ~63 min read

Header image from Anthropic's engineering blog post on raising the bar on SWE-bench Verified with Claude 3.5 Sonnet, showing the visual framing of the two-tool minimal-harness approach that is one of the three designs reviewed in this article.

Header image from the Anthropic engineering blog post on SWE-bench Verified, used under fair editorial use. The post itself uses more cautious framing.

1. Umbrella scope and cluster identity

Cluster citation. Three artefacts are read together in this article. (1) John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS 2024, arXiv:2405.15793¹. (2) Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang and 19 co-authors. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv:2407.16741². (3) Erik Schluntz. Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet. Anthropic engineering blog, 6 January 2025³.

Retrieval status. All three sources retrieved at writing time. The SWE-agent paper was read from the arXiv abstract, the arXiv PDF, the OpenReview NeurIPS 2024 submission record, and the official SWE-agent project documentation⁴⁵. OpenHands was read from the arXiv abstract and the arXiv PDF². The Anthropic post was read directly from anthropic.com/engineering³. ar5iv HTML renders for both papers returned conversion errors at retrieval time; figure extraction therefore falls back to direct PDF figure references with attribution, documented in the writer-note.

Classification. All three artefacts sit at the application layer of LLM-based software engineering: Architecture proposal (interface design, not model design) · Inference method · LLM-based · Application · Benchmark (in the experimental dimension). None of the three trains a new model. None of the three modifies a transformer backbone. All three change what surrounds the model, the scaffolding through which the model perceives a code repository and emits actions on it.

Why these three together. SWE-bench¹² is the benchmark the cluster is organised around: 2,294 real GitHub issues from popular Python repositories, each task requiring the agent to read the codebase, locate the bug, write a patch, and have the patch pass a hidden test set. The three artefacts under review here are the three most-cited contemporaneous answers to the question “how should the language model talk to a code repository?”. SWE-agent designs a small set of editor commands and calls them an Agent-Computer Interface (ACI). OpenHands packages the question into an open-source platform with a sandbox runtime, an event-stream abstraction, and an agent-skills library on top of the CodeAct paradigm. The Anthropic post argues, against both, that a model strong enough to drive a bash shell directly needs less scaffolding rather than more. Reading them together exposes where the three answers actually conflict and where they merely package the same primitives differently.

Technical abstract (in the publication’s voice). Three software-engineering agent designs are evaluated against the SWE-bench benchmark family. SWE-agent introduces a custom Agent-Computer Interface (ACI), a small set of commands (open, goto, scroll_up, scroll_down, edit, create, search_file, search_dir, find_file, submit) that constrain what the language model can do at each step and force a windowed view of source files; its headline result is 12.5% pass@1 on SWE-bench full and 18.0% on SWE-bench Lite using GPT-4 Turbo, beating the prior best retrieval-augmented baseline of 3.8%¹. OpenHands generalises the same insight into a platform: a Docker-sandboxed runtime exposing bash, Python (Jupyter), and a Playwright-driven browser; an event-stream abstraction recording every action and observation; an AgentSkills library of reusable Python primitives; and the CodeAct paradigm in which the agent emits executable Python rather than a fixed function-call schema. Its CodeActAgent v1.8 with Claude 3.5 Sonnet reaches 26% on SWE-Bench Lite at release time²¹⁰. The Anthropic post reaches 49% on SWE-bench Verified with a simple two-tool harness, bash and edit, arguing the scaffolding should shrink rather than grow as the underlying model improves³.

Primary research question (cluster). Given a code repository and a natural-language bug description, what is the right interface between the language model and the file system such that the model can read, search, edit, and verify its patch within a single rollout, and which design choices dominate the resulting pass rate?

Core technical claims (one per paper).

SWE-agent: A custom Agent-Computer Interface designed for the language model (small command set, windowed file viewer, linter-on-edit, structured search, guard-rails on invalid commands) lifts a GPT-4 Turbo agent from 3.8% to 12.5% on SWE-bench full, demonstrating that interface design, not just model capability, is a load-bearing axis¹.
OpenHands: A generalist software-developer agent should be assembled from three concerns kept clean: a sandboxed runtime (Docker container with bash + IPython + browser), an event-stream of action-observation pairs, and a skill library exposed to the model through CodeAct (executable Python actions). The platform is open-source, MIT-licensed, and has accumulated more than 2,100 contributions from 188 contributors at submission time².
Anthropic post: Once the underlying model is capable enough, the scaffolding should minimise not maximise. Two general-purpose tools (bash, edit) plus a permissive system prompt reach 49% on SWE-bench Verified with Claude 3.5 Sonnet, beating the prior best of 45% that used substantially more elaborate harnesses³.

Core technical domains.

Domain	Depth required
Agent scaffolding and tool-use design	Deep
Language-model prompting and context management	Moderate
Sandbox and containerisation patterns (Docker, IPython)	Moderate
Benchmark evaluation methodology (pass@1, contamination, leakage)	Moderate
Software-engineering primitives (AST-aware editing, linting)	Surface
Browser-agent interaction (OpenHands only)	Surface

Reader prerequisites. High-school algebra; comfort reading Python code; rough familiarity with what a large language model is (a token-in / token-out function with a context window) and what a GitHub issue and a unit test are. The Glossary in Section 2.5 covers every technical term used in the body, including the claim-taxonomy labels.

2. TL;DR and executive overview

TL;DR. A software-engineering agent is a small program that lets a language model fix bugs in a real codebase, by giving the model a set of commands like “open this file”, “edit lines 30 to 40”, or “run the tests”. This article reviews three influential designs from 2024 to 2026 that argue about how big or small that command set should be, and how all three are measured against SWE-bench, a benchmark of 2,294 real GitHub issues where the model has to write a patch that passes hidden tests.

Executive summary (one paragraph). Fixing a real software bug is harder than writing code from scratch because the model has to navigate a large repository, read the right files, understand what is broken, edit in the right place, and verify the fix did not break anything else. SWE-agent (2024) argued the right answer is a small custom interface tuned for language models, with line-numbered windowed file views, a linter that rejects syntactically broken edits, and a focused search command, and showed pass rates roughly triple a retrieval baseline. OpenHands (2024) generalised the idea into an open-source platform combining a sandboxed runtime, an event stream, and a skill library, and reached 26% on SWE-Bench Lite. Anthropic’s 2025 post on Claude 3.5 Sonnet pulled in the opposite direction: with a model strong enough to drive a raw bash shell, the scaffolding can shrink to two tools (bash and edit) and reach 49% on SWE-bench Verified. The three together demonstrate that the right interface depends on the underlying model, and that benchmark progress is mostly a story of agent design, not a story of new neural-network architectures.

Five practitioner-relevant takeaways.

Interface design has roughly the same leverage as a model upgrade in this regime. Stronger model on raw shell beats a weaker model on a custom interface; weaker model on raw shell loses to weaker model on a custom interface.
Linter-on-edit (reject syntactically broken patches before the agent sees a failed test) is a near-free pass-rate gain reported in the SWE-agent ablations¹.
Windowed file viewing with line numbers is a near-universal pattern across all three designs; the divergence is whether windowing is enforced (SWE-agent) or available (OpenHands, Anthropic).
Docker sandboxing is necessary, not optional, for any deployment beyond personal experiments; OpenHands ships this as a primitive, both other designs assume the operator brings their own.
SWE-bench scores depend as much on harness as on model. Cross-paper score comparisons are meaningless without specifying the harness, the SWE-bench split (full / Lite / Verified), and the inference cost.

Pipeline overview. All three designs share a common shape at inference time: the agent receives a task (a GitHub issue and a repo snapshot); the orchestration loop alternates between an LLM call that emits an action (a tool invocation, a command, or a Python snippet) and an environment step that executes the action and returns an observation (file content, search results, test output); the loop continues until the agent submits a patch or runs out of budget; the patch is then evaluated against the hidden test suite. The papers differ in (a) what actions the LLM can take, (b) how environment observations are shaped before re-entering the prompt, and (c) how the loop is bounded.

2.5 Glossary

Term	Plain-English explanation	First appears in
Agent	A loop that alternates between calling a language model and executing the action the model emitted, until a stopping condition is met.	Section 1
Agent-Computer Interface (ACI)	A small set of commands designed specifically for a language model to use, as opposed to a human-designed shell or IDE.	Section 1
SWE-bench	A benchmark of 2,294 real GitHub issues; the agent must produce a patch that passes a hidden test suite.	Section 1
SWE-bench Lite	A 300-instance subset of SWE-bench, easier on average and used for fast iteration.	Section 1
SWE-bench Verified	A 500-instance human-validated subset of SWE-bench where each task has been confirmed solvable from the issue text alone.	Section 1
pass@1	The percentage of tasks solved on the first attempt, with no resampling.	Section 1
Sandbox	A Docker container that isolates the agent’s actions so a wrong shell command does not damage the host system.	Section 1
CodeAct	A prompting paradigm where the agent’s actions are written as executable Python rather than as JSON-shaped function calls.	Section 1
Event stream	A logged sequence of action-observation pairs that records what the agent did and what the environment returned.	Section 1
Linter	A program that checks whether code is syntactically valid before it is executed.	Section 1
Windowed file view	Showing the model only N lines of a file at a time (with scroll commands) rather than dumping the whole file into context.	Section 1
Pass rate	The fraction of benchmark tasks the agent solves correctly, equivalent to pass@1 here.	Section 1
Patch	A diff against the original repository that, if applied, would fix the bug.	Section 1
Token budget	An upper bound on how many tokens the agent can spend (input plus output) on a single task.	Section 1
Harness	The orchestration code wrapping the language model: tool definitions, prompts, the loop, and the environment. Used as a synonym for agent in benchmark-result tables.	Section 1
`[Analysis]` label	The publication’s own reasoned assessment, distinct from what the paper itself claims.	Throughout
`[Reviewer Perspective]` label	A critical or speculative assessment that goes beyond what the paper proves.	Section 11, 12
`[Reconstructed]` label	Content the publication faithfully reconstructed because the source only partially disclosed it.	Where used
`[External comparison]` label	A comparison to prior work or general knowledge outside the source itself.	Section 4, 11
”From the paper:” prefix	Content directly supported by the paper’s text, equations, tables, or figures (or for the Anthropic post, by the blog post’s text).	Throughout

3. Problem formalisation

Notation table.

Symbol	Type	Meaning	First appears in
$\mathcal{T}$	task	A SWE-bench instance: a GitHub issue plus a repo snapshot at the buggy commit.	Section 3
$\mathcal{R}$	repository	The codebase at the pre-fix commit.	Section 3
$\mathcal{I}$	issue text	The natural-language bug description from the issue.	Section 3
$\mathcal{P}$	patch	A unified diff against $\mathcal{R}$ that the agent submits.	Section 3
$\mathcal{H}$	hidden test suite	Tests that the patch must pass to count as correct; not shown to the agent.	Section 3
$a_t$	action	The tool invocation the agent emits at step $t$ .	Section 3
$o_t$	observation	The environment’s response to $a_t$ .	Section 3
$\pi_{\theta}$	policy	The language model, conditioned on prompt and history.	Section 3
$\tau$	trajectory	The sequence $(a_1, o_1, a_2, o_2, \ldots, a_T, o_T)$ across one task.	Section 3
$B$	budget	Maximum number of LLM calls (or tokens) allowed per task.	Section 3
$\mathcal{A}$	action space	The set of legal commands or tool calls available.	Section 3
$\mathcal{O}$	observation space	The set of possible environment responses.	Section 3

Formal problem statement. Given a task $\mathcal{T} = (\mathcal{R}, \mathcal{I})$ , the agent runs a policy $\pi_\theta$ that emits actions $a_t \in \mathcal{A}$ conditioned on the trajectory so far. The environment returns observations $o_t \in \mathcal{O}$ . After at most $B$ steps the agent submits a patch $\mathcal{P}$ . The task is solved iff the patch passes the hidden test suite:

$\text{solved}(\mathcal{T}) = \mathbb{1}[\mathcal{H}(\text{apply}(\mathcal{P}, \mathcal{R})) = \text{pass}]$

The benchmark score across $N$ tasks is the empirical pass rate:

$\text{pass@1} = \frac{1}{N} \sum_{i=1}^{N} \text{solved}(\mathcal{T}_i)$

Where the three designs differ formally. The three artefacts under review fix $\pi_\theta$ to an external API model (GPT-4 Turbo, Claude 3.5 Sonnet, GPT-4o, etc.) and vary only $\mathcal{A}$ , the shape of $o_t$ (how observations are formatted before re-entering the prompt), and the orchestration of the loop. Formally none of the three modifies the policy. From the paper: SWE-agent restricts $\mathcal{A}$ to a hand-designed command set with structured output formats¹; OpenHands keeps $\mathcal{A}$ open-ended via CodeAct, where any executable Python snippet is a legal action²; the Anthropic post takes the middle path with two structured tools that wrap shell access and file editing³.

Assumptions, explicit and implicit.

From the paper (SWE-agent): the issue text $\mathcal{I}$ is sufficient (in principle) to locate the bug; the test suite $\mathcal{H}$ is faithful (passing the hidden tests means the bug is fixed)¹. Both assumptions inherit from SWE-bench itself¹².
From the paper (OpenHands): the sandbox can be reset between tasks without state leakage; the model can write valid Python under the CodeAct paradigm².
From the Anthropic post: the model has sufficient capability to recover from its own errors (re-read a file when an edit fails, re-run tests when output is unexpected) without explicit prompting³.
[Analysis] Potentially strong assumption (all three). Pass@1 on SWE-bench is treated as the load-bearing metric. Real-world software-engineering value is not the same as patch-against-hidden-tests, but the cluster collapses to a single scalar that all three optimise.

Why the problem is hard. Three structural reasons surface in the papers. First, the action space is open-ended: any string the model emits could be a valid command, an invalid command, or syntactically valid but semantically wrong. Second, observations are large and structured: a Python file can be 10,000 lines and only 4 of them matter, and the agent must allocate context to the right region. Third, success is verified only at the end of a long trajectory; intermediate rewards are weak signals at best (a passing test the agent ran does not guarantee the hidden test will pass).

4. Motivation and gap

Real-world example. A developer files a GitHub issue against a popular Python library: “pandas.read_csv raises a confusing error when a column header is duplicated.” The fix requires reading io/parsers/readers.py, locating the dedup logic, modifying it so the error message points at the offending column, and adding a regression test. From the paper (SWE-bench): roughly 2,294 such issues across 12 Python repositories form the benchmark; each comes with the repo state at the buggy commit and a hidden test that the fix must pass¹².

Existing approaches before SWE-agent. Two families dominated pre-2024. (i) Retrieval-augmented generation (RAG): retrieve the file or function most similar to the issue text, hand it to the model along with the issue, and ask for a patch. SWE-bench’s own paper reported the best RAG baseline at 3.8% pass@1¹². (ii) Bash-only agents: drop the model into a Linux shell inside a Docker container and let it run any command. From the paper (SWE-agent ablation): bash-only agents on the same model achieve roughly 3% pass@1, suggesting raw shell access is worse than custom interface design despite being more general¹.

Gap (SWE-agent’s framing). Both prior families ignore that the language model is not a human developer and not a script. RAG over-commits to a single context dump and cannot iterate; raw bash exposes a command surface designed for humans (find, grep, sed, vim) whose ergonomics actively hurt the model. The paper proposes that the interface should be designed for the agent: simpler vocabulary, structured output, and guard-rails on common failure modes¹.

Gap (OpenHands’ framing). SWE-agent solves one half of the problem (the interface) but leaves the platform half as an exercise. There is no shared sandbox abstraction, no shared event-stream pattern, no shared skill library across agent implementations. From the paper (OpenHands): different research groups re-implement the same sandboxing, the same observation formatting, the same multi-tool plumbing; OpenHands consolidates this into an MIT-licensed, community-maintained platform on top of the CodeAct paradigm².

Gap (Anthropic post’s framing). Both prior framings assume the agent harness is doing real work that the model cannot. By 2025, with Claude 3.5 Sonnet, that assumption is no longer load-bearing: a minimal two-tool harness reaches 49% on SWE-bench Verified, beating elaborate harnesses on the previous-generation models. The framing is not that custom interfaces are wrong, but that they should be smaller not larger as the underlying model improves³.

Practical stakes. SWE-bench has become the primary benchmark on which frontier-model providers report coding agent performance. From the paper (SWE-bench): the benchmark is used by Anthropic, OpenAI, Google DeepMind, Meta, and Cognition Labs as a headline number in launch materials. The choice of harness materially changes the number, which means scaffolding research has direct commercial implications.

[External comparison] Position in the broader landscape. Three nearby research lines orbit the cluster. (i) Aider¹³, an open-source CLI pair-programmer that pre-dates SWE-agent and uses repository maps, git-aware diffs, and tree-sitter-based context selection; not a paper but a load-bearing reference implementation. (ii) AgentLess (Xia et al., arXiv:2407.01489)¹⁴, which argues the agent loop itself is unnecessary: a three-stage pipeline (file localisation, code localisation, patch generation) with no iteration reaches competitive SWE-bench numbers. (iii) Devin (Cognition Labs), the closed-source commercial product whose launch precipitated the open-source response that became OpenDevin and then OpenHands. The reading is that the cluster sits in a contested design-space where minimalism (Aider, Anthropic post, AgentLess) and structured platforms (SWE-agent, OpenHands) co-exist and trade off against the underlying model.

5. Method overview

SWE-agent’s ACI

Name and source. Agent-Computer Interface, SWE-agent Section 3¹.

Plain-English intuition. The model is not a human; do not give it the commands a human would use. Instead, design a small dialect: open a file with line numbers, scroll a window of 100 lines at a time, edit by specifying a line range and a replacement, search files and directories with one command each. Every action has structured output that the model can re-parse on the next turn.

Exact mechanism. From the paper (SWE-agent documentation)⁵, the ACI exposes the following commands:

open <path> [line_number]: opens a file and displays a window of 100 lines centred on the optional line number.
goto <line_number>: moves the window to centre on a line.
scroll_up / scroll_down: moves the window by 100 lines.
create <path>: creates a new file.
edit <start_line>:<end_line> followed by replacement text and end_of_edit: replaces a line range with new content. A linter runs immediately after; if the result is syntactically invalid Python, the edit is rejected and the file reverts.
search_file <pattern> [path]: searches within a file for a pattern.
search_dir <pattern> [path]: searches across a directory tree for a pattern.
find_file <name> [path]: finds files matching a name pattern.
submit: submits the current state as the final patch and ends the trajectory.

Connection to the full pipeline. The agent loop is: prompt the LLM with system instructions + ACI command reference + scrollback of recent action-observation pairs; receive the next action; execute it in the sandbox; format the observation (file window with line numbers, or search results, or linter error); append to the scrollback; repeat until submit or budget exhaustion.

Design rationale. From the paper: each command was iterated against pilot runs to remove failure modes. Windowed display prevents context blow-up on large files; linter-on-edit catches typos before a test run; structured search returns ranked results rather than raw grep output; submit is an explicit termination signal so the agent does not over-iterate¹.

What breaks if removed. From the paper (SWE-agent ablation, Table 6 / Section 5.2): removing the windowed file viewer (cat-only access) drops pass@1 from 12.5% to roughly 3% on SWE-bench Lite, putting the harness back at the bash-only baseline. Removing the linter-on-edit causes a smaller but visible drop. Removing structured search reduces the agent’s ability to localise the bug¹. [Analysis] The windowed viewer is the single highest-payoff component in the ablation.

Classification. [New] for the explicit packaging as an Agent-Computer Interface design philosophy and for the specific command-set choices, with each individual command [Adapted] from standard IDE primitives.

OpenHands’ CodeAct platform

Name and source. OpenHands platform (formerly OpenDevin), built on the CodeAct paradigm¹¹ (Wang et al., arXiv:2402.01030, ICML 2024).

Plain-English intuition. Rather than design a custom command vocabulary, give the agent Python and a sandboxed runtime. Anything the agent wants to do, it writes as a Python snippet, the snippet runs in an IPython sandbox, the result comes back as a string. Common operations (open a file, search a directory, run a test) are wrapped as Python functions in an AgentSkills library and exposed to the model through the system prompt.

Exact mechanism. From the paper (OpenHands, Section 3)², the platform has three layers:

Runtime sandbox. A Docker container hosting bash, an IPython server, and a Playwright-controlled Chromium browser. From external sources¹⁰: agents interact via three primary action types, IPythonRunCellAction for Python, CmdRunAction for shell commands, and BrowserInteractiveAction for web navigation.
Event stream. Every action and every observation is appended to a typed event stream. The stream is the canonical agent memory; it is what gets shaped into the next prompt. Multi-agent coordination (one agent delegating sub-tasks to another) is mediated through the event stream.
AgentSkills library. A Python module of reusable primitives (open_file, edit_file, search_in_dir, run_tests, etc.) that the agent imports and calls inside its IPython actions, mirroring the SWE-agent ACI surface but as Python functions rather than custom commands.

Connection to the full pipeline. The CodeActAgent reads the event stream, generates a Python action (typically a single function call from AgentSkills or a raw bash command), executes the action against the sandbox, appends the resulting observation to the event stream, and continues until it emits a final-answer action or hits a step limit.

Design rationale. From the paper: the platform layer is what is missing from SWE-agent. Multiple research groups need the same sandbox, the same event abstraction, and the same skill library; consolidating these into an MIT-licensed open-source codebase reduces re-implementation and accumulates contributions from a wider community².

What breaks if removed. Removing the AgentSkills library forces the agent to write shell commands and Python from primitives, recovering roughly the bash-only baseline performance. Removing the Docker sandbox makes the platform unsafe to run on a real machine. Removing CodeAct in favour of structured function calls reduces the action expressiveness; the agent can no longer chain operations in a single action.

Classification. [Combination novel]. CodeAct itself is [Adopted] from Wang et al. 2024¹¹. The Docker sandbox pattern is [Adopted] from prior LLM-agent infrastructure. The novelty is the integration: a single MIT-licensed platform that operationalises all three layers and accumulates community-maintained agent implementations on top.

GitHub repository banner for OpenHands (All-Hands-AI/OpenHands, formerly OpenDevin), the open-source platform whose CodeAct architecture is the second design reviewed in this article.

OpenHands GitHub repository social card. The repository hosts the open-source platform behind arXiv:2407.16741.

Anthropic’s minimal two-tool harness

Name and source. Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet, Erik Schluntz, Anthropic engineering blog, 6 January 2025³.

Plain-English intuition. When the underlying model is strong enough, the right scaffolding is almost no scaffolding. Give the model two tools, bash (run any shell command) and edit (view, create, or string-replace in any file). Write the simplest possible system prompt. Trust the model to recover from its own mistakes.

Exact mechanism. From the source³, the harness exposes:

Bash tool. Executes arbitrary shell commands inside the sandbox. The tool description explains escaping rules, available packages, and how background processes behave.
Edit tool. Supports viewing a file, creating a file, and string-replacement edits. Requires absolute paths.
System prompt. Minimal. Permits the model arbitrarily long reasoning (“it’s fine if it’s very long”) and offers suggested steps rather than mandating an order.

Connection to the full pipeline. The model alternates between bash calls and edit calls until it believes the patch is correct, then ends the trajectory. There is no explicit submit step; the final state of the filesystem at trajectory end is the patch.

Design rationale. From the source: “give as much control as possible to the language model itself, and keep the scaffolding minimal.” Engineering effort goes into tool descriptions (clarifying behaviour so the model uses tools correctly) rather than into adding more tools³.

What breaks if removed. [Analysis] Removing either of the two tools is roughly equivalent to falling back to a non-functional setup; the harness has no redundancy by design. The interesting failure mode is not removing tools but adding them: the source argues additional tools may not help and may hurt by introducing inconsistent semantics.

Classification. [Adapted]. The two-tool surface is a minimalist version of the SWE-agent ACI and the OpenHands AgentSkills library; the novelty is the empirical demonstration that minimalism beats elaboration at the Claude-3.5-Sonnet capability level.

6. Mathematical contributions

The cluster is light on novel mathematics, since these are systems papers, but several quantities are load-bearing for the experimental claims and deserve formal treatment.

MATH ENTRY [1]: Pass@1 over a benchmark

Source: SWE-bench (Jimenez et al., arXiv:2310.06770), reused by all three artefacts¹².
What it is: The fraction of benchmark tasks the agent solves on its first and only attempt.
Formal definition:

$\text{pass@1} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}\big[\mathcal{H}_i(\text{apply}(\mathcal{P}_i, \mathcal{R}_i)) = \text{pass}\big]$

Each term explained:
- $N$ is the number of tasks in the benchmark split (2,294 for full, 300 for Lite, 500 for Verified).
- $\mathcal{P}_i$ is the patch the agent submits for task $i$ , a unified diff against the repository.
- $\mathcal{R}_i$ is the repository snapshot at the buggy commit for task $i$ .
- $\mathcal{H}_i$ is the hidden test suite for task $i$ , a Python callable that returns pass or fail.
- $\mathbb{1}[\cdot]$ is the indicator function (1 if true, 0 if false).
Worked numerical example. Suppose a small evaluation set of $N=4$ tasks. The agent solves tasks 1 and 3, fails 2 and 4. The indicator values are $[1, 0, 1, 0]$ ; the sum is 2; the average is $2/4 = 0.5$ , so pass@1 = 50%. With $N=300$ (Lite) and 54 solves, pass@1 = $54/300 = 18.0\%$ , the SWE-agent headline on GPT-4 Turbo¹.
Role: The single number every harness paper reports; the basis for cross-paper comparisons. Asymptotically unbiased estimator of the true model-plus-harness solve rate on the underlying task distribution.
Edge cases: Tasks where the hidden test suite is broken (the SWE-bench Verified split was constructed precisely to remove these); tasks the agent solves “by accident” without actually understanding the bug (patch passes the hidden tests but is semantically wrong).
Novelty: [Adopted] from SWE-bench¹².
Transferability: [Analysis] Pass@k generalises straightforwardly to multi-sample evaluation; pass@1 is the default in the cluster because each task is expensive and resampling is rarely budgeted.
Why it matters: It is the metric all three artefacts optimise and report.

MATH ENTRY [2]: Trajectory cost as a token budget

Source: SWE-agent Section 4 cost analysis; Anthropic post discussion of token cost¹³.
What it is: The total LLM tokens consumed by an agent run on a single task, dominated by the cumulative context that each subsequent LLM call sees.
Formal definition. Let $C_t$ denote the prompt context size at step $t$ . Then:

$\text{tokens}(\tau) = \sum_{t=1}^{T} \big( C_t + |a_t| \big)$

Each term explained:
- $T$ is the number of steps in the trajectory (variable, capped by the budget $B$ ).
- $C_t$ is the prompt-context length at step $t$ , accumulating system prompt + tool definitions + prior actions and observations.
- $\mid a_t\mid$ is the number of output tokens the model emits at step $t$ (the action itself).
Worked numerical example. Suppose system + tool prompt is 2,000 tokens; each action averages 150 tokens; each observation averages 500 tokens (one file window of about 100 lines plus formatting). After 20 steps the context contains $2000 + 20 \times (150 + 500) = 2000 + 13{,}000 = 15{,}000$ tokens at step 20, and the cumulative input tokens across all 20 calls is roughly $\sum_{t=1}^{20} C_t \approx 2000 \cdot 20 + 650 \cdot \binom{20}{2} \approx 40{,}000 + 123{,}500 \approx 163{,}500$ tokens just for inputs.
Role: From the source³: Anthropic’s successful runs frequently consume more than 100,000 tokens. From the paper¹: SWE-agent’s average run cost was substantially smaller because of windowed observations.
Edge cases: Trajectories that hit the context-window ceiling and require truncation; trajectories that the orchestrator terminates because cost has exceeded budget.
Novelty: [Adopted]; the cost-of-trajectory framing is standard in agent research.
Transferability: [Analysis] The quadratic-in-T input cost is the structural reason every harness in the cluster cares about observation compaction.
Why it matters: A meaningful chunk of SWE-bench-harness research is implicitly trading pass rate against tokens; without this quantity, score comparisons are incomplete.

MATH ENTRY [3]: Observation compaction (windowed file view)

Source: SWE-agent Section 3.1¹.
What it is: The window operator that, given a file of $L$ lines and a focus line $\ell$ , returns a substring of at most $W$ lines centred on $\ell$ with line-number annotations.
Formal definition. For a file $f$ of length $L$ lines and window size $W$ :

$\text{window}(f, \ell, W) = f\big[\max(1, \ell - W/2) : \min(L, \ell + W/2)\big]$

Each term explained:
- $L$ is the number of lines in the file.
- $\ell$ is the focus line (1-indexed).
- $W$ is the window size (100 in SWE-agent).
- The output is a contiguous slice of $f$ with line numbers prepended.
Worked numerical example. A file has $L = 1000$ lines, the agent calls goto 450, and $W = 100$ . The window covers lines $[\max(1, 450 - 50), \min(1000, 450 + 50)] = [400, 500]$ , returning 101 lines (the inclusive range) with each line prefixed by its 1-indexed line number. The observation token count is roughly $101 \times 10 = 1{,}010$ tokens (assuming ~10 tokens per line average), versus the $\approx 10{,}000$ tokens it would take to dump the whole file.
Role: Bounds the per-step observation size, which in turn bounds the trajectory cost (MATH ENTRY [2]) and prevents the context window from filling up after a few file opens.
Edge cases: Files shorter than $W$ (return the whole file); requests for lines past EOF (clamp at $L$ ); files in languages where the line-number convention differs (rare in the SWE-bench Python population).
Novelty: [Adapted] from IDE-style scrollback windows; [New] in its explicit framing as an agent-cost-control device.
Transferability: [Analysis] Highly transferable: any agent that touches large files benefits.
Why it matters: The single design choice that the ablation in SWE-agent Table 6 identifies as load-bearing for the 3% → 12.5% jump.

MATH ENTRY [4]: Lint-gated edit

Source: SWE-agent Section 3.2; restated in the ACI background documentation⁶.
What it is: An edit operation that is committed iff the resulting file parses cleanly under a language-aware linter.
Formal definition. Let $f$ be the file, $e = (\ell_1, \ell_2, \text{new\_text})$ the edit, and $f'$ the proposed result:

$\text{commit}(f, e) = \begin{cases} f' & \text{if } \text{lint}(f') = \text{ok} \\ f & \text{otherwise (and surface the lint error to the agent)} \end{cases}$

Each term explained:
- $\ell_1, \ell_2$ are the 1-indexed start and end lines of the edit range.
- $\text{new\_text}$ is the replacement content.
- $\text{lint}(\cdot)$ is a Python-aware linter (flake8 or equivalent) returning ok or an error message.
Worked numerical example. The agent issues edit 30:35 with replacement text that has an unmatched parenthesis. The linter rejects $f'$ ; the file remains at $f$ ; the agent sees the lint error and can re-issue the edit. Without lint gating, the agent would proceed to run tests against a syntactically-broken file, observe an ImportError, and waste several turns recovering.
Role: Cuts trajectory length by removing a common high-cost failure mode (run-tests-then-recover loops) before it starts.
Edge cases: Edits in non-Python files (most SWE-bench tasks are Python; for non-Python files the linter is a no-op); intentional partial edits the agent wants to make in stages (the linter forbids leaving the file mid-edit).
Novelty: [New] in the agent-design context.
Transferability: [Analysis] Highly transferable to any agent that edits structured code; the linter type varies by language.
Why it matters: Quantified in SWE-agent’s ablations as a measurable contribution to the pass-rate jump; conceptually generalises to guard rails against the LLM’s most common errors as a design pattern.

MATH ENTRY [5]: CodeAct’s expressiveness (sketch)

Source: CodeAct paper (Wang et al., arXiv:2402.01030)¹¹, restated in OpenHands Section 3.2².
What it is: An argument that emitting executable Python is strictly more expressive than emitting fixed function-call schemas at equal capability.
Formal definition (paraphrased). Let $\mathcal{A}_{\text{fn}}$ be the set of legal actions under a fixed function-call schema with $k$ functions and at most $m$ arguments each; let $\mathcal{A}_{\text{py}}$ be the set of legal actions under CodeAct (any well-typed Python expression). Then $\mathcal{A}_{\text{fn}} \subset \mathcal{A}_{\text{py}}$ strictly, with the gap including composition (function $f$ called on the result of function $g$ in one action) and control flow (a Python for-loop calling the same function with varying arguments).
Worked example. Under a fixed-schema function-call, the agent must issue two separate calls to read two files: read_file("a.py") then read_file("b.py"). Under CodeAct, the agent emits for path in ["a.py", "b.py"]: print(open(path).read()) as a single action, halving the round-trip cost.
Role: Justifies why OpenHands uses CodeAct rather than the OpenAI-style JSON function-call schema.
Edge cases: Models trained more heavily on JSON function-call data may underperform on CodeAct; the CodeAct paper reports this gap closes with model scale.
Novelty: [Adopted] from Wang et al. 2024.
Transferability: [Analysis] Transferable to any agent platform, with the caveat that CodeAct requires a sandboxed Python runtime to be safe.
Why it matters: It is the load-bearing argument for the second-paper-in-cluster’s headline design choice.

7. Algorithmic contributions

Headline algorithm: the SWE-agent main loop

ALGORITHM ENTRY [1]: SWE-agent inference loop

Source: SWE-agent Section 3 (paper) and the official code repository⁴.
Purpose: Drive a language model through a SWE-bench task by alternating ACI command emission and sandbox execution until the agent submits a patch.
Inputs:
- task: a SWE-bench instance (repo snapshot at buggy commit + issue text), object.
- policy: the language model API client, callable.
- aci: the Agent-Computer Interface command interpreter, object.
- budget: maximum number of steps, integer.
Outputs:
- patch: a unified diff against the original repo, string.
- trajectory: the full action-observation log, list of pairs.

GitHub repository banner for SWE-agent (SWE-agent/SWE-agent), showing the project's social-card identity. The repository hosts the canonical reference implementation of the Agent-Computer Interface design from arXiv:2405.15793.

SWE-agent GitHub repository social card, the canonical reference implementation accompanying the paper.

Pseudocode for the SWE-agent inference loop:

def swe_agent_loop(task, policy, aci, budget):
    sandbox = aci.init_sandbox(task.repo)
    trajectory = []
    prompt = build_initial_prompt(task.issue, aci.command_reference)
    for step in range(budget):
        action = policy(prompt)            # LLM call: returns one ACI command
        observation = aci.execute(action, sandbox)
        trajectory.append((action, observation))
        if action.is_submit:
            break
        prompt = update_prompt(prompt, action, observation)
    patch = sandbox.diff_against_initial()
    return patch, trajectory

Hand-traced example. Suppose the task is “fix the duplicate-column error in pandas.read_csv”, the budget is 30, and the initial prompt names the ACI commands. Step 1: policy(prompt) returns search_dir "read_csv" ., so the agent looks for the function definition. aci.execute returns a list of files matching, including pandas/io/parsers/readers.py. Step 2: policy(prompt) returns open pandas/io/parsers/readers.py 450 (guessed line number). aci.execute returns lines 400 to 500. Step 3: policy(prompt) returns search_file "duplicate", narrowing to the dedup logic. Step 4: policy(prompt) returns edit 478:482 followed by replacement text and end_of_edit. The linter passes. The file updates. Step 5: policy(prompt) returns a bash invocation running the unit test. The test passes. Step 6: submit. The loop exits and sandbox.diff_against_initial() returns the patch.
Complexity. Time: $\mathcal{O}(T \cdot C_T)$ for the LLM calls, where $T$ is the trajectory length (bounded by budget) and $C_T$ is the final-step context size. Space: $\mathcal{O}(T \cdot \bar{O})$ for the trajectory storage, where $\bar{O}$ is the average observation size; in SWE-agent, $\bar{O}$ is bounded by the window size and is therefore small. The bottleneck is the LLM call: token throughput dominates wall-clock time.
Hyperparameters. window_size = 100 (the file viewer’s window); max_steps = 50 (typical budget in SWE-agent experiments); temperature = 0.0 (deterministic policy in headline results); structured-output enforcement is on by default.
Failure modes. Action parsing errors (the model emits a command the ACI cannot parse) → the ACI surfaces a parsing-error observation and the agent retries; budget exhaustion without submit → the current sandbox diff is returned as the patch; LLM context overflow → the trajectory is truncated, oldest observations first.
Novelty: [New] as a packaged loop, with each individual step [Adopted] from standard ReAct-style agents.
Transferability: [Analysis] The loop shape generalises to any agent + tool-set pair; the load-bearing component is the ACI not the loop.

ALGORITHM ENTRY [2]: OpenHands CodeAct inference loop

Source: OpenHands Section 3 and CodeAct (Wang et al., arXiv:2402.01030)²¹¹.
Purpose: Drive a language model through any task in the OpenHands platform by alternating Python-action emission and sandbox execution.
Inputs: task (typed event), runtime (Docker sandbox), event_stream (typed log), agent (CodeActAgent or other), budget.
Outputs: terminal event (final answer or end-of-task) and the full event stream.

def openhands_loop(task, runtime, event_stream, agent, budget):
    event_stream.append(task)
    for step in range(budget):
        action = agent.step(event_stream)  # emits IPythonRunCellAction,
                                           # CmdRunAction, or BrowserAction
        event_stream.append(action)
        observation = runtime.execute(action)
        event_stream.append(observation)
        if action.is_finish:
            break
    return event_stream.terminal_event(), event_stream

Hand-traced example. Same pandas task. Step 1: agent.step(stream) returns an IPythonRunCellAction containing from agentskills import search_in_dir; search_in_dir("read_csv", "."). runtime.execute runs the Python in IPython and returns the search results. Step 2: another IPythonRunCellAction with open_file("pandas/io/parsers/readers.py", 450). Step 3: an edit_file(...) call. Step 4: a CmdRunAction running pytest. Step 5: an AgentFinishAction. The event stream now contains the full trajectory and is the canonical record of the run.
Complexity. Same asymptotic shape as ALGORITHM ENTRY [1]; the Python-action surface adds constant overhead for the IPython interpreter cycle.
Hyperparameters. max_iterations = 50 (default); runtime_container_image (the Docker image hosting the sandbox); agent_class (CodeActAgent, BrowsingAgent, etc.).
Failure modes. Runtime exceptions (the agent’s Python raises) → the exception traceback is surfaced as the observation; container crashes → the loop terminates with an error event; the agent emits a non-action event (e.g., a message) → the loop continues to the next step.
Novelty: [Combination novel]. CodeAct primitives [Adopted] from Wang et al. 2024¹¹; event-stream pattern [Adapted] from event-sourcing in distributed systems; integration is platform-novel.
Transferability: [Analysis] Adopting OpenHands as a platform requires accepting the Docker sandbox dependency and the CodeAct paradigm; the event-stream pattern is more portable on its own.

ALGORITHM ENTRY [3]: Anthropic two-tool loop

Source: Anthropic engineering blog³.
Purpose: Drive Claude 3.5 Sonnet through a SWE-bench Verified task using only the bash and edit tools.
Inputs: task, policy = claude_3_5_sonnet, tools = [bash, edit], system_prompt (minimal).
Outputs: filesystem state at trajectory end.

def anthropic_loop(task, policy, tools, system_prompt, budget):
    sandbox = init_sandbox(task.repo)
    messages = [system_prompt, user_message(task.issue)]
    for step in range(budget):
        response = policy(messages, tools=tools)
        if response.is_final:
            break
        for tool_call in response.tool_calls:
            result = execute(tool_call, sandbox)
            messages.append(assistant_tool_call(tool_call))
            messages.append(tool_result(result))
    return sandbox.diff_against_initial()

Hand-traced example. Same pandas task. Step 1: Claude emits a bash call running grep -r "read_csv" pandas/io/. Step 2: Claude reads the result and emits an edit call viewing pandas/io/parsers/readers.py lines 400-500. Step 3: Claude emits an edit call doing string-replacement on the dedup logic. Step 4: Claude emits a bash call running the unit test, observes it passes, and ends the trajectory.
Complexity. Same asymptotic shape; tool-call boilerplate per turn is slightly higher than SWE-agent’s ACI but the model’s recovery from errors compensates.
Hyperparameters. From the source³: Anthropic’s harness allows very long reasoning (“it’s fine if it’s very long”); successful runs frequently consume more than 100,000 tokens.
Failure modes. Edit applied to a stale view of the file (the file changed between view and edit); the source flags that the model assumes success without visibility into hidden test cases.
Novelty: [Adapted]. The minimalist two-tool surface is conceptually a strict subset of every prior design; the novelty is the empirical claim that this minimalism beats elaboration at Claude-3.5-Sonnet capability.
Transferability: [Analysis] Directly transferable to any sufficiently capable model; reportedly reverses on smaller / weaker models, where structured interfaces still help.

8. Specialised design contributions

8A. LLM / prompt design

PROMPT ENTRY [1]: SWE-agent system prompt structure

Source: SWE-agent paper appendix and repository⁴.
Role in pipeline: Sets the agent’s role, names the ACI commands, and provides one or more demonstrations of correct usage.
Prompt type: Few-shot (a demonstration trajectory is included) with structured output expectations.
Components in order: (1) Role statement; (2) ACI command reference (each command + signature + example); (3) demonstration trajectory; (4) output-format reminder (e.g., emit one ACI command per turn, wrapped in code fences).
Input schema: a SWE-bench issue text + a starting working directory. Output schema: a single ACI command per turn, parsed from the model’s response.

[Reconstructed] system-prompt skeleton:

You are an autonomous software-engineering agent. You have access to
the following commands: open, goto, scroll_up, scroll_down, edit,
create, search_file, search_dir, find_file, submit.

Here is one example trajectory:
[DEMONSTRATION TRAJECTORY]

When you are confident the bug is fixed, call submit.
Emit one command per response, wrapped in a code fence.

Failure handling. Malformed commands → ACI returns a parsing-error observation. Re-attempt with corrected syntax.
Design rationale. From the paper: structured output enforcement and demonstration prompts each contribute measurably in pilot studies¹.
Complexity. Prompt length is roughly 2,000-3,000 tokens at start; one demonstration trajectory adds 1,000-2,000 tokens.
Novelty: [Adapted] from ReAct-style prompting.
Transferability: [Analysis] The prompt skeleton transplants directly to any ACI-style agent; only the command names need swapping.

PROMPT ENTRY [2]: Anthropic minimalist system prompt

Source: Anthropic engineering blog³.
Role in pipeline: Frame the task and announce the two tools.
Prompt type: Zero-shot with permissive reasoning latitude.
Components in order: (1) Brief role statement; (2) tool announcement (bash + edit); (3) permissive guidance (“you may reason at length, it’s fine if it’s very long”); (4) suggested steps (not mandated).

[Reconstructed] prompt skeleton (the post does not publish the exact text):

You are a software engineer fixing a bug. You have two tools:
bash (run shell commands) and edit (view, create, or edit files).
You may reason at length before each action. Suggested approach:
read the issue, locate the relevant code, edit, run tests, repeat.

Failure handling. The source notes the model often assumes success without seeing hidden tests; mitigation is to surface as much test output as possible.
Design rationale. Minimise scaffolding, give the model control.
Complexity. Prompt under 1,000 tokens.
Novelty: [Adapted].
Transferability: [Analysis] Directly transferable to any capable model; the design philosophy is the contribution.

8B. Architecture-specific details

Not applicable to this cluster. None of the three artefacts modifies the underlying transformer architecture. All three call the language model as a black-box API.

8C. Training specifics

Not applicable to this cluster. None of the three trains or fine-tunes a model. The Anthropic post explicitly states that the SWE-bench Verified score is achieved with the same Claude 3.5 Sonnet weights served via the public API³. SWE-agent and OpenHands both evaluate against multiple external API models without modification¹².

8D. Inference / deployment specifics

Sandboxing. All three rely on Docker containerisation for safe execution. OpenHands ships the sandbox as a first-class platform component²; SWE-agent provides container images alongside the code⁴; the Anthropic post describes “extensive harness work” to make the sandbox reproducible³.
Test-time compute. The Anthropic post notes that successful runs often exceed 100,000 tokens, a substantial inference cost. SWE-agent runs are cheaper because of observation compaction. OpenHands runs vary by agent implementation.
Caching / retrieval. None of the three uses a retrieval-augmented prior over code embeddings as the primary mechanism; the agent loop itself is the retrieval method. AgentLess (a contemporaneous design, see Section 11) takes the opposite approach.
Multi-agent. OpenHands ships explicit support for multi-agent delegation; SWE-agent and the Anthropic harness do not. From the paper (OpenHands Section 3.4): one agent can spawn a sub-agent via the event stream, useful for delegating browsing or sub-task decomposition².

9. Experiments and results

Datasets

From the paper / source (all three):

SWE-bench full. 2,294 GitHub issues across 12 popular Python repositories¹². Each task: repo at the buggy commit + the issue text. Hidden tests are the regression tests added in the merged fix.
SWE-bench Lite. 300-instance subset, selected for being smaller and faster to iterate against¹².
SWE-bench Verified. 500-instance subset, human-validated for solvability from the issue text alone. From the source³: this is the split Anthropic’s reported 49% is measured on.
HumanEvalFix. A smaller code-repair benchmark, used by SWE-agent as a secondary evaluation. From the paper: SWE-agent reaches 87.7% pass@1 on HumanEvalFix¹.

Baselines

SWE-agent baselines: SWE-bench’s own RAG baseline (3.8%); bash-only agents on the same model (~3%); ReAct-style agents; raw GPT-4-Turbo with the issue and repo dumped into context.
OpenHands baselines: SWE-agent and Aider on the same model; CodeAct without the platform integrations.
Anthropic baselines: the prior best Claude harness (45%) and contemporaneous OpenAI submissions.

[Analysis] Obvious missing baselines across all three: a meaningful comparison against pure-RAG-no-agent approaches at the same compute budget; a meaningful comparison against contemporaneous smaller-model harnesses to disentangle scaffolding-vs-model effects.

Evaluation metrics

Pass@1 is the primary metric across all three artefacts. SWE-agent additionally reports time cost and API cost per task. OpenHands additionally reports performance on WEBARENA, GAIA, MINT, and other non-SWE-bench benchmarks. The Anthropic post additionally discusses token counts per successful run.

Reproduced result tables

Table A. SWE-agent headline results, reproduced from SWE-agent (arXiv:2405.15793), Tables 1-2.

Harness	Model	SWE-bench full pass@1	SWE-bench Lite pass@1
RAG baseline	GPT-4 Turbo	3.8%	n/a
Bash-only agent	GPT-4 Turbo	~3%	~3%
SWE-agent	GPT-4 Turbo	12.5%	18.0%
SWE-agent	Claude 2	4.8%	9.0%

Numbers from SWE-agent (arXiv:2405.15793), reproduced for editorial coverage.

Table B. OpenHands platform results, reproduced from OpenHands (arXiv:2407.16741) and the CodeAct 1.0 release post¹⁰.

Agent	Model	Benchmark	Score
CodeActAgent v1.0 (OpenDevin)	GPT-4 / Claude	SWE-Bench Lite	21.0%
CodeActAgent v1.8	Claude 3.5 Sonnet	SWE-Bench Lite	26.0%
BrowsingAgent	GPT-4o	WebArena	(see paper Section 4)

Numbers from OpenHands (arXiv:2407.16741), reproduced for editorial coverage.

Table C. Anthropic minimal-harness results, reproduced from the Anthropic engineering blog.

Harness	Model	SWE-bench Verified pass@1
Prior best (external)	various	45.0%
Anthropic two-tool harness	Claude 3.5 Sonnet (upgraded, Oct 2024)	49.0%

Numbers from the Anthropic engineering blog, reproduced for editorial coverage.

Hugging Face Papers social-card thumbnail for arXiv:2405.15793 (SWE-agent), used as a citation marker.

Hugging Face Papers entry for SWE-agent (arXiv:2405.15793) social card, used as a citation marker.

Hugging Face Papers social-card thumbnail for arXiv:2407.16741 (OpenHands), used as a citation marker.

Hugging Face Papers entry for OpenHands (arXiv:2407.16741) social card, used as a citation marker.

Main quantitative results

SWE-agent’s central claim: a custom ACI is the difference between a 3.8% RAG baseline and a 12.5% agent on the same underlying model, with the same available compute¹.
OpenHands’ central claim: the platform supports state-of-the-art results at the time of release across multiple benchmarks, while remaining open-source and community-extensible².
Anthropic’s central claim: a minimal two-tool harness with the right model beats an elaborate harness with a weaker model at SWE-bench Verified by 4 absolute percentage points³.

Ablations

SWE-agent (Table 6 / Section 5.2): removing the file viewer (windowed display) cuts pass@1 by roughly 60% (from 12.5% to ~5%); removing the linter cuts a few absolute points; removing structured search cuts a few points; combining all removals returns the harness to bash-only baseline¹.
OpenHands: ablation studies in Section 4 cover agent class (CodeActAgent vs other registered agents), the effect of multi-agent delegation, and the impact of the AgentSkills library².
Anthropic: the post does not present formal ablations; instead it presents the design philosophy and the headline score.

Hyperparameter sensitivity

SWE-agent reports light sensitivity analysis on temperature and budget. OpenHands’ multi-author repository tracks per-agent hyperparameter changes via the OSS release history rather than in the paper. The Anthropic post mentions tool-description wording as a sensitivity axis (“extensive effort went into tool descriptions”).

Robustness and stress tests

None of the three artefacts presents formal adversarial robustness tests. From the paper (SWE-agent): contamination is acknowledged but not formally tested; SWE-bench Verified (used by Anthropic) was constructed precisely to address contamination concerns by validating each task’s solvability from the issue text alone.

Qualitative results

SWE-agent’s appendix shows qualitative trajectories on representative issues, including failure cases (the agent over-iterates on the wrong file).
OpenHands’ paper shows screenshots of the platform UI and agent traces on browsing tasks.
The Anthropic post discusses individual successful and failed runs in prose, including the failure mode where the model assumes success based on local tests that pass while hidden tests fail.

Experimental scope limits

SWE-agent and OpenHands evaluate on the SWE-bench full / Lite splits; Anthropic evaluates on SWE-bench Verified. Cross-paper comparisons therefore require carrying the split label.
None of the three reports cost-normalised pass-rate curves at constant token budget; all three numbers are at the budget the authors picked.
Python-only: SWE-bench is exclusively Python, so none of the three speaks directly to JavaScript / Rust / Go / C++ codebases.

Independent benchmark cross-checks for SOTA claims

The SWE-bench leaderboard at swebench.com¹¹ is the canonical independent cross-check. As of early 2026, OpenHands + CodeAct on a Claude Opus 4.x model is reported in third-party leaderboards at around 68% on Verified, with the highest single-model harness combinations crossing 80%, substantial drift from the numbers in the three artefacts at their respective publication dates. [Analysis] The benchmark is no longer saturated but is approaching saturation; future research is expected to migrate to SWE-bench Pro¹¹, which is harder.

Evidence audit

Strongly supported claims. SWE-agent: the ACI ablation is direct evidence for the interface-design thesis. OpenHands: the platform’s open-source numbers (contributions, contributors) are auditable on GitHub. Anthropic: the 49% number is reproducible by anyone with API access and the two-tool harness.
Partially supported claims. SWE-agent: the precise contribution of each ACI command individually is reported as a single ablation per component, not a full factorial.
Narrow-evidence claims. All three: the headline pass rates are single-seed, single-budget snapshots; sensitivity to seed and budget is largely uncharacterised.

10. Technical novelty summary

Component	Type	Novelty level	Justification	Source
Agent-Computer Interface (concept)	Design philosophy	Fully novel	Names and operationalises a category that prior agent work treated as implementation detail	SWE-agent §3
Windowed file viewer	Mechanism	Combination novel	IDE concept ported into agent context with explicit cost-control framing	SWE-agent §3.1
Linter-on-edit	Mechanism	Fully novel (in agent context)	Standard linter, novel as a per-edit gate in an agent loop	SWE-agent §3.2
Structured search command	Mechanism	Incrementally novel	grep with ranked, agent-friendly output	SWE-agent §3.3
CodeAct paradigm	Action representation	Adopted	From CodeAct (Wang et al., arXiv:2402.01030)¹¹	OpenHands §3.2
Event-stream abstraction	Memory pattern	Combination novel	Event sourcing applied to agent memory	OpenHands §3.3
AgentSkills library	Skill abstraction	Combination novel	Python-callable skill library exposed to the agent	OpenHands §3.4
Docker runtime with bash + IPython + Playwright	Sandbox	Adopted	Standard primitives; integration is platform-novel	OpenHands §3.1
Multi-agent delegation	Coordination	Incrementally novel	Sub-agent spawning through the event stream	OpenHands §3.4
Two-tool minimal harness	Design philosophy	Adapted	Strict subset of prior surfaces; novelty is the empirical demonstration of minimalism beating elaboration	Anthropic post

Single most novel contribution. [Analysis] The single most novel contribution across the cluster is SWE-agent’s explicit operationalisation of Agent-Computer Interface as a named, ablate-able design axis. Prior agent work treated the question “what commands does the model see?” as an implementation detail. SWE-agent reframes it as a research-grade design space and demonstrates that the design choices within that space dominate the resulting benchmark score by a wider margin than backbone-model upgrades within the same scaffolding family.

What the cluster does NOT claim as novel. The transformer backbone (all three use external API models); the SWE-bench benchmark itself (Jimenez et al., 2023); the CodeAct paradigm (Wang et al., 2024); Docker sandboxing; ReAct-style action-observation loops; tree-sitter / linter primitives; git diff handling. The minimalist two-tool design philosophy in the Anthropic post is explicitly framed as empirical demonstration rather than novel design.

11. Situating the work

What prior work did. Pre-2024 LLM software-engineering work mostly used (a) RAG over code embeddings followed by single-shot patch generation, or (b) ReAct-style agents with raw shell access. Pass rates on SWE-bench were in the 2-5% band.

What this cluster changes conceptually. Reframes the problem as interface design between the model and the file system, with the policy held fixed. The shift moves a substantial fraction of subsequent research effort from “build a bigger code model” to “build a better harness.”

Contemporaneous related papers.

AgentLess (Xia et al., arXiv:2407.01489, July 2024)¹⁴. Argues that the agent loop itself is unnecessary: a three-stage pipeline (file localisation, code localisation, patch generation) with no iteration reaches competitive SWE-bench numbers at substantially lower cost. The relationship to the cluster: AgentLess is the loop-free counter-argument; SWE-agent and OpenHands assume the loop is load-bearing. [External comparison] AgentLess builds the case that for many tasks the model already knows what to do, and the iteration in agent loops is wasted compute.
Aider (Paul Gauthier, ongoing open-source project since 2023)¹³. A CLI pair-programmer that uses repository maps (tree-sitter-based ranking of files by likely relevance), git-aware diffs, and explicit chat-style turns. Pre-dates SWE-agent and is the load-bearing reference implementation for the light-touch interface family. [External comparison] Aider’s design choices (compact repo map, diff-as-output) influenced both SWE-agent’s structured search and OpenHands’ AgentSkills library design.
SWE-Gym (Pan, Wang, Neubig et al., 2025): a training environment for software-engineering agents built on top of OpenHands; takes the platform layer as given and pushes the question downstream to RL-style verifier training.
CodeAct (Wang et al., arXiv:2402.01030, ICML 2024)¹¹: the direct predecessor of OpenHands’ action paradigm; the technical relationship is that OpenHands operationalises CodeAct at platform scale.

Aider project hero image from aider.chat, the CLI pair-programmer that pre-dates SWE-agent and is the load-bearing reference implementation for the light-touch interface family discussed in Section 11.

Hero image from aider.chat, used under fair editorial use as a citation marker for the Aider project referenced in Section 11.

[Reviewer Perspective] Strongest skeptical objection. The interface-design thesis is partly an artefact of the model generation. Once Claude 3.5 Sonnet (and later) can drive a raw shell, the SWE-agent ACI’s ablation gap (~10 absolute pass-rate points) narrows substantially. Anthropic’s 49% with a two-tool harness is the empirical evidence for this objection. SWE-agent’s reply, if forced, would be that even Claude can be improved further with structured interfaces; no published evidence supports this at the Claude 3.5 Sonnet capability level.

[Reviewer Perspective] Strongest author-side rebuttal. From the paper (SWE-agent §6): the ACI was tuned against GPT-4 Turbo. The paper does not claim the gains transfer unchanged to stronger models; it claims they transfer the concept of interface design as a load-bearing axis. The Anthropic post’s minimalism is consistent with this: minimalism is itself an interface-design choice, and the Anthropic team spent substantial effort on tool descriptions (the prompt-side of the interface), validating the broader thesis even while changing the specific surface.

What remains unsolved.

Cost-normalised pass rate. No artefact reports a Pareto curve.
Generalisation beyond Python repositories.
Multi-language sandboxing in OpenHands’ style with comparable safety.
The boundary between “scaffolding helps” and “scaffolding hurts” as a function of model capability.

Three future research directions.

[Analysis] Adaptive interface design. A meta-controller that selects between a SWE-agent-style ACI, an OpenHands-style CodeAct loop, and a minimal two-tool harness per task, based on a quick capability probe. Grounded in the empirical observation that no single interface dominates across the model-capability spectrum.
[Analysis] Verifier-augmented training. Use OpenHands as the environment and SWE-Gym-style verifiers to train a software-engineering specialist model that closes the gap to frontier models at a fraction of the cost. Grounded in OpenHands’ explicit “platform for the community” framing.
[Reviewer Perspective] Cross-language SWE-bench. A SWE-bench equivalent over JavaScript, Rust, and Go repositories would expose how much of the cluster’s design is Python-specific (linting, tree-sitter, pytest) versus model-capability-specific.

12. Critical analysis

Strengths

SWE-agent. Clean ablations directly tie the headline 12.5% number to identifiable design choices. The paper is the canonical reference for the Agent-Computer Interface term and frames a research category cleanly. Open-source code at swe-agent.com⁴.
OpenHands. Open-source platform with substantial community contribution (more than 2,100 contributions, 188 contributors at submission)². Covers more than SWE-bench (WebArena, GAIA, MINT) so the platform is genuinely general.
Anthropic post. Crisp empirical demonstration of the minimalism thesis. Reproducible by anyone with API access. Honest about limitations (token costs, hidden-test blind spots).

Weaknesses explicitly stated by the authors

SWE-agent (§6 / limitations): the ACI is tuned against GPT-4 Turbo; transferability to other model families is not guaranteed.
OpenHands: the paper acknowledges open-source platform churn and the difficulty of stabilising benchmark numbers as the underlying agents evolve.
Anthropic post: “Several limitations” are listed explicitly: high token costs (>100k tokens common), grading complexity from environment setup, model assumptions about success without seeing hidden tests, absence of multimodal file viewing³.

Weaknesses not stated or understated

[Reviewer Perspective] SWE-agent: the headline 12.5% number is single-seed. Variance across seeds is not characterised; subsequent third-party reproductions on the SWE-bench leaderboard¹¹ showed measurable run-to-run variance.
[Reviewer Perspective] OpenHands: the platform’s pace of change makes paper-time benchmark numbers stale within months. The 26% Lite figure was rapidly superseded; the artefact’s lasting contribution is the platform, not the number.
[Reviewer Perspective] Anthropic post: the 49% on SWE-bench Verified is single-model. The post’s minimalism thesis would be stronger with a sweep across smaller Claude variants showing the inflection point at which minimalism stops being optimal. AgentLess¹⁴ presents an even stronger counter-argument that the cluster does not engage: maybe scaffolding itself is the wrong abstraction.

Reproducibility check

SWE-agent. Code released at swe-agent.com⁴. Trajectories released. Hyperparameters fully specified. Compute reported per task. Models are external API (not released by the paper). Overall: fully reproducible given external API access.
OpenHands. Code released on GitHub⁸ under MIT. Sandbox container images released. Evaluation scripts released. Models external API. Overall: fully reproducible.
Anthropic post. Tool definitions described in prose but the exact prompt text and harness code are not fully released. Models (Claude 3.5 Sonnet) accessible via Anthropic API. Overall: partially reproducible; close approximation is possible but the exact harness has been reimplemented by community projects rather than published as code.

Methodology

Sample size. SWE-agent: 2,294 instances (full) and 300 (Lite). OpenHands: 300 (Lite) plus benchmarks across 15 tasks total². Anthropic: 500 (Verified).
Evaluation set. SWE-bench full, Lite, and Verified respectively; held-out hidden tests in all three cases. Contamination is a documented concern for full / Lite; Verified addresses solvability validation but not contamination directly.
Baselines. SWE-agent: RAG, bash-only, ReAct. OpenHands: SWE-agent, Aider, CodeAct without platform. Anthropic: prior best published harness (45%).
Hardware / compute. SWE-agent: cost reported per task in dollars and tokens. OpenHands: not exhaustively reported. Anthropic: > 100,000 tokens common per successful run; hardware not reported beyond “API calls”.

Generalisability

To other domains. None of the three is tested on a non-Python software-engineering codebase. OpenHands ships browser tooling that extends the surface to web-agent tasks; SWE-agent and the Anthropic harness are file-system-focused.
To larger scales. SWE-bench is approaching saturation; future generalisability questions are about SWE-bench Pro and live benchmarks (see Section 11).
To different backbones. SWE-agent ablates across GPT-4 Turbo and Claude 2 (the latter underperforms by roughly 7 absolute points on full and 9 on Lite); OpenHands evaluates across multiple frontier models; the Anthropic post is single-model.

Assumption audit

The SWE-bench-as-proxy assumption is the load-bearing structural assumption (Section 3). It is realistic for the fix-a-bug slice of software engineering and fragile for the design-a-system slice. Failure mode: harnesses optimised for SWE-bench may not transfer to greenfield code generation.

What would make the cluster stronger

[Analysis] A jointly-authored Pareto study, holding the model fixed and varying the harness across the three designs at matched token budgets, would resolve the central debate. The closest existing artefact to this is the SWE-bench leaderboard¹¹, where independent third parties run each harness on each model; the leaderboard is uneven in coverage and the headline numbers are not cost-normalised.

13. What is reusable for a new study

REUSABLE COMPONENT [1]: Windowed file viewer with line numbers

What it is: An observation-formatting primitive that returns at most $W$ lines of a file centred on a focus line, with line numbers attached.
Why worth reusing: Highest-payoff component in SWE-agent’s ablation.
Preconditions: The agent needs a way to specify focus lines (a goto command or equivalent).
What would change in a different setting: Window size scales with the model’s effective context window; languages with verbose syntax (Java, C++) may benefit from larger windows.
Risks: Over-aggressive windowing hides important context; under-aggressive windowing blows up the prompt.
Interaction effects: Pairs naturally with a structured search command; without search, the agent has no way to pick a focus line.

REUSABLE COMPONENT [2]: Linter-on-edit guard

What it is: Reject any edit that produces a syntactically invalid file; surface the lint error to the agent instead.
Why worth reusing: Removes a high-frequency failure mode.
Preconditions: A language-aware linter that the harness can invoke synchronously.
What would change: The specific linter depends on the language.
Risks: Aggressive lint rules may reject edits the agent wanted to make in stages.
Interaction effects: Reduces test-run cost by catching errors before the test step.

REUSABLE COMPONENT [3]: Docker-sandboxed runtime

What it is: A container hosting bash, optionally Python, optionally a browser, against which the agent’s actions are executed.
Why worth reusing: Mandatory for any non-trivial deployment.
Preconditions: Docker (or equivalent) on the host; image-build infrastructure.
What would change: Image contents vary by task domain.
Risks: Container escape; host-volume leakage. Mitigation by following OpenHands’ container conventions.
Interaction effects: Decouples agent code from host environment, enabling reproducible benchmarks.

REUSABLE COMPONENT [4]: Event stream as agent memory

What it is: A typed, append-only log of action-observation pairs that is the canonical input to the next agent step.
Why worth reusing: Cleaner than ad-hoc prompt concatenation; supports multi-agent delegation.
Preconditions: A typed event abstraction; serialisation.
What would change: Event types depend on action types.
Risks: Stream growth without pruning fills the context window.
Interaction effects: Pairs naturally with observation compaction.

REUSABLE COMPONENT [5]: Two-tool minimal harness (bash + edit)

What it is: A two-tool surface plus a permissive system prompt.
Why worth reusing: Best-known-baseline at high model capability.
Preconditions: A capable model (Claude 3.5 Sonnet level or higher).
What would change: Less capable models need richer scaffolding.
Risks: The model assumes success without visibility into hidden tests; mitigation by surfacing more environment signal.
Interaction effects: Pairs naturally with windowed file viewing inside the edit tool’s view-mode.

Dependency map. Sandbox sits at the bottom (everyone depends on it). Above the sandbox: event stream (OpenHands) or scrollback (SWE-agent, Anthropic). Above that: the action interface (ACI / CodeAct / two-tool). Above that: the agent policy (system prompt + LLM call). Linter and windowed viewer are cross-cutting helpers that sit alongside the action interface.

Recommendation. [Analysis] Highest-value components for a new study: (a) the windowed file viewer (REUSABLE COMPONENT [1]) is near-mandatory at any model capability; (b) the Docker-sandboxed runtime (REUSABLE COMPONENT [3]) is non-negotiable for safety; (c) the event-stream pattern (REUSABLE COMPONENT [4]) is the most portable abstraction across agent designs.

[Analysis] What type of new study benefits most. A study proposing a new software-engineering benchmark, a new RL-style training regime over agent trajectories, or a new model fine-tuned specifically for agent action emission. All three would benefit from reusing the harness primitives above rather than re-implementing them.

14. Known limitations and open problems

Limitations stated by the authors.

SWE-agent: tuned to GPT-4 Turbo; absolute numbers expected to drift with model generation¹.
OpenHands: open-source pace of change makes paper-time numbers stale; the platform’s lasting contribution is the integration, not the headline scores².
Anthropic post: high token costs, environment setup complexity, hidden-test blind spots, no multimodal file viewing³.

Limitations not stated.

[Reviewer Perspective] Cross-model transferability of the SWE-agent ACI design is asserted rather than measured.
[Reviewer Perspective] None of the three artefacts reports cost-normalised pass rate. Score comparisons across the cluster are systematically incomplete.
[Reviewer Perspective] All three artefacts evaluate on Python-only SWE-bench. The cluster’s design choices that are Python-specific (pytest, flake8, tree-sitter Python) versus model-general (windowed views, sandboxed runtime, event streams) are not factorised.
[Reviewer Perspective] AgentLess¹⁴ presents a serious counter-argument (no agent loop required) that the cluster does not engage. The cluster’s framing assumes the loop is load-bearing; AgentLess’s evidence suggests it may not be for many tasks.

Technical root cause of the main limitations.

High token costs root-cause: trajectories accumulate context, and observation compaction (windowed views) only partly offsets this.
Hidden-test blind spots root-cause: the agent has no oracle for the hidden test; mitigation requires either richer environment signal or larger model self-criticism budgets.
Python-only root-cause: SWE-bench’s construction; the cluster inherits the limitation.

Open problems left behind.

A SWE-bench-equivalent across multiple programming languages.
A cost-normalised harness comparison.
An adaptive harness that selects interface complexity per task.

What a follow-up paper would need to solve. [Analysis] The most critical open problem the cluster leaves is the no cost-normalised comparison gap. A follow-up that runs SWE-agent, OpenHands, and the Anthropic-style two-tool harness on the same set of models at matched token budgets, on both SWE-bench Verified and a non-Python equivalent, would resolve the largest empirical ambiguity in the area.

How this article reads at three depths

For the curious high-school reader. A software-engineering agent is a program that lets an AI model fix real bugs in real code, by giving it commands like “open this file,” “edit these lines,” and “run the tests.” This article looks at three influential ways to build such an agent. The headline takeaway: as the underlying AI got smarter between 2024 and 2025, the right approach went from “give the AI lots of carefully designed commands” (SWE-agent) to “build it a full open-source platform” (OpenHands) to “give it just two general tools and let it figure things out” (Anthropic). All three are measured on the same test: 2,294 real GitHub bug reports the AI must fix.

For the working developer or ML engineer. Read SWE-agent for the ACI ablation: the windowed file viewer plus the linter-on-edit explain most of the 3.8% → 12.5% jump on SWE-bench Lite at GPT-4-Turbo capability. Read OpenHands for the platform: Docker sandbox with bash + IPython + Playwright, event stream, AgentSkills library, MIT licensed, CodeAct paradigm; it is the right starting point for any new agent project today. Read the Anthropic post for the minimalism counterweight: at Claude-3.5-Sonnet capability, two tools (bash + edit) plus a permissive prompt reach 49% on SWE-bench Verified, with the caveat that token costs exceed 100k per successful run. The practical question is not which is right but which fits the model you have: with a frontier model, lean toward minimalism; with a smaller model, the SWE-agent ACI still earns its keep.

For the ML researcher. SWE-agent formalises Agent-Computer Interface as an ablate-able design axis and demonstrates a roughly 9-absolute-point pass-rate gap from that axis alone, with each individual ACI component contributing measurably. OpenHands operationalises the same insight as a platform, factoring the design space into runtime / event stream / skill library; the contribution is integration, not novel mechanism. The Anthropic post argues empirically that minimalism beats elaboration at sufficient model capability; the supporting evidence is a single-model 49% on SWE-bench Verified. The cluster’s largest unaddressed counter-argument is AgentLess (Xia et al., 2024), which claims the iterative loop itself is unnecessary. The strongest follow-up would be a Pareto study of all four designs across matched token budgets on both Verified and a non-Python SWE-bench equivalent; without that, the cluster’s debate over interface complexity remains under-quantified.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, Yang et al., NeurIPS 2024, arXiv:2405.15793. (accessed 2026-05-19) ↩
2. OpenHands: An Open Platform for AI Software Developers as Generalist Agents, Wang et al., arXiv:2407.16741. (accessed 2026-05-19) ↩
3. Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet, Erik Schluntz, Anthropic engineering blog, 6 January 2025. (accessed 2026-05-19) ↩
4. SWE-agent official site and documentation (project home). (accessed 2026-05-19) ↩
5. SWE-agent NeurIPS 2024 OpenReview submission. (accessed 2026-05-19) ↩
6. SWE-agent ACI background documentation on GitHub. (accessed 2026-05-19) ↩
7. SWE-agent paper PDF. (accessed 2026-05-19) ↩
8. OpenHands GitHub repository (formerly OpenDevin). (accessed 2026-05-19) ↩
9. OpenHands paper PDF. (accessed 2026-05-19) ↩
10. OpenDevin CodeAct 1.0 release blog post, Xingyao Wang, 2024. (accessed 2026-05-19) ↩
11. SWE-bench benchmark official site and leaderboards. (accessed 2026-05-19) ↩
12. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez et al., ICLR 2024, arXiv:2310.06770. (accessed 2026-05-19) ↩
13. Aider: AI pair programming in your terminal, official documentation. (accessed 2026-05-19) ↩
14. AgentLess: Demystifying LLM-based Software Engineering Agents, Xia et al., arXiv:2407.01489. (accessed 2026-05-19) ↩