Neural Tech Daily
ai-tutorials

Build a Coding Agent from Scratch in Python: End-to-End Tutorial (May 2026)

Hands-on walk-through of an agent loop with Claude tool-use, file/bash tools, a sandbox, and a failing-test exercise comparing against Cursor and Claude Code.

Updated ~17 min read
Share
Anthropic tool-use documentation home page introducing the tool_use / tool_result message blocks this tutorial wires into a coding agent loop

Image: Anthropic tool-use documentation, used for editorial coverage of the API this tutorial builds against.

What you’ll build

By the end of this tutorial you will have a working coding agent: roughly 200 lines of Python that take a natural-language task, plan with Claude as the reasoner, execute file-system and shell tools, and loop until the task is done. You will use it to fix a failing test in a sample repo, add a basic sandbox plus a command allowlist, and compare its output against Claude Code and Cursor on the same exercise.

The architecture follows the pattern Anthropic describes in “Building effective agents”: an LLM in a loop that decides which tool to call next, with the agent code carrying the state and enforcing the safety rails 1 . Per Anthropic’s framing, this is the agent end of the spectrum, distinct from a fixed workflow where the orchestration path is hard-coded 2 . The loop itself is small. The interesting work is in the tools and the safety boundary around them.

This tutorial uses Claude Sonnet 4.5 via the Anthropic Messages API. The same code pattern works against any model that supports the Anthropic tool-use schema; the per-token cost differs by model, and Anthropic publishes the current tiers on the pricing page 3 .

What you’ll need

  • Python 3.10 or later.
  • An Anthropic API key (ANTHROPIC_API_KEY in your shell environment).
  • About $2 of API credit for a relaxed first run, including the benchmark section at the end.
  • Comfort reading and writing Python; no prior agent-framework experience required.
  • A spare directory you don’t mind the agent reading and writing in (the sandbox section explains why).

Budget around 90 minutes start to finish: 15 to install and wire the API client, 30 to build the loop and the four tools, 15 to run the failing-test exercise, 15 to add the sandbox and allowlist, and 15 to benchmark against Claude Code and Cursor.

Step 1: Set up the project

Create a fresh project folder and install the Anthropic Python SDK 4 .

mkdir mini-coding-agent && cd mini-coding-agent
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install "anthropic>=0.40" "rich>=13"

The rich package is optional but makes the agent’s trace readable in a terminal. Verify the key is set:

echo $ANTHROPIC_API_KEY | head -c 8

You should see the first eight characters of the key, not an empty line. If it is empty, export the key with export ANTHROPIC_API_KEY=sk-ant-... and retry.

Step 2: The agent loop, conceptually

Per Anthropic’s tool-use documentation, the request-response shape is fixed 5 :

  1. You send a messages.create call with a tools array (each tool is a JSON-schema-like definition: name, description, input_schema).
  2. Claude responds with content blocks. If stop_reason is "tool_use", one or more blocks are tool_use blocks carrying an id, the tool name, and the input JSON.
  3. Your code executes each tool and replies with a user message containing matching tool_result blocks, each referencing the original tool_use_id.
  4. Loop steps 2–3 until stop_reason becomes "end_turn" (the model is done) or a guardrail fires.

That’s the entire loop. The rest of the tutorial is filling in the tools, the safety rails, and the termination logic.

Anthropic documentation page describing the four-step tool-use loop: client sends tools, model returns tool_use blocks, client returns tool_result blocks, loop continues until end_turn.

Image: Anthropic tool-use overview, used for editorial coverage of the API contract this tutorial implements.

Step 3: Define the four tools

Create tools.py with the tool definitions Claude sees, separately from the Python functions that implement them. The split matters: the JSON schemas are what the model plans against; the Python functions are what your code runs after the model picks one.

# tools.py
TOOL_SCHEMAS = [
    {
        "name": "read_file",
        "description": (
            "Read the contents of a UTF-8 text file inside the workspace. "
            "Returns the file contents as a string, or an error string starting with 'ERROR:'."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Relative path inside the workspace, e.g. 'src/main.py'.",
                },
            },
            "required": ["path"],
        },
    },
    {
        "name": "write_file",
        "description": (
            "Overwrite a UTF-8 text file inside the workspace with the given content. "
            "Creates parent directories if missing. Returns 'OK' on success or 'ERROR: ...'."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "path": `{"type": "string", "description": "Relative path inside the workspace."}`,
                "content": `{"type": "string", "description": "Full file contents to write."}`,
            },
            "required": ["path", "content"],
        },
    },
    {
        "name": "list_dir",
        "description": (
            "List entries inside a workspace directory, one per line, with a '/' suffix for sub-directories. "
            "Returns the listing string or 'ERROR: ...'."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Relative path inside the workspace; '.' for the root.",
                },
            },
            "required": ["path"],
        },
    },
    {
        "name": "run_bash",
        "description": (
            "Run a shell command from the workspace root. "
            "Returns combined stdout+stderr, truncated to 8000 characters, prefixed with the exit code."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "command": {
                    "type": "string",
                    "description": "Single shell command; pipes and redirects allowed.",
                },
            },
            "required": ["command"],
        },
    },
]

The description strings are the most load-bearing thing in this file. Per Anthropic’s tool-use guidance, the description is the primary signal the model uses to pick the right tool and to format its input correctly 6 . Write them as if you were briefing a careful junior engineer who has never seen the codebase: explain inputs, outputs, error formats, and any non-obvious constraints.

Step 4: Implement the tools (no safety yet)

Create executor.py with the Python implementations. This first pass runs with no sandbox; Step 7 adds the safety layer.

# executor.py
import subprocess
from pathlib import Path

WORKSPACE = Path("./workspace").resolve()


def read_file(path: str) -> str:
    try:
        target = (WORKSPACE / path).resolve()
        return target.read_text(encoding="utf-8")
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: `{exc}`"


def write_file(path: str, content: str) -> str:
    try:
        target = (WORKSPACE / path).resolve()
        target.parent.mkdir(parents=True, exist_ok=True)
        target.write_text(content, encoding="utf-8")
        return "OK"
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: `{exc}`"


def list_dir(path: str) -> str:
    try:
        target = (WORKSPACE / path).resolve()
        entries = []
        for entry in sorted(target.iterdir()):
            suffix = "/" if entry.is_dir() else ""
            entries.append(f"{entry.name}`{suffix}`")
        return "\n".join(entries) if entries else "(empty directory)"
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: `{exc}`"


def run_bash(command: str) -> str:
    try:
        proc = subprocess.run(
            command,
            shell=True,
            cwd=str(WORKSPACE),
            capture_output=True,
            text=True,
            timeout=60,
        )
        output = (proc.stdout + proc.stderr)[:8000]
        return f"exit_code={proc.returncode}\n{output}"
    except subprocess.TimeoutExpired:
        return "ERROR: command exceeded 60-second timeout"
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: `{exc}`"


TOOL_FUNCTIONS = {
    "read_file": read_file,
    "write_file": write_file,
    "list_dir": list_dir,
    "run_bash": run_bash,
}

Two notes. The subprocess.run(..., shell=True) call is a known security tradeoff documented in the Python docs 7 : it lets the model use pipes and globs, but it also means a tool-use input is being passed straight to the shell. Step 7 adds an allowlist; for now, run only inside a throwaway directory. The WORKSPACE.resolve() call is the seed of the sandbox — Step 7 builds the containment check on top of it.

Step 5: Wire the agent loop

Create agent.py. This is the orchestrator that translates the conceptual loop from Step 2 into code.

Reference figure 1 from the cited vendor docs

Used for editorial coverage of the documentation referenced in this step.

# agent.py
import json
import os
from anthropic import Anthropic
from rich.console import Console

from tools import TOOL_SCHEMAS
from executor import TOOL_FUNCTIONS

console = Console()
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

MODEL = "claude-sonnet-4-5"
MAX_ITERATIONS = 25
SYSTEM_PROMPT = (
    "You are a careful coding agent. You have four tools: read_file, write_file, "
    "list_dir, run_bash. Plan briefly before each tool call. After each tool result, "
    "decide whether you have enough information to finish or whether you need another step. "
    "When the task is complete, summarise what you changed and stop."
)


def run_tool(name: str, tool_input: dict) -> str:
    func = TOOL_FUNCTIONS.get(name)
    if func is None:
        return f"ERROR: unknown tool `{name}`"
    return func(**tool_input)


def run_agent(task: str) -> None:
    messages = [`{"role": "user", "content": task}`]

    for iteration in range(MAX_ITERATIONS):
        response = client.messages.create(
            model=MODEL,
            max_tokens=2048,
            system=SYSTEM_PROMPT,
            tools=TOOL_SCHEMAS,
            messages=messages,
        )

        # Append the assistant turn verbatim so tool_use_ids stay paired.
        messages.append(`{"role": "assistant", "content": response.content}`)

        if response.stop_reason == "end_turn":
            for block in response.content:
                if getattr(block, "type", None) == "text":
                    console.print(f"[green]agent:[/green] {block.text}")
            console.print(f"\n[dim]finished after {iteration + 1} iterations[/dim]")
            return

        if response.stop_reason != "tool_use":
            console.print(f"[yellow]unexpected stop_reason: {response.stop_reason}[/yellow]")
            return

        tool_results = []
        for block in response.content:
            if getattr(block, "type", None) == "text" and block.text.strip():
                console.print(f"[cyan]plan:[/cyan] {block.text}")
            if getattr(block, "type", None) == "tool_use":
                console.print(f"[magenta]call:[/magenta] {block.name}({json.dumps(block.input)[:200]})")
                output = run_tool(block.name, block.input)
                console.print(f"[dim]→ {output[:200]}`{'...' if len(output) > 200 else ''}`[/dim]")
                tool_results.append(
                    {
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": output,
                    }
                )

        messages.append(`{"role": "user", "content": tool_results}`)

    console.print(f"[yellow]hit max iterations (`{MAX_ITERATIONS}`); aborting[/yellow]")


if __name__ == "__main__":
    import sys

    task = " ".join(sys.argv[1:]) or "List the files in '.' and tell me what kind of project this is."
    run_agent(task)

Per the Messages API reference, stop_reason returns one of a documented set including "end_turn", "tool_use", "max_tokens", and a few others 8 . The two values you care about here are the first two; the fallback branch logs anything else and stops, which is safer than looping on an unexpected state.

The messages.append({"role": "assistant", "content": response.content}) line is doing the load-bearing work of preserving every tool_use block’s id so the matching tool_result blocks in the next user turn can reference them. Skipping this and rebuilding the assistant turn yourself is the most common bug in first-pass agent code.

Step 6: Run the failing-test exercise

Set up a small sample repo inside ./workspace:

mkdir -p workspace
cat > workspace/calculator.py \<<'PY'
def add(a, b):
    return a + b


def subtract(a, b):
    return a - b


def multiply(a, b):
    return a + b   # bug: should be a * b
PY

cat > workspace/test_calculator.py \<<'PY'
from calculator import add, subtract, multiply


def test_add():
    assert add(2, 3) == 5


def test_subtract():
    assert subtract(5, 2) == 3


def test_multiply():
    assert multiply(4, 3) == 12
PY

Confirm the failure manually first:

cd workspace && python -m pytest -q
# expect 1 failed, 2 passed
cd ..

Now hand the task to the agent:

python agent.py "Run the pytest suite in this repo. If any test fails, find the bug, fix the source code, and re-run the suite until everything passes."

A working run looks roughly like this trace: the agent calls list_dir("."), then run_bash("python -m pytest -q"), sees the failing assertion, calls read_file("calculator.py"), identifies that multiply returns a + b, calls write_file("calculator.py", ...) with the corrected a * b implementation, re-runs pytest, sees 3 passed, prints a summary, and stops with end_turn.

Terminal output showing the agent's plan text, tool calls to list_dir, run_bash, read_file, write_file, and the final end_turn summary after the multiply bug is fixed.

Image: code-block rendering of a sample agent trace showing the plan / call / tool-result interleave for the failing-test exercise.

If the agent gets stuck, two failure modes recur. It may try to cd inside run_bash and then assume the directory persists across calls — each subprocess.run starts in cwd=WORKSPACE again, so cd does nothing useful. Sharpen the run_bash tool description to say so explicitly. It may also hallucinate a file that doesn’t exist; the ERROR: prefix on the tool output is what lets Claude recover, so don’t replace it with a silent empty string.

Step 7: Add the safety rails

The agent as written has two glaring holes: write_file can write anywhere WORKSPACE / path resolves to, and run_bash runs arbitrary shell. Add a sandbox check and a command allowlist.

# safety.py
import shlex
from pathlib import Path

WORKSPACE = Path("./workspace").resolve()

ALLOWED_COMMANDS = {
    "python", "python3", "pytest", "pip",
    "ls", "cat", "grep", "find", "head", "tail", "wc",
    "echo", "true", "false", "git",
}


class SandboxViolation(Exception):
    pass


def assert_inside_workspace(path: str) -> Path:
    target = (WORKSPACE / path).resolve()
    try:
        target.relative_to(WORKSPACE)
    except ValueError:
        raise SandboxViolation(f"path escapes workspace: `{path}`")
    return target


def assert_command_allowed(command: str) -> None:
    try:
        tokens = shlex.split(command)
    except ValueError as exc:
        raise SandboxViolation(f"unparseable command: `{exc}`")
    if not tokens:
        raise SandboxViolation("empty command")
    program = Path(tokens[0]).name  # strip any path prefix
    if program not in ALLOWED_COMMANDS:
        raise SandboxViolation(f"command not in allowlist: `{program}`")

Wire it into executor.py:

# executor.py — patches
from safety import assert_inside_workspace, assert_command_allowed, SandboxViolation


def read_file(path: str) -> str:
    try:
        target = assert_inside_workspace(path)
        return target.read_text(encoding="utf-8")
    except SandboxViolation as exc:
        return f"ERROR: SandboxViolation: `{exc}`"
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: `{exc}`"


def write_file(path: str, content: str) -> str:
    try:
        target = assert_inside_workspace(path)
        target.parent.mkdir(parents=True, exist_ok=True)
        target.write_text(content, encoding="utf-8")
        return "OK"
    except SandboxViolation as exc:
        return f"ERROR: SandboxViolation: `{exc}`"
    except Exception as exc:
        return f"ERROR: {type(exc).__name__}: `{exc}`"


def run_bash(command: str) -> str:
    try:
        assert_command_allowed(command)
        # ... existing subprocess.run body unchanged ...

The Path.resolve() then relative_to(WORKSPACE) check is the standard Python pattern for “is this path inside this directory”: resolve() collapses .. segments and symlinks, and relative_to raises ValueError if the target falls outside 9 . Path-traversal exploits via ..//../etc/passwd and friends are caught by this check.

The allowlist is intentionally short. Add to it only after deciding whether each command can be abused inside a shell line; git, for instance, will happily run arbitrary post-receive hooks if the agent points it at a hostile remote. For anything beyond a learning exercise, run the agent inside a container or VM with no network and a tight CPU / memory ceiling. The in-process allowlist is one layer, not the whole defence.

Per Anthropic’s “Building effective agents” post, the more autonomy an agent has the more important these guardrails become; the framing there is that the cost of an error in production scales with the blast radius of the tools the model can reach 10 .

Step 8: Where Anthropic’s hosted code-execution tool fits

The four tools above run in your local process. Anthropic ships a hosted alternative: the code execution tool, a server-side sandbox where Claude can run Python directly without you implementing the executor at all 11 . Per the tool’s documentation page, you enable it by adding {"type": "code_execution_20250522", "name": "code_execution"} (or the current version string) to the tools array and it runs inside Anthropic’s sandboxed environment.

Two tradeoffs to be aware of, both surfaced on the code-execution docs page:

  • The hosted sandbox is Anthropic-managed; you don’t see the disk and you don’t pick the package list, but you also don’t have to harden anything. For exploration and notebook-style tasks the ergonomics are far better.
  • The local-tools pattern you just built is what you need when the agent must touch your repo, your dependencies, or your network. The hosted sandbox runs against Anthropic’s environment, not yours.

Most production coding agents end up mixing the two: hosted execution for self-contained reasoning steps, locally-implemented tools for repo-touching steps. The mental model from Steps 5–7 transfers either way; the tool schema and the loop are unchanged.

Step 9: Benchmark vs Claude Code and Cursor

Run the same exercise (the failing multiply test fix) through three tools and compare. This is an N=1 micro-benchmark, not a generalisable evaluation; it exists to make the cost / control / convenience tradeoffs concrete.

Side-by-side brand marks for Claude Code and Cursor, the two reference tools the from-scratch agent is benchmarked against on the failing-multiply-test exercise.

Composite of brand marks for Claude Code and Cursor used as section visual context. Icons by dashboardicons.com (free to use).

Setup: same workspace/calculator.py and workspace/test_calculator.py files as Step 6, same instruction (“fix the failing test”). Per Anthropic’s Claude Code documentation, Claude Code is a terminal-resident coding agent that ships its own tool set (file edits, shell, search) and runs against the Anthropic API the user authenticates against 12 . Per Cursor’s documentation, Cursor is an IDE built on a VS Code fork with an agent mode that drives edits, runs commands, and reads project state from inside the editor 13 .

AxisYour mini-agentClaude CodeCursor
Setup time~30 minutes (this tutorial)Single install per the Claude Code docsInstall the IDE per the Cursor docs
Where tools runYour laptop, your sandboxYour laptop, Anthropic-defined tool setYour laptop, Cursor-defined tool set
Control of tool listTotal (you write the schemas)Limited to what Claude Code exposesLimited to what Cursor exposes
API key billingAnthropic API, per tokenAnthropic API (BYO key on relevant plans)Cursor billing or BYO key per Cursor’s docs
Best forLearning the loop; custom tool surfaces; embedded useDay-to-day terminal codingIDE-resident editing with autocompletion
Worst forPolish and reliability out of the boxEmbedding inside another productCLI / headless contexts

The failing-multiply exercise resolves in single-digit tool iterations on all three; correctness is identical when the bug is this simple. The interesting result is not “which tool fixed it” but the cost / convenience tradeoffs. The from-scratch agent is the smallest amount of code that lets you see and modify every tool call; Claude Code and Cursor are larger systems optimised for a specific surface (terminal, IDE).

Per Anthropic’s “Building effective agents” framing, the choice is rarely “which agent” and more often “where in your stack does the agent need to live” 14 . If the answer is “inside a custom workflow with tools we control end-to-end”, the from-scratch pattern is the right starting point. If the answer is “as a daily-driver coding assistant”, Claude Code or Cursor will save you the polish work.

Common pitfalls

The biggest first-time mistake is treating every model response as a single tool call. The Anthropic API can, and often does, return multiple tool_use blocks in one assistant turn. Your loop must iterate over response.content, collect every tool_use, run each one, and send back a user message with the matching tool_result blocks in the same list. Per the Anthropic implement-tool-use docs, the tool_use_id is what pairs them; out-of-order results are fine as long as the IDs match 15 .

The second most common mistake is forgetting to set max_tokens on the messages.create call. The SDK requires it, and a too-low value (under 1024) regularly causes the model to stop mid-plan with stop_reason="max_tokens" instead of continuing the loop.

A subtler bug: the agent works in development but rewrites your real codebase the moment you run it from ~. The fix is the sandbox from Step 7 plus always launching the agent with a WORKSPACE env var that points at a throwaway directory. Treat the local-process sandbox as a learning aid, not a production safety boundary; a container, a VM, or the hosted code-execution tool is the real isolation layer.

System-prompt drift is the slow-burning issue. The prompt in Step 5 is deliberately short; long, prescriptive system prompts make the agent rigid and brittle when the task drifts from the example. Per Anthropic’s “Building effective agents” post, simpler scaffolding generally outperforms heavier prompt engineering once tool descriptions are clear 16 . Lean on tool descriptions before lengthening the system prompt.

When to reach for a framework instead

A from-scratch loop is the right choice when you want full control of every tool call, when you’re embedding the agent inside a larger product, or when you’re learning. Three signs you’ve outgrown it:

  • You need multiple agents to coordinate (planner, executor, reviewer) and the routing logic is getting tangled. LangGraph and CrewAI exist for this.
  • You need durable, replayable execution across machine restarts. Temporal-style workflow engines or Anthropic’s planned long-running primitives are a better fit.
  • You need fine-grained eval harnesses that grade the agent’s tool-call traces. Promptfoo and LangSmith handle this without you writing a trace inspector by hand.

For the failing-test fix from Step 6, none of those apply. The whole point of the from-scratch pattern is that the loop is small enough to keep entirely in your head.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Anthropic Engineering — "Building effective agents": agents defined as LLMs in a loop that decide which tool to call next, with the agent code holding state and guardrails (accessed )
  2. 2. Anthropic Engineering — "Building effective agents": workflow vs agent distinction, with workflows defined as orchestration via predefined code paths (accessed )
  3. 3. Anthropic API pricing — per-million-token rates per model tier (verify current rates before estimating spend) (accessed )
  4. 4. anthropic-sdk-python on PyPI — current Python SDK release line used in this tutorial (accessed )
  5. 5. Anthropic — Implementing tool use: request shape (`tools` array, `name` / `description` / `input_schema`), response shape (`tool_use` blocks with `id` / `name` / `input`), and the tool_result reply contract (accessed )
  6. 6. Anthropic — Tool use overview: tool descriptions are the primary signal the model uses to pick and parameterise tools (accessed )
  7. 7. Python subprocess module — security considerations on `shell=True` and shell-injection exposure (accessed )
  8. 8. Anthropic Messages API reference — `stop_reason` enumerated values including `end_turn`, `tool_use`, `max_tokens` (accessed )
  9. 9. Python pathlib documentation — `Path.resolve()` collapses symlinks and `..` segments; `Path.relative_to()` raises `ValueError` when the target is outside the parent (accessed )
  10. 10. Anthropic Engineering — "Building effective agents": guardrail discussion framing tool blast radius as a function of agent autonomy (accessed )
  11. 11. Anthropic — Code execution tool: server-side sandbox with the `code_execution` tool type enabled via the `tools` array (accessed )
  12. 12. Anthropic — Claude Code overview: terminal-resident coding agent with built-in file, shell, and search tools running against the Anthropic API (accessed )
  13. 13. Cursor documentation home — VS Code fork with agent mode for in-editor edits, command execution, and project-state reading (accessed )
  14. 14. Anthropic Engineering — "Building effective agents": framing on where in a stack an agent fits vs which agent framework to pick (accessed )
  15. 15. Anthropic — Implementing tool use: `tool_use_id` pairs each request with its matching `tool_result`; results can be returned in any order (accessed )
  16. 16. Anthropic Engineering — "Building effective agents": simpler scaffolding outperforms heavier prompt engineering when tool descriptions are well-formed (accessed )

Further Reading

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.