Neural Tech Daily
ai-tutorials

Smolagents tutorial: build a local-LLM research agent in eight steps

End-to-end Smolagents tutorial. Wire Hugging Face's CodeAgent to a local Qwen 3 on Ollama, add two custom arXiv tools, and ship a working research agent on a laptop.

Updated ~13 min read
Share

The bottom line

If you want to build an agent without buying into a heavy framework, Smolagents is the path of least resistance. The library is a Hugging Face project that fits its core agent logic in roughly a thousand lines of Python 1 , runs against any LLM through LiteLLM, and treats the local-LLM case as a first-class citizen. This tutorial walks from a clean Python install to a working research agent that searches arXiv and returns paper summaries, using a Qwen 3 model served by Ollama on your own laptop. The whole stack is open-source, with no API key required and no monthly bill.

Smolagents is short for “small agents,” which is Hugging Face’s framing for a library that builds agents using just a few lines of code. CodeAgent is the default agent class: it generates Python code as its action format rather than JSON tool calls, and the launch blog post argues that writing actions in code wins on composability, object management, and generality, because LLMs already see plenty of Python in their training data. Ollama is a local model runner that exposes pulled models on a small HTTP server at port 11434, with an OpenAI-compatible endpoint that any LiteLLM-aware client can talk to.

Smolagents guided tour documentation page showing the CodeAgent versus ToolCallingAgent comparison, with Python code on one side and a JSON tool-call payload on the other.

Image: Smolagents guided tour, showing the CodeAgent versus ToolCallingAgent comparison.

What you’ll need

  • Python 3.9 or newer, with pip available
  • About 6GB of free RAM at runtime for the recommended model (8GB if you want headroom)
  • About 5GB of free disk space for the Qwen 3 model weights
  • A laptop. A discrete GPU is nice but not required; an Apple silicon Mac or a 16GB Linux laptop runs the recommended setup at usable speed
  • 30 minutes for the install plus 10 minutes per agent run

The tutorial uses the Smolagents 1.x line, current as of 8 May 2026. Verify the live release page on GitHub at the time you read this in case the API surface has shifted; the install command and the agent-class names below are stable across the 1.21–1.24 minor releases as of 2026-05-08.

Step 1: Install Smolagents and Ollama

Smolagents is a normal pip install. The two extras you want are litellm (which pulls the LiteLLM dependency for talking to Ollama) and toolkit (which brings the default web-search tool we will use as a sanity check before the custom tools).

pip install 'smolagents[litellm,toolkit]'

Then install Ollama. The canonical commands documented at ollama.com/docs are a single shell-script install on macOS and Linux, and a PowerShell install on Windows.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows (PowerShell)
irm https://ollama.com/install.ps1 | iex

On macOS, the installer drops a menu-bar app that runs the local server in the background. The Linux installer registers a systemd unit so the server auto-starts on boot, and the Windows installer behaves the same way. You rarely need to start the server manually; we will confirm it’s running in step 2.

Smolagents API reference page documenting the CodeAgent class with its parameters, including tools, model, max_steps, and additional_authorized_imports.

Image: Smolagents API reference, documenting the CodeAgent class.

Step 2: Pull a local model on Ollama

The current recommendation is Qwen 3, the family Alibaba’s Qwen team released in April 2025 under an Apache 2.0 licence and which remains the recommended default in Ollama’s library as of May 2026. Qwen 3 is a useful default for agentic work because the team’s release blog post calls out tool-calling as a focus area. For Smolagents on a typical laptop, the 8B variant is the sweet spot: large enough that the CodeAgent’s Python generation is reliable, small enough to run on a 16GB machine without a discrete GPU.

ollama pull qwen3:8b

The download is about 5.2GB on disk for qwen3:8b 2 . If you’re on an 8GB laptop and the 8B model is too slow, fall back to qwen3:4b at about 2.5GB on disk 3 . The 4B model handles search-and-summarise queries acceptably; it sometimes struggles on multi-step reasoning, which is what step 8 calls out.

Confirm the server is up:

ollama list

You should see qwen3:8b (or whichever variant you pulled) in the listed models. If ollama list reports a connection error, run ollama serve in a separate terminal — that starts the local API on http://localhost:11434.

A note on currency: by the time you read this, a newer Qwen line may have shipped. Check ollama.com/library for the current canonical Qwen tag. The same LiteLLMModel pattern in step 3 works for any Ollama-served chat model; only the model_id string changes.

Ollama library page for Qwen 3 showing the model variants table with disk sizes and context windows for the 0.6B through 235B tags.

Image: Ollama Qwen 3 library page, showing the variant table with sizes and context windows.

Step 3: Configure Smolagents to use the local model

Smolagents talks to Ollama through LiteLLM. The Smolagents docs surface the exact pattern, with one inline warning the docs themselves call out: Ollama’s default context window is 2,048 tokens, which the docs describe as a setting that “will fail horribly” for agent workflows 4 . Set num_ctx=8192 (or larger if you have RAM headroom) when you initialise the model.

from smolagents import CodeAgent, LiteLLMModel

model = LiteLLMModel(
    model_id="ollama_chat/qwen3:8b",
    api_base="http://localhost:11434",
    api_key="ollama",
    num_ctx=8192,
)

Three things to notice. The ollama_chat/ prefix on model_id is the LiteLLM convention for routing to Ollama’s chat-completions endpoint, not the bare ollama/ prefix. The api_key is required by LiteLLM but unused by Ollama; the literal string "ollama" is fine. The api_base points at the default Ollama port 11434 5 , which is what the Ollama OpenAI-compatibility blog post documents.

Sanity-check the model end-to-end before adding tools:

agent = CodeAgent(tools=[], model=model)
result = agent.run("Calculate the sum of numbers from 1 to 10.")
print(result)

The first run takes 10–30 seconds on a Mac M-series laptop (Ollama loads the model into memory) and returns 55. If the run hangs at the network step, your ollama serve is not up; if the agent returns a code-execution error, check that the LiteLLMModel parameters above match exactly.

Step 4: Define the arXiv search tool

Smolagents’ @tool decorator is the recommended path for simple tools. The docs describe the function as needing three things: a clear name, type hints on inputs and output, and a docstring with an Args: section that the agent uses as its instruction manual. Those elements get baked into the agent’s system prompt at initialisation, so the docstring is load-bearing — the agent decides whether to call your tool based on what the docstring says it does.

import time
import urllib.parse
import urllib.request
from smolagents import tool

ARXIV_BASE = "http://export.arxiv.org/api/query"

@tool
def search_arxiv(query: str, max_results: int = 5) -> str:
    """
    Search arXiv for papers matching the query and return the raw Atom XML response.

    Args:
        query: A free-text search query (e.g., "graph attention networks").
        max_results: How many results to return (default 5, max 30).
    """
    encoded = urllib.parse.quote(query)
    url = (
        f"{ARXIV_BASE}?search_query=all:{encoded}"
        f"&start=0&max_results={max_results}&sortBy=relevance"
    )
    with urllib.request.urlopen(url) as response:
        body = response.read().decode("utf-8")
    time.sleep(3)
    return body

Two details from the arXiv API user manual matter. The base URL is http://export.arxiv.org/api/query and the response is Atom 1.0 XML, not JSON. The manual asks API clients to “incorporate a 3 second delay” between repeat calls 6 , which is what the time.sleep(3) line at the bottom of the function does. The 3-second pause matters more once the agent starts running multi-query research loops.

Smolagents tools tutorial documentation page showing the @tool decorator pattern, the docstring + type hints contract, and a full custom-tool example.

Image: Smolagents tools tutorial, documenting the @tool decorator pattern used to register the arXiv-fetch and arXiv-parse tools in this build.

Step 5: Define the abstract-parsing tool

The first tool returns Atom XML. Letting the agent parse XML in code is risky on smaller models; a second tool that does the parse and returns a clean Python list is much friendlier. Use feedparser for this. It handles Atom 1.0 cleanly and is the canonical Python choice for the arXiv API surface.

import feedparser
from smolagents import tool

@tool
def parse_arxiv_results(atom_xml: str) -> list:
    """
    Parse the Atom XML returned by search_arxiv and return a list of dicts
    with title, authors, summary, arxiv_id, and link for each paper.

    Args:
        atom_xml: The raw Atom 1.0 XML body returned by search_arxiv.
    """
    feed = feedparser.parse(atom_xml)
    return [
        {
            "title": entry.title,
            "authors": [a.name for a in entry.authors],
            "summary": entry.summary,
            "arxiv_id": entry.id.split("/abs/")[-1],
            "link": entry.link,
        }
        for entry in feed.entries
    ]

You will need feedparser installed:

pip install feedparser

The docstring carries the agent’s mental model of the tool. Two sentences are enough at this scope. If the agent later complains it doesn’t know how to use the tool, expand the docstring before you touch anything else.

Step 6: Wire up the CodeAgent with both tools

Here’s where the CodeAgent comes together. Pass both tools, the model, a list of additional Python imports the agent’s sandbox is allowed to use, and a step ceiling.

from smolagents import CodeAgent, LiteLLMModel

model = LiteLLMModel(
    model_id="ollama_chat/qwen3:8b",
    api_base="http://localhost:11434",
    api_key="ollama",
    num_ctx=8192,
)

agent = CodeAgent(
    tools=[search_arxiv, parse_arxiv_results],
    model=model,
    additional_authorized_imports=["feedparser"],
    max_steps=10,
)

Two parameters do real work. The additional_authorized_imports list is critical — Smolagents’ Python interpreter blocks imports outside a small safe list by default, and the agent’s generated code will fail with an InterpreterError the first time it tries import feedparser if you forget this line. The max_steps=10 value caps how many iterations the agent runs before giving up; the framework default is 20 7 , but a tighter cap during development surfaces stuck-agent bugs faster.

Step 7: Run the agent

Now ask the agent a research question that exercises both tools.

result = agent.run(
    "Find the three most recent papers on graph attention networks "
    "and summarise their key contributions in two sentences each."
)
print(result)

What happens, in order: the agent emits a thought-step explaining its plan, generates a Python snippet that calls search_arxiv("graph attention networks", max_results=3), the framework executes the code and feeds the Atom XML back, the agent generates another snippet that calls parse_arxiv_results(...) on that XML, the framework executes again, and the agent reads the parsed list and writes its final answer with final_answer(...). On a Mac M-series laptop with qwen3:8b and a cold model, the whole loop takes 60–90 seconds.

Smolagents launch blog post header showing the library's positioning, the CodeAgent default class, and the 'agents that write their actions in code (as opposed to agents being used to write code)' framing.

Image: Smolagents launch blog post, framing the library’s value proposition. The agent run output described in this section reproduces the same thought-then-code-then-final-answer cycle the blog walks through.

Step 8: Common pitfalls and how to fix them

Four things will trip the first 10 runs on most setups. None are framework bugs; they are the seams where local-LLM agent work tends to break.

The num_ctx failure. If you skipped the num_ctx=8192 line in step 3 and got nonsense output or a model error after a few steps, that’s the docs’ “will fail horribly” warning landing on you. Set it explicitly. 8,192 is enough for the sample query; 16,384 if you ask the agent to read a paper’s full text rather than the abstract.

Vague tool docstrings. If the agent says it doesn’t know how to use one of your tools, the docstring is the first thing to expand, not the model. Add a one-line note about the input format, the output format, and one concrete example of when to call the tool. The Smolagents docs are explicit that the docstring is “an instruction manual for the LLM powering your agent” and that authors should “not neglect” it.

Local-model RAM limits. qwen3:8b needs about 6–8GB of free RAM during inference; if your laptop is paging swap, the agent slows from “slow” to “useless.” Either close other applications, or fall back to qwen3:4b at about 2.5GB on disk. The 4B model handles search-and-summarise queries; it is less reliable on multi-step planning.

Agents that loop without converging. If max_steps=10 runs out and the agent didn’t finish, the usual cause is that one of your tool’s outputs is bigger than the model’s working memory can hold. Either return a smaller payload (truncate the abstract), or shorten the prompt, or raise max_steps to 15–20. In production, watch the trace logs for the same tool being called in a near-loop; that’s a sign the agent is stuck and a max_steps bump won’t help.

Where to go next

Three concrete extensions if the basic agent works for you.

The first is to switch models. Pull qwen3:14b if your laptop has 32GB of RAM and a discrete GPU, or qwen3:4b if you want to run on an 8GB machine. The LiteLLMModel setup is identical; only the model_id string changes.

A second extension is general web search. The custom arXiv tools above suit academic research, but for broader queries the Smolagents [toolkit] extra ships a WebSearchTool that uses DuckDuckGo, and the framework’s tool-agnostic design means you can drop in Tavily or Serper API tools by writing a 20-line @tool function around their HTTP endpoints.

The third is production hardening. A thin caching layer over search_arxiv keeps repeat queries off the API. Tracing through Smolagents’ verbosity flag (or piping agent steps to a tool like Langfuse) gives you observability into long-running agents. When latency or accuracy becomes the binding constraint, the same Smolagents code runs against an OpenAI or Anthropic model with two-line changes if you outgrow local inference.

The Hugging Face Agents Course covers Smolagents in dedicated modules and is the right next read if you want depth on retrieval-augmented agents, multi-agent systems, and the security model around additional_authorized_imports. The course names LlamaIndex and LangGraph as alternative open-source agent frameworks worth knowing about, which is useful context if you want to compare the design trade-offs Smolagents makes against the heavier-framework path.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.