Build an arXiv Citation Finder Agent with Claude: End-to-End Python Tutorial
Paste a claim from your paper, the agent queries the arXiv API, ranks supporting and contradicting work with Claude Sonnet 4.6, and prints a BibTeX file.
Image: lukasschwab/arxiv.py on GitHub, used for editorial coverage of the Python wrapper taught in this tutorial.
What you’ll build
A small Python project where you paste a single claim from a draft paper (something like “contrastive learning improves few-shot text classification on small models”), the agent queries the arXiv API for the most relevant recent work, asks Claude Sonnet 4.6 to read each abstract and label it as supporting, contradicting, or off-topic, and then writes the supporting papers out as a BibTeX file you can drop into your \bibliography{} directive. The pipeline fits in roughly 150 lines of Python.
Per the arXiv API user manual, the public query endpoint at http://export.arxiv.org/api/query accepts field-prefixed search expressions like ti:, au:, abs:, and all: with Boolean operators, and the manual explicitly asks callers to incorporate a three-second delay between requests. 1 The arxiv Python wrapper on PyPI (version 4.0.0, released 17 May 2026) handles pagination, retries, and that delay for you with a Client plus Search API. 2 Claude Sonnet 4.6 is the right register for the relevance-labelling step because it’s strong at structured-output extraction and is billed at $3 per million input tokens and $15 per million output tokens per the Anthropic pricing page, 3 well under a cent per claim for the abstract volumes this script pulls.
The differentiator over a plain keyword search is the stance label: contradicting papers matter just as much as supporting ones when you’re writing the related-work section, and they’re the work a hostile reviewer is most likely to ask why you didn’t cite.
Prerequisites
You’ll need:
- Python 3.10 or newer (the
arxivpackage requires it per the PyPI page). 2 - An Anthropic API key. Set it once in your shell:
export ANTHROPIC_API_KEY="sk-ant-..."
The Anthropic Python SDK reads ANTHROPIC_API_KEY from the environment when you instantiate Anthropic() without arguments, per the SDK’s PyPI page. 4
- A working terminal and any code editor.
Step 1: Install the wrapper, the SDK, and Pydantic
Create a virtual environment and install three dependencies:
python -m venv .venv
source .venv/bin/activate
pip install arxiv anthropic pydantic
arxiv wraps the arXiv API; anthropic is the official Claude SDK; pydantic types the JSON Claude returns so the rest of the script sees a validated object instead of raw text.
Image: anthropics/anthropic-sdk-python on GitHub, used for editorial coverage of the official Python client.
Step 2: Query arXiv from Python
Create cite_finder.py and start with the search step. The arxiv package’s Client holds the HTTP transport (with the manual’s three-second courtesy delay baked in as delay_seconds=3.0 by default per the package’s PyPI documentation), 2 and Search describes what to query:
import arxiv
from dataclasses import dataclass
@dataclass
class Paper:
arxiv_id: str
title: str
authors: list[str]
abstract: str
published: str
pdf_url: str
def search_arxiv(query: str, max_results: int = 20) -> list[Paper]:
"""Run an arXiv API query and return a list of Paper records."""
client = arxiv.Client(page_size=20, delay_seconds=3.0, num_retries=3)
search = arxiv.Search(
query=query,
max_results=max_results,
sort_by=arxiv.SortCriterion.Relevance,
)
papers = []
for result in client.results(search):
papers.append(
Paper(
arxiv_id=result.entry_id.rsplit("/", 1)[-1],
title=result.title.strip(),
authors=[a.name for a in result.authors],
abstract=result.summary.replace("\n", " ").strip(),
published=result.published.strftime("%Y-%m-%d"),
pdf_url=result.pdf_url,
)
)
return papers
Three details earn their place. sort_by=arxiv.SortCriterion.Relevance asks arXiv to rank by its own relevance heuristic rather than submission date; per the arXiv API basics page, the supported sort criteria are relevance, lastUpdatedDate, and submittedDate. 5 page_size=20 matches max_results, so the wrapper makes one HTTP call instead of paginating. The result.entry_id field returns a URL like http://arxiv.org/abs/2401.12345v2; splitting on the final / keeps the canonical ID (2401.12345v2) for the BibTeX key later.
Step 3: Turn a draft claim into an arXiv query
The user pastes a sentence; arXiv expects a Boolean keyword expression. Two paths work, and the second is the one this tutorial uses.
Path A (naive): strip stopwords from the claim and feed the rest to arXiv as an all: query. Cheap, works passably for unique technical terms, fails when the claim is phrased in everyday English.
Path B (Claude as query writer): ask Claude to rewrite the natural-language claim as an arXiv query string. The model knows the field-prefix syntax (ti:, abs:, all:, Boolean AND / OR / ANDNOT) per the API user manual 1 and is better than a stopword regex at picking the load-bearing technical terms.
from anthropic import Anthropic
client = Anthropic() # reads ANTHROPIC_API_KEY from the environment
QUERY_REWRITE_PROMPT = """\
Rewrite the following research claim as an arXiv API query string.
Use the field prefixes: ti: (title), abs: (abstract), all: (any field).
Combine terms with AND, OR, ANDNOT. Quote multi-word phrases.
Return ONLY the query string. No commentary, no markdown fences.
CLAIM: {claim}
"""
def claim_to_arxiv_query(claim: str) -> str:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=200,
messages=[{"role": "user", "content": QUERY_REWRITE_PROMPT.format(claim=claim)}],
)
return message.content[0].text.strip()
claude-sonnet-4-6 is the canonical Sonnet model ID per the Anthropic models overview; 6 max_tokens=200 is generous for a single-line query string and caps cost predictably. The Messages API call shape (single-turn user message, model + max_tokens + messages list) follows the reference. 7
Image: pydantic/pydantic on GitHub, used for editorial coverage of the validation library.
Step 4: Label each abstract as supporting, contradicting, or off-topic
This is the agent’s value-add. For each paper the query returned, send the original claim plus the abstract to Claude and ask for a structured verdict. Pydantic gives the JSON a typed boundary:
import json
from pydantic import BaseModel, Field, ValidationError
class Verdict(BaseModel):
stance: str = Field(description="supports | contradicts | off-topic")
confidence: float = Field(ge=0.0, le=1.0)
quoted_snippet: str = Field(description="A short verbatim phrase from the abstract that justifies the stance, or empty if off-topic.")
reasoning: str = Field(description="One sentence on why this stance.")
VERDICT_PROMPT = """\
You are reviewing an arXiv abstract against a research claim.
Label the abstract's stance toward the claim:
- "supports": the abstract reports evidence consistent with the claim.
- "contradicts": the abstract reports evidence inconsistent with the claim.
- "off-topic": the abstract is about something else.
Return ONLY a JSON object with this shape, no markdown fences:
{{
"stance": "supports" | "contradicts" | "off-topic",
"confidence": 0.0 to 1.0,
"quoted_snippet": "<short verbatim phrase from the abstract, or empty>",
"reasoning": "<one sentence>"
}}
CLAIM: {claim}
ABSTRACT: {abstract}
"""
def label_paper(claim: str, paper: Paper) -> Verdict | None:
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=400,
messages=[{
"role": "user",
"content": VERDICT_PROMPT.format(claim=claim, abstract=paper.abstract),
}],
)
raw = message.content[0].text.strip()
try:
return Verdict.model_validate_json(raw)
except ValidationError:
return None
Two things keep this honest. The prompt asks explicitly for no markdown fences; Claude will sometimes wrap JSON in triple backticks and that wrapper trips the parser. The ValidationError returns None rather than crashing, so a single bad abstract doesn’t kill the run. quoted_snippet is what makes the output useful in a paper — you’ll cite the snippet in your related-work section to show you actually read the source, not just pattern-matched a title.
Step 5: Write supporting papers as BibTeX
BibTeX entries are plain text with a @article{key, field = {value}, ...} shape; per the format reference, common fields for arXiv preprints are title, author, year, eprint, archivePrefix, primaryClass, and url. 8 The arXiv help pages recommend the @article entry type with the eprint field carrying the arXiv ID for citation-style files that recognise it. 9
def to_bibtex(paper: Paper, verdict: Verdict) -> str:
"""Render a Paper plus its verdict snippet as a BibTeX @article entry."""
first_author_last = paper.authors[0].split()[-1].lower() if paper.authors else "anon"
year = paper.published[:4]
key = f"{first_author_last}{year}_{paper.arxiv_id.split('v')[0].replace('.', '')}"
authors_bib = " and ".join(paper.authors)
return (
f"@article{{{key},\n"
f" title = {{{paper.title}}},\n"
f" author = {{{authors_bib}}},\n"
f" year = {{{year}}},\n"
f" eprint = {{{paper.arxiv_id}}},\n"
f" archivePrefix = {{arXiv}},\n"
f" url = {{{paper.pdf_url}}},\n"
f" note = {{Supports claim with: \"{verdict.quoted_snippet}\"}},\n"
f"}}\n"
)
The note field is non-standard but renders in most bibliography styles, and it embeds the quoted snippet Claude pulled from the abstract — so when you come back to the BibTeX file three weeks later you remember why the citation was added. The BibTeX key follows the common surnameYEAR_arxivid convention; first author last name, lower-cased, plus four-digit year, plus the dot-stripped arXiv ID. Keys must be unique across your .bib file, and this scheme collides only when the same first author posts two preprints in the same year with similar IDs — rare enough to ignore in practice.
Image: lukasschwab/arxiv.py on GitHub, used for editorial coverage of the Python wrapper for the arXiv API.
Step 6: Wire the end-to-end run
Tie the pieces together:
def find_citations(claim: str, max_results: int = 20, out_path: str = "citations.bib") -> None:
query = claim_to_arxiv_query(claim)
print(f"arXiv query: {query}")
papers = search_arxiv(query, max_results=max_results)
print(f"Retrieved {len(papers)} papers from arXiv")
supports: list[tuple[Paper, Verdict]] = []
contradicts: list[tuple[Paper, Verdict]] = []
for paper in papers:
verdict = label_paper(claim, paper)
if verdict is None:
continue
if verdict.stance == "supports" and verdict.confidence >= 0.6:
supports.append((paper, verdict))
elif verdict.stance == "contradicts" and verdict.confidence >= 0.6:
contradicts.append((paper, verdict))
with open(out_path, "w", encoding="utf-8") as fh:
fh.write(f"% Citations for claim: {claim}\n")
fh.write(f"% Generated from arXiv API on 2026-05-20\n\n")
fh.write("% --- Supporting work ---\n")
for paper, verdict in supports:
fh.write(to_bibtex(paper, verdict))
fh.write("\n")
fh.write("% --- Contradicting work (cite + address in related-work) ---\n")
for paper, verdict in contradicts:
fh.write(to_bibtex(paper, verdict))
fh.write("\n")
print(f"Wrote {len(supports)} supporting and {len(contradicts)} contradicting entries to {out_path}")
if __name__ == "__main__":
import sys
claim = sys.argv[1] if len(sys.argv) > 1 else (
"Contrastive learning improves few-shot text classification on small models."
)
find_citations(claim)
Run it:
python cite_finder.py "Your claim here in plain English."
You get a citations.bib file with supporting entries first, contradicting entries second, each carrying the quoted snippet in the note field. Drop it next to your LaTeX source and add \bibliography{citations} at the bottom of your paper. The 0.6 confidence threshold filters out Claude’s “I’m not sure but…” labels; raise it to 0.8 for a stricter pass.
Image: Anthropic cookbook on GitHub, used for editorial coverage of the official Claude API examples.
Why this beats raw keyword search
A keyword search on Google Scholar or arXiv’s web UI gives you a ranked list of papers; the work of reading each abstract and deciding does this support or contradict my claim still falls to you. The agent automates the second step. Two trade-offs are worth naming:
- Claude can be wrong. A confident “supports” label on a paper whose abstract actually contradicts the claim happens. The
quoted_snippetfield is the mitigation: you check the snippet against the abstract before citing, and a mismatched snippet exposes the bad label in seconds. Per the Anthropic models overview, Sonnet is recommended for production extraction tasks but isn’t infallible; 6 always read the abstracts of the top hits yourself. - arXiv is not the whole literature. The script misses paywalled journal-only work, conference papers not posted to arXiv, and older pre-2007 work where arXiv coverage is patchy. For a thorough lit review, run the agent against Semantic Scholar or OpenAlex APIs in parallel and merge the BibTeX outputs.
The strongest case is the first-draft related-work section. You have a list of claims and a deadline. The agent gives you a defensible starting bibliography in five minutes per claim, with a quoted snippet per citation, that you can then refine manually.
Hardening checklist
A toy script is one thing; a project-shaped one needs a few more guards.
- Cache arXiv responses. The API manual notes results refresh once daily at midnight, so calling more than once per day for the same query wastes both your time and arXiv’s bandwidth. 1 Pickle the
paperslist to disk keyed by the query string. - Retry the Claude call. The Anthropic Python SDK raises typed exceptions on rate-limit and server errors; 4 wrap
client.messages.create(...)in a backoff loop for unattended runs over hundreds of abstracts. - Pin the model ID.
claude-sonnet-4-6is a snapshot per the models overview; 6 pinning the exact ID rather than an evergreen alias keeps verdicts reproducible across SDK upgrades. - Respect arXiv’s three-second delay. The
arxivpackage’sClient(delay_seconds=3.0)enforces it, 2 but if you swap to a rawrequestsimplementation, add the sleep yourself per the API manual’s guidance. 1 - De-duplicate BibTeX keys. If two queries hit the same paper, the second
to_bibtex(...)call writes a duplicate@article{key,...}block and LaTeX errors out at compile time. Track keys in a set and skip on collision.
Where to take it next
A few natural extensions:
- Multi-claim batch. Loop over a list of claims (one per paragraph of your related-work draft) and write a single merged
citations.bib. Use prompt caching on the system prompt across the loop to drop the per-call cost; cache reads are billed at 10% of base input per the Anthropic pricing page. 3 - Semantic Scholar or OpenAlex merge. Run the same claim through Semantic Scholar’s
/paper/searchendpoint and merge the BibTeX outputs by DOI to widen coverage past arXiv-only papers. - PDF-grounded verdicts. The current pipeline reads abstracts. For higher-stakes claims, download the PDF via
paper.pdf_url, extract the relevant section with a chunker, and pass that text to Claude alongside the abstract. Cost rises proportionally — a typical 500 KB PDF is roughly 125,000 input tokens per the Anthropic web-fetch guidance. 3 - Inline citation suggestions. Once the
.bibfile exists, run a second Claude pass over your draft paragraph and ask it to insert\cite{key}markers at the right sentences, with the keys constrained to the ones in your BibTeX file.
The full source for the walkthrough is the code blocks above, in order. Drop them into a single cite_finder.py, set ANTHROPIC_API_KEY, and the end-to-end run takes a claim, queries arXiv, labels each abstract, and writes a structured BibTeX file in one command.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. arXiv API user manual (export.arxiv.org/api/query endpoint, field prefixes ti / au / abs / all, Boolean operators, and the three-second courtesy delay) (accessed ) ↩
- 2. arxiv.py — Python wrapper on PyPI (version 4.0.0 released 17 May 2026; Client with delay_seconds, num_retries, page_size; Python 3.10+ required) (accessed ) ↩
- 3. Anthropic — API pricing (Claude Sonnet 4.6 at \$3 per million input tokens and \$15 per million output tokens; cache-read multiplier 0.1x) (accessed ) ↩
- 4. Anthropic — Python SDK (anthropic) on PyPI (ANTHROPIC_API_KEY environment variable and typed exceptions) (accessed ) ↩
- 5. arXiv API basics — supported sort criteria (relevance, lastUpdatedDate, submittedDate) (accessed ) ↩
- 6. Anthropic — Models overview (claude-sonnet-4-6 canonical model ID and recommended-tier guidance) (accessed ) ↩
- 7. Anthropic — Messages API reference (single-turn user message shape, model, max_tokens, messages list) (accessed ) ↩
- 8. BibTeX format — Wikipedia entry tracking the de-facto syntax (entry types, common fields, key conventions) (accessed ) ↩
- 9. arXiv — Bibliography styles and citation guidance (eprint, archivePrefix fields for preprints) (accessed ) ↩
Further Reading
- lukasschwab/arxiv.py on GitHub (accessed )
Anonymous · no cookies set